Bulk Loading Documents into Couchbase
This blog post is an example of how one might load data as JSON into Couchbase. For the purpose of this post, data extracted from RDBMS as a CSV will be converted to JSON. RDBMS DBAs and Admins familiar with Oracle, SQL server, MySQL, etc are probably looking for a way to experiment and test with NoSQL. Often the first step in using NoSQL is to convert whatever you have into JSON.
Couchbase supports JSON and Binary data, but for the purpose of this post we are looking at the most rich data type, JSON.This is important to understand because by loading documents into Couchbase in a format other than JSON, data will be loaded as binary and may impact view flexibility when trying to build views or indexes. With that, let’s get on with loading some documents. There are two ways to achieve this but for the purpose of this post I’m assuming that any document you want loaded is already in JSON format and either compressed or uncompressed. The next section will describe one possible path to ensure you are loading JSON data.
Some Couchbase ETL partners, such as Talend, which offers a connector for Couchbase if you want a GUI, maybe don’t want to deal with CSV files or want to reorder your data prior to commiting your data to CSV or just have a need to ETL data from several sources before storing within Couchbase Server. Talend can map and store documents directly as JSON files prior to loading into Couchbase, if desired .
This guide assumes you have some familiarity with Linux or Mac, package managers, and Ruby.
For additional SDK setup information please visit: developer.couchbase.com
The steps used to prep and load the data are as follows:
- Prepare the data: Look at a couple of example tools to convert the CSV to JSON.
- Load the data: Examine a few methods to load the data into couchbase via Ruby scripts.
Prerequisites for Linux and Mac: Requires a functional build environment!
- Libcouchbase: Since this script uses Ruby you must have libcouchbase installed prior to installing the couchbase gem
- rubygem: be sure that you have rubygems available to install the ruby couchbase wrapper
- Gems that I used are
- ‘ruby-progressbar’
- ‘couchbase’
- ‘yaji’
- ‘optparse’
- Gems that I used are
- Yajl parser: this must also be installed as a prerequisite to YAJI.
- Install Couchbase Gem: gem install couchbase
If the setup has been successful the ruby scripts I’ve provided should run. You may want to pass -h to the streamloader and ensure you get the syntax message. Lastly, don’t forget to install the GEMs listed as well, Yaji, optparse, couchbase, and ruby-progressbar. Links are provided at the bottom of this post.
Prepare the Data
Data Prep Method 1: Simple, speedy and consistent: csvtojson NodeJS script
- Can be found via Google for and there are others too.
- Installs via NPM ala
npm install -g csvtojson
Here is an example conversion for reference:
1 2 3 4 5 6 7 |
Austins-MBP:complaintstemp austin$ npm info csvtojson { name: 'csvtojson', description: 'A tool concentrating on converting csv data to JSON with customised parser supporting', 'dist-tags': { latest: '0.3.21' }, Austins-MBP:complaintstemp austin$ which csvtojson /usr/local/bin/csvtojson Austins-MBP:complaintstemp austin$ csvtojson data.gov_Consumer_Complaints.csv |
Data Prep Method 2: Write a Ruby Script: csv2json.rb
The time to complete the process will vary because Ruby is single threaded.
An additional note for this script is that I am using the YAJL parser instead of the default JSON module which doesn’t handle streaming data into Couchbase.
The script below shows the only change required. This will improve memory use during conversion. If you haven’t installed YAJL before you can simply do this: ‘gem install yajl-ruby’
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
#!/usr/bin/env ruby require 'rubygems' require 'csv' require 'pathname' #require 'JSON' require 'yajl/json_gem' Dir["./**/*.csv"].each do |csv_file_path| puts csv_file_path file_name = Pathname.new(csv_file_path).basename(".csv").to_s File.open("#{file_name}.json",'w') do |json_file| jsonData = CSV.read(csv_file_path, :headers => true, :header_converters => :symbol).map{|csv_row| csv_row.to_hash} json_file.write(JSON.pretty_generate(jsonData)); end end |
Post Conversion Steps
Once conversion is completed it is time to compress the file with ZIP:
1 2 3 4 5 |
Austins-MBP:complaintstemp austin$ ls Consumer_Complaints.csv Consumer_Complaints.json csv2json.rb Austins-MBP:complaintstemp austin$ zip Consumer_Complaints.zip Consumer_Complaints.json adding: Consumer_Complaints.json (deflated 95%) |
Place the zip file(s) into a directory. I used ~/Downloads/json_files/ in my home directory.
1 |
Austins-MBP:complaintstemp austin$ cp Consumer_Complaints.zip ~/Downloads/json_files/ |
Once data has been prepared you are ready to start loading. The following examples will touch on a couple of common ways to get your data into Couchbase in bulk.
The Couchbase install comes with a built-in tool called cbcdocloader. It takes individual document files, up to 20MB in size, either zipped up or within a directory and loads them. At the time of this writing cbcdocloader requires multiple JSON formated files contained in a directory. Second, I will discuss a tool I wrote in Ruby which employs Couchbase’s own Sergey Asavayev’s YAJI Ruby Gem. The code referenced is free to use and can be rewritten in any language you are comfortable using.
Doc Loading Method 1: Using cbcdocloader
Using a set of individual files within a directory, this is a common use case but depends on the structure of the files and directories to be imported to reflect desire documents as they will appear once loaded. To assist in preserving that structure we recommend packaging the files and directories to be loaded within a .zip file.
The document ID key names will be based off the document files provided.
Note: this method is not ideal for large consolidated document files. For large monolithic files I will exemplify how they are loaded in Method 2, below.
Then load the file or files by the following command:
1 2 3 |
cbdocloader -u Administrator -p s3kre7Pa55 -b MyBucketToLoad -n 127.0.0.1:8091 -s 1000 ~/json_files/beer-sample.zip |
Note:
The ‘-s 1000’ is the memory size for the bucket. You’ll need to adjust this value for your bucket.
Also the bucket does not need to exist as cbcdocloader will create it but be aware of your resource utilization prior to setting the ‘-s’ flag to make sure you have available RAM.
If everything was successful you’ll see output stating if documents were loaded, how many bytes, etc.
1 2 3 4 5 |
bucket: 2014-10-02-ny-json.zip, msgs transferred... : total | last | per sec byte : 242446488 | 242446488 | 3115728.2 |
Here is a brief script to load up a lot of .zip files in a given directory:
1 2 3 4 5 6 7 8 |
#!/bin/bash JSON_Dir=~/json_files/ for ZipFile in $JSON_Dir/*.zip ; do /Applications/Couchbase Server.app/Contents/Resources/couchbase-core/bin/cbdocloader -u Administrator -p s3kre7Pa55 -b MyBucketToLoad -n 127.0.0.1:8091 -s 1000 $ZipFile done |
Doc Loading Method 2: streamloadjson
The other method is to load all documents, comma separated, from a single monolithic file.
In order to accomplish this method I have prepared a small but effective script that uses the YAJI JSON stream parser and I called it streamjsonload.
The options for this program are:
1 2 3 4 5 6 7 8 |
~$ ruby streamloadjson.rb -h ruby json loader -f, --jsonfile jsonfile JSON formatted filename -b, --bucketname BUCKETNAME bucket name -n, --hostname hostname hostname/ip address -r, --root JSONroot JSON root to parse -d, --docid SearchKey JSON Key ID -h, --help This menu |
To load documents with a test JSON file such as fathers.json.txt from below, it can be called like so:
1 |
ruby streamloadjson.rb -b TestBucket -f fathers.json -n localhost -d id -r /fathers/ |
The script should provide output like below:
1 2 3 4 |
Reading from fathers.json JSON path processed is /fathers/ Couchbase node is localhost 29999 Time: 00:00:07 |>>---=---=---=---=---| Doc Loading |
One major advantage to using the YAJI parser is that it requires very low memory consumption. This means you could potentially paginate the input data and break it up into multiple streams to load into couchbase. It will spawn discreet processes since Ruby is single threaded but another language could also be used for multi-threading. An example of these are on Couchbase Labs Github repository.
A couple of things to note, This tool only loads monolithic document files, will try to create an ID automatically if one isn’t provided with ‘-d’ and will require some fine-tuning of the “root” with -r if no documents load.
Newer code for the loader is available on my github repository, but I have also provided it in-linebelow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
#!/usr/bin/env ruby #An example JSON file for testing is at: # https://github.com/ysharplanguage/FastJsonParser/raw/master/JsonTest/TestData/fathers.json.txt require 'ruby-progressbar' require 'rubygems' require 'couchbase' require 'yaji' require 'optparse' options = {} optparse = OptionParser.new do |opts| opts.banner = 'ruby json loader' opts.on('-f', '--jsonfile jsonfile', 'JSON formatted filename') do |jsonfile| options[:jsonfile] = jsonfile end opts.on('-b', '--bucketname BUCKETNAME', 'bucket name') do |bucketname| options[:bucketname] = bucketname end opts.on('-n', '--hostname hostname', 'hostname/ip address') do |hostname| options[:hostname] = hostname end opts.on('-r', '--root JSONroot', 'JSON root to parse') do |root| options[:root] = root end opts.on('-d', '--docid SearchKey', 'JSON Key ID') do |docid| options[:docid] = docid end opts.on('-h', '--help', 'This menu') { puts opts; exit} end optparse.parse! json_file = options[:jsonfile] bucket = options[:bucketname] host = options[:hostname] root = options[:root] docid = options[:docid] #set some defaults if not passed unless host host=localhost end unless root root="/" end puts "Reading from #{json_file}" puts "JSON path processed is #{root}" puts "Couchbase node is #{host}" # Connect to couchbase host and bucket provided on the command line client = Couchbase.connect(:bucket => bucket, :host => host) #Create the progressbar progressbar=ProgressBar.create(:title => "Doc Loading", :starting_at => 0, :total => nil, :throttle_rate => 0.01, :format => '%a |%b>>%i| %p%% %t', :length => 50) parser = YAJI::Parser.new(File.open(json_file)) parser.each(root.to_s).with_index do |doc,i| unless docid hashinfo=doc.hash.abs prikey="Doc_#{i}:#{hashinfo}" end prikey="#{docid}:#{doc["#{docid}"]}" client.set("#{prikey}", doc) print("rProcessed #{i} Documents") progressbar.increment end progressbar.finished? puts |
Finishing the Job
Once data has been loaded login to the Couchbase console and begin working with development views for queries and indexing.
If you are using Couchbase Server 4.0 with N1QL you will want to create a primary index so you can explore the Couchbase SQL-like interface immediately and start taking advantage of the power of N1QL query through our SDKs!
Many thanks to the great folks in the Open Source community for providing the YAJL gem and to Sergey Avseyev for the YAJI parser. Sergey is a very knowledgeable Couchbase resource responsible for Ruby SDK work and I would also like to encourage any of you to try our JRuby SDK and provide feedback.
Links:
CB Examples Github – https://github.com/agonyou/cb-examples/
YAJI Stream Parser – https://github.com/avsej/yaji
YAJL JSON Gem – https://github.com/brianmario/yajl-ruby
csv2json Gem – https://rubygems.org/gems/csv2json/
Couchbase Server 4 with N1QL – http://www.couchbase.com/nosql-databases/downloads
If I execute \’./cbdocloader -u tito -p foobar -b test -n 192.168.1.4:8091 -s 1000 /Users/tito/Desktop/sample.zip \’ and the sample file contains a JSON array with just two documents, the entire file content is imported as one document. In other words, cbdocloader does not seem to realize that the document is an array of JSON objects. Also, the editor shows \’Warning: JSON should not be an array\’. How is cbdocloader supposed to work? Thanks.
By the way, if I use the sample JSON found in \’http://www.rubydoc.info/gems/c…, I get the very same issue. In both cases, the file contents are valid JSON. It\’s just seems that cbdocloader is not cooperating. :-/
Well put together, you could also use https://sqlify.io/convert/csv/to/json to convert to JSON and then load the documents as normal instead of rolling your own script.