Bulk Loading Documents into Couchbase

This blog post is an example of how one might load data as JSON into Couchbase. For the purpose of this post, data extracted from RDBMS as a CSV will be converted to JSON. RDBMS DBAs and Admins familiar with Oracle, SQL server, MySQL, etc are probably looking for a way to experiment and test with NoSQL. Often the first step in using NoSQL is to convert whatever you have into JSON

Couchbase supports JSON and Binary data, but for the purpose of this post we are looking at the most rich data type, JSON.This is important to understand because by loading documents into Couchbase in a format other than JSON, data will be loaded as binary and may impact view flexibility when trying to build views or indexes. With that, let’s get on with loading some documents. There are two ways to achieve this but for the purpose of this post I’m assuming that any document you want loaded is already in JSON format and either compressed or uncompressed. The next section will describe one possible path to ensure you are loading JSON data.

Some Couchbase ETL partners, such as Talend, which offers a connector for Couchbase if you want a GUI, maybe don’t want to deal with CSV files or want to reorder your data prior to commiting your data to CSV or just have a need to ETL data from several sources before storing within Couchbase Server. Talend can map and store documents directly as JSON files prior to loading into Couchbase,  if desired .

This guide assumes you have some familiarity with Linux or Mac, package managers, and Ruby.

For additional SDK setup information please visit: www.couchbase.com/developers/

The steps used to prep and load the data are as follows:

  1. Prepare the data: Look at a couple of example tools to convert the CSV to JSON.
  2. Load the data: Examine a few methods to load the data into couchbase via Ruby scripts.

Prerequisites for Linux and Mac: Requires a functional build environment!

  1. Libcouchbase: Since this script uses Ruby you must have libcouchbase installed prior to installing the couchbase gem
  2. rubygem: be sure that you have rubygems available to install the ruby couchbase wrapper
    • Gems that I used are
      • ‘ruby-progressbar’
      • ‘couchbase’
      • ‘yaji’
      • ‘optparse’
  3. Yajl parser: this must also be installed as a prerequisite to YAJI.
  4. Install Couchbase Gem: gem install couchbase

If the setup has been successful the ruby scripts I’ve provided should run. You may want to pass -h to the streamloader and ensure you get the syntax message. Lastly, don’t forget to install the GEMs listed as well, Yaji, optparse, couchbase, and ruby-progressbar. Links are provided at the bottom of this post.

Prepare the Data

Data Prep Method 1: Simple, speedy and consistent: csvtojson NodeJS script

  • Can be found via Google for and there are others too.
  • Installs via NPM ala npm install -g csvtojson

Here is an example conversion for reference:

Data Prep Method 2: Write a Ruby Script: csv2json.rb 

The time to complete the process will vary because Ruby is single threaded.

An additional note for this script is that I am using the YAJL parser instead of the default JSON module which doesn’t handle streaming data into Couchbase.

The script below shows the only change required. This will improve memory use during conversion. If you haven’t installed YAJL before you can simply do this: ‘gem install yajl-ruby’

Post Conversion Steps

Once conversion is completed it is time to compress the file with ZIP:

Place the zip file(s) into a directory. I used ~/Downloads/json_files/ in my home directory.

Once data has been prepared you are ready to start loading. The following examples will touch on a couple of common ways to get your data into Couchbase in bulk.

The Couchbase install comes with a built-in tool called cbcdocloader. It takes individual document files, up to 20MB in size, either zipped up or within a directory and loads them. At the time of this writing cbcdocloader requires multiple JSON formated files contained in a directory. Second, I will discuss a tool I wrote in Ruby which employs Couchbase’s own Sergey Asavayev’s YAJI Ruby Gem. The code referenced is free to use and can be rewritten in any language you are comfortable using.

Doc Loading Method 1: Using cbcdocloader

Using a set of individual files within a directory, this is a common use case but depends on the structure of the files and directories to be imported to reflect desire documents as they will appear once loaded. To assist in preserving that structure we recommend packaging the files and directories to be loaded within a .zip file.

The document ID key names will be based off the document files provided.

Note: this method is not ideal for large consolidated document files. For large monolithic files I will exemplify how they are loaded in Method 2, below.

Then load the file or files by the following command:

Note:
The ‘-s 1000’ is the memory size for the bucket. You’ll need to adjust this value for your bucket.
Also the bucket does not need to exist as cbcdocloader will create it but be aware of your resource utilization prior to setting the ‘-s’ flag to make sure you have available RAM.

If everything was successful you’ll see output stating if documents were loaded, how many bytes, etc.

Here is a brief script to load up a lot of .zip files in a given directory:

Doc Loading Method 2: streamloadjson

The other method is to load all documents, comma separated, from a single monolithic file.

In order to accomplish this method I have prepared a small but effective script that uses the YAJI JSON stream parser and I called it streamjsonload.

The options for this program are:

To load documents with a test JSON file such as fathers.json.txt from below, it can be called like so:

The script should provide output like below:

One major advantage to using the YAJI parser is that it requires very low memory consumption. This means you could potentially paginate the input data and break it up into multiple streams to load into couchbase. It will spawn discreet processes since Ruby is single threaded but another language could also be used for multi-threading. An example of these are on Couchbase Labs Github repository.

A couple of things to note, This tool only loads monolithic document files, will try to create an ID automatically if one isn’t provided with ‘-d’ and will require some fine-tuning of the “root” with -r if no documents load.

Newer code for the loader is available on my github repository, but I have also provided it in-linebelow:

Finishing the Job

Once data has been loaded login to the Couchbase console and begin working with development views for queries and indexing.

If you are using Couchbase Server 4.0 with N1QL you will want to create a primary index so you can explore the Couchbase SQL-like interface immediately and start taking advantage of the power of N1QL query through our SDKs!

Many thanks to the great folks in the Open Source community for providing the YAJL gem and to Sergey Avseyev for the YAJI parser. Sergey is a very knowledgeable Couchbase resource responsible for Ruby SDK work and I would also like to encourage any of you to try our JRuby SDK and provide feedback.

Links:

CB Examples Github – https://github.com/agonyou/cb-examples/
YAJI Stream Parser – https://github.com/avsej/yaji
YAJL JSON Gem – https://github.com/brianmario/yajl-ruby
csv2json Gem – https://rubygems.org/gems/csv2json/
Couchbase Server 4 with N1QL – http://www.couchbase.com/nosql-databases/downloads

Author

Posted by Austin Gonyou, Solutions Engineer, Couchbase

Austin Gonyou is a Solutions Engineer at Couchbase from past 4 years. Austin brings technical solutions about Couchbase NoSQL Document Database server and mobile conversations facilitated by inside, mid-level, and enterprise sales staff for our prospects and customers.

3 Comments

  1. If I execute \’./cbdocloader -u tito -p foobar -b test -n 192.168.1.4:8091 -s 1000 /Users/tito/Desktop/sample.zip \’ and the sample file contains a JSON array with just two documents, the entire file content is imported as one document. In other words, cbdocloader does not seem to realize that the document is an array of JSON objects. Also, the editor shows \’Warning: JSON should not be an array\’. How is cbdocloader supposed to work? Thanks.

    1. By the way, if I use the sample JSON found in \’http://www.rubydoc.info/gems/c…, I get the very same issue. In both cases, the file contents are valid JSON. It\’s just seems that cbdocloader is not cooperating. :-/

  2. Well put together, you could also use https://sqlify.io/convert/csv/to/json to convert to JSON and then load the documents as normal instead of rolling your own script.

Leave a reply