Couchbase Hadoop Connector 1.2

Introduction

The Couchbase Hadoop connector allows you to connect to Couchbase Server 2.5 or 3.0 to stream keys into HDFS or Hive for processing with Hadoop. If you have used Sqoop before with other databases then using this connector should be straightforward since it uses a similar command line argument structure. Some arguments will seem slightly different as Couchbase has a very different structure than a typical RDBMS.

Installation

The installation process for the Couchbase Hadoop Connector is simple. When you download the connector, you should find a set of files that need to be moved into your Sqoop installation along with a script which will do this for you if you provide the path to the sqoop installation. These files along with a short description of why they are needed are listed below.

  • couchbase-hadoop-plugin-1.2.0-beta.jar — This jar file contains the Couchbase Hadoop Connector for sqoop itself.

  • couchbase-config.xml — This is a property file used to register a ManagerFactory for the Couchbase Hadoop Connector with sqoop.

  • couchbase-manager.xml — This property file tells sqoop in which jar the ManagerFactory defined in couchsqoop-config.xml resides.

  • couchbase-client-1.4.4.bundled.jar - This is a library dependency of the sqoop connector. It handles the basic communications with the Couchbase Cluster.

  • spymemcached-2.11.4.jar — This is a library dependency of the Couchbase Client. It provides networking and core protocol handling for the transferring of data.

  • jettison-1.1.jar - This is a dependency of the Couchbase Client.

  • netty-3.5.5.Final.jar - This is a dependency of the Couchbase Client.

  • install.sh — A script to assist with the installation of the Couchbase Hadoop Connector.

Script Based Installation

Script based installation is done through the use of the install.sh script that comes with the connector download. The script takes one argument, the path to your sqoop installation. For example:

shell> chmod 755 install.sh
shell> ./install.sh path_to_sqoop_home

Manual Installation

Manual installation of the Couchbase connector requires copying the files in the zip distribution into your sqoop installation. Below is a list of files that contained in the connector and the name of the directory in your Sqoop installation to copy each file to.

  • couchbase-hadoop-plugin-1.2.0-beta.jar — lib

  • spymemcached-2.11.4.jar — lib

  • jettison-1.1.jar — lib

  • netty-3.5.5.Final.jar — lib

  • couchbase-config.xml — conf

  • couchbase-manager.xml — conf/managers.d

Uninstall

Un-installation of the connector requires removal of all of the files that were added to sqoop during installation. To do this cd into your sqoop home directory and execute the following command:

shell> rm lib/couchbase-hadoop-plugin-1.2.0-beta.jar lib/spymemcached-2.11.4.jar \
    lib/jettison-1.1.jar lib/netty-3.5.5.Final.jar \
    conf/couchbase-config.xml conf/managers.d/couchbase-manager.xml

Using Sqoop

The Couchbase Hadoop Connector can be used with a variety of command line tools provided by Sqoop. In this section we discuss the usage of each tool.

Tables

Since Sqoop is built for a relational model it requires that the user specifies a table to import and export into Couchbase. The Couchbase Hadoop Connector uses the --table option to specify the type of data stream for importing and exporting into Couchbase.

For exports the user must enter a value for the --table option though what is entered will not be used by the connector.

For imports the table command accepts two values and will exit reporting errors with invalid input.

  • DUMP — Causes all keys currently in Couchbase to be read into HDFS. Any data items which are received by the Couchbase cluster while this command is running will also be passed along by the connector meaning new or changed items are part of the dump. However, items removed while the dump is running will not be removed from the output.

  • BACKFILL_## — Streams all key mutations for a given amount of time (in minutes). This is best used to sample a bucket in a cluster for a period of time.

For the --table value for the BACKFILL table that a time should be put in place of the brackets. For example BACKFILL_5 means stream key mutations in the Couchbase server for 5 minutes and then stop the stream.

Connect String

A connect string option is required in order to connect to Couchbase. This can be specified with --connect as an argument to the sqoop command. Below are two examples of connect strings.

http://10.2.1.55:8091/pools
http://10.2.1.55:8091/pools,http://10.2.1.56:8091/pools

When creating your connect strings simply replace the IP address above with the hostname or IP address of one or more nodes of your Couchbase Cluster. If you have multiple servers you can list them in a comma-separated list.

Connecting to Different Buckets

By default the Couchbase Hadoop Connector connects to the default bucket. If you want to connect to a bucket other than the default bucket you can specify the bucket name with the --username option. If the bucket has a password use the --password option followed by the password.

Importing

Importing data to your cluster requires the use of the Sqoop import command followed by the parameters --connect and --table. Below are some example imports.

shell> sqoop import --connect http://10.2.1.55:8091/pools --table DUMP

This will dump all items from Couchbase into HDFS. Since the Couchbase Java Client has support for a number of different data types, all values are normalized to strings when being written to a Hadoop text file.

shell> sqoop import --connect http://10.2.1.55:8091/pools --table BACKFILL_10

This will stream all item mutations from Couchbase into HDFS for a period of 10 minutes.

Sqoop provides many more options to the import command than we will cover in this document. Run sqoop import help for a list of all options and see the Sqoop documentation for more details about these options.

Some options which may be important in your import are those that define what delimiters sqoop will use when writing the records. The default is the comma (,) character. Through the sqoop command you may specify a different delimiter if, for instance, it’s likely that the item’s key or value may contain a comma.

When the import job executes, it will also generate a .java source code file that can facilitate reading/writing the records imported by other Hadoop MapReduce jobs. If, for instance, the job run was a DUMP, sqoop will generate a DUMP.java source code file.

Exporting

Exporting data to your cluster requires the use of the sqoop export command followed by the parameters --connect, --export-dir, and --table. Below are some example exports.

shell> sqoop export --connect http://10.2.1.55:8091/pools --table couchbaseExportJob --export-dir data_for_export

This will export all records from the files in the HDFS directory specified by --export-dir into Couchbase.

Sqoop provides many more options to the export command than we will cover in this document. Run sqoop export help for a list of all options and see the Sqoop documentation for more details about these options.

Some options which may be important in your export are those that define what delimiters sqoop will use when reading the records from the Hadoop text file to export to Couchbase. The default is the comma (,) character. Through the sqoop command you may specify a different delimiter.

When the export job executes, it will also generate a .java source code file that will show how the data was read. If, for instance, the job run had the argument --table couchbaseExportJob, sqoop will generate a couchbaseExportJob.java source code file.

List table

Sqoop has a tool called list tables. As noted in previous sections, Couchbase does not have a notion of tables, but we use DUMP and BACKFILL_## as values to the --table option.

Since there is no real purpose to the list-tables command in the case of the Couchbase Hadoop Connector, it is not recommended you use this argument to sqoop.

Import All Tables

Sqoop has a tool called import-all-tables. As noted in previous sections, Couchbase does not have a notion of tables.

Since there is no real purpose to the import-all-tables command in the case of the Couchbase Hadoop Connector, it is not recommended you use this argument to sqoop.

Limitations

While Couchbase provides many great features to import and export data from Couchbase to Hadoop there is some functionality that the connector doesn’t implement in sqoop. These are the known limitations:

  • Querying: You cannot run queries on Couchbase. All tools that attempt to do this will fail with a NotSupportedException. Querying will be added to future Couchbase products designed to integrate with Hadoop.

  • list-databases tool: Even though Couchbase is a multi-tenant system that allows for multiple buckets (which are analogous to databases) here is no way of listing these buckets from sqoop. The list of buckets is available through the Couchbase Cluster web console.

  • eval-sql tool: Couchbase does not use SQL, so this tool is not appropriate.

One other known limitation at this time is that the Couchbase Hadoop Connector does not automatically handle some classes of failures in a Couchbase cluster or changes to cluster topology while the sqoop task is being run.

Licenses

This documentation and associated software is subject to the following licenses.

Documentation License

This documentation in any form, software or printed matter, contains proprietary information that is the exclusive property of Couchbase. Your access to and use of this material is subject to the terms and conditions of your Couchbase Software License and Service Agreement, which has been executed and with which you agree to comply. This document and information contained herein may not be disclosed, copied, reproduced, or distributed to anyone outside Couchbase without prior written consent of Couchbase or as specifically provided below. This document is not part of your license agreement nor can it be incorporated into any contractual agreement with Couchbase or its subsidiaries or affiliates.

Use of this documentation is subject to the following terms:

You may create a printed copy of this documentation solely for your own personal use. Conversion to other formats is allowed as long as the actual content is not altered or edited in any way. You shall not publish or distribute this documentation in any form or on any media, except if you distribute the documentation in a manner similar to how Couchbase disseminates it (that is, electronically for download on a Web site with the software) or on a CD-ROM or similar medium, provided however that the documentation is disseminated together with the software on the same medium. Any other use, such as any dissemination of printed copies or use of this documentation, in whole or in part, in another publication, requires the prior written consent from an authorized representative of Couchbase. Couchbase and/or its affiliates reserve any and all rights to this documentation not expressly granted above.

This documentation may provide access to or information on content, products, and services from third parties. Couchbase Inc. and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services. Couchbase Inc. and its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party content, products, or services.

The information contained herein is subject to change without notice and is not warranted to be error free. If you find any errors, please report them to us in writing.