I would like to know if there is any “piplelined” system or product easy to make the data bridge between Couchbase (buckets) and Hadoop HDFS. The goal is to achieve the best in performance, easy connect between the two environments.
Please share your opiion.
@couchbwiss there are some ways to do it. You can use our Hadoop connector, go through kafka or even spark. See our connectors for more information: http://docs.couchbase.com/connectors/intro/index.html
If you can give us more context on your setup, that would help too!
Thanks @daschl .
I have million of documents in a couch base bucket and I want to process the JSON in Spark CLuster (on the top of HDFS cluster). Large data sets analysis will be on Spark side.
Couchbase cluster and Spark cluster are in separate servers in the same DC.
My goal is to achieve best performance in loading data from couchbase to HDFS and vise versa. ideally a pipe lined solution to achieve a near real time and best performance.
1- Does the Spark connector does the pipelined solution?
2- The examples in the documentation are given only in Scala, is there is any other example to implement it in java so I can follow?
@couchbwiss you are in luck! Two days ago I released the beta version which includes a java API too… Please give us feedback and feel free to ask questions as you move along: http://blog.couchbase.com/couchbase-spark-connector-1.0-beta-release http://docs.couchbase.com/connectors/spark-1.0/spark-intro.html
Btw, yes the connector pushes and loads RDDs on demand from the workers, so you should get great performance out of it.
Thanks man @daschl
I will try it and give you my feedback.
Can you send me in private you EMail address so I follow up with you the results of my tests?
@couchbwiss sent you a private message here in the forums.
received with big thanks @daschl