Integration with a search engine (solved)
Hi there!
First of all, I'm making my first steps with couchbase/membase/couchdb, so probably I will be asking some trivial questions...
With the new features provided by couchbase 2.0, a couple of questions arose:
PROBLEM:
I will need to perform search operations over several fields of the JSON documents stored in the database.
For example: String matching over all fields of a JSON document:
{
"type" : "CarType01",
"engine" : "SomeEngine",
"description" : "something",
"owner" : "someone"
}--> Query: give me all documents that contains "some" in any of their fields.
QUESTIONS:
1) Is it recommendable to integrate a search engine like solr? Is there any documentation about it? maybe using "TAP stream" protocol?
2) Should I use advanced views instead?
3) Those views are stored in memory and also in disk?
4) Can I designate a particular server to handle the search operations (handle the views)? so I can give to it more hardware capabilities. (maybe I misunderstanding the way views work)
5) Is there a way to split reading operations, writing operations and search operations among different servers?
What I mean:
- use several servers for reading with high memory capacity.
- use less servers for writing with better disk I/O.
- use some servers to perform search operations with high memory and processor capabilities.
6) any suggestions about a better solution?
Thanks a lot in advance.
Hi, thanks for the answers.
1) Understood :), I am downloading Solr.
2) I did not known how much advanced could be the scripts for the views, but with your explanations now I can see It is limited to basic operations.
3) OK.
4/5/6) When you say "Picking which nodes you use for View Merging and UI/Stats would likely help the health/strength of the cluster overall."... you mean splitting data into different buckets... putting searchable data in one bucket and non searchable data in other? so views will be created for the first bucket only?
regards.
You can use Views for complex or advanced work, but it really depends on what sort of results you need and what keys you have to use. Fulltext searching is really a different animal from key-based index result retrieval. This feedback (like any) is helpful for us to determine what things customers are focusing their time and energy on solving, so thanks again for asking about this here.
For 4/5/6, I wouldn't split any data between buckets, but rather focus your SDK requests via a specific IP address--the node you want to be the aggregation node for the View Merging. You could give this node more power (memory, CPU, etc). The related data, though, can (and should) be stored in a single bucket. Couchbase will take care of splitting things up between vBuckets (the internal "shard" system Couchbase uses).
Hope that keeps you moving forward. :)
Hi! you have clarified all the doubts I had had related to this topic :)
I'm now analyzing the TAP stream, to use it with Solr.
Thank you!
Glad I could help! See you around the forums. :)
I'm now analyzing the TAP stream, to use it with Solr.
Thank you!
Were you able to get this up and running? If so, do you have any suggestions on where to start? I'm interested in hooking solr up to the TAP stream to create my own indexes of the data.
Thanks!
Hi warptrosse , jpoloney
TAP can be complicated to work with and so we are looking at building an adapter that works with elasticSearch, a distributed full-text search engine built on Lucene. I think that's exactly what you need. we will be pushing out an early version of this transport to git soon. In fact we are talking about the integration at CouchConf SF.
If you are local or want to make a trip to SF, take a look: http://www.couchbase.com/couchconf-san-francisco
We have just published an integration with ElasticSearch. Take a look.
http://blog.couchbase.com/couchbase-and-full-text-search-couchbase-trans...
warptrosse,
For the "problem" you presented you would need to hook in something like Solr, Lucene, ElasticSearch to do a search more than a "begins with" or "ends with" style query. Things similar to the SQL "LIKE %some%" are not possible with Couchbase MapReduce alone, so augmenting it via a fulltext index (like one of the above) would do the trick. Using the TAP stream protocol and doing a custom integration is likely a good place to start at this point.
Answers to your other questions (2-6) are below:
2) Should I use advanced views instead?
I'm not sure what you mean by "advanced views" (there's only one type of view in Couchbase 2.0). However, the closest you can get to your example is to output individual values (or words from the values) as keys in the view index and then use something like,
?startkey="some"&endkey="someZ". That would return the above document 3 times one index "row" per value--if the key emitted is any value found in documents similar to the one above.3) Those views are stored in memory and also in disk?
Views are stored exclusively on disk.
4) Can I designate a particular server to handle the search operations (handle the views)? so I can give to it more hardware capabilities. (maybe I misunderstanding the way views work)
vBuckets are stored in individual CouchDB databases on each node. View indexes for the data stored in that vBucket is stored along side. When the View is requested from any node the results of the indexes (on per node) are aggregated and results are sent back to you via the node from which you requested results.
Each node handles the index creation for the data stored on it. The aggregation happens at the time of request on the node through which you request the view results.
5) Is there a way to split reading operations, writing operations and search operations among different servers?
What I mean:
- use several servers for reading with high memory capacity.
- use less servers for writing with better disk I/O.
- use some servers to perform search operations with high memory and processor capabilities.
You could consistently use a certain node (or nodes) to request views through (thereby consolidating the view merging "cost") and possibly use a certain node for UI/stats aggregation (which use a similar pattern to view merging--single node does the aggregation/presentation work). Beyond that, though, there's no way to designate which servers are primarily memory caches and which serve as your persistence/disk storage. Part of the key benefit of using Couchbase is that you can failover nodes, and if each node were too distinct failing over certain nodes would have a much higher cost than failing over other nodes.
6) any suggestions about a better solution?
Picking which nodes you use for View Merging and UI/Stats would likely help the health/strength of the cluster overall. Fulltext indexing is an open issue, so your input and exploration of it would be helpful.