Evaluating Full Text Search

Hi,

I am evaluating CB FTS and elasticsearch for my search feature. While doing that i have few questions regarding CB FTS. These are mentioned as below:

CB version - 6.0.5 and already having data-index-query nodes configuration as 5-5-8 and number of data is 500 million and i already use kv and n1ql queries having GSI indexes.

  1. Should i add search nodes to same cluster or should i have different cluster for search nodes? how both options will impact existing performance of system?
  2. Creating FTS index is exploding in nature. How to control that?
  3. Will FTS index be rebalanced automatically once there is a write to dataset? How much time does it take for above storage? Like in elastic search we need to rebalance explicitly.
  4. If there is a high load of write, how will indexes be re-balanced efficiently?
  5. Can we update the FTS index by adding/removing field? How much time does it take? In case of elasticsearch, we need to re-index the data.
  1. Should i add search nodes to same cluster or should i have different cluster for search nodes? how both options will impact existing performance of system?

You will need to add search nodes to the same cluster. Search will use data nodes within the same cluster as the source of truth.

  1. Creating FTS index is exploding in nature. How to control that?

This would depend on how you’ve configured your FTS index definition. If you just configure it to index everything, the size could get much larger than the data footprint, due to the various data structures that FTS will set up for fast querying. It is advised for users to set up crisp index definitions for production use cases and we’ve documentation on it here …
https://docs.couchbase.com/server/6.6/fts/fts-creating-indexes.html

  1. Will FTS index be rebalanced automatically once there is a write to dataset? How much time does it take for above storage? Like in elastic search we need to rebalance explicitly.

If data is written to the couchbase bucket that the FTS index is pulling data from, these mutations will automatically be shipped to FTS and will be indexed, you wouldn’t need to do anything explicitly.
Here at couchbase we use the term “rebalance” differently - where you can add more search nodes into the cluster for your partitioned index to be distributed among them with users experiencing no down time.

  1. If there is a high load of write, how will indexes be re-balanced efficiently?

As mentioned earlier, data nodes will automatically ship all mutations to FTS. If there is high work load on the data nodes, you will notice your FTS service consuming higher resources to index all the mutations.

  1. Can we update the FTS index by adding/removing field? How much time does it take? In case of elasticsearch, we need to re-index the data.

It is the same case with couchbase FTS. Adding/Removing a field will cost you an automatic index rebuild. The time here would directly depend on the amount of data that FTS will need to ingest from the couchbase bucket.

Thanks @abhinav for the above resolutions.

Not able to understand one thing is on high work load. Will it impact search queries because of search node to use more resources?

Apart from that, i have few feature related doubts:

My data structure: (3 types of documents)

{
“_class” : “com.abc.snapshot.OneSnapshot” ,
“aggregateId” : “uuid-1”,
“entity” : {
“_class” : “com.abc.one.One”,
“fieldOne” : “field-name-1”,
“hierarchyOne” : {
“fieldTwo” : “filed-name-2”
},
“hierarchyOneOne” : {
“hierarchyTwo” : {
“fieldThree” : “field-name-3”
}
},
“dateFieldOne” : 4545454545554, // timestamp,
“dateFieldTwo” : 45454545455554
}
}

{
“_class” : “com.abc.snapshot.TwoSnapshot” ,
“aggregateId” : “uuid-1”,
“entity” : {
“_class” : “com.abc.two.Two”,
“fieldOne” : “field-name-1”,
“hierarchyOne” : {
“fieldTwo” : “filed-name-2”
},
“hierarchyOneOne” : {
“hierarchyTwo” : {
“fieldThree” : “field-name-3”
}
},
“dateFieldOne” : 4545454545554, // timestamp,
“dateFieldTwo” : 45454545455554
}
}

{
“_class” : “com.abc.data.Data” ,
“aggregateId” : “uuid-1”,
“entity” : {
“_class” : “com.abc.one.One”,
“fieldOne” : “field-name-1”,
“fieldTwo” : “filed-name-2”,
“fieldThree” : “field-name-3”,
“dateFieldOne” : 4545454545554, // timestamp,
“dateFieldTwo” : 45454545455554
}
}

i want to have following type of queries only on OneSnapshot and TwoSnapshot, How should my index look like for those queries:

  1. find by aggregateId
  2. find by fieldOne and fieldTwo (exact value)
  3. find by fieldThree (can be partial value)
  4. find by fieldOne and date range on dateFieldOne and dateFieldTwo
  5. order by dateFieldTwo Desc/Asc
  6. paginated data (alongwith total count of data) - kind of faceted search

It’s possible, query time could be affected with a high mutation ingest, depending on the number of resources, especially those involving scatter-gather (where if you have multiple search nodes in your cluster, a query to a single node, the single node will assume the role of a coordinating node and will be scatter the request to the rest of the nodes in the cluster, and then gathers all the results, merges them and returns them to the user).

We’ve a timeout setting for queries that’s configurable which defaults to 1os, meaning a query that takes longer than that would be canceled by the server. This setting should come in handy when dealing with overloaded servers.

i want to have following type of queries only on OneSnapshot and TwoSnapshot, How should my index look like for those queries:

  1. find by aggregateId
  2. find by fieldOne and fieldTwo (exact value)
  3. find by fieldThree (can be partial value)
  4. find by fieldOne and date range on dateFieldOne and dateFieldTwo
  5. order by dateFieldTwo Desc/Asc
  6. paginated data (alongwith total count of data) - kind of faceted search

I’m not sure I understand what you mean by OneSnapshot and TwoSnapshot here.
To satisfy all these kind of queries, I’d define an index that would …

  • Index aggregateId, fieldOne, fieldTwo as text fields using the keyword analyzer
  • Depends what you mean by a partial value for fieldThree, but assuming prefix or substrings, you could look into defining a custom analyzer with the edge ngram token filter to help you here. I’ve a blog here on how to set up a custom analyzer that you may find useful → Text Analysis within a Full-Text Search Engine - The Couchbase Blog
  • Index dateFieldOne and dateFieldTwo as datetime fields
  • Sorting (Order-by) and Pagination is something you can do at query time using size/from/sort. Here’s documentation on it → Search Request | Couchbase Docs

Thanks @abhinav for the detailed information. Basically, i just want to index the documents which have _class valued as OneSnapshot and TwoSnapshot. I want to ignore the third one while indexing (kind of partial index in GSI). Please suggest for the same.

Regarding indexing documents, elasticSearch is far more easy to understand as per documentation perspective but i find difficulty to understand the CB FTS indexing documentation. Unable to understand type, mapping, field, child field etc. Can you please help me to understand it using the scenarios mentioned in my previous reply? I am referring this article Creating Indexes | Couchbase Docs

Type mappings allow for FTS users to configure their index to index content of a specific type. For example, if you want to index documents that contain "__class" = "OneSnapshot" or "__class" = "TwoSnapshot", you would first need to set the “JSON Type Field” within the “Type Identifier” in the index definition to “__class”. (See “Specifying Type Identifiers” within Creating Indexes | Couchbase Docs).

Now create 2 type mappings, one with the name “OneSnapshot” and another with the name “TwoSnapshot”. You can now choose to either index specific fields within documents that match these criteria (I’ve highlighted the fields you’ll need for your queries in my previous comment) or index everything from the documents that satisfy the __class criteria. (See “Specifying Type Mappings” within Creating Indexes | Couchbase Docs)

Thanks @abhinav for guiding me to create index but still i am unclear to understand following:

While setting JSON type identifier, what will be value in its text field? Should it be “type”? What are the other values possible inside that text field?

While creating type mappings, what should be the value in type name text field? What value should be selected from drop down?

Can you guid me using some screenshots for the data structure i provided?

Please ignore this question. I am able to solve it using your blog post. https://blog.couchbase.com/full-text-search-indexing-best-practices-by-use-case/

Few more doubts:

  1. how to add child fields for keys inside keys? For example, given in my scenarios for fieldOne, fieldTwo and fieldThree which hierarchically inside structure.
  2. How to write queries with AND, OR conditions?
  3. How to write queries for fields hierarchically inside structure as mentioned in point 1?
  4. Where to try search query instead of using curl or http tools? Like, elasticsearch provides kibana for the same.

hi @Nitesh_Gupta ,

Please try to read a bit on the documentation on related topics as it helps while creating indexes and querying.

  1. Check the “inserting-a-child-mapping” sub title here. Creating Indexes | Couchbase Docs
  2. Check the “compound-queries” part here Query Types | Couchbase Docs
  3. The reviews field example from travel-sample bucket is similar to that of your usecase. The compound query examples shows how to query nested fields. eg: field: “fieldOne.fieldTwo.FieldThree” etc.
  4. You couldtry query string query types from the index definition pages. Currently, there are no other easy way to explore queries than from using curl commands.
    ref - Searching from the UI | Couchbase Docs

(N1QL query workbench is an option for FTS queries too but then it has its own learning curve of two services and their interactions)

Hi @abhinav ,

Thanks for providing the above information. While i am working on all of them, can you suggest me the option of using N1QL queries to do FTS ? And tell me the performance impact for using N1QL queries vs normal queries for FTS ?

Thanks
Nitesh

Hi @Nitesh_Gupta ,

More about N1QL - FTS integration can be found here - Search & Rescue: 7 Reasons N1QL (SQL) developers use Search
Introducing FTS with N1QL | The Couchbase Blog

https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/searchfun.html

Searching FTS from N1QL would be the lesser performant option than direct FTS access, lacking a few features (like facets).
It becomes useful when someone wants to leverage the FTS capabilities (array indexing, fuzzy, geo, language-aware search) with native N1QL as GSI can’t support those features.

Hi @abhinav ,
So, it means we can’t use N1QL+FTS on versions < 6.5 ? And can you let me know is there any major difference between FTS 6.0.2 and FTS 6.6 ?

Thanks
Nitesh

Hey @Nitesh_Gupta ,

You are right, N1QL integration of FTS happened in 6.5.
Many things got changed between these two releases. If you are particular about any area then we might help you there.
(N1QL integration, performance improvements, new polygon queries, rebalance, replica management improvements, score :“none” queries, pagination improvements, etc and the list goes on)

Cheers!
Sreekanth

Hi @sreeks ,
Can you please provide one or two liners each (if these are related to FTS) so that we will evaluate based on that to use FTS on 6.0.2 or not as we are still on 6.0.2?

Thanks
Nitesh

These are significant updates that are definitely going to help any active FTS customers without any doubt and hence an upgrade is generally recommended.

But in this case IMHO, since there was an absence of field/SLA requirements from your side,
you could potentially wait for the next/upcoming Couchbase server release 7.0 and upgrade to the latest one.
(that has all the latest and greatest changes including significant index size reductions for FTS, better N1QL intg etc…etc).

https://www.couchbase.com/downloads

Cheers!

Thanks @sreeks

Actually, it is not so easy to upgrade at our end. That’s why i am trying to figure out the differences between 6.0.2 and 6.6. If those differences are not of our use, then only we will opt for CB FTS. Right now, the only advantage we are having is to avoid any data sync as compare to using any other technology.

Can you please help me to elaborate regarding the major differences between these 2 versions regarding FTS?

Thanks
Nitesh

You should be able to look at the release notes for various versions here …

Thanks @abhinav . I am already on reading them. While reading them, i have few doubts:

  1. Regarding issues https://issues.couchbase.com/browse/MB-38303, https://issues.couchbase.com/browse/MB-27429 and https://issues.couchbase.com/browse/MB-38142, is there any impact after fixing them overall as these includes too much of internal technical details which are not meant for me? If i use version 6.0.2, what kind of functional or non functional issues will be there because of above fixes?
  2. These 3 looks functional. Can you please explain them? https://issues.couchbase.com/browse/MB-39838, https://issues.couchbase.com/browse/MB-38957, https://issues.couchbase.com/browse/MB-39592
  3. https://issues.couchbase.com/browse/MB-39887, https://issues.couchbase.com/browse/MB-41854 Please explain them as well.
  4. Also, i am now concerned about support on 6.0.2 FTS. Will i get proper support on 6.0.2 if we will opt to use it?
  5. If i opt to use latest one 6.6, then i will need to have another cluster where data nodes will be XDCRd with older cluster.
    a. Will i be able to XDCR from 6.0.2 to 6.6?
    b. Will there be any high cost for data nodes in 6.6 cluster because that is kind of extra nodes in my system? If there will be high cost associated, i will be forced to choose elasticsearch which is reliable and already a cost effective solution in the market.

Thanks
Nitesh

Hey @Nitesh_Gupta ,

While you wait for further responses here,

I am not sure about your approach here,
Evaluating your upgrade decisions and their implications aren’t something that needs to be solved over a couple of forum ping pongs. You should talk to the support or solutions engineering team through your organizational-level contacts.
They would help you assess your current bottlenecks/troubles in the system and suggest the best approach to figure out the cluster upgrade roadmap which aligns better with your requirements.

Apart from that,
the release notes don’t reflect the magnitude of fixes or feature changes that went into these releases for the Search service. One thing is assured that upgrading to 6.6.X would be the best option you have if you need the greatest CB server software at the moment. (not only just for FTS)

Hi @abhinav
This is regarding the index and query:

I am able to get the document id list in the query. How to get full or specific fields of documents in the results?

Using below mentioned query:

{
“explain”: false,
“highlight”: {},
“query”: {
“conjuncts”: [
{
“field”: “field1”,
“match”: “abc”
}
]
}
}

Tell me, what extra thing i need to add above?

Thanks
Nitesh