FTS: Using Dynamic Type Mapped Index for documents containing dynamic fields

Hi CB Team,

We have a scenario where we want to index (using FTS) documents that contains both static and dynamic fields.
Static fields: Common fields across all the documents. Ex: make, model, country.
Dynamic fields: They are not common across all the documents. Ex: mpg (car), rateOfClimb(aircraft), stroke(ship)

Sample document to illustrate the scenario:

{   
    "id": "vehicle::1000",
    "type": "vehicle",
    "category": "car",
    "make": "Hyundai",
    "model": "Sonata",
    "country": "South Korea",

    "airbags": true,
    "engineType": "gasoline",
    "horsepower": 500,
    "mpg": 30
}

{   
    "id": "vehicle::1001",
    "type": "vehicle",
    "category": "aircraft",
    "make": "Boeing",
    "model": "747",
    "country": "USA",
    
    "engine": "Rolls-Royce",
    "thrust": 59450,
    "range": 4000,
    "rateOfClimb": 6000
}   

{
    "id": "vehicle::1002",
    "type": "vehicle",
    "category": "ship",
    "make": "Marine Shipbuilding Co",
    "model": "Coral Princess",
    "country": "France",

    "stroke": 2500,
    "propeller": "screw",
    "pistonSpeed": 10
}   

Note: We will be using each and every field in the index creation.
Note: We will be running search queries that are text, number or range based on any field.
Note: We would like to keep the JSON document as flat as possible. i.e. no nested fields.

In our research, we found out that the "default type mapped” index would be a perfect fit for all of our functional requirements. But the concern is on non-functional side (performance and scalability). According to Couchbase’s FTS best practices and optimization guide - “The default dynamic mapping produces larger indexes and is potentially unsuitable for production deployments.” (Refer: Full-Text Search Indexing Best Practices & Tips - Part 1).

Questions:

  1. Is "default type mapped” index a production worthy solution when you have dynamic fields, have to index every field and have few tens of millions documents in the bucket?

  2. If not, what is the best approach to deal with this situation?

  3. If yes, do you have any performance numbers or metrics that you could share us with?

Thanks,
Vishnu

Before I get around to answering your questions, let me give you some information on a default type mapped dynamic index which is supported to showcase several abilities of full text search …

  • A default type mapped dynamic index indexes content from all fields.
  • It also saves a copy of all this content into a composite field (that we call _all). Indexing content into the composite field will allow users to not need to specify the field while searching. For example, for your document examples above - you can simply search for “vehicle” as opposed to “type:vehicle”. This will increase the size of the index.
  • The analyzer you set within “default_analyzer” in your index definition is applied to all textual fields. So you will not be able to use a different analyzer for each of your fields.
  • Numeric data is auto recognized, I suppose this is good for your use case.
  • Term vectors (array positions) will be recorded for all textual fields. If you’re not using highlighting or phrase searching this is unnecessary. This would increase the size of the index.
  • DocValues for all the fields will be stored. This needed when you want to do sorting or faceting. This would increase the size of the index.

A larger index could mean that the queries would have to read more data that could affect performance a bit - this strictly depends on the kind of your data.

The way I see your use case - what you call dynamic fields - are just fields that appear in a few documents and are missing in a few. If you agree with me on this, I would recommend -

  • Explicitly indexing only fields of interest under a single type mapping - even if you think that some of them do not appear in all the documents.
  • Set only those options for each of the fields that you deem necessary
  • If you intend to set the field while searching for a term, I recommend against checking the “include in _all” capability

Note that if any of the fields you’ve specified in your index definition are missing in some/any of the incoming documents, then there would simply be nothing to index for the field for the document.

For example, if I include “mpg” as numeric in the index, and I search for mpg:>=30, only the first document will be returned which satisfies the condition.

Here is documentation on each of the field options.

Setting a crisp index definition will keep the index’s size reasonable and will empower your query performance.

Thanks @abhinav for your elaborate explanation.
Regarding - “The way I see your use case - what you call dynamic fields - are just fields that appear in a few documents and are missing in a few.”
You’re right in a certain way, but please note that these fields are dynamically created by the users at a later stage. i.e. they are unknown during the creation of index, so we cannot include them in the create index definition. We would like these fields to be searchable in an _all style way. Apart from that, we have (static) fields which are known during the index creation stage and want to targeted search (ex: +make:Boeing). So in short we need an index that satisfies both the requirements, that’s the main reason why we would like to use "default type mapped” index.

Thanks!