FTS is slow when the potential number of results is big, no matter that the sorts are on indexed fields

Hi @abhinav , do you have a hint for the following?

I have a FTS index on 40 million documents

If I search for something where the potential results set is small (100 results from 1000 total potential results) the query time is super fast (less than 5 ms).

But if the number of potential matches is higher let’s say 2 millions, no matter how small is the LIMIT the results are returned in 4-6 seconds.

The query contains some sorts, but all are on indexed fields excluding the _score which is calculated at the query time.

Is there a trick that I could use to get the results faster when my LIMIT is around 10 results from x millions?

@flaviu Believe most of that latency could be coming from sorting, given the large number of hits. FTS automatically sorts on -_score if you don’t specify anything otherwise.

Would you share a sample query to look into before we share any advise/recommendation here?

I tried without sorting and the response is still around 2 seconds.

Below is an example document and how the query is made:

{
       'id' => "123fd-fds45-fdt4-5435",
       'pp' => "word1 word2, word3",
       'np' => "word4 word5",
       'cd' => 167322240,
       'pv' => 7,
       'nv' => 1,
       'dl' => 3,
       'iw' => 512,
       'ih' => 512,
       'mi' => 4,
       'st' =>true,
       'ii' => "asd-asd-fdsf-fgdfg-gfd",
       'm'  => array(
           't'  => 'fts',
           'it' => unixtimestamp
           'v'  => 4 #version
       )
}

the query is a conjunction of

MatchPhraseSearchQuery for the fields pp and np
NumericRangeSearchQuery for the fields mi,nv,iw,ih
MatchSearchQuery for the fields ii and im

the sort is: '-_score', '-pv', '+nv', '-dl', 'cd'

there is a limit of 10 results and on offest which could go up to 300

the index defintion is like this:

{
 "name": "image_prompt_fts-v2",
 "type": "fulltext-index",
 "params": {
  "doc_config": {
   "docid_prefix_delim": "",
   "docid_regexp": "",
   "mode": "type_field",
   "type_field": "m.t"
  },
  "mapping": {
   "default_analyzer": "en",
   "default_datetime_parser": "dateTimeOptional",
   "default_field": "_all",
   "default_mapping": {
    "dynamic": true,
    "enabled": false
   },
   "default_type": "_default",
   "docvalues_dynamic": false,
   "index_dynamic": false,
   "store_dynamic": false,
   "type_field": "_type",
   "types": {
    "fts": {
     "dynamic": false,
     "enabled": true,
     "properties": {
      "cd": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "index": true,
         "name": "cd",
         "type": "number"
        }
       ]
      },
      "dl": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "index": true,
         "name": "dl",
         "type": "number"
        }
       ]
      },
      "ih": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "index": true,
         "name": "ih",
         "type": "number"
        }
       ]
      },
      "ii": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "analyzer": "keyword",
         "index": true,
         "name": "ii",
         "type": "text"
        }
       ]
      },
      "im": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "analyzer": "keyword",
         "index": true,
         "name": "im",
         "type": "text"
        }
       ]
      },
      "iw": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "index": true,
         "name": "iw",
         "type": "number"
        }
       ]
      },
      "mi": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "index": true,
         "name": "mi",
         "type": "number"
        }
       ]
      },
      "mt": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "index": true,
         "name": "mt",
         "store": true,
         "type": "number"
        }
       ]
      },
      "np": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "analyzer": "en",
         "include_term_vectors": true,
         "index": true,
         "name": "np",
         "type": "text"
        }
       ]
      },
      "nv": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "index": true,
         "name": "nv",
         "type": "number"
        }
       ]
      },
      "pi": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "index": true,
         "name": "pi",
         "store": true,
         "type": "boolean"
        }
       ]
      },
      "pp": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "analyzer": "en",
         "include_term_vectors": true,
         "index": true,
         "name": "pp",
         "type": "text"
        }
       ]
      },
      "pv": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "index": true,
         "name": "pv",
         "type": "number"
        }
       ]
      },
      "s": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "index": true,
         "name": "s",
         "type": "number"
        }
       ]
      },
      "st": {
       "enabled": true,
       "dynamic": false,
       "fields": [
        {
         "index": true,
         "name": "st",
         "type": "boolean"
        }
       ]
      }
     }
    }
   }
  },
  "store": {
   "indexType": "scorch",
   "segmentVersion": 15
  }
 },
 "sourceType": "gocbcore",
 "sourceName": "images_fts",
 "sourceUUID": "808e0bb49eb0ae702ced856f2824d649",
 "sourceParams": {},
 "planParams": {
  "maxPartitionsPerPIndex": 342,
  "indexPartitions": 3,
  "numReplicas": 0
 },
 "uuid": "163a5123f8ba96bf"
}

search index stats looks like this:


as an example without any sort:

and the servers load is around 2%

FTS ram used by the index is around 30GB per server and allocated is around 72GB per server

Would you try disabling score calculation as well - which is done at query time.
So your query would look like this -

{"query": {"match_phrase": "girl", "field":"pp"}, "score": "none"}

it is a little better - 500ms but still, I need to figure out how to make it work with the sorting

I need to figure out how to make it work with the sorting

Right. Looking at your query you’re sorting on 4 indexed fields after -_score. Multiple sort keys come in handy when the previous sort values are missing or equal with a few others. You must include doc values for those values you want to be able to sort on for efficient sorting - this should apply to pv, nv, dl and cd.

Also, a different more efficient pagination scheme worth looking into with search_after (Pagination | Couchbase Docs) you should be able to achieve faster responses especially where deep pagination is involved.

tried your recommendation and the results are similar. It takes 6.488936901092 for a query. This is getting the results without any offset.

This is how the index definition looks like:

{
  "type": "fulltext-index",
  "name": "image_fts-v2-test",
  "uuid": "6f906c8966cb3132",
  "sourceType": "gocbcore",
  "sourceName": "images_fts",
  "sourceUUID": "808e0bb49eb0ae702ced856f2824d649",
  "planParams": {
    "maxPartitionsPerPIndex": 342,
    "indexPartitions": 3
  },
  "params": {
    "doc_config": {
      "docid_prefix_delim": "",
      "docid_regexp": "",
      "mode": "type_field",
      "type_field": "m.t"
    },
    "mapping": {
      "analysis": {},
      "default_analyzer": "en",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": true,
        "enabled": false
      },
      "default_type": "_default",
      "docvalues_dynamic": false,
      "index_dynamic": false,
      "store_dynamic": false,
      "type_field": "_type",
      "types": {
        "fts": {
          "dynamic": false,
          "enabled": true,
          "properties": {
            "cd": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "docvalues": true,
                  "index": true,
                  "name": "cd",
                  "type": "number"
                }
              ]
            },
            "dl": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "docvalues": true,
                  "index": true,
                  "name": "dl",
                  "type": "number"
                }
              ]
            },
            "ih": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "ih",
                  "type": "number"
                }
              ]
            },
            "ii": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "keyword",
                  "index": true,
                  "name": "ii",
                  "type": "text"
                }
              ]
            },
            "im": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "keyword",
                  "index": true,
                  "name": "im",
                  "type": "text"
                }
              ]
            },
            "iw": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "iw",
                  "type": "number"
                }
              ]
            },
            "mi": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "mi",
                  "type": "number"
                }
              ]
            },
            "mt": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "mt",
                  "store": true,
                  "type": "number"
                }
              ]
            },
            "np": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "en",
                  "include_term_vectors": true,
                  "index": true,
                  "name": "np",
                  "type": "text"
                }
              ]
            },
            "nv": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "docvalues": true,
                  "index": true,
                  "name": "nv",
                  "type": "number"
                }
              ]
            },
            "pi": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "pi",
                  "store": true,
                  "type": "boolean"
                }
              ]
            },
            "pp": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "en",
                  "include_term_vectors": true,
                  "index": true,
                  "name": "pp",
                  "type": "text"
                }
              ]
            },
            "pv": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "docvalues": true,
                  "index": true,
                  "name": "pv",
                  "type": "number"
                }
              ]
            },
            "s": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "s",
                  "type": "number"
                }
              ]
            },
            "st": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "index": true,
                  "name": "st",
                  "type": "boolean"
                }
              ]
            }
          }
        }
      }
    },
    "store": {
      "indexType": "scorch",
      "segmentVersion": 15
    }
  },
  "sourceParams": {}
}

This is how the index stats looks like:

As the servers load (3 servers) is very low (2%) would it make sense to make more index partitions? would this somehow make it work in parallel and faster? Now I have 3 partitions, would it work faster if I make 30 partitions?

I am doing something wrong?

Your index’s definition looks good to me.

As the servers load (3 servers) is very low (2%) would it make sense to make more index partitions? would this somehow make it work in parallel and faster? Now I have 3 partitions, would it work faster if I make 30 partitions?

Raising the number of partitions would increase the rate of data ingest for the indexes. Now, your queries may show some improvement in latency as well - but I don’t except it to be of the same magnitude.

Here’s how a scatter gather request works - the search request received at any node (coordinator) is forwarded to every other node - so your pagination, sort and other filters are applied at each node and these results are sent back to the coordinator where they’re merged and the filters are re-applied over the merged result set before responding back to the user.

So let me ask you a question here - what’s your rationale behind the multiple sort keys?
I see _score is your first sort key which is generated for every document. Multiple sort keys come in handy when either sort values or missing or are identical for certain hits.

let me explain the reason behind multiple sorting

I have a list of texts, many of these text are exactly repeating across multiple documents (in groups of 10 to 1000 identical texts per group of documents).

these texts are a description of what is inside of different pictures, for example multiple images with the text description being “cat” or “girl” or “dog”.

In the same time, in each document I have some additional information like for example positive votes (pv) or negative votes (nv) (number) or date of document creation (number). Some images are more beautiful than others.

I need to be able to search for “girl” or “cat” or “dog”, etc and order the first documents being the ones with the most positive votes and then the ones with less negative votes, and then by the date of creation, etc.

In the end I will end up with a list of documents sorted by the score, positive votes, negative votes, downloads, creation date, etc.

Does it make sense?

Would you recommend a combination of N1QL + FTS or something else?

Yes that makes sense, thanks!

I don’t expect you’d get any much higher performance with N1QL+FTS in this situation. If you do end up trying this - you’ll want to defer sorting to N1QL in which case FTS would stream the results to the query engine which would collect and sort the streamed results. You’d have to make sure that you design your search index in such a way so it would cover the query’s requirement and avoid any bucket fetches - for example you’ll need to store all the fields you want to sort over and fetch them along side the hit IDs. Your query would end up looking something like this:

SELECT meta().id from `keyspace` as ks
LET smeta=SEARCH_META()
WHERE (ks, {"query":{"match_phrase": "girl", "field":"pp"},"fields":["*"]})
ORDER BY smeta.score DESC, smeta.fields.pv DESC, smeta.fields.nv ASC, smeta.fields.dl DESC, smeta.fields.cd ASC;

You mention you have 40mil documents in this index, what’s the memory quota you’ve assigned to the search service in your cluster?

The available memory and CPU are significant factors during query execution, especially while dealing with large results sets that involve sorting.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.