FTS: slow sort if sorted by indexed field

gizmo74 · September 15, 2017, 12:04pm

Hi,

I have a fulltext index on a bucket with about 250k documents with the json format:

{
“title”: “foo bar”,
“timestamp”: 1234,
“lead”: “foo bar”,
“type”: “story”,
“text”: “foo and bar”
}

A query like

{“size”:100,“sort”: ["_score"],“query”:{“query”:“and”}}

returns in < 20 ms with about 75k results (yes I know “and” a bad example, but a good test to see what happens with a large result set).

If I change to a custom sort field like

{“size”:100,“sort”: [“timestamp”],“query”:{“query”:“and”}}

the query needs > 1s, so 20 times slower.

Is there a way to make this faster? Or at least a reason why a custom sort is so much slower than a sort with _score or _id?

mschoch · September 15, 2017, 12:44pm

Hi,

The first thing to note is that “and” is a stop word, removed by both the “en” and “standard” analyzers. So I’m assuming you’ve already adjusted the mapping to use a different analyzer.

Why is custom sort so much slower than _score or _id? The reason is that for every hit matching the search (ie, all 75k) we have to load (often from disk) additional informational we don’t have (in this case the timestamp). Even though you only want the top 100, we have to do these loads on all 75k matches to find the top 100 timestamp values. In contrast, for both _score and _id, we already have this information readily available, and no additional loads are required.

The only tip to make it faster is to try and reduce the number of documents matching the search (ie, the 75k not the 100). In your example, if you already knew that you would get more than 100 hits from “today”, you could add an additional search clause to filter the matches by timestamp as well. This would reduce the number hits that we have to load the timestamp from.

marty

gizmo74 · September 15, 2017, 1:07pm

Hi Marty,

I did it in german, just translated everything for the forum post here . So the german “und” was not a stop word using the standard analyzer

Thanks for your answer, very interesting. Do you know why Elastic Search (yes I know, bleve is always compared to this ;-)) has similar answer times, no matter which sort option I choose?

Actually I just play with Couchbase FST, and except of this issue I’m quite happe with it. Maybe it can replace ES for us at a later time, reducing complexity and separate servers is always good

Thanks, Pascal

mschoch · September 15, 2017, 1:19pm

Lucene has two capabilities FieldCache and IndexDocValues that are used to make this faster. I’m not exactly sure how ES uses these without some explicit configuration (automatically building them for all fields would seem to waste RAM and not be what you want).

Here are two articles explaining how they work:

marty

gizmo74 · September 15, 2017, 1:24pm

Do you plan to implement something like this in Bleve? A manual configuration of fields using for sort would be cool.
Thanks for the link, now I have something to read at this rainy weekend

mschoch · September 15, 2017, 1:32pm

Yes, we’d like to do something similar, probably initially with something in the mapping to give us a hint that you plan to sort on a field.

marty

gizmo74 · August 24, 2018, 8:04am

A year ago we decided to stay with elastic search because fast sort by date was too important for us. Now I was testing fts again. First with moss storage, with the same result as a year ago. After a hint from @sreeks I tested the same with scorch, and… The results are amazing. More than 10 times faster than moss for this sort queries, and comparable with ES for my test set of now 200k results. So for everyone who needs custom sort, scorch is definitely worth to try.

@mschoch, do you have some Infos how much the overhead of custom sort now is, compared to native _id or _score? There could still be the idea of a workaround by using a timestamp as _id that allows the correct sort order if that’s faster and more cpu friendly.

sreeks · August 24, 2018, 9:06am

Hi Pascal, As my time zone is more overlapping here, let me pour some insights.

Yes, scorch has newer implementation (it’s own versions of docValues/FieldCaches). But bleve implementation is not that elaborate and optimised as that of lucene.

The overall the custom sort flow mentioned earlier in this thread still remains the same.
Hence the overheads are still more for custom fields compared to that of native _score/_id.

To me, the work around you suggested seems plausible to bring some improvements as per theory, may be worth giving another try.

Cheers,
Sreekanth

Topic		Replies	Views
FTS is slow when the potential number of results is big, no matter that the sorts are on indexed fields Full Text Search	10	766	June 28, 2023
FTS Sort by a fiel is not sorted Full Text Search	13	2873	June 20, 2018
FTS Query size problem Full Text Search query , n1ql , server , index , fts	3	46	March 10, 2025
Slow FTS query on a 60m bucket Full Text Search	1	767	February 4, 2022
FTS search works but sorting doesn't Couchbase Server	6	1094	August 26, 2021

FTS: slow sort if sorted by indexed field

Related topics