FTS Scoring is inconsistent among identical search term results

The_Cimmerian · July 21, 2020, 4:27pm

@sreeks @abhinav - You gentlemen have been very helpful with FTS.

I am seeking a better understanding why different scores would be returned for identical terms.

My environment:
3-node 6.0.4 Cluster
Node SDK 2.6.9 AND REST API

The issue:
Searching against index on “name” field in “entity” type docs:
Search term: “YAMA AUTOMOTIVE”

Results returned include nearly 100 matches, all with “xxx AUTOMOTIVE” contained within the name value. Many of the matches return names containing identical spelling to one another. For example:

Given what we know about our data, we expected this sampling:

5 matches bear the name “JUMBO AUTOMOTIVE”
2 bear the name “SLIMS AUTOMOTIVE”.
0 matches expected to match our search term identically or even highly similarly

What was returned was not a surprise. That is, the results include 5, 2, & 0 matches respectively of those above.

What we did not expect was a wide variance between identically named terms.
For example;

“SLIMS AUTOMOTIVE” was the 6th ranked term with a 2.6122 score. However, the term is found a second time ranked 48th with a 1.8993 score.

We are confident there is a reason to explain this but we would like to know what that is. Most, importantly, we need to know how to address our query where makes more sense to the user of the query. This is especially troublesome when the user is expecting the same results against the same set of data and find some terms near the edge of the cut-off point appear to return intermittently.

We know we can sort these by name but that defeats the purpose of the scoring value. We expected identical terms to have largely matching numeric values, even if not perfect.

Thank you for your assistance.

JG

sreeks · July 22, 2020, 7:38am

hi @The_Cimmerian,

Can you confirm whether the second hit’s name field (which was the 48th item among the 100 hits when you searched for “YAMA AUTOMOTIVE”) was having an exact text like - “SLIMS AUTOMOTIVE” alone Or was it just a part of the name of that document, meaning was the name containing any other extra texts in it?

You may also enable the “Explain” option in search request to have a glance on how the scoring/rank was arrived for each of the hits.
https://docs.couchbase.com/server/6.5/fts/fts-response-object-schema.html#request
And then sharing the explain output helps in checking this further.

Meantime you may also explore how the scoring works in FTS here -

https://docs.couchbase.com/server/6.5/fts/fts-troubleshooting.html

Cheers!

The_Cimmerian · July 22, 2020, 12:17pm

Hello @sreeks!

That second hit is the exact text stored in the “name” field of our “entity” document type. There are no other “extra” text or non-visible characters.

To rule out unseen characters or other BOM which may be polluting the score, we placed each result, and then, each source value through a byte analysis and each verified as identical, with no punctuation, no special characters, nor any other non-visible characters.

All our indexing throughout our application, thus far, is always scoped to specific fields. None of our indexes are flag to include in “_all”

In this case, the index is isolated to the “name” field. Both the 6th-ranked and the 48th-ranked match are on the same field, “name”.

If you would provide a non-public means to provide our REST API results, I will provide it to you. The results and the request contain real, private information so I am not at liberty to share in a public forum. Although the names used to illustrate our issue are highly similar to the real names in our case, they are not the real names. The REST API results I can provide you privately are the actual results.

I did seek a similar issue and did find and read your post, “Understanding FTS score value”, you provided in your reply before posting my own question. I don’t believe it provided any insight I could use to address or resolve our issue.

Meanwhile, I will read the “fts-troubleshooting” reference you provided.

I look forward to resolving this issue.

Thank you

JG

The_Cimmerian · July 22, 2020, 12:47pm

@sreeks - Below is the actual explained hits with the name changed to protect the privacy of the entity referenced which yielded these results.

The 6th ranked result:

{
    "index": "entity_name_only_123",
    "id": "abc::app::entity::123",
    "score": 1.6713294483132843,
    "explanation": {
        "value": 1.6713294483132843,
        "message": "sum of:",
        "children": [
            {
                "value": 1.6713294483132843,
                "message": "product of:",
                "children": [
                    {
                        "value": 3.3426588966265687,
                        "message": "sum of:",
                        "children": [
                            {
                                "value": 3.3426588966265687,
                                "message":"weight(name:automotive^1.000000 in �), product of:",
                                "children": [
                                    {
                                        "value": 0.6263527651208354,
                                        "message": "queryWeight(name:automotive^1.000000), product of:",
                                        "children": [
                                            {
                                                "value": 1,
                                                "message": "boost"
                                            },
                                            {
                                                "value": 7.5472383777016825,
                                                "message": "idf(docFreq=18, maxDocs=13249)"
                                            },
                                            {
                                                "value": 0.0829909874016163,
                                                "message": "queryNorm"
                                            }
                                        ]
                                    },
                                    {
                                        "value": 5.336703344770428,
                                        "message":"fieldWeight(name:automotive in �), product of:",
                                        "children": [
                                            {
                                                "value": 1,
                                                "message": "tf(termFreq(name:automotive)=1"
                                            },
                                            {
                                                "value": 0.7071067690849304,
                                                "message":"fieldNorm(field=name, doc=�)"
                                            },
                                            {
                                                "value": 7.5472383777016825,
                                                "message": "idf(docFreq=18, maxDocs=13249)"
                                            }
                                        ]
                                    }
                                ]
                            }
                        ]
                    },
                    {
                        "value": 0.5,
                        "message": "coord(1/2)"
                    }
                ]
            }
        ]
    },
    "sort": [
        "_score"
    ],
    "fields": {
        "name": "SLIMS AUTOMOTIVE"
    }
}

And, here is the 48th ranked result:

{
"index": "entity_name_only_789",
"id": "abc::app::entity::456",
"score": 1.468544659925635,
"explanation": {
    "value": 1.468544659925635,
    "message": "sum of:",
    "children": [
        {
            "value": 1.468544659925635,
            "message": "product of:",
            "children": [
                {
                    "value": 2.93708931985127,
                    "message": "sum of:",
                    "children": [
                        {
                            "value": 2.93708931985127,
                            "message":"weight(name:automotive^1.000000 in �), product of:",
                            "children": [
                                {
                                    "value": 0.586172353500337,
                                    "message": "queryWeight(name:automotive^1.000000), product of:",
                                    "children": [
                                        {
                                            "value": 1,
                                            "message": "boost"
                                        },
                                        {
                                            "value": 7.086092676186764,
                                            "message": "idf(docFreq=29, maxDocs=13191)"
                                        },
                                        {
                                            "value": 0.08272151950117786,
                                            "message": "queryNorm"
                                        }
                                    ]
                                },
                                {
                                    "value": 5.010624097694811,
                                    "message":"fieldWeight(name:automotive in �), product of:",
                                    "children": [
                                        {
                                            "value": 1,
                                            "message": "tf(termFreq(name:automotive)=1"
                                        },
                                        {
                                            "value": 0.7071067690849304,
                                            "message":"fieldNorm(field=name, doc=�)"
                                        },
                                        {
                                            "value": 7.086092676186764,
                                            "message": "idf(docFreq=29, maxDocs=13191)"
                                        }
                                    ]
                                }
                            ]
                        }
                    ]
                },
                {
                    "value": 0.5,
                    "message": "coord(1/2)"
                }
            ]
        }
    ]
},
"sort": [
    "_score"
],
"fields": {
    "name": "SLIMS AUTOMOTIVE"
}

}

sreeks · July 22, 2020, 1:20pm

I do see a variation in the maxDocs parameter in the explanation across results…

                                            "value": 7.086092676186764,
                                            "message": "idf(docFreq=29, maxDocs=13191)"
                                        },
Vs

{
                                                "value": 7.5472383777016825,
                                                "message": "idf(docFreq=18, maxDocs=13249)"
                                            },

Whats the page size of your query? I mean, are you fetching pages of size 10 Or getting the whole hits with a single search request of page size 100? A sample curl query would be highly appreciated.

Dropping a private message is an option here in these forums, but going thru the CBSE channel is highly recommended here as it helps others facing similar problems.

Cheers!

The_Cimmerian · July 22, 2020, 2:21pm

@sreeks,

I see the variances in the docFreq and maxDocs you highlighted. What steps shall I take to cure that? How may I bring those into sync?

In the meantime, I attempted to paste in and then upload the full, actual results but it exceeded the size limits. If you still need the original, please provide a means to upload them to you.

JG

sreeks · July 22, 2020, 3:21pm

hey @The_Cimmerian,

One reason for this mostly could be the fact that both these hits(6 and 48) are coming from different index partitions. This should be visible from the “index” field of hits in the response. It happens as the score/rank is computed at each of the index partition level and then the document hits are gathered in a query processing node. So ranking is prune to certain correctness errors here is my hunch from the info shared so far.

How big is your data? If the whole data can be put into a single partitioned FTS index, then this problem could be solved is my thinking now. But that depends on a lot of factors like scaling/data load/SLAs etc.
You may explore/try a single partitioned index for the above data set to verify this and confirm your SLA compliances.

As I mentioned earlier, there is more to this ticket which would be difficult to track it over forum ping pongs.

Cheers!

The_Cimmerian · July 22, 2020, 4:00pm

Sample CURL query:

curl --location --request POST 'http://13.41.100.111:8094/api/index/entity_name_only/query?limit=100&offset=0' \
--header 'Content-Type: application/json' \
--header 'Authorization: Basic xxx=' \
--data-raw '{
    "query": {
        "conjuncts": [
            {
                "match": "yama automotive",
                "field": "name"
            }
        ]
    },
    "fields": [
        "*"
    ],
    "sort": [
        "-_score"
    ],
    "size": 100,
    "explain": true
}'

Topic		Replies	Views
How to configure the FTS scoring? Full Text Search	22	1509	February 16, 2021
FTS giving different score for documents with the exact same field value. Why is that? Full Text Search	3	642	June 27, 2023
Understanding FTS score value Full Text Search	7	2313	November 21, 2019
Full Text Search relevance scores in multi tenants environment Full Text Search	13	2245	November 21, 2019
FTS with PhraseSearch Full Text Search	13	957	May 8, 2023

FTS Scoring is inconsistent among identical search term results

Related topics