FTS with PhraseSearch

Hi @abhinav ,

I need an advice. I am trying to use the the FTS in a project and I encounter some limitations. I was hoping you could give me a solution.

I have some documents with a content similar to:

“word1 word2 word3 word4 word5”

I would like to be able to search for: “word1 word2 word3 word4 word5” and receive the above document

in the situation in which I search just for “word1 word2 word3” I don’t want the above documents to be returned

in the same time the order of the words is important, for example I don’t want the above document to be returned if I am using “word1 word3, word2, word5, word4”. because of this I think the only option I have is to use PhraseSearch

But how do I make sure that the document doesn’t have additional words when I search for “word1 word2 word3”?

Is there a way in which I could calculate what would be the correct score for a document if it would have just “word1 word2 word3” so that I could filter out the documents also containing the “word4 and word5” (because they will have a different score) ?

Do you have another suggestion?

Hi @abhinav
We have several products in several languages. If you can say which you are using, we’ll probably be able to provide a more specific answer.

@flaviu To support phrase search in the necessary order, you will need to “include term vectors” for the field in question; you can choose the analyzer as needed but for your example text I’ll go with the standard analyzer so the numbers don’t get dropped.

With the above settings, the term dictionary will also include array positions for your text. Here’s a sample …

dictionary:
 word1 - 204 (cc) posting byteSize: 20 cardinality: 2
 word2 - 249 (f9) posting byteSize: 20 cardinality: 2
 word3 - 294 (126) posting byteSize: 20 cardinality: 2
 word4 - 331 (14b) posting byteSize: 18 cardinality: 1
 word5 - 366 (16e) posting byteSize: 18 cardinality: 1

Now you can perform a match_phrase query over this that will take into account the order of the criteria. Note that “term vectors” are a requirement for the match_phrase query.

Here’re queries that would work …

  • {"query": {"field": "fieldX", "match_phrase": "word1 word2 word3"}}
  • {"query": {"field": "fieldX", "match_phrase": "word1 word2 word3 word4 word5"}}

and here’re those that won’t …

  • {"query": {"field": "fieldX", "match_phrase": "word5 word1"}}
  • {"query": {"field": "fieldX", "match_phrase": "word1 word2 word3 word5 word4"}}

Remember a match_phrase query is an analytic query, so the analyzer for the text field (from the index definition) is applied on the search criteria before executing the search.

If you’ve 2 documents with these contents in fieldX:

  • "word1 word2 word3"
  • "word1 word2 word3 word4 word5"

Then running a match_phrase search for word1 word2 word3 will return both the documents as hits, scoring the first above the second, because of exact match.

Alternatively, you could look into applying a custom analyzer with a shingle token filter while indexing your data and using a non-analytic query such as term to search for your data.

Here’s the definition of a custom analyzer …

"analysis": {
    "analyzers": {
     "temp_shingle": {
      "token_filters": [
       "shingle_min_5_max_5"
      ],
      "tokenizer": "whitespace",
      "type": "custom"
     }
    },
    "token_filters": {
     "shingle_min_5_max_5": {
      "filler": "",
      "max": 5,
      "min": 5,
      "output_original": false,
      "separator": " ",
      "type": "shingle"
     }
    }
   }

With this definition, the index will NOT index text whose shingle length is less than or greater than 5, meaning the text word1 word2 word3 is not even indexed. Here’s a sample term dictionary for the above 2 documents …

dictionary:
 word1 word2 word3 word4 word5 - 9223372039002259457 (8000000080000001) -- docNum: 1, norm: 0.000000

Remember to hook this analyzer to your fieldX while defining the index. Now here’s a query …

{"query": {"field": "fieldX", "term": "word1 word2 word3 word4 word5"}}

Hope this helps.

thanks for the comprehensive answer, but I am not sure it answer my question.

the problem I have is that if I have a text: “maine coon cat black” I would like to be able to search "“maine coon cat” (without the “black”) but would be great to know if the result is an exact match or a partial match.

because if is a partial match I need to take an action and if is a full match I need to take another action. so, for me is important to be able to either not return the result at all or return the result but somehow to know that the full match will have a score and a partial match will have a lower score… but from what I understand there is no way to know what would be the score for a full phrase match.

Does it make sense?

I think this question was referring to me.

I am testing Couchbase Enterprise Edition 7.1.3 build 3479 with Full Text Search enabled

the problem I have is that if I have a text: “maine coon cat black” I would like to be able to search "“maine coon cat” (without the “black”) but would be great to know if the result is an exact match or a partial match.

Ok, so you should “include term vectors” for the field when indexing and use the match_phrase query from my earlier comment.

Now about hit scores - scoring is relative. An exact match will score higher than a partial match, but using the score alone you cannot determine whether a hit was exact or partial.

I’ll recommend an approach here where your application will have to determine whether it is an exact match or a partial, after the search engine returns to you the hits for your query.

  1. Index field of interest with store and include term vectors enabled.
  2. Run match_phrase query using the fields options to obtain the actual field content of the hit.
{
  "query": {
    "field": "fieldX",
    "match_phrase": "maine coon cat"
  },
  "fields": [
    "fieldX"
  ]
}

This query would produce hits, where alongside the hit ID, you will also see the field fragment. Using this fragment, your application should be able to determine whether the hit was an exact match or not.

Here’s a sample of how your search response would look …

  "hits": [
    {
      "index": "...",
      "id": "doc1",
      "score": ...,
      "sort": [
        "_score"
      ],
      "fields": {
        "fieldX": "maine coon cat black"
      }
    }
  ]
1 Like

Wow… Sorry for the confusion.

Thanks for the fast answer

Wouldn’t this make the index enormous? I may get to 10 billion documents, the text will have between 2000 and 4000 characters.

The “stored” content is compressed, but that said - yes it would take up a significant portion of your final index size.

so, basically, there is no way to either not return the result if is not a full match or to know if it is a partial match other than storing the text and then comparing the response with the search query. Did I understood correct?

Yes, that’s correct. You can rely on scoring (tf-idf) to determine which is the most closest match, but you’ll need to verify for yourself to determine which is the exact match.

Thank you, I really appreciate your help.

Is this functionality (full phrase match) be possible to be added to Couchbase FTS?

I’ve created a ticket for it - https://issues.couchbase.com/browse/MB-55479 to add to our roadmap.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.