Full text search with boosting and regexp

saisrikarmutya · March 6, 2018, 6:31pm

Hi,

My usecase is that I want to perform a search where I have a phrase as input and the result expected is like below.

spiderman^6 OR spiderman\ ^5 OR spiderman^4 OR *\ spiderman\ *^3 OR spiderman^2 OR *spiderman^1

Title which exactly matches with the word spiderman should have more preference
Titles which has spiderman as first word
Title which starts with spiderman
Title which has spiderman as a word in between
Title which has spiderman anywhere in between
Title which ends with spiderman

But it seems boosting is supported by Query String Query only and it doesn’t support wildcard/regexp according to this link

https://developer.couchbase.com/documentation/server/5.0/fts/fts-query-types.html

So is there any alternative for me in FTS to do this.

Thanks

mschoch · March 6, 2018, 7:04pm

So, there is a lot to cover here. First, boosting is supported by all query types, and query string does support wildcard and regexp now, that page of documentation is incomplete.

But, based on your requirements, I don’t think we’ll be able to satisfy all your requirements.

When you say title “exactly matches” the “word” spiderman, this is already something that can be solved two very different ways. Option 1 is to tokenize the input, but do no analysis on the terms, then do a term search for ‘spiderman’ on the resulting terms. You have to disable analysis for this approach because otherwise things like stemmers will transform words and you won’t get exact matches. The downside of this approach is that “exact match” is still subject tokenization not always doing what you want (hyphonated words, numbers, abbreviations, periods, etc) Option 2 is to not tokenize the input at all, and index the entire field as a single term. Typically we don’t do this because it throws away all the FTS features, and you’re basically stuck doing regexp or wildcard queries within the value.
There is no primitive in FTS today to express this concept. The closest approximation is to use Option 2 described above, and use a regexp starting with ‘spiderman’
I’m not clear on the difference from number 2. Isn’t spiderman as the first word the same as starting with spiderman?
Again, there is no FTS primitive today to do this, you’d have to use regexp.
Not clear on the difference from 4. Maybe you mean also including ‘spidermanagement101’?
Again, you’d have to use regexp. And you might want to check your rules as it seems like depending on how this one is defined it also satisfies others like 5.

So, taking all this together, it looks like you’ll need to index the entire field as is (using what’s called the keyword analyzer). Then use regular expressions for all of these rules. You can do this in a query string, but I’d recommend against it. Query strings are meant to help users key in queries manually with some ease, not for building complex queries. I’d recommend you use the SDK to build these more complex queries.

Finally, just remember that when you index the field this way, you basically can only do regexp queries on that field now. If you want fuzzy matching, or stemming of words, that is incompatible, and you’d have to index the field multiple times.

saisrikarmutya · March 7, 2018, 7:14am

Hi @mschoch ,

Thanks for the immediate reply. As suggested by you I have created I have the index using keyword analyzer.

{
“type”: “fulltext-index”,
“name”: “movies-index”,
“uuid”: “35f5501ea4ee643e”,
“sourceType”: “couchbase”,
“sourceName”: “movies”,
“sourceUUID”: “b83aaf0f430524843b1add3f33effac3”,
“planParams”: {
“maxPartitionsPerPIndex”: 171
},
“params”: {
“doc_config”: {
“mode”: “type_field”,
“type_field”: “type”
},
“mapping”: {
“default_analyzer”: “standard”,
“default_datetime_parser”: “dateTimeOptional”,
“default_field”: “_all”,
“default_mapping”: {
“dynamic”: true,
“enabled”: false
},
“default_type”: “_default”,
“index_dynamic”: true,
“store_dynamic”: false,
“types”: {
“movie”: {
“default_analyzer”: “keyword”,
“dynamic”: false,
“enabled”: true,
“properties”: {
“title”: {
“dynamic”: false,
“enabled”: true,
“fields”: [
{
“include_in_all”: true,
“include_term_vectors”: true,
“index”: true,
“name”: “title”,
“store”: true,
“type”: “text”
}
]
}
}
}
}
},
“store”: {
“kvStoreName”: “mossStore”
}
},
“sourceParams”: {}
}

Also used a combination of DisjunctionQuery and QueryStringQuery.

public static void findDocByTextMatch(String searchText) {

    String           title  = "title:";
    QueryStringQuery query1 = new QueryStringQuery(title.concat(searchText)).boost(16);
    QueryStringQuery query2 = new QueryStringQuery(title.concat(searchText).concat("*")).boost(8);
    QueryStringQuery query3 = new QueryStringQuery(title.concat("*").concat(searchText).concat("*")).boost(4);
    QueryStringQuery query4 = new QueryStringQuery(title.concat("*").concat(searchText)).boost(2);

    DisjunctionQuery disjunctionQuery = new DisjunctionQuery();
    disjunctionQuery.or(query1, query2, query3, query4);

    SearchQueryResult result = bucket.query(

        new SearchQuery("movies-index", disjunctionQuery));

    for (SearchQueryRow hit : result.hits()) {

        System.out.println("****** score := " + hit.score() + " and content := "

            + bucket.get(hit.id()).content().get("title"));

    }
}

where my search text is spiderman.

The problem I am facing here is even with different boost values I am getting the same response. That means even if I reverse the boost order, I am getting the same response.

saisrikarmutya · March 12, 2018, 10:00am

hi @mschoch…

Did you get a chance to look into the above problem statement.

Thanks

mschoch · March 12, 2018, 3:32pm

Here are a few more suggestions.

Simplify your requirements temporarily to 2 cases, get that working first then expand it to more cases.
Since you are already building disjunction query in code, I recommend that you not use the query string query at all. It is designed for humans to type in. In cases where you’re building a query in code, there is always a more direct way to get exactly what you want, and avoid parsing issues or other unintended behavior. In this case, the query string queries you’re building are ultimately just doing wildcard searches. So I would replace them with direct creation of wildcard queries.
One complication I pointed out in the original message remains, and that is that your wildcards will sometimes match multiple cases. For example. “spiderman” matches “spider*” AND “spider”. The disjunction query also boosts results that satisfies multiple clauses, which is not what you want in this case where you’re trying to get fine-grained control over the relative scores. Currently there is no way to disable this, so this may make your requirements difficult or impossible to satisfy.
It would be helpful to create a very small dataset (say 2 docs), share those here, and then show the exact queries and results that aren’t what you want. This gives us a concrete example to work with and try to reproduce, and/or make suggestions.

mschoch · March 12, 2018, 3:34pm

It appears the forums here are doing some sort of markup, which changed my examples in item 3.

"spiderman" matches both "spider*" AND "*spider*"