Searching on value containing a - (hyphen)

stb3 · June 4, 2021, 8:16pm

Hi,

I’m having trouble with what seems like a simple FTS

Given the data:

{ “id”: “sim_current-anthropology_1989-02_30_1” }
{ “id”: “sim_current-anthropology_1989-03_30_1” }
{ “id”: “sim_current-anthropology_1989-04_30_1” }
etc…

This FTS works (many results):

{“size”: 10, “explain”: false, “fields”: [“id”], “query”:{“wildcard”: “sim_current*”}}

This does not (0 results):

{“size”: 10, “explain”: false, “fields”: [“id”], “query”:{“wildcard”: “sim_current-*”}}

The difference is a “-” at the end of “current”.

I tried with “prefix” instead of “wildcard”.

I tried escaping the hyphen with a backslash but it errors with “invalid character ‘-’ in string escape code”.

Tried regexp:

{“size”: 10, “explain”: false, “fields”: [“id”], “query”:{“regexp”: “sim_current[-].+”}}

Tried a non-FTS with "LIKE = “sim_current-anthropology%” but it is slow even when indexed (16m records of which 1.5M begin with “sim_”)

Is there something I am missing? Thanks.

vsr1 · June 5, 2021, 5:52am

If you are looking for N1QL. Underscore wild card you need to escape that like below
https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/comparisonops.html

create index ix1 ON default(id);
SELECT d.*
FROM default AS d
WHERE d.id LIKE "sim\\_current-%"

Check EXPLAIN and see spans has enough information passed to indexer produce less items.

               {
                    "exact": true,
                    "range": [
                        {
                            "high": "\"sim_current.\"",
                            "inclusion": 1,
                            "low": "\"sim_current-\""
                        }
                    ]
                }

stb3 · June 5, 2021, 6:25am

Thank you. I’m using the API not N1QL. Tried escaping the underscore and/or dash with double \ with no change.

Here is the full command:

curl -u priv:priv -X POST -H "Content-Type: application/json" http://localhost:8094/api/index/index_2/query -d '{"size": 10, "explain": true, "fields": ["id"], "query":{"wildcard": "sim_current-*"}}'

The output:

{"status":{"total":6,"failed":0,"successful":6},"request":{"query":{"wildcard":"sim_american-*"},"size":10,"from":0,"highlight":null,"fields":["id"],"facets":null,"explain":true,"sort":["-_score"],"includeLocations":false,"search_after":null,"search_before":null},"hits":[],"total_hits":0,"max_score":0,"took":710955,"facets":null}

As noted earlier, changing to this:

{"wildcard": "sim_current*"}

…it returns the correct results. But this does not work:

{"wildcard": "sim_current-*"}

I need to to include the hyphen to narrow the results as it is too many without.

sreeks · June 5, 2021, 7:59am

@stb3 ,

The problem here mostly stems from the analyzer in use for the field used.
The default standard analyzer omits the contents after the hyphen - and hence not searchable.
You explore that here- http://bleveanalysis.couchbase.com/analysis

Also, in the above query since you haven’t specified any target fields the search is applied on the default _all field.
So you could either fix your default analyzer in the index definition Or
change the analyzer for the field of interest and use that field in the query as the target field.

Keyword analyzer would be one choice, but the final analyzer depends on what exactly your search requirements.

https://docs.couchbase.com/server/current/fts/fts-using-analyzers.html

Cheers

stb3 · June 5, 2021, 4:07pm

Hello Sreeks - yes that worked! It required to change the default analyzer since it had already been built in the index with the ‘standard’ analyzer. Thank you! I will need to learn more about analyers.

idangazit · August 24, 2021, 1:13pm

@sreeks HI,
I have an FTS index and in one of the fields I set up keyword analyzer(to support ‘-’ in my value):

 "connection_name": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "keyword",
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "connection_name",
                  "type": "text"
                }
              ]
            },

Now, if I am searching for values such as idan-connector (search ida/idan/idan-/idan-c) I get the expected results, but if I am searching for values such as idanCon (search ida/idanC/idanCon) I get 0 results.

abhinav · August 24, 2021, 11:03pm

@idangazit A keyword analyzer generates a single token for the entire text field, so for "full text search" the only token generated is ["full text search"].

Taking your example, if your documents contained: "connection_name": "idan-connector", the token indexed for the field would be idan-connector. And you should be able to match all those documents if you use the following wildcards over field connection_name …

ida*
idan*
idan-c*

If there do exist documents that contain connection_name:idanCon, searching for the following wildcards should match them …

ida*
idan*
idanC*

If I have misunderstood your question, sharing the entire index definition and your exact search requests would help.

idangazit · August 25, 2021, 12:55pm

Thanks @abhinav,
The thing is that when I search for ida/idan/idanC, it cant find any documents(even though there are documents with connection_name idanCon)
here is my index defenition:

{
  "type": "fulltext-index",
  "name": "something_search_v2",
  "uuid": "1aa46fa7d0e524e7",
  "sourceType": "couchbase",
  "sourceName": "somethings",
  "planParams": {
    "maxPartitionsPerPIndex": 171,
    "indexPartitions": 6
  },
  "params": {
    "doc_config": {
      "docid_prefix_delim": "",
      "docid_regexp": "",
      "mode": "type_field",
      "type_field": "type"
    },
    "mapping": {
      "analysis": {},
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": true,
        "enabled": false
      },
      "default_type": "_default",
      "docvalues_dynamic": true,
      "index_dynamic": true,
      "store_dynamic": false,
      "type_field": "_type",
      "types": {
        "something": {
          "dynamic": false,
          "enabled": true,
          "properties": {
            "something_id": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "case_id",
                  "store": true,
                  "type": "text"
                }
              ]
            },
            "connection_name": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "keyword",
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "connection_name",
                  "type": "text"
                }
              ]
            }
            "something2": {
              "default_analyzer": "web",
              "dynamic": true,
              "enabled": true
            },
            "something2_type": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "threat_type",
                  "type": "text"
                }
              ]
            },
            "updated_at": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "updated_at",
                  "type": "number"
                }
              ]
            }
          }
        }
      }
    },
    "store": {
      "indexType": "scorch"
    }
  },
  "sourceParams": {}
}

abhinav · August 25, 2021, 8:54pm

@idangazit you are using the keyword analyzer for the field connection_name.
So for documents carrying “connection_name:idanCon”, the token indexed is idanCon.

If you’re running a match/term query over it, you’ll need to search for “idanCon” to match those documents.
If you’re running a wildcard query, like I mentioned earlier - you can use …

ida*
idan*
idanC*

idangazit · August 26, 2021, 11:03am

@abhinav thanks,
Our search includes prefix query and match phrase query:

qq.Or(search.NewPrefixQuery(searchTxt).Field("connection_name"))
qq.Or(search.NewMatchPhraseQuery(searchTxt).Field("connection_name"))

And when we search for ida/idan/idanC there are zero hits

sreeks · August 26, 2021, 12:11pm

Your prefix queries should have worked from the info so far.
Are you sure that there are no other white space characters before idanCon ?

idangazit · August 26, 2021, 1:30pm

hi @sreeks , thanks.
There are no white spaces there.
when searching for this:
idan-c it finds documents with connection_name: idan-connector
devc it does not finds documents with connection_name: devconnector
devconnector it does not finds documents with connection_name: devconnector
test-c it finds documents with connection_name: test-connector
moshe-h it finds documents with connection_name: moshe-haim
testc it does not find documents with connection_name: testconn
testconn it does not find documents with connection_name: testconn

Any ideas on what might happen here?

abhinav · August 26, 2021, 1:46pm

@idangazit I don’t see why any of the above situations wouldn’t work if you are using the keyword analyzer and the documents are actually present.

What version of couchbase server are you using?

idangazit · August 26, 2021, 1:49pm

@abhinav I am using community-6.5.0

sreeks · August 26, 2021, 3:51pm

Just trying to narrow down the issue,
Is the same query works if you attempt a direct curl call at the fts node?
Can you please check with a direct prefix query?

curl -XPOST -H "Content-Type: application/json" -u <userName:password>
http://<nodet:8094>/api/index/<indexName>/query -d '{"query": {"prefix": "devc", "field": "connection_name"}}'

idangazit · August 29, 2021, 7:51am

@sreeks thanks,
I think I understand the problem - the keyword analyzer preserves the field as it is with upper cases.
Consider the current definition of my index(above), how do you suggest me to modify the index to use the keyword alalyzer with ‘to_lower’? (I dont want to change the default alalyzer to keyword since most of the fields in the index use standard alalyzer.)

sreeks · August 30, 2021, 3:42am

You could create and use a custom analyzer under the Analyzer section of the index definition’s Edit page like below for the field “connection_name”.

Topic		Replies	Views
Full text search with N1QL query gives incorrect and inconsistent response Full Text Search query , n1ql , index , fts	7	1421	July 21, 2022
Escaping wildcards in LIKE Couchbase Server n1ql	3	2839	February 16, 2017
Behaviour changed from 6 to 7 with wildcard / regexp Full Text Search n1ql , fts	4	583	June 7, 2023
Help :( Node js full text search with n1ql SQL++	2	1220	July 13, 2017
FTS search() having strange results Full Text Search n1ql , fts	1	826	February 24, 2021

Searching on value containing a - (hyphen)

Related topics