Searching on value containing a - (hyphen)

Hi,

I’m having trouble with what seems like a simple FTS

Given the data:

{ “id”: “sim_current-anthropology_1989-02_30_1” }
{ “id”: “sim_current-anthropology_1989-03_30_1” }
{ “id”: “sim_current-anthropology_1989-04_30_1” }
etc…

This FTS works (many results):

{“size”: 10, “explain”: false, “fields”: [“id”], “query”:{“wildcard”: “sim_current*”}}

This does not (0 results):

{“size”: 10, “explain”: false, “fields”: [“id”], “query”:{“wildcard”: “sim_current-*”}}

The difference is a “-” at the end of “current”.

I tried with “prefix” instead of “wildcard”.

I tried escaping the hyphen with a backslash but it errors with “invalid character ‘-’ in string escape code”.

Tried regexp:

{“size”: 10, “explain”: false, “fields”: [“id”], “query”:{“regexp”: “sim_current[-].+”}}

Tried a non-FTS with "LIKE = “sim_current-anthropology%” but it is slow even when indexed (16m records of which 1.5M begin with “sim_”)

Is there something I am missing? Thanks.

If you are looking for N1QL. Underscore wild card you need to escape that like below
https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/comparisonops.html

create index ix1 ON default(id);
SELECT d.*
FROM default AS d
WHERE d.id LIKE "sim\\_current-%"

Check EXPLAIN and see spans has enough information passed to indexer produce less items.

               {
                    "exact": true,
                    "range": [
                        {
                            "high": "\"sim_current.\"",
                            "inclusion": 1,
                            "low": "\"sim_current-\""
                        }
                    ]
                }

Thank you. I’m using the API not N1QL. Tried escaping the underscore and/or dash with double \ with no change.

Here is the full command:

curl -u priv:priv -X POST -H "Content-Type: application/json" http://localhost:8094/api/index/index_2/query -d '{"size": 10, "explain": true, "fields": ["id"], "query":{"wildcard": "sim_current-*"}}'

The output:

{"status":{"total":6,"failed":0,"successful":6},"request":{"query":{"wildcard":"sim_american-*"},"size":10,"from":0,"highlight":null,"fields":["id"],"facets":null,"explain":true,"sort":["-_score"],"includeLocations":false,"search_after":null,"search_before":null},"hits":[],"total_hits":0,"max_score":0,"took":710955,"facets":null}

As noted earlier, changing to this:

{"wildcard": "sim_current*"}

…it returns the correct results. But this does not work:

{"wildcard": "sim_current-*"}

I need to to include the hyphen to narrow the results as it is too many without.

@stb3 ,

The problem here mostly stems from the analyzer in use for the field used.
The default standard analyzer omits the contents after the hyphen - and hence not searchable.
You explore that here- Bleve Text Analysis Wizard

Also, in the above query since you haven’t specified any target fields the search is applied on the default _all field.
So you could either fix your default analyzer in the index definition Or
change the analyzer for the field of interest and use that field in the query as the target field.

Keyword analyzer would be one choice, but the final analyzer depends on what exactly your search requirements.

https://docs.couchbase.com/server/current/fts/fts-using-analyzers.html

Cheers

Hello Sreeks - yes that worked! It required to change the default analyzer since it had already been built in the index with the ‘standard’ analyzer. Thank you! I will need to learn more about analyers.

@sreeks HI,
I have an FTS index and in one of the fields I set up keyword analyzer(to support ‘-’ in my value):

 "connection_name": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "keyword",
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "connection_name",
                  "type": "text"
                }
              ]
            },

Now, if I am searching for values such as idan-connector (search ida/idan/idan-/idan-c) I get the expected results, but if I am searching for values such as idanCon (search ida/idanC/idanCon) I get 0 results.

@idangazit A keyword analyzer generates a single token for the entire text field, so for "full text search" the only token generated is ["full text search"].

Taking your example, if your documents contained: "connection_name": "idan-connector", the token indexed for the field would be idan-connector. And you should be able to match all those documents if you use the following wildcards over field connection_name

ida*
idan*
idan-c*

If there do exist documents that contain connection_name:idanCon, searching for the following wildcards should match them …

ida*
idan*
idanC*

If I have misunderstood your question, sharing the entire index definition and your exact search requests would help.

Thanks @abhinav,
The thing is that when I search for ida/idan/idanC, it cant find any documents(even though there are documents with connection_name idanCon)
here is my index defenition:

{
  "type": "fulltext-index",
  "name": "something_search_v2",
  "uuid": "1aa46fa7d0e524e7",
  "sourceType": "couchbase",
  "sourceName": "somethings",
  "planParams": {
    "maxPartitionsPerPIndex": 171,
    "indexPartitions": 6
  },
  "params": {
    "doc_config": {
      "docid_prefix_delim": "",
      "docid_regexp": "",
      "mode": "type_field",
      "type_field": "type"
    },
    "mapping": {
      "analysis": {},
      "default_analyzer": "standard",
      "default_datetime_parser": "dateTimeOptional",
      "default_field": "_all",
      "default_mapping": {
        "dynamic": true,
        "enabled": false
      },
      "default_type": "_default",
      "docvalues_dynamic": true,
      "index_dynamic": true,
      "store_dynamic": false,
      "type_field": "_type",
      "types": {
        "something": {
          "dynamic": false,
          "enabled": true,
          "properties": {
            "something_id": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "case_id",
                  "store": true,
                  "type": "text"
                }
              ]
            },
            "connection_name": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "analyzer": "keyword",
                  "docvalues": true,
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "connection_name",
                  "type": "text"
                }
              ]
            }
            "something2": {
              "default_analyzer": "web",
              "dynamic": true,
              "enabled": true
            },
            "something2_type": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "threat_type",
                  "type": "text"
                }
              ]
            },
            "updated_at": {
              "dynamic": false,
              "enabled": true,
              "fields": [
                {
                  "include_in_all": true,
                  "include_term_vectors": true,
                  "index": true,
                  "name": "updated_at",
                  "type": "number"
                }
              ]
            }
          }
        }
      }
    },
    "store": {
      "indexType": "scorch"
    }
  },
  "sourceParams": {}
}

@idangazit you are using the keyword analyzer for the field connection_name.
So for documents carrying “connection_name:idanCon”, the token indexed is idanCon.

If you’re running a match/term query over it, you’ll need to search for “idanCon” to match those documents.
If you’re running a wildcard query, like I mentioned earlier - you can use …

ida*
idan*
idanC*

@abhinav thanks,
Our search includes prefix query and match phrase query:

qq.Or(search.NewPrefixQuery(searchTxt).Field("connection_name"))
qq.Or(search.NewMatchPhraseQuery(searchTxt).Field("connection_name"))

And when we search for ida/idan/idanC there are zero hits

Your prefix queries should have worked from the info so far.
Are you sure that there are no other white space characters before idanCon ?

hi @sreeks , thanks.
There are no white spaces there.
when searching for this:
idan-c it finds documents with connection_name: idan-connector
devc it does not finds documents with connection_name: devconnector
devconnector it does not finds documents with connection_name: devconnector
test-c it finds documents with connection_name: test-connector
moshe-h it finds documents with connection_name: moshe-haim
testc it does not find documents with connection_name: testconn
testconn it does not find documents with connection_name: testconn

Any ideas on what might happen here?

@idangazit I don’t see why any of the above situations wouldn’t work if you are using the keyword analyzer and the documents are actually present.

What version of couchbase server are you using?

@abhinav I am using community-6.5.0

Just trying to narrow down the issue,
Is the same query works if you attempt a direct curl call at the fts node?
Can you please check with a direct prefix query?

curl -XPOST -H "Content-Type: application/json" -u <userName:password>
http://<nodet:8094>/api/index/<indexName>/query -d '{"query": {"prefix": "devc", "field": "connection_name"}}' 

@sreeks thanks,
I think I understand the problem - the keyword analyzer preserves the field as it is with upper cases.
Consider the current definition of my index(above), how do you suggest me to modify the index to use the keyword alalyzer with ‘to_lower’? (I dont want to change the default alalyzer to keyword since most of the fields in the index use standard alalyzer.)

You could create and use a custom analyzer under the Analyzer section of the index definition’s Edit page like below for the field “connection_name”.

1 Like