Finding cafes & gmail

househippo · April 4, 2017, 2:59pm

How do I answer this question. “Find all with Gmail who’s likes coffee.”

{
“type”:”profile”,
“email”:”ted@gmail.com”,
“friends”:[{“name”:”Bob”}
          ,{“name”:”Kevin”}],
“bio”:”I like long walks on the beach and listening to music in cafes.”
}

In N1QL I would do a very complex and inefficient query like this:

SELECT * FROM `bucket’ WHERE
     email LIKE “%gmail.com” AND
    (REGEX_FIND(bio, ”cafe+.*") OR
    REGEX_FIND(bio, ”caffe+.*") OR
    REGEX_FIND(bio, ” café+.*”));

mschoch · April 4, 2017, 4:14pm

Great question. There are really two main parts to this.

Matching cafe-like things.

The typical full-text approach to this is to use a language specific analyzer that happens to transform various forms of a word into a single common form. I did a quick check with the “en” analyzer that FTS ships with today and there are a few things to note.
a) words starting with cafe- and caffe- get stemmed to different tokens
b) é does not get transformed to e

For a, we could just search for multiple terms, and for b, we’d probably recommend building a custom analyzer which does some sort of ascii folding.

Matching email domain:

You could index the entire email field as is and use a regular expression match, like you do with N1QL, but this will be just as inefficient in FTS. Instead I’d recommend building a custom analyzer which only indexes the domain portion. Then you can do exact matching for "@gmail.com" which will be very fast.

The custom analyzer could use our existing “regexp” tokenizer with the pattern “@.*” which will produce a single token with just the domain portion of the email.

All of the previous steps would be done before building the index, in the FTS index mapping. Then once you had the index built you would run a query like:

{
  "conjuncts": [
    {
      "term": "@gmail.com",
      "field": "emailDomain"
    },
    {
      "disjuncts": [
        {
          "match": "cafe",
          "field": "bio"
        },
        {
          "match": "caffe",
          "field": "bio"
        }
      ]
    }
  ]
}

The only piece missing above is that I don’t think we offer a built-in ascii folding token filter today. I’ll look at adding that. Meanwhile, you could work around it today with a character filter, which simply does character for character replacements for the ones you care about.

Let me know if you have more questions.

Topic		Replies	Views
Efficiently pattern matching SQL++	5	981	February 15, 2019
FTS Custom Analyzer help Full Text Search	2	1013	June 11, 2020
FTS NOT matching on plurals or other characters Full Text Search	3	920	February 17, 2021
Full text search configuration for French names with accents Couchbase Server	10	2239	November 12, 2019
FTS search on a field that may contain diacritic symbols Full Text Search fts	5	1233	March 30, 2022

Finding cafes & gmail

Related topics