Finding cafes & gmail

How do I answer this question. “Find all with Gmail who’s likes coffee.”

{
“type”:”profile”,
“email”:”ted@gmail.com”,
“friends”:[{“name”:”Bob”}
          ,{“name”:”Kevin”}],
“bio”:”I like long walks on the beach and listening to music in cafes.”
}

In N1QL I would do a very complex and inefficient query like this:

SELECT * FROM `bucket’ WHERE
     email LIKE “%gmail.com” AND
    (REGEX_FIND(bio, ”cafe+.*") OR
    REGEX_FIND(bio, ”caffe+.*") OR
    REGEX_FIND(bio, ” café+.*”));

Great question. There are really two main parts to this.

  1. Matching cafe-like things.

The typical full-text approach to this is to use a language specific analyzer that happens to transform various forms of a word into a single common form. I did a quick check with the “en” analyzer that FTS ships with today and there are a few things to note.
a) words starting with cafe- and caffe- get stemmed to different tokens
b) é does not get transformed to e

For a, we could just search for multiple terms, and for b, we’d probably recommend building a custom analyzer which does some sort of ascii folding.

  1. Matching email domain:

You could index the entire email field as is and use a regular expression match, like you do with N1QL, but this will be just as inefficient in FTS. Instead I’d recommend building a custom analyzer which only indexes the domain portion. Then you can do exact matching for "@gmail.com" which will be very fast.

The custom analyzer could use our existing “regexp” tokenizer with the pattern “@.*” which will produce a single token with just the domain portion of the email.

All of the previous steps would be done before building the index, in the FTS index mapping. Then once you had the index built you would run a query like:

{
  "conjuncts": [
    {
      "term": "@gmail.com",
      "field": "emailDomain"
    },
    {
      "disjuncts": [
        {
          "match": "cafe",
          "field": "bio"
        },
        {
          "match": "caffe",
          "field": "bio"
        }
      ]
    }
  ]
}

The only piece missing above is that I don’t think we offer a built-in ascii folding token filter today. I’ll look at adding that. Meanwhile, you could work around it today with a character filter, which simply does character for character replacements for the ones you care about.

Let me know if you have more questions.

1 Like