Index from SUFFIXES by whitespace

jinchong · August 9, 2023, 3:57am

Hi,

I created the array index using SUFFIXES but the size of it is over 160MiB, too big.

CREATE INDEX idx_name_suffixes ON `sample` (
DISTINCT ARRAY v
FOR v IN SUFFIXES(LOWER(name))
END
)

To reduce index size, instead of SUFFIXES, I am looking for another aka SUFFIXES function but tokenise by whitespace such as:

SPACE_SUFFIXES("my name is john.") = [
"my",
"my name",
"my name is",
"my name is jonh."]

Is there any function like this?
Or do I need to create a custom function? In this case, can I use it with index?

Many thanks in advance.
Jin

dh · August 9, 2023, 7:24am

There isn’t such a function built in. UDFs are not indexable. Off-hand I don’t have a solution for such a requirement.

Your example seems to be more prefixes than suffixes. (For suffixes I’d expect the results to be “john.”,“is john”,“name is john”,“my name is john”.)

What’s the use case? Are you looking to implement auto-completion? (In which case LIKE pattern matching will be simplest on a basic field index.)

jinchong · August 9, 2023, 9:27am

@dh Thanks for reply.
Yes, it is for real time autocomplete.
The requirement is, for example, if they want to search a record which has “my name is john.”, they said that they will search by “name is” or “john” or “my nam”. But not by “y name” or “onh”.

Unfortunately the index by SUFFIXES(“my name is john.”) is too big. Bigger than 160MiB per fields.
So I just want to create the array index like [“my”, “my name”, “my name is”, “my name is jonh.”]. Or the other way is OK: [“my name is jonh.”, “my name is”, “my name”, “my” ]

Is there any solution or recommendation?
Many thanks in advance

vsr1 · August 9, 2023, 1:02pm

See FTS search (match_phrase Match Phrase Query | Couchbase Docs) if anything help full. cc @abhinav

dh · August 9, 2023, 2:46pm

@jinchong as @vsr1 noted, FTS is probably the most likely to have an appropriate solution.

( You could build an array from SPLIT() results and index that but some brief experimentation suggests limited possible evaluation via an index. (SPLIT() rather than TOKENS() so as to retain order.) At this point it is unlikely to meet your needs so I wouldn’t pursue it further.

That said, if the words supplied could be in any order - i.e. match both “my name is john” and “john is my name” with [“is”, “john”] as the filter (note: individual tokens, not single string - but this could be generated in the statement) - then this could potentially work. Index may still be large though.

If they would always provide the prefix, then LIKE would be the way to go but with leading wild-cards there is no index evaluation. )

jinchong · August 9, 2023, 11:24pm

@dh Thanks for your advice.
I’m trying by FTS as well.

abhinav · August 9, 2023, 11:30pm

I’d recommend looking into creating a custom analyzer using the shingle token filter with separator being white space, with a min and max setting.

This should help tokenize “my name is john” into …

my
name
my name
is
name is
my name is
john
is john
name is john
my name is john

For auto complete we typically recommend the ngram tokenfilter which does the same as above at a character level. You can test out all this analyzer capability here - https://bleveanalysis.couchbase.com/analysis

There’s instructions on how to create a custom token filter here - Create a Custom Token Filter | Couchbase Docs

system · November 7, 2023, 11:30pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.