Character Filter

gizmo74 · September 20, 2018, 2:18pm

Hi,

I’m playing with character filter. Idea is to filter some german umlauts and other characters. I can’t use standard “de” filter, because I need to use prefix/wildcard search. So idea is to filter them in couchbase via character filter (ü -> u etc.) and manually do the same with the query string (because fts don’t use analyzer for wildcard/prefix).

I indexed some documents with texts like “hello mister müller”.

Standard (no filter): wildcard query with müll* works.
character filter with “regular expression = ü, replace=u”: wildcard query with mull* does NOT work
character filter with “regular expression = ü, replace=[emtpty]”: wildcard query with mll* works
character filter with “regular expression = e, replace=a”: wildcard query with hall* works

So something seems to be wrong with ü -> u replacement, while ü -> empty or e -> a works perfectly.

Do I something wrong? Or could it be that is a problem of utf8, because ü ist a 2 byte character, while u is 1 byte?

Thanks, Pascal

gizmo74 · September 21, 2018, 3:13pm

I created a inex now with edge_ngram token filter. Now it works as expected with matchquery and is also faster than prefix queries… I’ll continue with testing that for my use case.

Topic		Replies	Views
Custom character filter in analyzer not working Full Text Search	1	786	March 27, 2020
Full text search configuration for French names with accents Couchbase Server	10	2328	November 12, 2019
Handling special/latin characters like ö in searches Full Text Search	3	1475	May 15, 2018
Full Text search with ASCII Folding Filter Full Text Search	4	1831	December 13, 2018
Diacritic insensitive in like query Full Text Search query , n1ql , index , node	4	1820	March 9, 2020

Character Filter

Related topics