Have you sound yourself asking the question, “What is fuzzy matching?” Fuzzy matching allows you to identify non-exact matches of your target item. It is the foundation stone of many search engine frameworks and one of the main reasons why you can get relevant search results even if you have a typo in your query or a different verbal tense.
As you might expect, there are many fuzzy searching algorithms that can be used for text, but virtually all search engine frameworks (including bleve) use primarily the Levenshtein Distance for fuzzy string matching:
Levenshtein Distance
Also known as Edit Distance, it is the number of transformations (deletions, insertions, or substitutions) required to transform a source string into the target one. For a fuzzy search example, if the target term is “book” and the source is “back”, you will need to change the first “o” to “a” and the second “o” to “c”, which will give us a Levenshtein Distance of 2.Edit Distance is very easy to implement, and it is a popular challenge during code interviews (You can find Levenshtein implementations in JavaScript, Kotlin, Java, and many others here).
Additionally, some frameworks also support the Damerau-Levenshtein distance:
Damerau-Levenshtein distance
It is an extension to Levenshtein Distance, allowing one extra operation: Transposition of two adjacent characters:
Ex: TSAR to STAR
Damerau-Levenshtein distance = 1 (Switching S and T positions cost only one operation)
Levenshtein distance = 2 (Replace S by T and T by S)
Fuzzy matching and relevance
Fuzzy matching has one big side effect; it messes up with relevance. Although Damerau-Levenshtein is a fuzzy matching algorithm that considers most of the common user’s misspellings, it also can include a significant number of false positives, especially when we are using a language with an average of just 5 letters per word, such as English. That is why most of the search engine frameworks prefer to stick with Levenshtein distance. Let’s see a real fuzzy matching example of it:
First, we are going to use this movie catalog dataset. I highly recommend it if you want to play with full-text search. Then, let’s search for movies with “책” in the title. A simple code would look like the following:
1 2 3 4 5 6 |
문자열 인덱스 이름 = "movies_index"; 매치 쿼리 쿼리 = 검색 쿼리.일치("book").필드("title"); 검색 쿼리 결과 결과 = 영화 저장소.getCouchbaseOperations().카우치베이스 버킷 가져오기().쿼리( new 검색 쿼리(인덱스 이름, 쿼리).하이라이트().limit(6)); printResults(영화); |
The code above will bring the following results:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
1) 그리고 예약 Thief 점수: 4.826942606027416 Hit Locations: 필드:title 용어:책 조각:[그리고 <mark>예약</mark> Thief] ----------------------------- 2) 그리고 예약 의 Eli 점수: 4.826942606027416 Hit Locations: 필드:title 용어:책 조각:[그리고 <mark>예약</mark> 의 Eli] ----------------------------- 3) 그리고 예약 의 Life 점수: 4.826942606027416 Hit Locations: 필드:title 용어:책 조각:[그리고 <mark>예약</mark> 의 Life] ----------------------------- 4) 블랙 예약 점수: 4.826942606027416 Hit Locations: 필드:title 용어:책 조각:[블랙 <mark>예약</mark>] ----------------------------- 5) 그리고 Jungle 예약 점수: 4.826942606027416 Hit Locations: 필드:title 용어:책 조각:[그리고 Jungle <mark>예약</mark>] ----------------------------- 6) 그리고 Jungle 예약 2 점수: 3.9411821308689627 Hit Locations: 필드:title 용어:책 조각:[그리고 Jungle <mark>예약</mark> 2] ----------------------------- |
By default, the results are case-insensitive, but you can easily change this behavior by creating new indexes with different analyzers.
Now, let’s add a fuzzy matching capability to our query by setting fuzziness as 1 (Levenshtein distance 1), which means that “책" 및 "보기” will have the same relevance.
1 2 3 4 5 6 |
문자열 인덱스 이름 = "movies_index"; 매치 쿼리 쿼리 = 검색 쿼리.일치("book").필드("title").흐릿함(1); 검색 쿼리 결과 결과 = 영화 저장소.getCouchbaseOperations().카우치베이스 버킷 가져오기().쿼리( new 검색 쿼리(인덱스 이름, 쿼리).하이라이트().limit(6)); printResults(영화); |
And here is the fuzzy search result:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
1) Hook 점수: 1.1012248063242538 Hit Locations: 필드:title 용어:hook 조각:[<mark>Hook</mark>] ----------------------------- 2) 여기 Comes 의 Boom 점수: 0.7786835148361213 Hit Locations: 필드:title 용어:붐 조각:[여기 Comes 의 <mark>Boom</mark>] ----------------------------- 3) Look Who's Talking Too score: 0.7047266634351538 Hit Locations: field:title term:look fragment:[<mark>Look</mark> Who's Talking Too] ----------------------------- 4) Look Who's Talking score: 0.7047266634351538 Hit Locations: field:title term:look fragment:[<mark>Look</mark> Who's Talking] ----------------------------- 5) 그리고 예약 Thief 점수: 0.5228811753737184 Hit Locations: 필드:title 용어:책 조각:[그리고 <mark>예약</mark> Thief] ----------------------------- 6) 그리고 예약 의 Eli 점수: 0.5228811753737184 Hit Locations: 필드:title 용어:책 조각:[그리고 <mark>예약</mark> 의 Eli] ----------------------------- |
Now, the movie called “Hook” is the very first search result, which might not be exactly what the user is expecting in a search for “예약".
How to minimize false positives during fuzzy lookups
In an ideal world, users would never make any typos while searching for something. However, that is not the world we live in, and if you want your users to have a pleasant experience, you have got to handle at least an edit distance of 1. Therefore, the real question is: How can we make fuzzy string matching while minimizing relevance loss?
We can take advantage of one characteristic of most search engine frameworks: A match with a lower edit distance will usually score higher. That characteristic allows us to combine those two queries with different fuzziness levels into one:
1 2 3 4 5 6 7 8 |
문자열 인덱스 이름 = "movies_index"; 문자열 단어 = "예약"; 매치 쿼리 titleFuzzy = 검색 쿼리.일치(단어).흐릿함(1).필드("title"); 매치 쿼리 titleSimple = 검색 쿼리.일치(단어).필드("title"); 분리 쿼리 ftsQuery = 검색 쿼리.분리(titleSimple, titleFuzzy); 검색 쿼리 결과 결과 = 영화 저장소.getCouchbaseOperations().카우치베이스 버킷 가져오기().쿼리( new 검색 쿼리(인덱스 이름, ftsQuery).하이라이트().limit(20)); printResults(결과); |
Here is the result of the fuzzy query above:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 |
1) 그리고 예약 Thief 점수: 2.398890092610849 Hit Locations: 필드:title 용어:책 조각:[그리고 <mark>예약</mark> Thief] ----------------------------- 필드:title 용어:책 조각:[그리고 <mark>예약</mark> Thief] ----------------------------- 2) 그리고 예약 의 Eli 점수: 2.398890092610849 Hit Locations: 필드:title 용어:책 조각:[그리고 <mark>예약</mark> 의 Eli] ----------------------------- 필드:title 용어:책 조각:[그리고 <mark>예약</mark> 의 Eli] ----------------------------- 3) 그리고 예약 의 Life 점수: 2.398890092610849 Hit Locations: 필드:title 용어:책 조각:[그리고 <mark>예약</mark> 의 Life] ----------------------------- 필드:title 용어:책 조각:[그리고 <mark>예약</mark> 의 Life] ----------------------------- 4) 블랙 예약 점수: 2.398890092610849 Hit Locations: 필드:title 용어:책 조각:[블랙 <mark>예약</mark>] ----------------------------- 필드:title 용어:책 조각:[블랙 <mark>예약</mark>] ----------------------------- 5) 그리고 Jungle 예약 점수: 2.398890092610849 Hit Locations: 필드:title 용어:책 조각:[그리고 Jungle <mark>예약</mark>] ----------------------------- 필드:title 용어:책 조각:[그리고 Jungle <mark>예약</mark>] ----------------------------- 6) 그리고 Jungle 예약 2 점수: 1.958685557004688 Hit Locations: 필드:title 용어:책 조각:[그리고 Jungle <mark>예약</mark> 2] ----------------------------- 필드:title 용어:책 조각:[그리고 Jungle <mark>예약</mark> 2] ----------------------------- 7) National Treasure: 예약 의 Secrets 점수: 1.6962714808368062 Hit Locations: 필드:title 용어:책 조각:[National Treasure: <mark>예약</mark> 의 Secrets] ----------------------------- 필드:title 용어:책 조각:[National Treasure: <mark>예약</mark> 의 Secrets] ----------------------------- 8) American Pie Presents: 그리고 예약 의 Love 점수: 1.517191317611584 Hit Locations: 필드:title 용어:책 조각:[American Pie Presents: 그리고 <mark>예약</mark> 의 Love] ----------------------------- 필드:title 용어:책 조각:[American Pie Presents: 그리고 <mark>예약</mark> 의 Love] ----------------------------- 9) Hook 점수: 0.5052232518679681 Hit Locations: 필드:title 용어:hook 조각:[<mark>Hook</mark>] ----------------------------- 10) 여기 Comes 의 Boom 점수: 0.357246781294941 Hit Locations: 필드:title 용어:붐 조각:[여기 Comes 의 <mark>Boom</mark>] ----------------------------- 11) Look Who's Talking Too score: 0.32331663301992025 Hit Locations: field:title term:look fragment:[<mark>Look</mark> Who's Talking Too] ----------------------------- 12) Look Who's Talking score: 0.32331663301992025 Hit Locations: field:title term:look fragment:[<mark>Look</mark> Who's Talking] ----------------------------- |
As you can see, this result is much closer to what the user might expect. Note that we are using a class called DisjunctionQuery now, disjunctions are an equivalent to the “또는” operator in SQL.
What else could we improve to reduce the negative side effect of a fuzzy matching algorithm? Let’s reanalyze our problem to understand if it needs further improvement:
We already know that fuzzy lookups can produce some unexpected results (e.g. Book -> Look, Hook). However, a single term search is usually a terrible query, as it barely gives us a hint of what exactly the user is trying to accomplish.
Even Google, which has arguably one of the most highly developed fuzzy search algorithms in use, does not know exactly what I’m looking for when I search for “테이블":
So, what is the average length of keywords in a search query? To answer this question, I will show a graph from Rand Fishkin’s 2016 presentation. (He is one of the most famous gurus in the SEO world)
According to the graph above, ~80% of the search queries have 2 or more keywords, so let’s try to search for the movie “Black Book” using fuzziness 1:
1 2 3 4 5 6 |
문자열 인덱스 이름 = "movies_index"; 매치 쿼리 쿼리 = 검색 쿼리.일치("Black Book").필드("title").흐릿함(1); 검색 쿼리 결과 결과 = 영화 저장소.getCouchbaseOperations().카우치베이스 버킷 가져오기().쿼리( new 검색 쿼리(인덱스 이름, 쿼리).하이라이트().limit(12)); printResults(영화); |
Result:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
1) 블랙 예약 점수: 0.6946442424065404 Hit Locations: 필드:title 용어:책 조각:[<mark>블랙</mark> <mark>예약</mark>] ----------------------------- 필드:title 용어:black 조각:[<mark>블랙</mark> <mark>예약</mark>] ----------------------------- 2) Hook 점수: 0.40139742528039857 Hit Locations: 필드:title 용어:hook 조각:[<mark>Hook</mark>] ----------------------------- 3) Attack 의 Block 점수: 0.2838308365090324 Hit Locations: 필드:title 용어:블록 조각:[Attack 의 <mark>Block</mark>] ----------------------------- 4) 여기 Comes 의 Boom 점수: 0.2838308365090324 Hit Locations: 필드:title 용어:붐 조각:[여기 Comes 의 <mark>Boom</mark>] ----------------------------- 5) Look Who's Talking Too score: 0.25687349813115684 Hit Locations: field:title term:look fragment:[<mark>Look</mark> Who's Talking Too] ----------------------------- 6) Look Who's Talking score: 0.25687349813115684 Hit Locations: field:title term:look fragment:[<mark>Look</mark> Who's Talking] ----------------------------- 7) Grosse Pointe Blank 점수: 0.2317469073782136 Hit Locations: 필드:title 용어:빈 조각:[Grosse Pointe <mark>Blank</mark>] ----------------------------- 8) 그리고 예약 Thief 점수: 0.19059065534780495 Hit Locations: 필드:title 용어:책 조각:[그리고 <mark>예약</mark> Thief] ----------------------------- 9) 그리고 예약 의 Eli 점수: 0.19059065534780495 Hit Locations: 필드:title 용어:책 조각:[그리고 <mark>예약</mark> 의 Eli] ----------------------------- 10) 그리고 예약 의 Life 점수: 0.19059065534780495 Hit Locations: 필드:title 용어:책 조각:[그리고 <mark>예약</mark> 의 Life] ----------------------------- 11) 그리고 Jungle 예약 점수: 0.19059065534780495 Hit Locations: 필드:title 용어:책 조각:[그리고 Jungle <mark>예약</mark>] ----------------------------- 12) Back 에 의 미래 점수: 0.17061000968368 Hit Locations: 필드:title 용어:뒤로 조각:[<mark>Back</mark> 에 의 미래] ----------------------------- |
Not bad. We got the movie we were searching for as the first result. However, a disjunction query would still bring a better set of results.
But still, looks like we have a new nice property here; the side effect of fuzziness matching slightly decreases as the number of keywords increases. A search for “Black Book” with fuzziness 1 can still bring results like back look or lack cook (some combinations with edit distance 1), but these are unlikely to be real movie titles.
A search for “book eli” with fuzziness 2 would still bring it as the third result:
1 2 3 4 5 6 |
문자열 인덱스 이름 = "movies_index"; 매치 쿼리 쿼리 = 검색 쿼리.일치("book eli").필드("title").흐릿함(2); 검색 쿼리 결과 결과 = 영화 저장소.getCouchbaseOperations().카우치베이스 버킷 가져오기().쿼리( new 검색 쿼리(인덱스 이름, 쿼리).하이라이트().limit(12)); printResults(영화); |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
1) Ed Wood 점수: 0.030723793900031805 Hit Locations: 필드:title 용어:wood 조각:[<mark>Ed</mark> <mark>Wood</mark>] ----------------------------- 필드:title 용어:ed 조각:[<mark>Ed</mark> <mark>Wood</mark>] ----------------------------- 2) Man 의 Tai Chi 점수: 0.0271042761982626 Hit Locations: 필드:title 용어:chi 조각:[Man 의 <mark>Tai</mark> <mark>Chi</mark>] ----------------------------- 필드:title 용어:tai 조각:[Man 의 <mark>Tai</mark> <mark>Chi</mark>] ----------------------------- 3) 그리고 예약 의 Eli 점수: 0.02608335441670756 Hit Locations: 필드:title 용어:eli 조각:[그리고 <mark>예약</mark> 의 <mark>Eli</mark>] ----------------------------- 필드:title 용어:책 조각:[그리고 <mark>예약</mark> 의 <mark>Eli</mark>] ----------------------------- 4) 그리고 Good Lie 점수: 0.02439822770591834 Hit Locations: 필드:title 용어:lie 조각:[그리고 <mark>Good</mark> <mark>Lie</mark>] ----------------------------- 필드:title 용어:좋은 조각:[그리고 <mark>Good</mark> <mark>Lie</mark>] ----------------------------- |
However, as the average English word is 5 letters long, I would NOT recommend using an edit distance bigger than 2 unless the user is searching for long words that are easy to misspell, like “Schwarzenegger” for instance (at least for non-Germans or non-Austrians).
결론
In this article, we discussed fuzziness matching and how to overcome its major side effect without messing up with its relevance. Mind you, fuzzy matching is just one of the many features which you should take advantage of while implementing a relevant and user-friendly search. We are going to discuss some of them during this series: N-Grams, Stopwords, Steeming, Shingle, Elision. Etc.
Check out also the 1부 그리고 파트 2 of this series.
그동안 궁금한 점이 있으시면 다음 주소로 트윗해 주세요. @deniswsrosa.
How does the score returned from the index relate to a percentage match? How would one construct a query for which the relevancy score of the results represent a relevancy of > 50%, for example?
If you want to understand more about how documents are scored, both lucene and bleve have the “explain” method. In couchbase you can set explain(true) to see exactly how the score is calculated.The results are by default sorted by score, so the most relevant ones should be the first ones in the list. What are you trying to achieve exactly?