{"id":5816,"date":"2018-09-12T08:10:08","date_gmt":"2018-09-12T15:10:08","guid":{"rendered":"https:\/\/www.couchbase.com\/blog\/?p=5816"},"modified":"2025-06-13T20:59:13","modified_gmt":"2025-06-14T03:59:13","slug":"how-analyzers-tokenizers-filters-work-fts-part-2","status":"publish","type":"post","link":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/","title":{"rendered":"Building a Shazam-like app to understand how Tokenizers and Filters work | FTS Part 2"},"content":{"rendered":"<p>In the previous blog post, we talked about <a href=\"https:\/\/www.couchbase.com\/blog\/why-you-should-avoid-like-deep-dive-on-fts-part-1\/\">why full-text search is a better solution at scale to implement a well-designed search in your application<\/a>. In this second part, we are going to deep-dive on the Inverted Index and explore how analyzers, tokenizers, and filters might shape the result of your searches.<\/p>\n<p>Full-text search is all about searching on the text; therefore, it does not matter if you are indexing and searching logs, genes in a DNA, your own data structure, and of course, language. They will all essentially work nearly the same way.<\/p>\n<p>To give you an example of how to you can use FTS even when you have your own custom structure, let\u2019s leverage the fact that Apple finally bought Shazam and build an imaginary Shazam-like app. However, instead of listening to a small fragment of music like Shazam does, we will ask for the user to whistle it.<\/p>\n<p>&nbsp;<\/p>\n<h2><strong>Wait\u2026 why do I need Full-text Search for it?<\/strong><\/h2>\n<p>As the user might wrongly whistle some parts of the song, we will need to split it in \u201csmall blocks of melody\u201d and then try to match them against our library. Assuming that our library will have thousands or even millions of songs (Apple and Spotify libraries have over 30 million songs), a simple LIKE \u201c%melody%\u201d stands no chance of bringing results in a reasonable amount of time.<\/p>\n<p>An inverted index seems to be the right tool for the job as we can easily find all songs that contain a given block of melody. If you are not familiar with this concept yet, please check <a href=\"https:\/\/www.couchbase.com\/blog\/why-you-should-avoid-like-deep-dive-on-fts-part-1\/\">out my previous blog post<\/a> about it.<\/p>\n<p>&nbsp;<\/p>\n<h2><strong>The Parsons Code<\/strong><\/h2>\n<p>The first thing we need to do is convert our songs library to text. We can achieve that by using the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Parsons_code\">Parsons code<\/a>, which is a notation used to identify a piece of music\u00a0according to movements of the\u00a0<u><a href=\"https:\/\/en.wikipedia.org\/wiki\/Pitch_(music)\">pitch<\/a><\/u>\u00a0up and down:<\/p>\n<p>&nbsp;<\/p>\n<ul>\n<li>* = first tone as reference,<\/li>\n<li>u = &#8220;up&#8221;, for when the note is higher than the previous note,<\/li>\n<li>d = &#8220;down&#8221;, for when the note is lower than the previous note,<\/li>\n<li>r = &#8220;repeat&#8221;, for when the note has the same pitch as the previous note.<\/li>\n<\/ul>\n<p>Using parsons code, a song like &#8220;<u><a href=\"https:\/\/en.wikipedia.org\/wiki\/Twinkle_Twinkle_Little_Star\">Twinkle Twinkle Little Star<\/a><\/u>&#8221; will be converted to <strong>*rururddrdrdrdurdrdrdurdrdrddrururddrdrdrd<\/strong>.<\/p>\n<p>Here is the whole song:<\/p>\n<!--[if lt IE 9]><script>document.createElement('audio');<\/script><![endif]-->\n<audio class=\"wp-audio-shortcode\" id=\"audio-5816-1\" preload=\"none\" style=\"width: 100%;\" controls=\"controls\"><source type=\"audio\/mpeg\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2018\/09\/Twinkle_Twinkle_Little_Star_plain.mp3?_=1\" \/><a href=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2018\/09\/Twinkle_Twinkle_Little_Star_plain.mp3\">https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2018\/09\/Twinkle_Twinkle_Little_Star_plain.mp3<\/a><\/audio>\n<p>and here is it&#8217;s visualization using Parsons code:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5819\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2018\/09\/Screen-Shot-2018-09-12-at-4.34.00-PM.png\" alt=\"\" width=\"640\" height=\"374\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-4.34.00-PM.png 640w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-4.34.00-PM-300x175.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-4.34.00-PM-20x12.png 20w\" sizes=\"auto, (max-width: 640px) 100vw, 640px\" \/><\/p>\n<h2><strong>Analyzers<\/strong><\/h2>\n<p>In order to create our inverted index, we need to prepare our text first, like breaking it in smaller parts, converting it to lower case, removing irrelevant words, etc. The preparation\/analysis phase usually runs during the <a href=\"https:\/\/docs.couchbase.com\/server\/current\/n1ql\/n1ql-language-reference\/createindex.html\">index creation<\/a> and before the query is executed. This way, we can guarantee that both the target text and the term being matched went through the exact same transformations.<\/p>\n<p>The code responsible for this transformation is called Analyzer, and roughly speaking, we group analyzers in two main categories: tokenizers and filters.<\/p>\n<p>&nbsp;<\/p>\n<h3><strong>Tokenizers<\/strong><\/h3>\n<p>When we are dealing with language, the standard tokenizer will split a text in words. The tokenization strategy will slightly change according to the idiom, as we should also consider characters other than just white spaces, like l&#8217;amour in French or \u201cI\u2019m\u201d in English.<\/p>\n<p>In Couchbase FTS, the standard tokenizer works out-of-the-box most of the time, but we also provide tokenizers for <a href=\"https:\/\/docs.couchbase.com\/server\/5.5\/fts\/fts-using-analyzers.html\">HTML and a few other data structures<\/a>. Therefore, it\u2019s always worth to check that you are using me most appropriate one.<\/p>\n<p>Ideally, in our Shazam-like app, we should create a custom n-gram tokenizer, but to keep things simple, let\u2019s try to leverage the default one. To do that, we will need to slightly change the Parsons code by inserting a white space after every 5 letters. The reason for it is because I\u2019m assuming that if the user can whistle at least 5 notes correctly in a row, I will consider that a \u201cblock of melody\u201d and try to match it against our inverted index.<\/p>\n<p>As such, our &#8220;<u><a href=\"https:\/\/en.wikipedia.org\/wiki\/Twinkle_Twinkle_Little_Star\">Twinkle Twinkle Little Star<\/a><\/u>&#8221; will be stored as <strong>*rurur ddrdr drdur drdrd urdrd rddru rurdd rdrdr d<\/strong>.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h3><strong>Filters<\/strong><\/h3>\n<p>&nbsp;<\/p>\n<p>Couchbase FTS also comes with <a href=\"https:\/\/docs.couchbase.com\/server\/5.5\/fts\/fts-using-analyzers.html\">a variety of filter<\/a>s,\u00a0 the three most popular ones are potentially the <strong>to_lower<\/strong>, <strong>stop_tokens<\/strong>, and <strong>stemmer<\/strong>:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5820\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2018\/09\/Screen-Shot-2018-09-12-at-4.44.25-PM.png\" alt=\"\" width=\"393\" height=\"687\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-4.44.25-PM.png 533w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-4.44.25-PM-172x300.png 172w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-4.44.25-PM-300x524.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-4.44.25-PM-11x20.png 11w\" sizes=\"auto, (max-width: 393px) 100vw, 393px\" \/><\/p>\n<p>&nbsp;<\/p>\n<ul>\n<li><strong>to_lower<\/strong>: Converts all characters to lower case. For example,\u00a0HTML\u00a0becomes\u00a0html.<\/li>\n<li><strong>stop_tokens<\/strong>: Removes from the stream tokens considered unnecessary for a Full-Text Search: for example,\u00a0and,\u00a0is, and\u00a0the.<\/li>\n<li><strong>Stemmer<\/strong>: Uses\u00a0<a href=\"https:\/\/snowball.tartarus.org\/\">libstemmer<\/a>to reduce tokens to word-stems. For example, words\u00a0like <em>fishing<\/em>,\u00a0<em>fished<\/em>, and\u00a0<em>fisher<\/em>\u00a0are reduced to <em>fish<\/em>.<\/li>\n<\/ul>\n<p>Ideally, you should have multiple indexes for the same data, where each index uses a composition of filters focused on highlighting a specific characteristic. We will discuss more about it in the upcoming articles.<\/p>\n<p>For our Shazam-like app, filters might not be necessary, but if we want to improve our results, we could also add some sort of custom <strong>stop_tokens<\/strong> or custom character filter.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5821\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2018\/09\/Screen-Shot-2018-09-12-at-4.46.34-PM.png\" alt=\"\" width=\"473\" height=\"385\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-4.46.34-PM.png 531w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-4.46.34-PM-300x244.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-4.46.34-PM-20x16.png 20w\" sizes=\"auto, (max-width: 473px) 100vw, 473px\" \/><\/p>\n<p>For instance, in most pop songs, the singer might shout for a few seconds an \u201c<strong>Ahhhhhhh<\/strong>\u201d or \u201c<strong>Ohhhhhh<\/strong>\u201d. Using Parsons Code, it will be translated to a series of <strong>r<\/strong> (\u201crepeat&#8221;, for when the note has the same pitch as the previous note). So, our stop_tokens\/custom character filter might remove any sequence of ten | twenty \u201c<strong>r<\/strong>\u201d.<\/p>\n<p><strong>Ex:\u00a0<\/strong><strong>*rururddrdrdrdurdrdrdurdrdrddrururddrdrdrdrrrrrrrrrr <\/strong>becomes\u00a0<strong>*rururddrdrdrdurdrdrdurdrdrddrururddrdrdrd<\/strong><\/p>\n<p>This way, the song will be identified by its core melody instead of trying to find it by a sequence of repeated notes, which would potentially return wrong results.<\/p>\n<p>&nbsp;<\/p>\n<h2><strong>Querying the data<\/strong><\/h2>\n<p>Now that we have our songs library properly indexed, all we need to do is to record the user\u2019s whistle, convert it to Parsons Code, and finally query the database. FTS will automatically transform our query term using the same tokenizers and analyzers we used to index the data.<\/p>\n<p>For now, let\u2019s just assume that the query will simply bring results ordered by the total matches.<\/p>\n<p><strong>Ex:<\/strong><\/p>\n<p>A query like <strong>rurur<\/strong><strong> ddrdr <\/strong>will potentially bring the &#8220;<u><a href=\"https:\/\/en.wikipedia.org\/wiki\/Twinkle_Twinkle_Little_Star\">Twinkle Twinkle Little Star<\/a><\/u>&#8221; song as we have 4 matches in it:<\/p>\n<p>*<span style=\"color: #0000ff\"><strong>rurur<\/strong><\/span><span style=\"color: #ff0000\"><strong>ddrdr<\/strong><\/span><strong>drdurdrdrdurdrdrdd<span style=\"color: #ff0000\"><span style=\"color: #0000ff\">rurur<\/span>ddrdr<\/span>drd<\/strong><\/p>\n<p><strong>\u00a0<\/strong><strong>\u00a0<\/strong><\/p>\n<h2><strong>Where is the demo?<\/strong><\/h2>\n<p><strong>\u00a0<\/strong>We are going to build another type of application during this blog series, but if you are interested in trying a real application that implements something similar to what I have described here, check out <a href=\"https:\/\/beta.midomi.com\/\">Midemi<\/a>.<\/p>\n<p><a href=\"https:\/\/beta.midomi.com\/\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5822\" src=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2018\/09\/Screen-Shot-2018-09-12-at-5.06.36-PM.png\" alt=\"\" width=\"651\" height=\"221\" srcset=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-5.06.36-PM.png 651w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-5.06.36-PM-300x102.png 300w, https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Screen-Shot-2018-09-12-at-5.06.36-PM-20x7.png 20w\" sizes=\"auto, (max-width: 651px) 100vw, 651px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<h2><strong>Conclusion<\/strong><\/h2>\n<p>The goal of this article was to show the importance of tokenizers and filters even when we are dealing with other types of structures. I highly recommend reading the official documentation about it to understand what is the best use-case for each one of them.<\/p>\n<p>If you already have a good knowledge of FTS, you might have noticed some potential problems with our Shazam-like app: As the user usually won\u2019t start whistling the song since its beginning, we might tokenize the whistle from a different point rather than where we have tokenized the original song. As we are grouping the song in tokens of 5 notes, the chances of tokenizing both the music and the term query in the correct point are 1 in 5.<\/p>\n<p><strong>Ex:<\/strong><\/p>\n<p>&#8220;<u><a href=\"https:\/\/en.wikipedia.org\/wiki\/Twinkle_Twinkle_Little_Star\">Twinkle Twinkle Little Star<\/a><\/u>&#8220;: <strong>rururddrdrdrdurdrdrdurdrdrddrururddrdrdrd<\/strong><\/p>\n<p>Tokenized &#8220;<u><a href=\"https:\/\/en.wikipedia.org\/wiki\/Twinkle_Twinkle_Little_Star\">Twinkle Twinkle Little Star<\/a><\/u>&#8220;<strong>: rurur ddrdr drdur drdrd urdrd rddru rurdd rdrdr d<\/strong><\/p>\n<p>User\u2019s whistle:\u00a0<strong>rdrdrdurdrdrdurdrd<\/strong> (a random part in the middle of the song)<\/p>\n<p>Tokenized User\u2019s whistle:\u00a0<strong>rdrdr durdr drdur drd<\/strong><\/p>\n<p>&nbsp;<\/p>\n<p>In the example above, we had 2 matches (<strong>rdrdr<\/strong> and\u00a0<strong>drdur<\/strong>) by chance, but as they are out of order, the score of this song will be seriously compromised, which can lead to unexpected results.<\/p>\n<p>&nbsp;<\/p>\n<h4><strong>Full-Text Search Series<\/strong><\/h4>\n<ul>\n<li><a href=\"https:\/\/www.couchbase.com\/blog\/why-you-should-avoid-like-deep-dive-on-fts-part-1\/\">Why you should avoid LIKE %<\/a> &#8211; Part 2<\/li>\n<li><a href=\"https:\/\/www.couchbase.com\/blog\/fuzzy-matching\/\">Fuzzy Matching<\/a>\u00a0&#8211; Part 3<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>We will see how to solve this problem and a few others in the next articles of this series. In the meantime if you have any questions, just tweet me at <a href=\"https:\/\/twitter.com\/deniswsrosa\">@deniswsrosa<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the previous blog post, we talked about why full-text search is a better solution at scale to implement a well-designed search in your application. In this second part, we are going to deep-dive on the Inverted Index and explore [&hellip;]<\/p>\n","protected":false},"author":8754,"featured_media":5817,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"inline_featured_image":false,"footnotes":""},"categories":[2165],"tags":[],"ppma_author":[9059],"class_list":["post-5816","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-full-text-search"],"acf":[],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v25.7.1 (Yoast SEO v25.7) - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Explore how analyzers, tokenizers, and filters works<\/title>\n<meta name=\"description\" content=\"This post focuses on the Inverted Index and also explore how analyzers, tokenizers, and filters might shape the result of your searches.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Building a Shazam-like app to understand how Tokenizers and Filters work | FTS Part 2\" \/>\n<meta property=\"og:description\" content=\"This post focuses on the Inverted Index and also explore how analyzers, tokenizers, and filters might shape the result of your searches.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/\" \/>\n<meta property=\"og:site_name\" content=\"The Couchbase Blog\" \/>\n<meta property=\"article:published_time\" content=\"2018-09-12T15:10:08+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-06-14T03:59:13+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Couchbase-FTS-Part2.png\" \/>\n\t<meta property=\"og:image:width\" content=\"728\" \/>\n\t<meta property=\"og:image:height\" content=\"210\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Denis Rosa, Developer Advocate, Couchbase\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@deniswsrosa\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Denis Rosa, Developer Advocate, Couchbase\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/\"},\"author\":{\"name\":\"Denis Rosa, Developer Advocate, Couchbase\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/fe3c5273e805e72a5294611a48f62257\"},\"headline\":\"Building a Shazam-like app to understand how Tokenizers and Filters work | FTS Part 2\",\"datePublished\":\"2018-09-12T15:10:08+00:00\",\"dateModified\":\"2025-06-14T03:59:13+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/\"},\"wordCount\":1324,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#organization\"},\"image\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Couchbase-FTS-Part2.png\",\"articleSection\":[\"Full-Text Search\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/\",\"url\":\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/\",\"name\":\"Explore how analyzers, tokenizers, and filters works\",\"isPartOf\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Couchbase-FTS-Part2.png\",\"datePublished\":\"2018-09-12T15:10:08+00:00\",\"dateModified\":\"2025-06-14T03:59:13+00:00\",\"description\":\"This post focuses on the Inverted Index and also explore how analyzers, tokenizers, and filters might shape the result of your searches.\",\"breadcrumb\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#primaryimage\",\"url\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Couchbase-FTS-Part2.png\",\"contentUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Couchbase-FTS-Part2.png\",\"width\":728,\"height\":210},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.couchbase.com\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Building a Shazam-like app to understand how Tokenizers and Filters work | FTS Part 2\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#website\",\"url\":\"https:\/\/www.couchbase.com\/blog\/\",\"name\":\"The Couchbase Blog\",\"description\":\"Couchbase, the NoSQL Database\",\"publisher\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.couchbase.com\/blog\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#organization\",\"name\":\"The Couchbase Blog\",\"url\":\"https:\/\/www.couchbase.com\/blog\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png\",\"contentUrl\":\"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png\",\"width\":218,\"height\":34,\"caption\":\"The Couchbase Blog\"},\"image\":{\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/fe3c5273e805e72a5294611a48f62257\",\"name\":\"Denis Rosa, Developer Advocate, Couchbase\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/image\/be0716f6199cfb09417c92cf7a8fa8d6\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/f8d1f5c13115122cab89d0f229b904480bfe20d3dfbb093fe9734cda5235d419?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/f8d1f5c13115122cab89d0f229b904480bfe20d3dfbb093fe9734cda5235d419?s=96&d=mm&r=g\",\"caption\":\"Denis Rosa, Developer Advocate, Couchbase\"},\"description\":\"Denis Rosa is a Developer Advocate for Couchbase and lives in Munich - Germany. He has a solid experience as a software engineer and speaks fluently Java, Python, Scala and Javascript. Denis likes to write about search, Big Data, AI, Microservices and everything else that would help developers to make a beautiful, faster, stable and scalable app.\",\"sameAs\":[\"https:\/\/x.com\/deniswsrosa\"],\"url\":\"https:\/\/www.couchbase.com\/blog\/author\/denis-rosa\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"Explore how analyzers, tokenizers, and filters works","description":"This post focuses on the Inverted Index and also explore how analyzers, tokenizers, and filters might shape the result of your searches.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/","og_locale":"en_US","og_type":"article","og_title":"Building a Shazam-like app to understand how Tokenizers and Filters work | FTS Part 2","og_description":"This post focuses on the Inverted Index and also explore how analyzers, tokenizers, and filters might shape the result of your searches.","og_url":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/","og_site_name":"The Couchbase Blog","article_published_time":"2018-09-12T15:10:08+00:00","article_modified_time":"2025-06-14T03:59:13+00:00","og_image":[{"width":728,"height":210,"url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Couchbase-FTS-Part2.png","type":"image\/png"}],"author":"Denis Rosa, Developer Advocate, Couchbase","twitter_card":"summary_large_image","twitter_creator":"@deniswsrosa","twitter_misc":{"Written by":"Denis Rosa, Developer Advocate, Couchbase","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#article","isPartOf":{"@id":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/"},"author":{"name":"Denis Rosa, Developer Advocate, Couchbase","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/fe3c5273e805e72a5294611a48f62257"},"headline":"Building a Shazam-like app to understand how Tokenizers and Filters work | FTS Part 2","datePublished":"2018-09-12T15:10:08+00:00","dateModified":"2025-06-14T03:59:13+00:00","mainEntityOfPage":{"@id":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/"},"wordCount":1324,"commentCount":0,"publisher":{"@id":"https:\/\/www.couchbase.com\/blog\/#organization"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#primaryimage"},"thumbnailUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Couchbase-FTS-Part2.png","articleSection":["Full-Text Search"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/","url":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/","name":"Explore how analyzers, tokenizers, and filters works","isPartOf":{"@id":"https:\/\/www.couchbase.com\/blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#primaryimage"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#primaryimage"},"thumbnailUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Couchbase-FTS-Part2.png","datePublished":"2018-09-12T15:10:08+00:00","dateModified":"2025-06-14T03:59:13+00:00","description":"This post focuses on the Inverted Index and also explore how analyzers, tokenizers, and filters might shape the result of your searches.","breadcrumb":{"@id":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#primaryimage","url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Couchbase-FTS-Part2.png","contentUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/sites\/1\/2018\/09\/Couchbase-FTS-Part2.png","width":728,"height":210},{"@type":"BreadcrumbList","@id":"https:\/\/www.couchbase.com\/blog\/how-analyzers-tokenizers-filters-work-fts-part-2\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.couchbase.com\/blog\/"},{"@type":"ListItem","position":2,"name":"Building a Shazam-like app to understand how Tokenizers and Filters work | FTS Part 2"}]},{"@type":"WebSite","@id":"https:\/\/www.couchbase.com\/blog\/#website","url":"https:\/\/www.couchbase.com\/blog\/","name":"The Couchbase Blog","description":"Couchbase, the NoSQL Database","publisher":{"@id":"https:\/\/www.couchbase.com\/blog\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.couchbase.com\/blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/www.couchbase.com\/blog\/#organization","name":"The Couchbase Blog","url":"https:\/\/www.couchbase.com\/blog\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/","url":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png","contentUrl":"https:\/\/www.couchbase.com\/blog\/wp-content\/uploads\/2023\/04\/admin-logo.png","width":218,"height":34,"caption":"The Couchbase Blog"},"image":{"@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/fe3c5273e805e72a5294611a48f62257","name":"Denis Rosa, Developer Advocate, Couchbase","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.couchbase.com\/blog\/#\/schema\/person\/image\/be0716f6199cfb09417c92cf7a8fa8d6","url":"https:\/\/secure.gravatar.com\/avatar\/f8d1f5c13115122cab89d0f229b904480bfe20d3dfbb093fe9734cda5235d419?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/f8d1f5c13115122cab89d0f229b904480bfe20d3dfbb093fe9734cda5235d419?s=96&d=mm&r=g","caption":"Denis Rosa, Developer Advocate, Couchbase"},"description":"Denis Rosa is a Developer Advocate for Couchbase and lives in Munich - Germany. He has a solid experience as a software engineer and speaks fluently Java, Python, Scala and Javascript. Denis likes to write about search, Big Data, AI, Microservices and everything else that would help developers to make a beautiful, faster, stable and scalable app.","sameAs":["https:\/\/x.com\/deniswsrosa"],"url":"https:\/\/www.couchbase.com\/blog\/author\/denis-rosa\/"}]}},"authors":[{"term_id":9059,"user_id":8754,"is_guest":0,"slug":"denis-rosa","display_name":"Denis Rosa, Developer Advocate, Couchbase","avatar_url":"https:\/\/secure.gravatar.com\/avatar\/f8d1f5c13115122cab89d0f229b904480bfe20d3dfbb093fe9734cda5235d419?s=96&d=mm&r=g","author_category":"","last_name":"Rosa, Developer Advocate, Couchbase","first_name":"Denis","job_title":"","user_url":"","description":"Denis Rosa is a Developer Advocate for Couchbase and lives in Munich - Germany. He has a solid experience as a software engineer and speaks fluently Java, Python, Scala and Javascript. Denis likes to write about search, Big Data, AI, Microservices and everything else that would help developers to make a beautiful, faster, stable and scalable app."}],"_links":{"self":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/posts\/5816","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/users\/8754"}],"replies":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/comments?post=5816"}],"version-history":[{"count":0,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/posts\/5816\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/media\/5817"}],"wp:attachment":[{"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/media?parent=5816"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/categories?post=5816"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/tags?post=5816"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.couchbase.com\/blog\/wp-json\/wp\/v2\/ppma_author?post=5816"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}