Index types in Full Text Search (FTS) – refers to the collective specification of the inverted index format and it’s storage aspects. Scorch is the advanced and evolving index type. But before delving further into the details of scorch, its imperative to skim over the former indexing type known as the upside_down.
In upside_down indexing format, the storage aspects of the backing index are offloaded to a key value store. In essence what this means is that, all the rudimentary information retrieval data structures like terms dictionary, postings lists etc are all get stored in form of keys and values in a generic key value store. Like many similar systems, the keys were decorated with more details to identify what the value represents. Let’s skip the details here, as the main topic under discussion is scorch. The upside_down was the default index type until Couchbase server 5.5.0 release.
The carefully designed bleve(indexing library) interfaces helped us to experiment and performance benchmark inverted index representation on different key-value systems (eg: levelDB, RockDB, mossStore etc). But sooner the key value representation proved to be a bottleneck in scaling the inverted index.
Few of the major issues observed are as below,
- huge index size amplification mainly due to the naive data representation in KV formats.
- data deduplication potentials weren’t tapped.
- less friendly representation to serve natural language queries like fuzzy/edit distance.
These flaws snowballed into bigger scalability worries with massive data sets and that led to the decision to redesign the inverted index representation and the disk storage. And those efforts conceived the newer index type called scorch.
Many fundamental design concepts of scorch are inspired from that of Lucene.
Scorch follows a segment based index architecture. Each index is a collection of segments and each segment is a self sufficient immutable sub index which is capable of serving the query.
With scorch, we decided to represent documents using numeric identifiers internally. This opened up huge optimisation opportunities in the inverted index representation.
Now, let’s take a deeper look into the segment architecture. Each segment is composed of these major building blocks.
Term Dictionary – Arguably the most important part of an inverted index. It stores every indexed term, along with its document frequency and a pointer to the posting lists per field.
Scorch uses Finite State Transducers (FST) for implementing this term dictionaries. FST based implementations helps in memory savings and query optimisations for advanced queries like edit distance or regex based queries by leveraging the DFA property of FSTs.
Postings Lists – Gives the list of documents on which a search term occurs. As we have given numeric IDs for representing documents internally, this can be best represented as a bitmap. Bitmap representations helps in saving space, faster lookups and SSE optimised code.
Frequency Norms/Location Details – Frequency and norm details of indexed terms are used while scoring. Location or positional details are needed for performing phrase queries or highlighting the results etc. So, these unordered numeric values are varint encoded and stored using proprietary chunking logic.
Stored Fields – Helps users to store unanalysed field values in the index and retrieve them as a part of the search results. This helps to avoid further data lookup trips to the KV. Scorch uses compression techniques on the proprietary chunking logic to represent this data.
DocValues – performs reverse lookups in the index, for eg: find all values indexed for a given document. This helps in powering queries like facets or sorting on custom fields etc. Scorch uses a columnar representation on top of the proprietary compressed chunking logic to represent this portion of the index.
Scorch uses a newer binary format called “zap” for representing these byte flattened segment contents on disk. And these disk segment bytes are mmap’ed.
TLDR; – Above optimisations in scorch brought index size reduction of up to 4X and query performance improvements(on latency and throughput) of up to 20X for many queries in our internal benchmarking.
Scorch is the default index type since Couchbase server release 6.0.0 and more features are getting added and road mapped for scorch in upcoming FTS releases.
So, it’s highly recommended to upgrade the indexes to scorch index formats, if you have a genuine FTS use case.