This post kicks off a multi-part series on composite vector indexing in Couchbase. We will start by building intuition, then progressively dive into internals, execution optimizations, and performance.
The series will cover:
- Why composite vector indexes matter, including concepts, terminology, and developer motivation. A Smart Grocery Recommendation System will be used as a running example.
- How composite vector indexes are implemented inside the Couchbase Indexing Service.
- How ORDER BY pushdown works for composite vector queries.
- Real-world performance behavior and benchmarking results.
Smart Grocery Recommendation System With Filtered ANN
Imagine you’re building a grocery-recommendation app.
A user opens it on a Sunday morning and types:
“I love dark chocolate spread, but I’m trying to cut sugar and add more protein. What else should I buy?”
At this moment, your system needs to understand the user’s intent, compare food items semantically, and apply strict nutritional filters.
This is exactly where Filtered Approximate Nearest Neighbor (Filtered ANN) comes in:
- Your ANN layer first finds semantically similar items/foods that “feel like” dark chocolate spread in flavor profile, texture, or category.
- Then your filtering layer steps in to remove anything with high sugar, keep items above a certain protein threshold, and maybe enforce dietary preferences (vegan, keto, nut-free).
The result? A recommendation engine that understands both meaning and constraints just like a smart store associate who knows your taste and considers your goals.
Before We Get to FANN, Let’s Build Intuition
NN (Nearest Neighbor): Finding the most similar thing to what you have. It’s like asking, “Which food in my list tastes most like this chocolate spread?”
ANN (Approximate Nearest Neighbor): Finding something very similar, but faster. It’s like saying, “I don’t need the perfect match, just something that’s close enough quickly.”
FANN (Filtered Approximate Nearest Neighbor): Finding something close enough pero only among items that meet certain rules. It’s like saying, “Show me foods similar to chocolate spread, but only the ones that are low in sugar and high in protein.”

ANN algorithms trade a bit of effectiveness (accuracy) for much greater eficacia (speed and memory).
A composite index is an index built on multiple fields (columns) together, not just one. For example, it’s like sorting a spreadsheet first by Category, then by Sugar, then by Protein. This ordering method groups all chocolate spreads together first. Within that group, you can quickly find low-sugar, high-protein products without scanning everything.
Why Traditional Indexes Fail
Assume you have a small subset of the World Food Facts dataset loaded into memory as:
|
1 2 3 4 5 6 7 8 9 10 11 12 |
tipo Food struct { ID cadena ProductName cadena Categoría cadena Descripción cadena Sugars100g float64 Proteins100g float64 Tags []cadena Ingredientes []cadena ... País cadena } |
To find foods like dark chocolate spreads that are low in sugar and high in protein you can use a query like the one below:
|
1 2 3 4 |
SELECCIONE product_name DESDE alimentos DONDE categoría = "chocolate_spread" Y sugars_100g < 20 Y proteins_100g > 10; |
To speed up the query, you can use a composite secondary index like the one below:
|
1 |
CREAR ÍNDICE idx_food EN alimentos(categoría, sugars_100g, proteins_100g, product_name) |
Composite secondary indexes can be viewed as sorted lists of concatenated keys that enable faster lookups for specific values or iteration across a range of low to high values (i.e., range scan). These lookup values, as well as the high and low values, are constructed at query time using the query predicates.
|
1 2 3 4 5 6 7 8 9 |
... ("almond_butter", 15, 20, "Almond butter with chocolate chips") ("chocolate_spread", 19, 7, "Chocolate spread with nuts") ("chocolate_spread", 20, 4, "Creamy chocolate spread") ("chocolate_spread", 23, 6, "Chocolate spread with honey") ("chocolate_spread", 25, 5, "Coffee chocolate spread") ("milk_chocolate", 4, 6, "Milk chocolate spread") ("peanut_butter", 19, 30, "Chocolate flavored peanut butter") ... |
Composite indexes work great for structured lookups.
But a category filter can never find:
- chocolate-flavored nut butters
- chocolate-protein spreads
- hazelnut cocoa blends
- chocolate protein bars
…even though a human instantly knows they are relatives of chocolate spreads.
Traditional indexes only match structure, not meaning. This is why category-based range scans fail.
How Filtered ANN Works
You can convert the query and data into vectors
The user’s sentence is fed into an embedding model (e.g., OpenAI, Cohere, or a domain-specific model).
The result is a dense vector that captures concepts like:
- chocolate-like flavor
- spreadable texture
- dessert/snack category
This vector represents what the user wants as opposed to just the literal words.
Next, you can find nearest neighbors (semantic similarity).
Candidates might include:
- Hazelnut cocoa spread
- Chocolate almond butter
- Cocoa protein spread
- Chocolate tahini
But not all are healthy options, and the user specifically asked for low sugar and high protein.
You can apply strict filters, which is the “Filtered” part of Filtered ANN.
You can filter out items:
- Sugar > threshold (e.g., >5g per serving)
- Protein < threshold (e.g., <8g per serving)
Your system may also combine metadata filters:
- Only vegan
- No palm oil
- No nuts
- Under $10
What remains is a set of items that match both meaning and constraints.
Why Solely Using Filters Does Not Work
Using only filters, you would get:
- Any high‑protein, low‑sugar product
- As well as items unrelated to chocolate (like tofu, Greek yogurt, chicken breast)
But the user wants something “similar to chocolate spread.”
Filtered ANN = Personalization + Constraints. It mimics how a human store associate would answer the request: “If you want something like chocolate spread but healthier, try this…”
Behind the scenes, however, your recommendation engine faces a subtle but serious problem. Modern vector databases say they can do “hybrid search,” but they usually keep scalar fields like sugar or protein off to the side, as plain metadata. The ANN index has no idea how to use them.
So what happens?
The system first pulls in a huge batch of vector-similar candidates… and only then starts checking nutrition rules like sugars_100g < 20 or proteins_100g > 10.
It’s like a store employee bringing out every chocolate-related product from the back room, placing them on the counter, and then saying:
“Oh wait, you wanted low-sugar? High-protein? Let me throw most of these away.”
Some vector systems try to filter earlier during graph traversal, but they still can’t do real range filtering or prefix pruning. They must fetch and decode every candidate before deciding whether to throw it out.
What does this mean for your app?
- More disk reads
- More distance calculations
- More latency
…And a lot of wasted work for results the user will never see.
This is exactly why a composite vector index that merges vector similarity and scalar pruning into the same index is a game-changer.
Composite Vector Indexes – Overview
Step 1: Embeddings Layer – Create Vector Embeddings
Each product’s text description (tags, product name, category, ingredients) is converted into a high-dimensional vector using a language model. Products with similar meanings will have similar vectors.
For example, embeddings for product names:
- “dark chocolate spread” → [0.23, -0.15, 0.87, …] (384 dimensions)
- “chocolate hazelnut butter” → [0.25, -0.12, 0.85, …] (similar vector)
- “chocolate protein bar” → [0.18, -0.08, 0.79, …] (somewhat similar)
Step 2: FANN Index: Build Composite Vector Index
Create a vector index (e.g., Couchbase Vector Index, FAISS) that can quickly find nearest neighbors in the embedding space.
How are vectors different from other datatypes in a composite vector index?
- Vectors do not have natural total order, hence sort order for vector fields cannot be determined at index time for index construction.
- Vector fields do not support conventional comparison predicates (such as equality or range filters) in the WHERE clause.
- But vector fields are used in ORDER BY with vector distance functions, and may participate in query planning via those expressions.
- Ordering is done at scan time using similarity to a query vector. The similarity function is chosen by the user as needed for the data and application.
- APPROX_VECTOR_DISTANCE can be used in the ORDER BY clause and is efficiently supported when a compatible vector index exists; otherwise, it results in a full scan.
- As each dimension in a vector does not have any standalone meaning, you can only ask questions like “how similar are two vectors.” So you can only find the nearest neighbor or similar elements.
- Similarly, function and query needs to be provided as input at query time.
- Nearest neighbor search is a computationally intensive problem which only worsens with increase in vector dimensions. So you need a time and space-efficient solution to get approximate results.
- Quantization methods are provided in the description in the DDL.
- You will have to reduce the number of comparisons at query time for faster querying.
- Number of centroids and nprobes value help in reducing the search space.
Composite Vector Index is an index where at least one of the keys has a vector attribute, while other attributes like dimension, similarity, and description, etc. are given to qualify the vector.
|
1 |
CREAR ÍNDICE idx_vec EN alimentos(sugars_100g, proteins_100g, text_vector Vector, product_name) CON { "dimension": 384, "similitud" : "L2", "descripción": "IVF,SQ8" } |
In this definition, the VECTOR keyword explicitly marks text_vector as a vector attribute. This is necessary because, at the JSON level, a vector embedding is stored as a simple array of floating-point numbers. Without the vector annotation, GSI would treat the field as an ordinary array and apply standard indexing semantics.
By declaring a field as a vector, the user establishes an explicit contract with the GSI service that:
- The index will contain a single vector key, and that key represents the embedding used for vector similarity search in this index
- The application is responsible for generating the vector embedding (for example, using an external embedding model) and persisting it in the specified document field.
- The GSI service must interpret the field semantically as a vector embedding and build vector-aware index structures optimized for Approximate Nearest Neighbor (ANN) search, rather than using conventional scalar or array indexing logic.
In vector index DDL, a user must specify a few extra parameters like:
- Dimension: length of the vector embeddings created
- Similarity: metric used for ANN search
- Description: FAISS index like description to specify the accuracy vs speed trade-off
In the above example:
- We created the 384 dimensional embeddings for tags, product name, category and ingredients fields using sentence-transformers/all-MiniLM-L6-v2 model and stored them in the text_vector field of the document.
- We used IVF coarse quantizer with default number of centroids and SQ8 quantization.
Step 3: Filtered ANN Query
Instead of filtering by exact category, we:
- Generate an embedding for the query “dark chocolate spread.”
- query_text = “dark chocolate spread”
- query_embedding = [0.23, -0.15, 0.87, 0.42, …, -0.31] # 384-dimensional vector
- Find the top-k most similar products using ANN search (e.g., top 10) that meet our criteria (sugars_100g < 20 AND proteins_100g > 10).
- Return the top matches.
SQL++ Example (Couchbase):
|
1 2 3 4 5 |
SELECCIONE product_name DESDE alimentos DONDE sugars_100g < 20 Y proteins_100g > 10 PEDIR POR DISTANCIA_VECTOR_APROX(text_vector, [query_embedding], 'L2') LÍMITE 10; |
Key Advantages
This approach finds products that are:
- Semantically similar to “dark chocolate spread” (using vector search).
- Meet the nutritional filters (low sugar, high protein).
- May include products from different categories like “chocolate protein bars,” “nut butter spreads,” or “chocolate-flavored snacks” that are similar in meaning but don’t match the category “chocolate spreads” filter.
Learn more about composite vector indexes in the next part of this series, where we will answer practical questions such as:
- How are vector embeddings stored and organized efficiently inside the index layer?
- Can a composite vector index answer scalar-only queries without reading the full document?
- Does the order of scalar fields and vector fields in the index definition matter?