Impact of Storing Multiple Embedding Model Vectors in a Single Document on Vector Search Performance

We are currently storing vectors from different embedding models in the same Couchbase document, and we’re building FTS vector indexes on each of the fields. Here’s an example of how we structure the data in our documents:

{
...
  "embedding_map": {
    "text-embedding-3-large": { "vector": [] },
    "voyage-3": { "vector": [] }
  }
...
}

Each vector belongs to a different model, and we perform searches using FTS on these fields. While this approach has been working, we’ve noticed that response times are sometimes extremely slow, and we’re not sure where the bottleneck is. It could be related to FTS vector search performance or potentially another part of our application.

My question is: Could having vectors from different embedding models in the same document affect FTS vector search performance? Could this be due to how Couchbase handles multiple vector fields within the same document, or would it be better to separate the vectors into different documents?

Any insights on how this setup might impact performance or any best practices to follow when designing vector indexes for multiple models would be highly appreciated.

Thanks in advance!

Let me know if you need any additional details.

@abhinav (tagging you because I know you are doing search stuff)

@abhinav just following up in case this post got overlooked.

Thanks @PShri - it seems I did miss the notification from the original post.

Could having vectors from different embedding models in the same document affect FTS vector search performance?

It shouldn’t, but would you let me know what version of couchbase server are you using, I’d strongly recommend 7.6.2 or later for vector search.

Holding multiple vectors in different fields within the same document will mean separate vector indexes for each field within each index segment. One vector index should not affect the other for as long as you have sufficient resources available to handle your use case.

Would you maybe share the logs (cbcollect_info) from one or more of the nodes hosting the search service in your cluster? Perhaps we can find some clues there on what’s happening underneath.

We are using 7.6.3.

The issue no longer exists. It is also not reproducible. It got too slow a couple of times and I suspected it could be FTS vector search (as that was the new addition). Next time I come across that, I shall share the logs.

Sounds like there is no difference from a performance standpoint whether the vectors are in a single document or separated out.

Thanks for the information, though I am not sure what exactly is an index “Segment”.

Sounds good.

I am not sure what exactly is an index “Segment”

The search index follows a partitioned segmented architecture - LSM-like. As each index partition ingests data, batched content is persisted into immutable reference-counted files which we call segments - you can view these as mini indexes. A merger/compactor routine is responsible for eventually merging smaller segments into larger ones in a tiered fashion. Those segments whose reference count falls to zero (stale data) are purged.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.