Storing huge number of entries in an CB doc array

mr.sushobhit · May 9, 2025, 5:25am

We are processing traffic data that includes TAC (Tracking Area Code) values, which we need to persist in our database. These values must be stored uniquely, and since the TAC space allows for millions of possible values, we are looking for an efficient data modeling strategy.

Problem Context:

Option 1:
Store all TAC values in a single document with an array.
- Drawback: Not scalable due to document size limitations in Couchbase.
Option 2:
Store each TAC value in its own document, e.g., one document per TAC.
- Drawback: May lead to millions of documents, potentially impacting performance and manageability.

Current Data Model Example:

{
  "__t": "amms5g-tac-map",
  "tac": ["019AB", "019A1", "1234561", "1234567"],
  "__at": 20250407154813
}

Proposed Data Model Example:

{
  "__t": "amms5g-tac",
  "tac": "019AB",
  "__at": 20250407154813
}

Usage Context:

These TAC values are master data that will be queried by our GUI application to display all known TACs received from live traffic.

Challenge:

Both approaches have limitations:

Large document sizes (Option 1)
Large number of documents (Option 2)

Request:

We are looking for best practices or alternative strategies in Couchbase for efficiently managing high-cardinality master data like this. Specifically, something that balances query performance, storage efficiency, and operational manageability.

Any recommendations or patterns you can suggest for handling this type of use case would be greatly appreciated.

mreiche · May 9, 2025, 1:25pm

How is the data used? Does it need to be stored separately at all? Would it be sufficient to

SELECT DISTINCT tac FROM traffic_readings?

Based on the size limit, you can already rule out putting them in a single document.

mr.sushobhit · May 12, 2025, 4:42am

My requirement is to store unique data periodically, and read all data from GUI application, any design will work just need to achieve the functionality

mreiche · May 12, 2025, 1:33pm

to display all known TACs received from live traffic

Select distinct on the collection containing the live traffic would give that. The TACs would not have to be stored separately.

vsr1 · May 12, 2025, 3:18pm

Why don’t use hybrid approach
Take first 2 character of tac assume all remaining will fit in 20MB, if not take 3 characters.
Create document based on that (inserting new value find document it belongs and make sure you insert value only if not exist, see if you can use KV subdoc API).
This way all tacs are unique
Now query all the documents and dispaly. (i.e indirectly read multiple documents append vs one big array)

pccb · May 13, 2025, 10:29am

I think one of you mentioned using a collection. While that’s certainly more convenient—since the N1QL queries are shorter, neater, and more intuitive—I’m wondering whether it also offers a performance advantage.

For example, between the two approaches below, which one is likely to be more performant?

– Using a collection

CREATE INDEX idx1 ON bkt1.collection1;
SELECT * FROM bkt1.collection1;

– Using a field to distinguish a type of doc from other docs within a bucket

CREATE INDEX idx1 ON bkt1 WHERE __t = "amms5g-tac";
SELECT * FROM bkt1 WHERE __t = "amms5g-tac";

Does the use of a collection result in any inherent performance benefit (e.g., due to data isolation or index scoping), or is it mostly about manageability and cleaner query syntax?

thanks

mreiche · May 13, 2025, 1:42pm

Like I said earlier, you don’t even need to store them separately as select disinct would work.

Anyway - try them? I would not expect to find a measurable difference.

But the first just works while the second requires an index and a predicate in the query.

Also, select raw tac from … will return just the string without the json reducing the network transfer.

pccb · May 13, 2025, 4:47pm

Thanks @mreiche — appreciate your input.

However, I think there may be a slight misunderstanding. Regardless of whether the values are stored as individual documents, within an array, across multiple arrays, or in a collection, the underlying values themselves are stored just once. So in this context, using DISTINCT isn’t really necessary.

Noted on the SELECT RAW FROM ... usage, thanks!

Topic		Replies	Views
Help document structure, timestamp value Couchbase Server	1	1780	September 9, 2015
What is the best way to solve this problem with couchbase? Sync Gateway	1	886	July 3, 2018
Couchbase server document limit Couchbase Server	6	6796	October 28, 2014
Small Data Types Couchbase Server	0	540	November 29, 2019
Couchbase - best practice for storing huge listts Couchbase Server data_modelling , query , n1ql	1	1639	September 24, 2017

Storing huge number of entries in an CB doc array

Related topics