We are processing traffic data that includes TAC (Tracking Area Code) values, which we need to persist in our database. These values must be stored uniquely, and since the TAC space allows for millions of possible values, we are looking for an efficient data modeling strategy.
Problem Context:
Option 1:
Store all TAC values in a single document with an array.
Drawback: Not scalable due to document size limitations in Couchbase.
Option 2:
Store each TAC value in its own document, e.g., one document per TAC.
Drawback: May lead to millions of documents, potentially impacting performance and manageability.
These TAC values are master data that will be queried by our GUI application to display all known TACs received from live traffic.
Challenge:
Both approaches have limitations:
Large document sizes (Option 1)
Large number of documents (Option 2)
Request:
We are looking for best practices or alternative strategies in Couchbase for efficiently managing high-cardinality master data like this. Specifically, something that balances query performance, storage efficiency, and operational manageability.
Any recommendations or patterns you can suggest for handling this type of use case would be greatly appreciated.
My requirement is to store unique data periodically, and read all data from GUI application, any design will work just need to achieve the functionality
Why don’t use hybrid approach
Take first 2 character of tac assume all remaining will fit in 20MB, if not take 3 characters.
Create document based on that (inserting new value find document it belongs and make sure you insert value only if not exist, see if you can use KV subdoc API).
This way all tacs are unique
Now query all the documents and dispaly. (i.e indirectly read multiple documents append vs one big array)
I think one of you mentioned using a collection. While that’s certainly more convenient—since the N1QL queries are shorter, neater, and more intuitive—I’m wondering whether it also offers a performance advantage.
For example, between the two approaches below, which one is likely to be more performant?
– Using a collection
CREATE INDEX idx1 ON bkt1.collection1;
SELECT * FROM bkt1.collection1;
– Using a field to distinguish a type of doc from other docs within a bucket
CREATE INDEX idx1 ON bkt1 WHERE __t = "amms5g-tac";
SELECT * FROM bkt1 WHERE __t = "amms5g-tac";
Does the use of a collection result in any inherent performance benefit (e.g., due to data isolation or index scoping), or is it mostly about manageability and cleaner query syntax?
However, I think there may be a slight misunderstanding. Regardless of whether the values are stored as individual documents, within an array, across multiple arrays, or in a collection, the underlying values themselves are stored just once. So in this context, using DISTINCT isn’t really necessary.