We are running couchbase server 6.6.2 enterprise edition together with sync gateway 2.8 and CBL. All our data resides in our main bucket. The sg is the only service that CREATES documents in the main bucket. Our back end services only change data in the main bucket.
At the moment our main bucket contains 10 million documents. Only 3 million documents are created by our CBL clients. We don’t know what the rest of the documents are
How can we:
- Identify all types of documents we have in the main bucket accurately?
- Determine the number of documents for each type accurately?
The “Data Insights” right pane in the query editor shows incorrect values because it is based on 1000 document sample. How can we accurately determine the two numbers above?
Whilst it is what the “data insights” uses under the covers, you may still want to look into INFER and set the sample size to something larger, however not 7 million which shouldn’t be statistically necessary (and would probably fail due to memory requirements).
But this isn’t 100% accurate - it is statistical sampling.
The only way to test/qualify every single document is to examine every single document, typically via queries. This would then come down to what you know of the documents - for example, if you know a field that only CBL clients would set, you can filter on it etc. to produce counts. If you simply have no knowledge of what fields are present in the documents then functions like OBJECT_NAMES would likely come in handy in initial queries to perhaps group/count documents based on what fields are present.
(Perhaps something like:
select f, count(1) cnt from (select distinct object_names(t) f from `travel-sample` t limit 1000) flds group by f
would give initial flavours; vary the limit once it has been deemed to be useful. If you have a “type” field of course, just use that to identify document types - similarly if the document keys carry type-identifying information, you can query/group/count etc. on that embedded information too.)