CB Compressed data 1 M in 1GB .... for 48 keys/ columns that's insane

Dear Couchbase team ,

We have 150 Millin documents loaded into 1 bucket which took ~150 GB storage space …
each document has 48 keys or 48 columns …
Even if after default auto compaction we didn’t get much reduction from storage perspective.
We know Cb use default Snappy compression .
Same volume of data in Oracle takes < 50 MB.
Can you please help understand why there is big difference and how can we compress the data in scale.?

1 Like

Are you sure about that? 150M documents in 50MB would be 0.3 bytes per document. :face_with_monocle: Did you mean 50 GB?

JSON is inherently less space-efficient than statically-typed columns. The flexible schema implies that every document has to describe the names of its properties. Single-document compression doesn’t help that much because a property name usually only occurs once per document.

Is using 150GB of space causing problems for you? That’s only a few dollars of storage, at current prices…

I am sorry I mean to say 1 M documents in Oracle takes < 50 MB with HCC (high columnar compression).
I know storage is cheap but we were talking about 150 Billions of documents which would potentially takes 250 TB of storage (if I project that way) . that includes only data space and no space for Index and others … And also we are looking for storage in Pure for Kubernetes so yes that much storage consumption definitely bothers us .
So my primary concern is why Couchbase would be storing the data in that volume which is 10 or 20x more than Oracle . Isn’t it all file based ?
Is that my document header is taking more storage ?
just trying to understand if this is what it is and if there is anything we are missing grossly here rather than simple visualizing what the Storage used showing in Web console .
Also with 3/4 replicas for buckets will increasingly required more storage . So need some insights please

What do your documents look like in JSON? Can you paste one here?

Sure …Here is sample 1 document JSON :

“CURRENCY_DATE”: “2018-09-04 12:00:00”,
“ERROR_DTTM”: null,
“DESC_BIT_FLAGS”: “4328522752”,
“ARRIVAL_DATE”: “2018-09-04 12:00:00”,
“SOURCE_NAME”: “My Text (incl Qualified)”,
“EXCHANGE_RATE_TO_GBP”: 0.775385317919386,
“VALID_FROM_DTTM”: “2018-08-12 11:25:20”,
“IS_GOVT”: “Y”,
“STATUS”: “S”,
“HAS_WF”: “N”,
“K_RT”: “1000002824460”,
“HAS_PRK”: “N”,
“IS_AUTO”: “N”,
“DESC_MD5”: “A7AC0A753D3AB3A74E9D03A97049C1E0”,
“IS_CORP”: “N”,
“LS”: 1,
“RT”: 155,
“VALID_TO_DTTM”: “3999-12-31 12:00:00”,
“EXTRA_TEXT_MD5”: “5B8832B50151CE105400EADDD5E5D48E”,
“K_PROPERTY”: 350001851,
“GUESTS”: 1,
“LAST_SHOP_DTTM”: “2018-09-04 11:10:48”,
“EXCHANGE_RATE_TO_EUR”: 0.861437349668806,
“ID”: 4189,

so @jens I am assuming we are doing something grossly wrong while loading this data … Is that expected this would take that much storage ?

Also just to explain that the underlying data in buckets in stored in Pure-file or Pure-Block in Kuberenetes NFS Storage . Do you think its issue .
Compression is big trouble here for Billions of documents …
I need some insights please.

@jens Any pointer . We need help here . Any idea who has more insights on this compression issue from Couchbase side.
If this is a bottleneck we might want to scrape this Couchbase idea and move to other database like Mongo or Druid or be in Oracle . Is there any technology experts in this forum can shed some lights on whats going on ?
thanks and sorry for keep pesting on this

Hi @eldorado,

Short answer here is that Couchbase does not currently do “column” compression…we do “row” compression. i.e. each document is compressed separately from the others. In some cases this can have a significant impact, in some cases not. To be honest, our current row-based compression is never going to compare to column compression for these sort of data models where so much data is repeated across the row (documents).

Thanks for sharing that sample JSON. That document is 1580 bytes, so 150M of them would be roughly 220GiB of storage. I did a quick check to run that same document through snappy compression and it came out at 933 bytes, validating that there is the expected 1.7x compression. 150M of those would be about 130GiB so at least we’re in the ballpark of the 150GiB you reported. There’s probably a bit more variation in document size, some extra metadata structures, etc.

A common optimization approach would be to rename your fields to use as few characters as possible. I’ve seen this done many times with pretty significant effect, and you should be able to enable/disable that from a dev to production environment to tradeoff space for readability. I took a quick pass through the JSON you shared and was able to get the uncompressed down to 1145 bytes and the compressed down to 646 which means it would take up ~90GB for 150M items. It’s not perfect, but over billions of documents it will be quite meaningful.

As far as I know, MongoDB does not have columnar compression either, but I can’t speak well enough for Druid.

Net-net, it’s going to come down to the tradeoffs you make in order to get a particular set of benefits. Couchbase may not be as space efficient as Oracle here…but then Oracle doesn’t give you the flexibility, scale out, cloud-native, etc that Couchbase does. It may also be a question of use case…do you really need high-performance and flexible operational access to 150 Billion records? Perhaps you only need that kind of access to a few billion and you can offload/archive the rest to something like HDFS which will be cheaper but slower.

I hope that’s helpful.



Hi Perry
Appreciate the response and understand that there is no magic bullet to address the Compression ratio rather than whatever already the Compact option providing with Snappy compression.

This is definitely helpful as this says we are not missing anything grossly and math of the numbers proved that.
We will work on usecase and see the best way to address this .

Any idea why I am seeing RAM usage is almost 120 Gig for 150M same documents (that includes 150 Gig of Storage) ? and getting continuously metadata overhead warnings … If this is for Quick access of documents then wonder what will happen if I put 900 M records .

thanks again

A Couchbase bucket will use as much RAM as you give it to aggressively cache the documents. As you put more data in than there is RAM available, Couchbase will automatically eject data that has already been written to disk. You can reduce the RAM usage by reducing the bucket quota.

I would also suggest switching the bucket to “Full Eviction” mode which will help avoid those metadata overhead warnings.

Generally we recommend keeping about 15-20% of your data cached in RAM, especially for events like node failure and rebalancing. We’re working on some longer-term improvements to remove that recommendation.

1 Like

Thank you very much . The information and help is appreciated .