Time series data compression rate related

bellpumpkin · January 17, 2024, 1:55am

I am trying to verify the compression ratio of time series data by following this site.

I downloaded the regular-time-series.csv file and added data to it.

The data inserted through cbimport is 1382 rows and 31 KB.

However, when I checked the data size,

Import 1, Item 1.38K memory used : 335 KiB, Disk : 220 KiB
Import 2, Item 2.76K memory used : 699 KiB, Disk : 365 KiB
Import 3, Item 4.14K memory used : 1.02 MiB, Disk : 503 KiB

The disk size has increased much more even though I inserted 31 KB of data. Is it not compressed?

Is there another way to check the document size?

Current Version

Couchbase Server Enterprise Edition 7.2.0 build 5325

dh · January 18, 2024, 11:02am

There will be a few things at play here. The source is comma-separated; it is converted to JSON data for storage. JSON by definition is going to be larger as each document contains every field name, plus necessary supplemental syntax (quotes, braces, etc.).

The first two lines of the CSV are:

Date,Low,High,Mean,Region
2013-1-2,-2,5,1.5,UK

$ head -2 regular-time-series.csv|awk '{print length($0)}'
26
21

Translated to JSON this is at minimum:

$ cbc cat 4c88b6ed-eb5e-48e8-9436-602223f66cad -u Administrator -P password -U couchbase://192.168.2.22/travel-sample --scope=time --collection=regular
4c88b6ed-eb5e-48e8-9436-602223f66cad CAS=0x17ab6b2042a20000, Flags=0x0, Size=62, Datatype=0x01(JSON)
{"Date":"2013-1-2","High":5,"Low":-2,"Mean":1.5,"Region":"UK"}

or

select encoded_size(r) from `travel-sample`.time.regular r use keys["4c88b6ed-eb5e-48e8-9436-602223f66cad"];
{
    "requestID": "c30511cc-55b8-4b05-a3c3-14ef106079c0",
    "signature": {
        "$1": "number"
    },
    "results": [
    {
        "$1": 62
    }
    ],

i.e. from 21 source bytes (data only, including the newline) to 62 bytes.

cbq> select sum(encoded_size(r)) total, count(1) cnt, sum(encoded_size(r))/count(1) avg from `travel-sample`.time.regular r;
{
    "requestID": "aa02b584-eea3-44e0-bbf4-1738fd3d31b1",
    "signature": {
        "total": "number",
        "cnt": "number",
        "avg": "number"
    },
    "results": [
    {
        "total": 22760,
        "cnt": 365,
        "avg": 62.35616438356164
    }

This is nothing special about time series data, rather the basics of CSV vs JSON formats.

On disk, the data service will compress documents when it deems it beneficial. (This compressed size isn’t reflected in querying.)

I presume you’re looking at the collection statistics in the UI for the memory and disk sizes? These will not be constants - for example I can see only 54.7 KiB in memory with the file loaded once (365 documents), and 196 KiB on disk (includes other overhead, not just the raw data). Loading the file another 9 times (for 3650 documents in total) the stats are 546 KiB in memory but only 608 KiB on disk. A further ten loads (7300 documents) and it is 1.06 MiB in memory and 1.04 MiB on disk - the point being that it isn’t linear so extrapolation from a small sample is unlikely to be accurate.

Buckets, Memory, and Storage | Couchbase Docs provides more detail on storage, caching, compaction etc.

HTH.

mreiche · January 18, 2024, 3:10pm

There is minimum compression ratio and a minimum document size to be compressed. So just having compression enabled does not guarantee compression Compression | Couchbase Docs

system · April 17, 2024, 3:11pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Finding the size of the documents that are stored inside Couchbase server Couchbase Server	2	4121	March 11, 2016
Size of the document in couchbase Couchbase Server	1	3020	November 30, 2016
How much disk size free should I need to run compaction in my case? Couchbase Server couchbase-cli	6	80	January 17, 2025
Couchbase reduce large amount other data Couchbase Server	2	1593	May 10, 2017
CB Compressed data 1 M in 1GB .... for 48 keys/ columns that's insane Couchbase Server	10	1686	October 25, 2019

Time series data compression rate related

Current Version

Related topics