Time series data compression rate related

Time Series Data | Couchbase Docs

I am trying to verify the compression ratio of time series data by following this site.

I downloaded the regular-time-series.csv file and added data to it.

The data inserted through cbimport is 1382 rows and 31 KB.

However, when I checked the data size,

Import 1, Item 1.38K memory used : 335 KiB, Disk : 220 KiB
Import 2, Item 2.76K memory used : 699 KiB, Disk : 365 KiB
Import 3, Item 4.14K memory used : 1.02 MiB, Disk : 503 KiB

The disk size has increased much more even though I inserted 31 KB of data. Is it not compressed?

Is there another way to check the document size?

Current Version

Couchbase Server Enterprise Edition 7.2.0 build 5325

There will be a few things at play here. The source is comma-separated; it is converted to JSON data for storage. JSON by definition is going to be larger as each document contains every field name, plus necessary supplemental syntax (quotes, braces, etc.).

The first two lines of the CSV are:

Date,Low,High,Mean,Region
2013-1-2,-2,5,1.5,UK
$ head -2 regular-time-series.csv|awk '{print length($0)}'
26
21

Translated to JSON this is at minimum:

$ cbc cat 4c88b6ed-eb5e-48e8-9436-602223f66cad -u Administrator -P password -U couchbase://192.168.2.22/travel-sample --scope=time --collection=regular
4c88b6ed-eb5e-48e8-9436-602223f66cad CAS=0x17ab6b2042a20000, Flags=0x0, Size=62, Datatype=0x01(JSON)
{"Date":"2013-1-2","High":5,"Low":-2,"Mean":1.5,"Region":"UK"}

or

select encoded_size(r) from `travel-sample`.time.regular r use keys["4c88b6ed-eb5e-48e8-9436-602223f66cad"];
{
    "requestID": "c30511cc-55b8-4b05-a3c3-14ef106079c0",
    "signature": {
        "$1": "number"
    },
    "results": [
    {
        "$1": 62
    }
    ],

i.e. from 21 source bytes (data only, including the newline) to 62 bytes.

cbq> select sum(encoded_size(r)) total, count(1) cnt, sum(encoded_size(r))/count(1) avg from `travel-sample`.time.regular r;
{
    "requestID": "aa02b584-eea3-44e0-bbf4-1738fd3d31b1",
    "signature": {
        "total": "number",
        "cnt": "number",
        "avg": "number"
    },
    "results": [
    {
        "total": 22760,
        "cnt": 365,
        "avg": 62.35616438356164
    }

This is nothing special about time series data, rather the basics of CSV vs JSON formats.

On disk, the data service will compress documents when it deems it beneficial. (This compressed size isn’t reflected in querying.)

I presume you’re looking at the collection statistics in the UI for the memory and disk sizes? These will not be constants - for example I can see only 54.7 KiB in memory with the file loaded once (365 documents), and 196 KiB on disk (includes other overhead, not just the raw data). Loading the file another 9 times (for 3650 documents in total) the stats are 546 KiB in memory but only 608 KiB on disk. A further ten loads (7300 documents) and it is 1.06 MiB in memory and 1.04 MiB on disk - the point being that it isn’t linear so extrapolation from a small sample is unlikely to be accurate.

Buckets, Memory, and Storage | Couchbase Docs provides more detail on storage, caching, compaction etc.

HTH.

2 Likes

There is minimum compression ratio and a minimum document size to be compressed. So just having compression enabled does not guarantee compression Compression | Couchbase Docs