How to model tags

konstantin_dev · September 8, 2022, 9:21am

Hi i’m quite new to couchbase, but read and benchmarked a lot of possible solutions.
Currently i’m quite frustrated, since i’m not quite happy with my outcome and timing.

Maybe you can help, how i should model my data correctly.

The Problem:

I have a lot of events (100.000.000 every month - stored for 3months), which contain a lot of annotated tags (1900 unique / avg. 30 per document ).

Now i simply want to index the tags, so i can search for a tag and retrieve/paginate all results.

# sample data
{"data":{},"tags":{"tag_1":true,"tag_2":true}}

Related Solutions/ what i tried so far:

Current Solution:

In the current solution i create an index of the dictionary (based on the documentation)

CREATE INDEX `idx_data_tags` ON `data`.`_default`.`data`((distinct (array [(`v`.`name`), (`v`.`val`)] for `v` in OBJECT_PAIRS(`tags`) end)))

and use this to search for matches:

SELECT * FROM `data`.`_default`.`data` WHERE (ANY v IN OBJECT_PAIRS(tags) SATISFIES [v.name,v.val] = [$tag ,TRUE] END)  LIMIT 1000 OFFSET 1000

Maybe someone can help a desperate soul. I tried arrarys too, but got better result with this method. If you require any more information, i’m happy to help.

Thanks

Result execution plan (click me)

{
    "plan": {
        "#operator": "Sequence",
        "~children": [
            {
                "#operator": "Sequence",
                "~children": [
                    {
                        "#operator": "DistinctScan",
                        "limit": "1000",
                        "offset": "1000",
                        "scan": {
                            "#operator": "IndexScan3",
                            "bucket": "data",
                            "index": "idx_data_tags",
                            "index_id": "1eeda9ec5baa7833",
                            "index_projection": {
                                "primary_key": true
                            },
                            "keyspace": "data",
                            "namespace": "default",
                            "scope": "_default",
                            "spans": [
                                {
                                    "exact": true,
                                    "range": [
                                        {
                                            "high": "[$tag, true]",
                                            "inclusion": 3,
                                            "index_key": "(distinct (array [(`v`.`name`), (`v`.`val`)] for `v` in object_pairs(`tags`) end))",
                                            "low": "[$tag, true]"
                                        }
                                    ]
                                }
                            ],
                            "using": "gsi"
                        }
                    },
                    {
                        "#operator": "Fetch",
                        "bucket": "data",
                        "keyspace": "data",
                        "namespace": "default",
                        "scope": "_default"
                    },
                    {
                        "#operator": "Parallel",
                        "~child": {
                            "#operator": "Sequence",
                            "~children": [
                                {
                                    "#operator": "Filter",
                                    "condition": "any `v` in object_pairs((`data`.`tags`)) satisfies ([(`v`.`name`), (`v`.`val`)] = [$tag, true]) end"
                                },
                                {
                                    "#operator": "InitialProject",
                                    "result_terms": [
                                        {
                                            "expr": "self",
                                            "star": true
                                        }
                                    ]
                                }
                            ]
                        }
                    }
                ]
            },
            {
                "#operator": "Limit",
                "expr": "1000"
            }
        ]
    }
}

vsr1 · September 8, 2022, 12:22pm

What you have right one.

With CB 7.1+ try this

CREATE INDEX ix1 ON `data`.`_default`.`data`( DISTINCT ARRAY FLATTEN_KEYS(n,v) FOR n:v IN tags END);

SELECT d.* 
FROM `data`.`_default`.`data` AS d
WHERE ANY n:v IN d.tags SATISFIES n = "tag1" AND v = true END;

konstantin_dev · September 8, 2022, 1:36pm

vsr1:

CREATE INDEX ix1 ON `data`.`_default`.`data`( DISTINCT ARRAY FLATTEN_KEYS(n,v) FOR n:v IN tags END);

SELECT d.* 
FROM `data`.`_default`.`data` AS d
WHERE ANY n:v IN d.tags SATISFIES n = "tag1" AND v = true END;

Thanks for the very quick response <3. Will try this out again.

I also found a great article: Arrays in JSON: Modeling, Querying + Indexing Performance , which describes exactly the problem i might have and FTS as a potential solution.

I’ll benchmark both solution and will try to post it here again.

konstantin_dev · September 13, 2022, 6:17pm

Ok wow. This changed everything - in every dimension.
The data must be stored in an array not map - but after this it can be used directly by bleeve

previous:

Disksize of the index: > 1 TB
Index speed > 92h
Query speed (depending on query ~1-60sec)
Query aggregates (e.g. count by tags) > 15min+

Specs from the small test dataset:

22million rows
1500 tags+
average tags ~ 20
data Size ~ 50GB

now:

Disksize of the complete index dropped to 1% !!! 15GB for indexing
Index Speed < 1.5h
Query speed 0.1 - 10s
Query aggregations: out of the box support via facets <10s

Wow. I’m impressed by the performance and feature set of bleve and couchbase.
Huge shoutout to you guys. I’ll write a blog post describing everything a little more clearer.

TL;DR

If you want to search for tags within million of documents i highly recommend storing them in a flat array and use Couchbase FTS (bleve) with the keyword analizer.

{"data":"","tags":["tag_1","tag_2"]}

-> Search -> Quick Index ->  "Index this field as an identifier"

SELECT * FROM data._default.data WHERE SEARCH(`app`,{
"query":{ "analyzer":"keyword","field":"tags","match":"<tag>"},
"explain":"false",
"score":"false",
"size":10,
"sort":"_id",
"fields":["*"]
});

Topic		Replies	Views
Array Indexing in Couchbase 4.5 SQL++	1	1322	September 9, 2016
Data Modeling for Couchbase Couchbase Server data_modelling	2	2610	August 14, 2015
Migration to Couchbase and design question Couchbase Server	2	1941	March 31, 2015
Offset+limit for nested documents in couchbase lite Couchbase Lite	1	919	October 26, 2018
Array indexing in Couchbase server 4.5 Couchbase Server	1	1417	July 22, 2016

How to model tags

TL;DR

Related topics