How to model tags

Hi i’m quite new to couchbase, but read and benchmarked a lot of possible solutions.
Currently i’m quite frustrated, since i’m not quite happy with my outcome and timing.

Maybe you can help, how i should model my data correctly.

The Problem:

I have a lot of events (100.000.000 every month - stored for 3months), which contain a lot of annotated tags (1900 unique / avg. 30 per document ).

Now i simply want to index the tags, so i can search for a tag and retrieve/paginate all results.

# sample data
{"data":{},"tags":{"tag_1":true,"tag_2":true}}

Related Solutions/ what i tried so far:

Current Solution:

  • In the current solution i create an index of the dictionary (based on the documentation)
CREATE INDEX `idx_data_tags` ON `data`.`_default`.`data`((distinct (array [(`v`.`name`), (`v`.`val`)] for `v` in OBJECT_PAIRS(`tags`) end)))

and use this to search for matches:

SELECT * FROM `data`.`_default`.`data` WHERE (ANY v IN OBJECT_PAIRS(tags) SATISFIES [v.name,v.val] = [$tag ,TRUE] END)  LIMIT 1000 OFFSET 1000

Maybe someone can help a desperate soul. I tried arrarys too, but got better result with this method. If you require any more information, i’m happy to help.

Thanks :slight_smile:

Result execution plan (click me)
{
    "plan": {
        "#operator": "Sequence",
        "~children": [
            {
                "#operator": "Sequence",
                "~children": [
                    {
                        "#operator": "DistinctScan",
                        "limit": "1000",
                        "offset": "1000",
                        "scan": {
                            "#operator": "IndexScan3",
                            "bucket": "data",
                            "index": "idx_data_tags",
                            "index_id": "1eeda9ec5baa7833",
                            "index_projection": {
                                "primary_key": true
                            },
                            "keyspace": "data",
                            "namespace": "default",
                            "scope": "_default",
                            "spans": [
                                {
                                    "exact": true,
                                    "range": [
                                        {
                                            "high": "[$tag, true]",
                                            "inclusion": 3,
                                            "index_key": "(distinct (array [(`v`.`name`), (`v`.`val`)] for `v` in object_pairs(`tags`) end))",
                                            "low": "[$tag, true]"
                                        }
                                    ]
                                }
                            ],
                            "using": "gsi"
                        }
                    },
                    {
                        "#operator": "Fetch",
                        "bucket": "data",
                        "keyspace": "data",
                        "namespace": "default",
                        "scope": "_default"
                    },
                    {
                        "#operator": "Parallel",
                        "~child": {
                            "#operator": "Sequence",
                            "~children": [
                                {
                                    "#operator": "Filter",
                                    "condition": "any `v` in object_pairs((`data`.`tags`)) satisfies ([(`v`.`name`), (`v`.`val`)] = [$tag, true]) end"
                                },
                                {
                                    "#operator": "InitialProject",
                                    "result_terms": [
                                        {
                                            "expr": "self",
                                            "star": true
                                        }
                                    ]
                                }
                            ]
                        }
                    }
                ]
            },
            {
                "#operator": "Limit",
                "expr": "1000"
            }
        ]
    }
}

What you have right one.

With CB 7.1+ try this

CREATE INDEX ix1 ON `data`.`_default`.`data`( DISTINCT ARRAY FLATTEN_KEYS(n,v) FOR n:v IN tags END);

SELECT d.* 
FROM `data`.`_default`.`data` AS d
WHERE ANY n:v IN d.tags SATISFIES n = "tag1" AND v = true END;
1 Like

Thanks for the very quick response <3. Will try this out again.

I also found a great article: Arrays in JSON: Modeling, Querying + Indexing Performance , which describes exactly the problem i might have and FTS as a potential solution.

I’ll benchmark both solution and will try to post it here again.

Ok wow. This changed everything - in every dimension.
The data must be stored in an array not map - but after this it can be used directly by bleeve

previous:

  • Disksize of the index: > 1 TB
  • Index speed > 92h
  • Query speed (depending on query ~1-60sec)
  • Query aggregates (e.g. count by tags) > 15min+

Specs from the small test dataset:

  • 22million rows
  • 1500 tags+
  • average tags ~ 20
  • data Size ~ 50GB

now:

  • Disksize of the complete index dropped to 1% !!! 15GB for indexing
  • Index Speed < 1.5h
  • Query speed 0.1 - 10s
  • Query aggregations: out of the box support via facets <10s

Wow. I’m impressed by the performance and feature set of bleve and couchbase.
Huge shoutout to you guys. I’ll write a blog post describing everything a little more clearer.

TL;DR

If you want to search for tags within million of documents i highly recommend storing them in a flat array and use Couchbase FTS (bleve) with the keyword analizer.

{"data":"","tags":["tag_1","tag_2"]}
-> Search -> Quick Index ->  "Index this field as an identifier"
SELECT * FROM data._default.data WHERE SEARCH(`app`,{
"query":{ "analyzer":"keyword","field":"tags","match":"<tag>"},
"explain":"false",
"score":"false",
"size":10,
"sort":"_id",
"fields":["*"]
});

5 Likes