Ok wow. This changed everything - in every dimension.
The data must be stored in an array not map - but after this it can be used directly by bleeve
previous:
- Disksize of the index: > 1 TB
- Index speed > 92h
- Query speed (depending on query ~1-60sec)
- Query aggregates (e.g. count by tags) > 15min+
Specs from the small test dataset:
- 22million rows
- 1500 tags+
- average tags ~ 20
- data Size ~ 50GB
now:
- Disksize of the complete index dropped to 1% !!! 15GB for indexing
- Index Speed < 1.5h
- Query speed 0.1 - 10s
- Query aggregations: out of the box support via facets <10s
Wow. I’m impressed by the performance and feature set of bleve and couchbase.
Huge shoutout to you guys. I’ll write a blog post describing everything a little more clearer.
TL;DR
If you want to search for tags within million of documents i highly recommend storing them in a flat array and use Couchbase FTS (bleve) with the keyword analizer.
{"data":"","tags":["tag_1","tag_2"]}
-> Search -> Quick Index -> "Index this field as an identifier"
SELECT * FROM data._default.data WHERE SEARCH(`app`,{
"query":{ "analyzer":"keyword","field":"tags","match":"<tag>"},
"explain":"false",
"score":"false",
"size":10,
"sort":"_id",
"fields":["*"]
});