I’m working on optimizing some queries with new indexes and ran into a question regarding array searches.
In many parts of our system, when we need to check whether a value exists inside an array field (for example, searching for a string within an array), we typically use the simple syntax:
$value IN `fieldname`
This works correctly, but when I started creating indexes that include array fields using the syntax:
DISTINCT `fieldname`
I noticed that the Index Advisor does not show these indexes being used for queries that rely on the IN operator. This made me wonder: Is the IN syntax efficient when the element being searched is dynamic, or should we prefer using functions like CONTAINS() or the more explicit:
ANY v IN fieldName SATISFIES v = $value END
My current understanding is that the ANY … SATISFIES form is more index-friendly because it maps directly to array index entries and it should be recognized by the index, whereas IN may not leverage these indexes when the search value is dynamic.
For reference, here’s an example query I’m trying to optimize. Currently, it falls back to using a generic index:
SELECT DISTINCT N1, N2, ARRAY_AGG(N3)
FROM bucket
WHERE Field1 = 'fixed value'
AND Field2 = $value2
AND ($value3 IN ArrayField3
OR $value4 IN ArrayField4
OR $value5 IN $ArrayField5)
AND Field6 = false
GROUP BY N1, N2
If index key is filedName is scalar most likely you must search value using equal (OR multiple values)or range predicate
filedName IN [1,2,3] same as (fieldName = 1 OR fieldName = 2 OR filedName = 3)
The index will chosen
If filedName is ARRAY or ARRAY of objects indexing as scalar will not choose index. As Index has whole value and you are looking part of that. In that case ARRAY indexing comes
Index DISTINCT `fieldname`. This allows index create separate entry for each value of ARRAY and able to scan
Query:
ANY v IN fieldName SATISFIES v = $value END
ANY v IN fieldName SATISFIES v IN [$value1, $value2] END
CREATE INDEX ix3 ON bucket(Field1, Filed2, Filed6, DISTINCT ARRAY v3 FOR v3 IN ArrayField3 END);
CREATE INDEX ix4 ON bucket(Field1, Filed2, Filed6, DISTINCT ARRAY v4 FOR v4 IN ArrayField4 END);
CREATE INDEX ix5 ON bucket(Field1, Filed2, Filed6, DISTINCT ARRAY v5 FOR v5 IN ArrayField5 END);
SELECT N1, N2, ARRAY_AGG(N3)
FROM bucket
WHERE Field1 = 'fixed value'
AND Field2 = $value2
AND (ANY v3 IN ArrayField3 SATISFIES v3 = $value3
OR ANY v4 IN ArrayField4 SATISFIES v4 = $value4
OR ANY v5 IN ArrayField5 SATISFIES v5 = $value5)
AND Field6 = false
GROUP BY N1, N2
NOTE: Index can have 1 Array index Field, otherwise due to each array filed as separate can result in cartesian product and index can be huge.
As I suspected, in order to force the use of an array index, I need to use the ANY … SATISFIES syntax. If I use the CONTAINS function maybe it uses the array index too, but I need to test it because I don’t remember.
In our system, there are many fields that are arrays of strings whose values can change over time. Whenever we need to search for an element within one of these arrays, we currently use the syntax: WHERE value IN ArrayFieldName
If I understand correctly, the IN clause is internally translated into a series of OR conditions. Is that correct?
If we continue using the IN condition in these queries, is there a way to make them faster by creating some kind of index? Or is the only viable solution to update all queries to use the ANY … SATISFIES syntax and then create array indexes on those fields?
1)Sorry, what I mean was the ARRAY_CONTAINS function of the arrays, not the CONTAINS string function. I don’t know if using the ARRAY_CONTAINS is the same as IN or ANY SATISFIES.
2)So I can assume that the IN condition doesn’t use any type of index, not only the array index. The array condition (IN) is always checked after the fetch of the document and it is quiet fast for arrays of max 100-200 elements.
3)If I use something like WHERE ‘value’ IN [Field1, Field2, Field3] I can assume that it is the same as
(value = Field1 OR value = Field2 OR value = Field3) since the Field1/2/3 are string fields.
Thanks for the support. I swear that these are the last questions xD
ARRAY_CONATINS() functionally same as IN or ANY but one can’t index because ARRAY_CONTAINS() function has constant you looking, if index with constant there will be no use of index because it can work only one value. Any such thing post index scan/fetch applied.
IN can be used without any index, constant on left side it will be applied post index scan/fetch. It can use index if left side in scalar field and right side constant, If field is array it must create array index and use ANY clause in query
If you are looking one value in multiple fields with OR
one can use few techniques
CREATE INDEX ix1 ON default ( DISTINCT ARRAY v FOR v IN [Field1, Field2, Field3] END);
SELECT *
FROM default d
WHERE ANY v IN [Field1, Field2, Field3] SATISFIES v IN [val1, val2] END;
If Field1,Filed2,FIeld3 you can use
CREATE INDEX ix1 ON default ( DISTINCT ARRAY v FOR v IN ARRAY_CONCAT(Field1, Field2, Field3) END);
SELECT *
FROM default d
WHERE ANY v IN ARRAY_CONCAT(Field1, Field2, Field3) SATISFIES v IN [val1, val2] END;
Hello @vsr1 , sorry to bother you again, but I have a strange behaviour in couchbase and I don’t know how to solve it and why it gives me this problem. I have this query
SELECT *
FROM bucket
WHERE Type = 'test'
AND Project = 'test'
AND ( any v in ArrayField1 satisfies v = 'abb' end
OR any v in ArrayField2 satisfies v = 'abb' end)
and IsDeleted = false
I tried to create the indexes for various queries without problem, but with this one I get a weird error. I tried to use both my indexes or the indexes suggested by the index advisor, but the query keeps using the old index. Also if I try to use the use index() syntax I can’t force to use these new indexes. For reference these are my new indexes
CREATE INDEX adv_1 ON `bucket`(DISTINCT ARRAY `v` FOR `v` IN `ArrayField1` END,`Project`) WHERE `Type` = 'test' AND `IsDeleted` = false
CREATE INDEX adv_2 ON `bucket`(DISTINCT ARRAY `v` FOR `v` IN `ArrayField2` END,`Project`) WHERE `Type` = 'test' AND `IsDeleted` = false
But it still use this index
CREATE INDEX adv_3 ON `bucket`(`Type`,`Project`,`IsDeleted`,`ID`)
The fun side is that if I only use one of the condition in OR it uses the correct index. And I tried with other similar queries and I don’t have the same problem, but uses the correct indexes.
For example this query use the correct index adv_1
SELECT *
FROM bucket
WHERE Type = 'test'
AND Project = 'test'
AND any v in ArrayField1 satisfies v = 'abb' end
and IsDeleted = false
Move Project leading (as it is common) in adv_1, adv_2
Also try
FROM bucket USE INDEX(adv_1, adv_2)
If looking both arrays v = “abb”
CREATE INDEX ix1 ON `bucket`( Project, DISTINCT ARRAY `v` FOR `v` IN ARRAY_CONCAT(IFMISSINGORNULL(`ArrayField1`,[]), IFMISSINGORNULL(`ArrayField2`,[])) END,`Project`) WHERE `Type` = 'test' AND `IsDeleted` = false;
SELECT *
FROM bucket
WHERE Type = 'test'
AND Project = 'test'
AND any v in ARRAY_CONCAT(IFMISSINGORNULL(`ArrayField1`,[]), IFMISSINGORNULL(`ArrayField2`,[])) satisfies v = 'abb' end
and IsDeleted = false