Couchbase Query question

When performing a query in Couchbase,
if the WHERE clause is “WHERE name = ‘test’”,
does it fetch only the ‘test’ data from the block?
Or does it fetch the entire block containing ‘test’?

※ Assuming the index is created incorrectly, resulting in additional data retrieval from the Data Node

There is no concept of “block”; if we extend that to mean a collection then:

If it is using say a primary index or sequential scan (7.6+), then Query fetches all documents from the KV and applies the filter to them discarding those that don’t qualify. (Of course if your filter is suitable and on the document key, then the primary scan can perform filtering just like any index.)

If you have a secondary index supporting some or all of the filtering then just the document keys returned by index filtering are fetched.

[Edit: If there is an index on name but you’re filtering on both name and age, then all qualifying names are returned by the index scan, the documents are fetched and then the age filter is applied.]

(This is largely why correct index creation is integral to superior performance.)

If you have no suitable secondary index and no primary index and lack permission (you need a specific privilege to use sequential scans) the query will fail with an error indicating there was no index available to facilitate access.

Please see Primary and Secondary Index Reference | Couchbase Docs for more information.

HTH.

To my knowledge, the magma bucket has a concept called magmaSeqTreeDataBlockSize. Doesn’t this mean that data is stored in blocks? If there is a block size setting, I assume that data is also fetched in blocks or as data KV values.

I would like to know how it is stored in memory or on disk.

The Magma storage engine is proprietary so I’m unsure anyone will divulge details of the internal workings to you here, but I’ll leave that for others to respond to.

Needless to say as it is internal to the Data service and doesn’t affect how the Query service processes a query.

To answer your question applied to your assertions regarding the Data service’s storage engine:

  • if the WHERE clause is “WHERE name = ‘test’”, does it fetch only the ‘test’ data from the block
  1. The where clause is not directly relevant. It is a concept in the Query (& indirectly the Index) service only.
    1.1 “Not directly” as between Query and Indexing it is transformed into a list of candidate document keys. The filtering value is never sent to the Data service, only the list of keys.
  2. Regardless of the internal storage engine, only the specific document keys requested are returned by the Data service.
    2.1 Regardless of the storage engine, as illustrated here, the Data service will attempt to return individual documents from is memory cache.

If the index is created as CREATE INDEX country_idx ON demo.docker.coll(key, point)

Queries:

  1. SELECT * FROM coll WHERE key = 'emp_1' AND custid = 'emp_1' AND name = 'Jung success'
  2. SELECT * FROM coll WHERE key = 'emp_1' AND point = 3911

Both queries use the country_idx index. The country_idx index is created with key and point, but the first query includes additional conditions with custid and name. Doesn’t the first query need to fetch data from the Data service? While the second query can directly fetch data from the Index service,

In the case of the first query, if it needs to fetch additional data from the Data service, I am curious whether it fetches the block or just the K/V data containing the required data.

Or is my understanding completely incorrect?

Qyery Plan
1.

This is as I noted. Query via the Index service will obtain a list of all document keys in the index with the attribute key equal to 'emp_1'. [1] This entire list of document keys is sent to the Data service and all those - but only those - documents are returned. There are no spurious additional documents returned just because perhaps they were adjacent on disk.

Obviously this is likely more documents than will ultimately qualify when all filters are applied.

The FILTER then applies all the conditions to those returned documents and discards those that don’t match. (“All” because even the index keys are re-applied as the index is updated asynchronously so may be out of date relative to the document.)

You can see how much inefficiency there is by observing the difference between “#itemsIn” and “#itemsOut” in the FILTER operator.

The selecting of “*” in both your statements means fetching is required. If you were selecting say just point, then the second statement would be “covered” (means all data required for the query is available in the index) and the FETCH wouldn’t take place. But the first statement would still require the FETCH (to apply the filters) and would fetch the same set of documents.


[1] If you lack any even partially applicable secondary index but have a primary index (or may run a sequential scan) then all document keys notionally “qualify” for further filtering - all documents are fetched & filtered. This is why primary indices and sequential scans are generally (obviously exceptions exist) best suited to small collections and/or non-production environments.

To supplement my prior responses:

If you have the documents:

d1:{"a":1,"b":1,"c":1}
d2:{"a":1,"b":2,"c":1}
d3:{"a":1,"b":1,"c":2}
d4:{"a":2,"b":1,"c":1}

And an index on (a,b) which is conceptually organised as:

                                   Root
                                     |
             +-----------------------+-----------------------+-----------------------+
             |                                               |                       |
            a=1                                             a=2                     ...
             |                                               |
   +---------+---------+---------+                 +---------+---------+
   |                   |         |                 |                   |
  b=1                 b=2       ...               b=1                 ...
Keys:d1,d3           Keys:d2                      Keys:d4

If you run a query with WHERE a=1 AND c=1, it is established (in Query) that the index supports only the “a=” filter and the scan of the index stops at the “a=” level. It returns all keys that exist under the qualifying “a=” level branches. In this case that is the one branch (since the filter is an equality condition) and keys d1,d3 and d2 are all returned.

All 3 of the documents are fetched.

The filtering then applies “a=1” to all; all still match.
Then “c=1” is applied matching the documents with keys “d1” and “d2”, document “d3” is discarded.

So if by “block” you mean “all keys in the portion of the index below where filtering on the index stops”, then yes, this is what is returned by the scan and therefore what is fetched. The Data service storage has no bearing on this.

Only the data itself controls how many keys are in this portion of the index and thus how many are returned.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.