GSI/View first time creation is read intensive regardless of filter

Hi,
I am just trying to understand the index creation process.
Whenever we create a View and publish it I understand that it does a full bucket scan and this happens regardless of whether the emit is conditional (i.e. only subset of bucket docs) or just all docs in the bucket.

My question is for GSI index that while creating any GSI index even if it is conditional/restricted whether it actually causes all records to be loaded into RAM? (if yes, we may have to create the indexes offline/non-peak hours)

Also (in case of yes) , in a pure “ad hoc/one time” query, if the query itself causes full bucket scan and if creating an index does the same we may decide to run this one time query without thinking about creating an index to suite the where clause.

We do require a full scan on index creation. One way you can optimize index building is to divide data up into multiple buckets. Index scope is bucket and only the buckets data that is being indexed has to be loaded. You can also use the defer_build facility to build multiple indexes with a single scan. http://developer.couchbase.com/documentation/server/4.1/n1ql/n1ql-language-reference/createindex.html

thanks
-cihan

Thanks Cihan,

I think it makes sense to defer_build and later build the group of pending indexes in one go. Just couple of questions to understand the power of indexes (effect of operations like compaction):-

1: With the nature of couchbase distributed and append only architecture (& frequent compaction) I assume any GSI index will only hold the information of which vbucket the document resides on i.e. its not like indexes in oracle where the disk page reference is stored in the index (I can be wrong please correct).
In which case if we have (say) 6 million records on each vbucket (probably when low number of nodes are present), does using an (already created) index mean a full scan of the vbucket at least if not cluster?

2: Concerning Views even though I have read that views hold information of both map and reduce. I strongly believe that reduce runs at the time when one queries the view i.e. at run time. Because of the dynamic nature of scatter-gather and re-reduce (which can only happen on one node of the cluster) in picture I am not sure if Couchbase can pre-calculate reduce/re-reduce.
[Asking to understand the load on cluster if we run views at peak times]

Appreciate your help!

Regards
Rohan

More realistically, we will have approx total of 95 Million documents in Couchbase.

95M/1024 = 93,000 (approx) documents per vbucket.

The estimated size of each document is 6KB which gives each vbucket file:-

(93K*6)/1024 = 543 MB (approx) size of each vbucket file.

If an index points to say 2 or 3 vbucket we will be scanning through close to a GB of data even when using an index!

Appreciate help and corrections with above!

Regards
Rohan

No we don’t scan vbuckets. Indexes (GSI) don’t align with vbuckets either. Here is how it works; The index holds the attributes you used in the index definition as the index key (so “CREATE INDEX i1 on bucket1(a1)” would make a1 the index key). The index key is associated with a document key which will give N1QL the reference to the documents to fetch. So no it does not mean we are scanning buckets.
Lets say you run the following query “SELECT a1 FROM bucket1 WHERE a1 between 1 and 100”. For this, we use the index tree/skiplist to find the a1’s between 1 to 100. return all the a1 values with the associated document IDs that match the 1 to 100 filter. So in many ways this is very similar to the RDBMS indexing, except we are a distributed system and our global index can reside on a node admin chooses as opposed to being distributed aligned to the data on each node. This is done for scan performance.

Yes, there are parts of the view query that will be calculated just in time with Reduce. It is recomended that leave some compute for view processing. Without the actual code, it is hard to estimate how much just in time processing you need to do so best option is to try the queries with the index.
thanks
-cihan

Thanks Cihan, even if we get the Document ID from the index, the vbucket which is the most granular (address) information that the cluster map holds for a document would still need to be scanned If the document is not present in the RAM. It will actually have to read the whole vbucket from the disk or scan until the document ID is found which could be first or last (or in light of fragmentation it would have to read entire vbucket anyways as it is append only and latest changes would be towards the end of the vbucket file). So maybe in that sense it scans the vbucket right?
(But then this would be the case with any key based calls as well if the document does not reside in RAM.)

The reason for this question is to understand how big the bucket (subsequently vbucket) can be. We have an estimate of 95 Million documents and as per calculations above, that makes vbucket size to be 500MB. If there are scans such as these we would split into multiple buckets.
Also we may not have well defined working set as any customer can login to view his/her details and thus are anticipating cache misses and disk reads to be on higher side.

Regards
Rohan

Just came out of a meeting with Couchbase, so answering my own question in this forum:-

Each vBucket file (1024 of them) has a header at the end of it (or top of it - does not matter) which is a Btree index (of some sort) that holds the disk address of each document Key in the vbucket file. This vBucket header index off course changes after every compaction as addresses change.

Do not confuse this index with View or GSI as this is internally created and managed by Couchbase.

Regards
Rohan