Creating the SGI faster

I have 150 million documents on 1 bucket ? Any advice how to make the Standard Global Index faster ? as currently its taking more than 20 minutes which seems ridiculously too much comparing the relational database we have.
My index is Composite key index on 5 keys and I have each document having 40 fields / keys .


1 Like

@eldorado : Can you please share which version and edition you are using?

I am using EE edition with CB 6.0 .
I just want to know what is the recommended way to create Index Faster . As I have 150 millions of records and can’t keep waiting for index experiments where I need to potentially recreate index multiple times after seeing the EXPLAIN PLAN . I can increase Index threads and for us the ultimate goal is to have index on 15 Billion records so keeping Memory Optimized index is not a great option because of the greater h/w footprint . Please advice . We have CB 5 node cluster running on K8s platform (4 hosts)

@eldorado, for Standard GSI, the recommended memory resident percentage is 20%. Based on the size of indexed field(and indexing overheads), that should give you an idea of how much memory would be required.
May be you can first index 10M items and check how many resources it needs and extrapolate the numbers from there.

Regarding CPU, you can start with something like 20 cores, and if indexer is saturating that, allocate more. As the workload becomes disk-bound, SSD bandwidth would determine the indexing throughput.

If resources on a single box are limited, index can be partitioned. This can improve the indexing throughput significantly by being able to utilize resources across nodes(memory/cpu/disk i/o).

thanks but a simple / Covered or Partition Index creation taking 4-6 hrs to complete on this much of data volume . we have TB’s of data and I am simply asking any method to or guidelines to increase Index threads during creation to make it faster . It s a hrs and hrs of wait just to see what is right indexing approach . Trying to see if this could be avoided . We have Index in GSI and not memory optimized . We are not concern on h/w resources now but need to know FASTER approach to create Index.

@eldorado, what is the node spec in terms of CPU, memory, disk bandwidth? How much memory is being allocated to the index service? What is the CPU/disk IO usage observed when the index is being built? What is the memory resident percent of the index after it has been built fully?

PS: If you are already an enterprise customer, you can get it touch with Couchbase Professional Services team to help you with sizing.

We have Kubernetes On-premise platform with 80 GB RAM on each node of Couchbase with 4 node cluster . Index is SGI and NOT memory optimized , 10 CPU each node … 150 GB storage in each node … Index service is not in MDS … all service available across all nodes…

# The amount of memory that should be allocated to the data service
dataServiceMemoryQuota: 40960
# The amount of memory that should be allocated to the index service
indexServiceMemoryQuota: 10240
# The amount of memory that should be allocated to the search service
searchServiceMemoryQuota: 2048
eventingServiceMemoryQuota: 10240
# The amount of memory that should be allocated to the analytics service
analyticsServiceMemoryQuota: 10240
indexStorageSetting: plasma

- metadata:
name: couchbase
- ReadWriteOnce
storageClassName: pure-block
storage: 150Gi

Rest of the Usages I will post when I can capture the statistics …
I am potential EE customer but doing POC’s across different document store like Mongo , CB .
If we are not getting great community help to prove in POC we need to find out other DB option . Currently for CB Indexing is big challenge per say because we need some practical guidance as creating indexes for hrs and hrs not being helpful . Is there any modular approach ?
Do you have practical guidelines on best way to leverage all threads to increase Index creation performance ? Also is there any performance benefit we can is 6.5 version , I know its still in Beta

My best guess, without the statistics, would be that index service doesn’t have enough CPU to go faster. As data service and index service are colocated, there could be CPU contention when trying to build index. As MDS is the recommended setup for production, probably you can change the POC setup to also follow that so you get more realistic numbers.

You can probably try with first allocating one node for only Index/Query and 50GB memory quota for indexer. If there is a box with better CPU, that would be help significantly. Just to give you some numbers from our internal benchmarks for Standard GSI for index creation build times:

  1. 500M docs, 50% memory resident - Build time = 18mins (CPU usage = 35 cores)
  2. 500M docs, 20% memory resident - Build time = 38mins (CPU usage = 25cores)

These benchmarks are run on boxes with SSDs (500MB/sec I/O throughput).

The basic guideline is to have enough memory to keep at least 20% memory resident for the index and having 25-30 CPU cores with good SSDs.

1 Like

Hi @deepkaran.salooja - I am certainly excited to try this options and I am currently stuck with MDS configuration for my Kubenetes cluster for multi node CB cluster installation . Somehow the config of couchbase-cluster.yaml is restricting me on MDS … Once I have that I would definitely try this .
If you don’t mind can you see if the nodes allocations and quota for the POD’s from CPU and memory respective is good or not . Here is the reference in another thread.

Also we have pureblock and pure-file storages on Data / Index everything . Are your saying we need to put index on SSD based h/w . Do you think realistically Indexing will work better with query perf with 150 billions of documents ?

  • thanks and appreciate your help

@eldorado, cpu: “30”, memory: 50Gi seems like a reasonable starting point for 150M documents. Once the index gets built, you can look at the memory resident percent on the UI graphs(keeping it 20% memory resident should be good).

pureblock should be fine. What matters is the I/O bandwidth you can get. You can observe it using iostats etc.

It depends on what kind of queries you want to run. If the queries need to scan a subset of data, then index service provides you with working set management by caching the most used index data in memory. You can also benefit by creating “partial indexes” to meet specific query needs to better manage the TCO.

If the queries are more of “analytical” nature i.e. the requirement is to scan the whole 150B documents before answering a query, you may need to try it out with the “analytics” service if query/index service cannot meet the SLA.

@deepkaran.salooja - Is there any ability to change the memory resident % of it ? and where should I look this into ? In statistics section I see only showing Index stat for PRIMARY KEY indexes and not the Covered index created for my bucket .
Are you saying I could see that number in the " Max Index RAM Used %" section in the Server stat ?
All just for my own benefit during Index creation does that Advanced Index thread and configuration makes any difference ?

@eldorado, memory resident percent is available per index on the UI. Look for “cache resident percent” mini graph in the UI. More details here:

If the memory resident is too low, you can allocate more quota to index service under Settings or partition the index.

All just for my own benefit during Index creation does that Advanced Index thread and configuration makes any difference ?

By default, index service sets index threads to be equal to all CPU cores on the system. Unless you have set it to lower number, the default(0 value) should work fine.

@deepkaran.salooja , thanks . I got it . for my 4 index nodes I see one is having 8% cache resident rest of the nodes all are 0 as Index created only on this node (I guess) and index is not partition index either. As you are saying we can allocate more quota per node but that mean editing the cluster yaml if there is not other faster way (as I am trying to do in settings and it is always reverted to previous number and doesn’t preserver my changes)
Also for each node I have 10 GB index quota allocated in settings . so technically 4 nodes index can be total 40 GB .
In my case for Index on 180 M index items 8% on one is is cache resident with 7 GB memory used with 9 GB datasize with 31% fragmentation . Do you see any health issue here ?
node that target to be Index availability on 15B records … do you think its reasonable to make the index partitioned ?

It would be a good idea to create the index as partitioned index. For 15B items, a single node may not have enough resources. Another option you could consider is to create partial index with a where clause to reduce number of items per index. But that depends on your usage of the index and if it makes sense for the query you intend to run.

What you see is expected for a non-partitioned index. It can only reside on a single node.

I think you may need to edit your cluster yaml(I am not very familiar with that area). If cluster hasn’t been allocated the resources, you may not be able to increase the allocation from UI->Settings.

thanks @deepkaran.salooja
May be you can help me understanding one more issue in terms of data load …
When loading the data through python scripts from Oracle to CB we see Ops/sec is very small 1K ops/ seconds or half .
Guessing this is the speed how data as each documents is ingested from Oracle to CB . I see ~20-30% off CPU usages in all 3 data nodes.
How can we push the limit of ingestion to be blazing fast (not sure if network is bottleneck here ) ?

We have sufficient CPU and memory across all POD’s so exact bottleneck is not known . We have Purestorage so IO bottleneck should be not a problem .
Any pointer will be helpful .


I am no expert in Oracle to couchbase migration but for the slow ops/sec, I think we should first identify where the bottleneck is. What is the average size of your doc and what is the available network bandwidth. We can try to roughly estimate if network is a bottleneck network or not.

Also, I see there are three stages in the pipeline: Reading data from Oracle, Processing it in Python script (if any processing is done), Uploading data to Couchbase. Can you let us know the Ops/sec at the end of each stage in the pipeline.


thanks @varun.velamuri - Sure we will get some data for you .
Coming to actual context of slow Index creation which I was talking to @deepkaran.salooja
In my 180M docs creating Primary Index took 7 mins and only 1 node was 100% busy as I haven’t partitioned .

However when we do below it took 12 mins:
I look for “Cache Resident Percent” in Index Stats it was using 95% while Index creation is in progress . I have 4 nodes (all services) and all 4 Node was busy with 40% on Max Index RAM used ,

My idea was to split the load of Index creation across 4 nodes parallel so that it would be faster which I am not seeing here :
Not sure why I don’t see that performance of creating partition Index is faster .


Among the many benefits of partitioned indexes, one of them is to offer horizontal scaling per index. If a single node does not have enough resources to fit the entire index, then partitioned index can help in those cases. There is an additional work done to achieve this. E.g., For each document, the HASH(META().id) has to be computed and the document has to be routed to the corresponding indexer node for indexing.

In this case, whether the index is in DGM or not will influence the build time of index. For the use case you are trying, what was the resident percent when you did:

For comparing the build times, I would try the following: Increase the number of documents (or) reduce the quota so that index goes to DGM. With this new configuration, check the partitioned index build time.


@varun.velamuri - Understand .
I already have 200M documents on top of which I am trying to build Index .
As explained in below link I am getting rid of Primary Key Indexes .

So I have tried to create covered Index(Non Partitioned) and seeing that cache memory resident on one node is ~5-7% while Index build is going on …
I understand that 20% of Memory resident theory for Index but for DGM and plasma to work in best use cases . I have 10 GB RAM quota on each node so 40 GB total RAM quota for Index Services . I have separate 20 GB per node Data service RAM quota .
How do I know upfront that how much memory the Index will take so that I can plan for the IndexServiceMemory Quota ?


I was checking the index build times on our perf cluster for both partitioned index and non-partitoned index. The test builds secondary index. When indexing 100M docs, the partitioned index build was almost 3 times faster than non-partitioned index (Non-partitioned index took 580 sec, partitioned index too 168 seconds). This is possibly due to our cluster configuration. So, I think we need to understand more about the test you are trying and see why there such a performance. Can you share the cbcollect logs so that we can study and understand the reason for such a build time.

For IndexServiceMemory quota, it depends on a lot of factors like doc size, number of indexes, type of index etc. I can give you a rough idea but this may not be completely accurate. If you are indexing 10M docs, and the avg. size of indexable value is 100 bytes (say), then for close to 100% resident ratio, data size with overheads would come to around (10M * (100 + 56) * 2). ~= 3G. I would set the quota to 4G to accommodate for indexer rss. This is just a rough calculation as there are many other factors that should be considered.

PS: If you are already an enterprise customer, you can get it touch with Couchbase Professional Services team to help you with sizing.