Slow indexing speed (primary and GSI)

I have a 3 node cluster running on m5a.8xlarge instances on AWS (32 cores, 128 GB). There is about 1.35 billion documents loaded into Couchbase on a single bucket.

I run into a few issues with loading the data (Couchbase kept using too much memory with the 90% of memory). I’m currently running with 82 GB memory quota for the bucket.

I’m using Couchbase community 7.0 beta to try things out.

I defined a couple of indexes on the data:

  • Primary index (CREATE PRIMARY INDEX ON BucketName)
  • Secondary index (CREATE INDEX users_by_email_and_name ON BucketName (email, username) as u WHERE u.$meta.$type = 'Users')

It has been taking forever for them to index.

It seems that it is indexing at a rate of < 4K / sec at max, and usually at half that.

The nodes aren’t really busy, max CPU usage is around 12%.

I tried giving the indexing process more memory (gave it 20GB), but it didn’t seem to help any here.

I tried creating the indexes while I was inserting the data, but due to the memory usage I had node failures. After I sorted this out the index was stuck on the warmup state for over 6 hours and I dropped and re-created it.

Is there something that I’m missing with how to make it go faster?
Is this behavior expected? Am I doing something wrong?

@orendb, for the 2 indexes mentioned in the screenshot, the total data size is ~175GB. And the memory quota 20GB per node. Once the data becomes greater memory available, the workload gets disk IO bound. Please check what kind of disk IO is available on the boxes. SSD is the recommended choice as the indexing throughput can get limited by the slow HDD disk IO.

Also, for indexing billions of items, enterprise edition is much more optimized to handle the workload. With EE version and SSD disk, you should be able to see very good throughput numbers.

I’m running this on a gp2 disk on AWS, given the size, it is getting 16,000 IOPS, so I don’t think that this is the case.

I’m running a benchmark on the system, and the Enterprise edition license doesn’t allow that, so I’m limited to the community option.

@orendb You should be able to download and use Couchbase EE version for all your ‘internal’ heavy workload testing & consumption purposes. Could you please try the same and let us know if you still see the issue?
Cc @binh.le .

What is the expected sizing for this configuration with the community edition?

@orendb It’s best to try your workload on EE.