We have a couchbase cluster that has brought us great pain, mostly during rebalancing where nodes start failing one after another.
Main facts:
- Hosted on 6 AWS
r3.xlarge
instances- 6 * 30.5= 185GB RAM
- 6 * 4 = 24 vCPUs
- Each node has two 500GB io1 1000IOPS SSDs combined with LVM for a total of 6 * 2 * 500 = 6TB
- One big bucket with:
- 6 * 28 = 168GB bucket quota total
- 2.55TB disk usage out of the 6TB total
- 20 million items
- 2 days expiration date
- 2 replicas
- 600 ops per second
- 300 gets
- 300 sets
The problem is that every once in a while (since we’re on the cloud) we lose nodes. When this happens we reboot the failing node which changes the underlying host, add it back and rebalance. This rebalance takes about 6 hours. This is a problem on it’s own but most of the time rebalance doesn’t even complete because in the mean time we start losing more nodes and the rebalance stops. In the end we accept that we’ll lose all the data, drop the bucket, rebalance the rest of the cluster in a minute, recreate the bucket and start clean.
In the monitoring tab there are a few metrics that seem “dangerous”:
- 1.15% active docs resident
- Disk write queue peaks on high traffic hours at 10+K
But our cache miss ratio is 5% which is decent enough for our use.
Also the healthcheck utility complains for even more stuff like:
- Average item loaded time ‘7.545 ms’ is slower than ‘500 us’
- Replica resident item ratio ‘0.99%’ is below ‘20.00%’
- Number of backlog items ‘5.89 quadrillion’ is above threshold ‘100 thousand’
- Number of backlog item to active item ratio ‘2186457269979.22%’ is above threshold ‘30.0%’
- Total memory fragmentation ‘5.732 GB’ is larger than ‘2 GB’
To tackle the problem we’re trying to follow the sizing guidelines mentioned in the documentation but we’re at a loss even there. The first number we have to calculate is the value size which (since we cannot easily calculate it from the client app) we suppose is the total disk size used divided by the number of items (including replicas) so that’d be roughly 2.55TB / 60M which is about 40KB. The next controversial number is the working_set_percentage. Now based on the client app usage we’re ok with 1% since that’s the part of the data that’s accessed 90% of the time. The rest falls on the long tail and we don’t mind if it’s a bit slower.
Plugging this data in we get the following:
documents_num 19,108,793.00
ID_size 36.00
value_size 46,080.00
number_of_replicas 2.00
working_set_percentage 1%
per_node_ram_quota 30,064,771,072.00
Metadata per document 56.00
SSD or Spinning SSD
headroom 30%
High Water Mark 85%
no_of_copies 3.00
total_metadata 5,274,026,868.00
total_dataset 2,641,599,544,320.00
working_set 26,415,995,443.20
Cluster RAM quota required 48,467,092,946.54
number of nodes 1.61
So… that says we should use 2 nodes??? On the other hand maybe aiming for 1% working set percentage is not a good idea? If as the healthcheck suggests we have to aim for 20% that pushes the number of nodes to 27 which is outside our price range. So maybe couchbase is not made for our use case? Or maybe after all the memory is not the issue here but it’s the CPU and the disk IO?
What we plan to do now is to try to temporarily improve the situation a bit by moving to double the amount of nodes with the half size instances (r3.large), dropping to 1 replica and then revisit the situation given any feedback we get here.
Thanks and sorry for the long post!