We have a lot of performance issues with our couchbase server, most docs fail when we try to retrieve them from the nodejs client with a timeout. I think it is when we request a large number of inactive docs (on disk, not memory) that they fail. Our couchbase console has always reported <1% cache miss ratio though, so I always assumed memory usage was not the problem since it seems like 99% of our gets are being sent from memory. I ran the cbhealthcheck tool today and it has a lot of warnings about low memory and it says we should increase our memory:
Active resident ratio - Not enough RAM in the cluster.
Active resident item ratio '11.64%' is below '30.00%
Impact: Performing failover will slow down nodes severely because it will likely require information stored on disk
Average item loaded time - Poor ep-engine key performance indicators
Average item loaded time '69.241 ms' is slower than '500 us'
Impact: Server performance is below expectation
The couchbase console says Active Resident Ratio is % of active items cached in RAM, so mu understanding is that this is saying only 12% of our active items are cached in RAM, but also that we have <1% cache miss ratio. How can those stats be that way? Am I misunderstanding what they are actually saying? We’re using Couchbase 3.1.0.
It can be that way if the way you access your data isn’t uniform across the dataset. I can have a 0% cache miss ratio and only 1% resident if in a given time period I only read/write the same document.
With most “real world” applications, the histogram of access will tend to look like a power law. How quickly the slope drops off is very application dependent. It can also be affected greatly by the “arrival rate” of your users. If for instance a given geography comes online and that makes you cycle a new set of data into your working set (ejecting data from geographies that go offline), you’ll need the system capacity to make the change. That sometimes means sizing for these kinds of patterns.
I heard back from couchbase support about this and found out Active Resident Ratio is the percentage of ALL your docs in memory. So we had these values because our working set was about less than 12% of our total documents. My original understanding was that Active Resident Ratio meant the % of your Active/working set in memory, which was incorrect and led to my confusion.
Hey, my stats are even worse, I have a cache miss ratio of 5% and an active docs resident % of 1.21! The cluster behaves very bad, breaking often on rebalances and we’re in the process of trying to come up with the correct size to resize it to. The thing is if active docs resident does not refer to the working set is there any metric that does, since that would help getting the needed numbers for the sizing calculations?
If your cache miss ratio is only 5% then it seems like you have enough memory, that means 95% of the time you are retrieving your documents from memory. Though our average cache miss ratio is less than 1% but the performance is so terrible retrieving that 1% of docs from disk that it almost always fails to retrieve them.
For us rebalance is a huge load on CPU, often using 100% on some of our nodes. If rebalance failing is your problem maybe you need more CPU?