Seeking advice on tuning Cluster/Bucket Performance when under sustained get/put load
We're running a Membase cluster that looks roughly like: 4x Dell R410 - 128GB RAM, 3x512GB SSD, 1 HDD (OS, logs, etc) We're in the processes of taking that cluster from 4 nodes to 6 to address general scalability concerns. We're seeing odd performance on the cluster (wild swings from 0-10k ops/sec) and timeouts reported by our clients (currently set at 15ms & 22ms depending on which of a couple of apps are accessing the shared pool).
We're hoping we can see some advice on tuning/general performance principles.
We have about 18 buckets currently allocated, serving a mix of workloads and see very different performance per bucket. We have approximately 10 servers running Java -> NetSpy -> Moxi -> Membase. Up to now we have tuned size of DRAM allocations, and tried to ensure there's some available space at all times, but not tuned low/high water marks or other tweaks. Some other usage stats.
Small-Bucket Workload (2/3 of buckets in this category - per-bucket):
- Approximately 2500 get/sec (4000 peak)
- Approximately 300 put/sec (500 peak)
Large-Bucket Workload (per-bucket):
- Approximately 4500 get/sec (8-10000 peak)
- Approximately 3000 put/sec (6000 peak)
In both cases we see odd performance graphs (below) from Membase, though client apps seem to report good response time (<10ms) from small buckets, and terrible response time (~85% timeout at 22ms timeout) from large buckets. There are 2 types of app workloads:
#1 (2/3 traffic) : HTTP Request -> Membase Get (check current value) -> Update value -> Membase Put (store updated value)
#2 (1/3 traffic) : HTTP Request -> Membase Get (check current value) -> App performs some action based on value
There is no relationship in terms of time of access of a key based on app workload 1 or 2 above - in fact they're very likely to be pulling from different sets due to different business factors.
Overall Membase stat snapshot:
You can see the Disk Ops per second even at the cluster level seems to have significant swings up or down, around the consistent operation rate baseline.
Snapshot from our larger bucket:
It looks to me like there's a correlation in when transactions are blocked/dip and Items persisted/Disk write queue size. I had understood from reading the docs that a background thread walks the mem buffer from high-water -> low-water marks, but thought that it was supposed to be trumped by real-time accesses.
A smaller bucket for comparison:
Am hoping to get some ideas of how to handle this workload, plus any tuning tips, or pointers to documentation we may have missed