Node failure during testing
Today I was performing a test which would span the course of one hour. The plan was:
0 min → Begin test with 3 nodes and 250 million items
Begin insertion process, ~13.5K sets/sec (constant load to run over course of 1 hour)
Random modification of existing value ~3K/sec (between 0 and current inserted value, also constantly run over the course of the hour)
0+15 min → Bring node 3 down, rebalance
0+30 min → Bring node 2 down, rebalance
0+45 min → Bring node 3 up, rebalance
0+60 min → Bring node 2 up, rebalance
This test is to simulate node failures, rebalancing and data persistance with worst case scenario production load simulation.
We’re using the same java libraries (the latest beta/'non-stable') as before to inject the data and talk to membase.
I have a few questions around what happened:
13:36: Test begins @ 250M initial unique items. Insertion methods started. Membase being hit at ~16.5K ops/sec.
13:51: Node 3 taken down, rebalance successful with no data loss.
Here’s where it gets weird...
13:56: Node 2 crashes, 260M items in database → 130 available. Some kind of restoration process started? (see later)
14:01: Node 2 comes back up, still at 130M items.
~14:07: Stopped hitting membase with the client.
Attempted to bring node 3 back, rebalance failed - “wait for memcached failed”
14:09: Attempted another rebalance with node 3 up, same failure.
14:20: Data is restored, back at 260M items:
Bucket "default" loaded on node 'firstname.lastname@example.org' in 1167 seconds. ns_memcached001 14:20:02
14:26: Started rebalance on all 3 nodes,
14:54 now, still running.
OK so this is fine, we have our all our data back after the bucket loading back after the crash which took 1167 secs. But why did node 2 crash in the first place? We weren’t hitting membase that hard and I can’t see any issues like it running out of RAM or disk space.
Perhaps you can shine some light on this. I have sent you some logs generated from the “collect_info” command and also some screenshots named based on the time taken.