Couchbase 4.0 CE benchmark (continue)

I’ve continued experimenting with couchbase 4.0 and noticed another issue.

Three server nodes:
3 * EC2 rx.xlarge with mounted 5000GB SSD volume (not EBS optimised) (Total size of 15T)
DataRam Quota 22407 MB (Total Ram ~66GB)
IndexRam Quota 2048 MB

  • Generated 2 indexes - one on node2 one the other on node3
    cqb> create index field1_idx on test (field1); // on node2
    cqb> create index field7_idx on test (field7); on node3

As previously I run the ycsb client and load the server with the following command:

ycsb load couchbase -s -P workloads/workloada -p recordcount=100000000 -p core_workload_insertion_retry_limit=3 -p couchbase.url=http://node1:8091/pools -p couchbase.bucket=test -threads 20

When trying to insert 100 million docs after about ~16 million the system crashes and the exceptions I see on the client side are:

Error inserting, not retrying any more. number of attempts: 4Insertion Retry Limit: 3
2016-01-21 11:56:56,896 1336057 [Thread-2] ERROR - Could not insert value for key usertable:user8661638711032907145
java.lang.RuntimeException: Timed out waiting for operation
** at net.spy.memcached.internal.OperationFuture.get( ~[spymemcached-2.9.1.jar:2.9.1]**
** at net.spy.memcached.internal.OperationFuture.getStatus( ~[spymemcached-2.9.1.jar:2.9.1]**
** at ~[couchbase-binding-0.7.0-SNAPSHOT.jar:na]**
** at ~[couchbase-binding-0.7.0-SNAPSHOT.jar:na]**
** at [core-0.7.0-SNAPSHOT.jar:na]**
** at [core-0.7.0-SNAPSHOT.jar:na]**
** at [core-0.7.0-SNAPSHOT.jar:na]**
Caused by: net.spy.memcached.internal.CheckedOperationTimeoutException: Timed out waiting for operation - failing node:
** at net.spy.memcached.internal.OperationFuture.get( ~[spymemcached-2.9.1.jar:2.9.1]**
** at net.spy.memcached.internal.OperationFuture.get( ~[spymemcached-2.9.1.jar:2.9.1]**
** … 6 common frames omitted**

Logs from all three nodes are here

Thanks in advance, Eli.

Hi @eli_golin, for YCSB which couchbase repo are you using?

Sorry, but didn’t really get the question.

I am just pulling the latest from ycsb and working against it.
I think the version now is 0.7.0-SNAPSHOT

Hi… we ran the benchmark again yesterday.
We have 10 fields in each doc and I’ve created GSI indexes for 8 fields out of 10.
Put 4 of them on node2 and the remaining 4 on node3.
As before ycsb is working against node1.
We are running a single client instance with 20 threads.

We can see that the insertion speed during the first ~20M docs is around 11k ops/sec, but afterwards it is slowly rising towards 17k ops/sec (Wondering what is the cause of such behaviour)

Approximately after inserting 40M documents we start seeing “Temporary failures” from couchbase.
This is happening during the same time that couchbase’s memory is rising above the hight water mark(55.9GB) to approximately ~59GB.
So this is the first question, why isn’t the memory consumption being limited by the “hight watermark”?

Afterwards we can see that the insertions continue but occasionally with “Temporary failure” responses returned by the server. The client continues jumping above the “hight watermark” line another few times during its run (you can see it from the screenshots)

At around ~50M documents the client crashes with an exception related to timeout from memcached (the same exception as I posted before).

After the client crash I’ve noticed that one of the nodes is in the “pending” state and after awhile the other two nodes became "pending"as well.

The total number of documents inserted is ~60M (couchbase flushed the in memory documents after the client died).

I’m adding the client log + server logs.
Here are some monitoring snapshots that I’ve made.

Thanks, Eli.

Hi @eli_golin,Looks like your writes can still not keep up with the insertion rate. Disk write queue is just building up in the graphs. Couchbase does a memory write and can ingest fast but if the disk isn’t keeping up, we will eventually run out of space in memory to put new incoming updates.
You need better write throughput - to get that you can either use a better IO subsystem or get more nodes for the data service. I believe you are running this on AWS - which SKU is the VM for the nodes?
Other option is you can slow the inserts down and allow it to catch up with the disk writes.