Very higher Disk Usage than Data Usage in insert-only environment

I was able to insert 100M keys (20K op/sec) to Couchbase, until my 100G partition got full, however with Couchbase sizing documentation and my data sizes, I should have been able to achieve around twice inserts.

My Couchbase Console shows:

Disk Usage:92.3GB
Data Usage:44.3GB

I don’t expect compaction would help since I did no updates in my load test. Could anyone explain why is this happening?
Is compaction necessary in insert-only mode?

Well, the disk usage also contains info from external files, not under couchbase control.If you look at “Disk Overview” on the “Cluster overview” tab, what do the three markers say?

In use, other data, free.

InUse: 92.6 GB
OtherData: 5.8 GB
Free: 0 B

@behrad interesting, it might be worth that someone takes a look at it on a ticket. Can you run a cbcollect_info and post a ticket to http://www.couchbase.com/issues/browse/MB with a good description of the workload and whats going on?

Thanks,
Michael

I’m running another test, this time compacting data on each 20M. After compaction I see Disk/Data usage are getting close to each other, So compaction is solving that issue. However I don’t see why a compaction should be needed in a insert-only load ? I hadn’t seen this behavior on CouchDB, so should it be Couchbase’s meta data?

If you are doing a 100% insert stream of new items you will in fact encounter significant compaction activity.

The compaction activity occurs on the BTree index for the vbucket file. The index structure is updated on each insert, and the rules for updating the index are the same as for updating an item: append new data to the end of the file. So BTree index maintenance results in significant redundant data in the file and it is that data which is being compacted.

3 Likes

that’s exactly the issue @morrie which I discovered when digged into couchdb database files under couchbase data directory :slight_smile: Couchbase meta data really needs compaction under heavy inserts :wink: