Global Secondary Index Size

I created the following GSI Index for my bucket:

CREATE PRIMARY INDEX `#primary` ON `parse_with_index` USING GSI WITH {"nodes":"cb41ee-04-us-us-east-1.dev.cld.touchtunes.com:8091"};

Here is the size of the index folder on my node before the creation:

/dev/xvdh1      296G   65M  281G   1% /index

Here is the size after the index creation:

/dev/xvdh1      296G  162G  119G  58% /index

My bucket size says the following on the console:

63.6GB /
79.7GB

I am wondering why would the Index size be greater than the actual data on disk.

Thanks,
Steeve

Could you share your version pls? the behavior has changes in GSI from between versions on this.

There may be a few reasons ;

  • compaction may not have run yet: we have an append only write mode that favors fast sequential writes over space. a process called compaction removes the orphaned pages at an interval. To see the fragmentation ratio, you can look at the index stats for the index under the “% fragmentation” .
  • maintaining a tree is more expensive than raw data. index has some overhead over data due to the tree structure it maintains either with skiplist or with btree flavors. you can also see the index disk size and index data size (size of data being indexed) for the index under stats (click on bucket name under the data buckets" tab in web console. let me know if you are seeing a discrepency between the stat and what you observe on the filesystem.
    thanks
    -cihan

Hello Cihan,

I am testing on 4.1.1-5914 Enterprise Edition (build-5914). I had to remove my Node because of disk space issue. I should have another up and running in an hour or 2. I will keep you posted on the index stats.

Thanks for the help.

-Steeve

Thanks. I’d recommend also looking into 4.5 with circular writes.
4.0, 4.1 and 4.5 all provide append only write mode. With 4.5 we have introduced another write mode that eliminates the need for frequent compaction. The write looks for orphaned pages and reuses the space instead of appending all writes to the end of the file. It is now the new default mode when standard GSI storage mode is used.
If you have enough memory to keep your indexes in memory, I’d recommend using memory-optimized indexes as well. The IO overhead and the IO subsystem required is much less demanding when running memory optimized storage mode for indexes. All index maintenance happen in memory thus much faster for scans and index maintenance as well.
You can find the indexing options for n1ql here http://developer.couchbase.com/documentation/server/4.5/indexes/n1ql-in-couchbase.html

thanks

@steevebisson,
you definitely should use 4.5.
4.0/4.1 indexes have pretty ugly implementation, or ,“rephrasing politely”, “there a lot of really significant improvements in 4.5”. For 4.1.X take a look at GSI: different sporadic bugs + fragmentation up to 98% :wink: there are a lot of fun …

Here is my index stats info.

Hello,

Disk space returned to normal after the compaction of the bucket. 8.3GIG! WOW, this is great.