Decompressing inside a view

Hi,

Currently I’m working in a project where we use Couchbase as a caching layer server.
The thing is that we are storing big json documents (between 1MB and 2MB) and based on the caching layer section of Couchbase documentation, those documents that are frequently queried/accesed are kept in RAM to avoid Disk swap.
I know that I have several options like stripping the JSON, removing data that we may not need, but involves tracking those properties that we remove and code changes.
I decided to investigate how to perform client side compression using Couchbase Java SDK (I created my own Transcoder) and performed ZLIB compression using DeflaterOutputStream. I tried it with a 100Kb document and was reduced up to 3.5Kb (zlib compression + base64 encoding).
The only problematic thing as we know are the Couchbase views that expect doc in a JSON format, so I decided to use a minified version of https://github.com/beatgammit/deflate-js/blob/master/lib/rawinflate.js inside a view to test the inflate process (previous I had to call decodeBase64 Couchbase built in function).
After all this research/testing work, I could generate a view index based on a compressed document.

My questions are:

  1. Is it worth to do all this compression/decompression work when documents are big or Couchbase can handle it without problem? I know that depends on RAM sizing for node and bucket and another config parameters.
  2. If I still want to go with the compression approach, inflating in the view is gonna be a problem? I know that indexes generation are incremental and will depend on the inflate javascript function also, but a first index creation will inflate all documents for a specific view.

Thanks!
Juanma

I am sorry I cannot help you with your question but I have a question for you :slight_smile:

I am trying to use the same library but its not inflating my data. I know that couchbase encodes the data as base64 too. I tried to decode it and then pass it on to inflate as well. But still it doesnt work. Can you tell me how did you get inflate to work?

//Added the js file you mentioned here

var data = inflate(doc);
emit(meta.id, data);

Hi,

First of all are you storing JSON or serialzed java objects? the thing is that the default transcoder (SerializingTranscoder) doesn’t compress json strings, so what I did was to create my own transcoder to be able to use deflate algorithm (the default transcoder uses GZIP algorithm) and also compress json strings. Then you need to override the default transcoder (CouchbaseConnectionFactoryBuilder).
Heres is the transcoder code

import java.io.ByteArrayOutputStream; import java.io.IOException; import java.util.Date; import java.util.zip.Deflater; import java.util.zip.DeflaterOutputStream;

import net.spy.memcached.CachedData;
import net.spy.memcached.compat.CloseUtil;
import net.spy.memcached.transcoders.SerializingTranscoder;
import net.spy.memcached.transcoders.TranscoderUtils;
import net.spy.memcached.util.StringUtils;

/**

  • Transcoder that serializes and compresses objects.
    */
    public class CouchbaseCompressionTranscoder extends SerializingTranscoder {

// General flags
static final int SERIALIZED = 1;
static final int COMPRESSED = 2;

// Special flags for specially handled types.
static final int SPECIAL_BOOLEAN = (1 << 8);
static final int SPECIAL_INT = (2 << 8);
static final int SPECIAL_LONG = (3 << 8);
static final int SPECIAL_DATE = (4 << 8);
static final int SPECIAL_BYTE = (5 << 8);
static final int SPECIAL_FLOAT = (6 << 8);
static final int SPECIAL_DOUBLE = (7 << 8);
static final int SPECIAL_BYTEARRAY = (8 << 8);

private final TranscoderUtils tu = new TranscoderUtils(true);

/*

  • (non-Javadoc)

  • @see net.spy.memcached.Transcoder#encode(java.lang.Object)
    */
    public CachedData encode(Object o) {
    byte[] b = null;
    int flags = 0;
    if (o instanceof String) {
    b = encodeString((String) o);
    if (StringUtils.isJsonObject((String) o)) {
    if (b.length > compressionThreshold) {
    byte[] compressed = compress(b);
    if (compressed.length < b.length) {
    b = compressed;
    flags |= COMPRESSED;
    }
    }

     return new CachedData(flags, b, getMaxSize());
    

    }
    } else if (o instanceof Long) {
    b = tu.encodeLong((Long) o);
    flags |= SPECIAL_LONG;
    } else if (o instanceof Integer) {
    b = tu.encodeInt((Integer) o);
    flags |= SPECIAL_INT;
    } else if (o instanceof Boolean) {
    b = tu.encodeBoolean((Boolean) o);
    flags |= SPECIAL_BOOLEAN;
    } else if (o instanceof Date) {
    b = tu.encodeLong(((Date) o).getTime());
    flags |= SPECIAL_DATE;
    } else if (o instanceof Byte) {
    b = tu.encodeByte((Byte) o);
    flags |= SPECIAL_BYTE;
    } else if (o instanceof Float) {
    b = tu.encodeInt(Float.floatToRawIntBits((Float) o));
    flags |= SPECIAL_FLOAT;
    } else if (o instanceof Double) {
    b = tu.encodeLong(Double.doubleToRawLongBits((Double) o));
    flags |= SPECIAL_DOUBLE;
    } else if (o instanceof byte[]) {
    b = (byte[]) o;
    flags |= SPECIAL_BYTEARRAY;
    } else {
    b = serialize(o);
    flags |= SERIALIZED;
    }
    assert b != null;
    if (b.length > compressionThreshold) {
    byte[] compressed = compress(b);
    if (compressed.length < b.length) {
    getLogger().debug(“Compressed %s from %d to %d”,
    o.getClass().getName(), b.length, compressed.length);
    b = compressed;
    flags |= COMPRESSED;
    } else {
    getLogger().info(“Compression increased the size of %s from %d to %d”,
    o.getClass().getName(), b.length, compressed.length);
    }
    }
    return new CachedData(flags, b, getMaxSize());
    }

@Override
protected byte[] compress(byte[] in) {
if (in == null) {
throw new NullPointerException(“Can’t compress null”);
}
ByteArrayOutputStream bos = new ByteArrayOutputStream();
DeflaterOutputStream def = null;
try {
// create deflater without header
def = new DeflaterOutputStream(bos, new Deflater(Deflater.DEFAULT_COMPRESSION, true));
def.write(in);
} catch (IOException e) {
throw new RuntimeException(“IO exception compressing data”, e);
} finally {
CloseUtil.close(def);
CloseUtil.close(bos);
}
byte[] rv = bos.toByteArray();
getLogger().debug(“Compressed %d bytes to %d”, in.length, rv.length);
return rv;
}
}

Then put this script in your view:

function (doc, meta) { var inflate=(function(){var WSIZE=32768,STORED_BLOCK=0,STATIC_TREES=1,DYN_TREES=2,lbits=9,dbits=6,slide,wp,fixed_tl=null,fixed_td,fixed_bl,fixed_bd,bit_buf,bit_len,method,eof,copy_leng,copy_dist,tl,td,bl,bd,inflate_data,inflate_pos,MASK_BITS=[0x0000,0x0001,0x0003,0x0007,0x000f,0x001f,0x003f,0x007f,0x00ff,0x01ff,0x03ff,0x07ff,0x0fff,0x1fff,0x3fff,0x7fff,0xffff],cplens=[3,4,5,6,7,8,9,10,11,13,15,17,19,23,27,31,35,43,51,59,67,83,99,115,131,163,195,227,258,0,0],cplext=[0,0,0,0,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,0,99,99],cpdist=[1,2,3,4,5,7,9,13,17,25,33,49,65,97,129,193,257,385,513,769,1025,1537,2049,3073,4097,6145,8193,12289,16385,24577],cpdext=[0,0,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13],border=[16,17,18,0,8,7,9,6,10,5,11,4,12,3,13,2,14,1,15];function HuftList(){this.next=null;this.list=null}function HuftNode(){this.e=0;this.b=0;this.n=0;this.t=null}function HuftBuild(b,n,s,d,e,mm){this.BMAX=16;this.N_MAX=288;this.status=0;this.root=null;this.m=0;var a;var c=[];var el;var f;var g;var h;var i;var j;var k;var lx=[];var p;var pidx;var q;var r=new HuftNode();var u=[];var v=[];var w;var x=[];var xp;var y;var z;var o;var tail;tail=this.root=null;for(i=0;i<this.BMAX+1;i++){c[i]=0}for(i=0;i<this.BMAX+1;i++){lx[i]=0}for(i=0;i<this.BMAX;i++){u[i]=null}for(i=0;i<this.N_MAX;i++){v[i]=0}for(i=0;i256?b[256]:this.BMAX;p=b;pidx=0;i=n;do{c[p[pidx]]++;pidx++}while(--i>0);if(c[0]===n){this.root=null;this.m=0;this.status=0;return}for(j=1;j<=this.BMAX;j++){if(c[j]!==0){break}}k=j;if(mmi){mm=i}for(y=1<<j;j<i;j++,y<0){x[xp++]=(j+=p[pidx++])}p=b;pidx=0;i=0;do{if((j=p[pidx++])!==0){v[x[j]++]=i}}while(++i0){while(k>w+lx[1+h]){w+=lx[1+h];h++;z=(z=g-w)>mm?mm:z;if((f=1<a+1){f-=a+1;xp=k;while(++j<z){if((f<el&&w<el){j=el-w}z=1<<j;lx[1+h]=j;q=[];for(o=0;o0){x[h]=i;r.b=lx[h];r.e=16+j;r.t=q;j=(i&((1<>(w-lx[h]);u[h-1][j].e=r.e;u[h-1][j].b=r.b;u[h-1][j].n=r.n;u[h-1][j].t=r.t}}r.b=k-w;if(pidx>=n){r.e=99}else if(p[pidx]<s){r.e=(p[pidx]<256?16:15);r.n=p[pidx++]}else{r.e=e[p[pidx]-s];r.n=d[p[pidx++]-s]}f=1<>w;j<z;j+=f){q[j].e=r.e;q[j].b=r.b;q[j].n=r.n;q[j].t=r.t}for(j=1<>=1){i^=j}i^=j;while((i&((1<<w)-1))!==x[h]){w-=lx[h];h--}}}this.m=lx[1];this.status=((y!==0&&g!==1)?1:0)}function GET_BYTE(){if(inflate_data.length===inflate_pos){return-1}return inflate_data[inflate_pos++]&0xff}function NEEDBITS(n){while(bit_len<n){bit_buf|=GET_BYTE()<>=n;bit_len-=n}function inflate_codes(buff,off,size){var e;var t;var n;if(size===0){return 0}n=0;for(;;){NEEDBITS(bl);t=tl.list[GETBITS(bl)];e=t.e;while(e>16){if(e===99){return-1}DUMPBITS(t.b);e-=16;NEEDBITS(e);t=t.t[GETBITS(e)];e=t.e}DUMPBITS(t.b);if(e===16){wp&=WSIZE-1;buff[off+n++]=slide[wp++]=t.n;if(n===size){return size}continue}if(e===15){break}NEEDBITS(e);copy_leng=t.n+GETBITS(e);DUMPBITS(e);NEEDBITS(bd);t=td.list[GETBITS(bd)];e=t.e;while(e>16){if(e===99){return-1}DUMPBITS(t.b);e-=16;NEEDBITS(e);t=t.t[GETBITS(e)];e=t.e}DUMPBITS(t.b);NEEDBITS(e);copy_dist=wp-t.n-GETBITS(e);DUMPBITS(e);while(copy_leng>0&&n0&&n1){fixed_tl=null;console.error("HufBuild error: "+h.status);return-1}fixed_td=h.root;fixed_bd=h.m}tl=fixed_tl;td=fixed_td;bl=fixed_bl;bd=fixed_bd;return inflate_codes(buff,off,size)}function inflate_dynamic(buff,off,size){var i;var j;var l;var n;var t;var nb;var nl;var nd;var ll=[];var h;for(i=0;i<286+30;i++){ll[i]=0}NEEDBITS(5);nl=257+GETBITS(5);DUMPBITS(5);NEEDBITS(5);nd=1+GETBITS(5);DUMPBITS(5);NEEDBITS(4);nb=4+GETBITS(4);DUMPBITS(4);if(nl>286||nd>30){return-1}for(j=0;j<nb;j++){NEEDBITS(3);ll[border[j]]=GETBITS(3);DUMPBITS(3)}for(null;j<19;j++){ll[border[j]]=0}bl=7;h=new HuftBuild(ll,19,19,null,null,bl);if(h.status!==0){return-1}tl=h.root;bl=h.m;n=nl+nd;i=l=0;while(in){return-1}while(j-->0){ll[i++]=l}}else if(j===17){NEEDBITS(3);j=3+GETBITS(3);DUMPBITS(3);if(i+j>n){return-1}while(j-->0){ll[i++]=0}l=0}else{NEEDBITS(7);j=11+GETBITS(7);DUMPBITS(7);if(i+j>n){return-1}while(j-->0){ll[i++]=0}l=0}}bl=lbits;h=new HuftBuild(ll,nl,257,cplens,cplext,bl);if(bl===0){h.status=1}if(h.status!==0){if(h.status!==1){return-1}}tl=h.root;bl=h.m;for(i=0;i257){return-1}if(h.status!==0){return-1}return inflate_codes(buff,off,size)}function inflate_start(){if(!slide){slide=[]}wp=0;bit_buf=0;bit_len=0;method=-1;eof=false;copy_leng=copy_dist=0;tl=null}function inflate_internal(buff,off,size){var n,i;n=0;while(n0){if(method!==STORED_BLOCK){while(copy_leng>0&&n0&&n0);inflate_data=null;return buff}return{inflate:inflate}}());

// data inflated
var data = inflate.inflate(decodeBase64(doc));

// if it is a json you can parse it to access properties
var json = JSON.parse(String.fromCharCode.apply(String, data));

// emit data or json, is not a good practice to emit the json as value, just use it for testing
emit(meta.id, data);
emit(meta.id, json);
}

Send me an email if you want the files.

Thanks so much for the code. Unfortunately it didn’t work for me.

To answer your question, I am not using JSON. Its serialized php objects.

To give you a background, I am working on an existing PHP project where it uses couchbase to store session data. The data stored, is not JSON and it is compressed using ZLIB. I am trying to write a view in which I want to uncompress the data and parse it. When I tried your code the output I get is just ‘[ ]’. I guess probably because you are using GZIP and my compression method is ZLIB. Any thoughts?

Also, did you get answers/benchmark results to your questions you mentioned in here?

Thanks a lot for your help.

Hi

Im currently using Java Deflater Class

new Deflater(Deflater.DEFAULT_COMPRESSION, true)

and the constructor java doc says:

/**
* Creates a new compressor using the specified compression level.
* If ‘nowrap’ is true then the ZLIB header and checksum fields will
* not be used in order to support the compression format used in
* both GZIP and PKZIP.
* @param level the compression level (0-9)
* @param nowrap if true then use GZIP compatible compression
*/
public Deflater(int level, boolean nowrap) {

The implementation I’m using (DeflaterOutputStream) differs on GZIPOutputStream in that is not writing gzip header and CRC.
Didn’t get any benchmark, maybe you can calculate it tracking a timestamp before the inflate and base64 calls and after them so you can emit the result in an index to see how much it takes for each document.

Ah! I will have to figure out a way to decompress. Thanks for your help.