Retrieving single tenant data in multi-tenant bucket

Hello,

We are considering of using Couchbase database. I have question about how we could achieve following use-case using this storage engine:

  • we would like to store many (millions) documents per company
  • each document has external unique identifier
  • there is dynamic list of companies - companies can be added but also removed
  • as list of companies can change dynamically we would prefer to have one bucket for all those documents and distinguish somehow to which company particular document belongs
  • documents are retrieved using unique identifier - main use case
  • any company can ask to export all its document - very rare operation
  • in case of company removal we should delete all documents that belong to given company - very rare operation

From above list of requirements we see that using single bucket where document is referred by its unique identifier would handle main use case by just using Key Value Couchbase operations.

Anyway for mentioned two very rare operations: company documents export and company documents removal we wouldn’t be probably able to achieve functionality without adding some indexes/views etc. am I right?

It leads us to two questions:

  • which indexing mechanism use: Global Secondary Index or View?

    • Views couldn’t be queried using N1QL as far as I understand
    • Primary index of N1QL will allow you to query any field of your JSON - as we would like to query only one property it looks like too wide and too expensive
    • GSI - looks like option which would support mentioned requirements and provide lower overhead in compare with above features
    • we could even consider creating ad-hoc partial SGI index dedicated to tenant just for the time of export/removal
    • Is GSI still doesn’t self replicate among nodes? I’m referring to Differences between primary and secondary Indexes? - #17 by cihangirb “we don’t support replication/replicas with GSI but you can manually do this. The way to do this would be to create the 2 identical indexes on 2 separate nodes (a.k.a mirror the index). If one node fails and takes out one copy of the index, all traffic will hit the remaining available index.”
  • where store information about company name?

    • We could store it as part of document identifier
    doc-${COMPANY_NAME}-${DOCUMENT_ID}: 
    {
      "someDocumentKey": "someDocValue"
    }
    

    it would allow us to query with N1QL LIKE statement (Indexing Metadata Information | Couchbase Docs)

    // get all company document IDs
    SELECT META().id FROM bucketName WHERE META().id LIKE "%COMPANY_NAME%"
    
    • alternatively we can put info about company name in custom field
    doc-${DOCUMENT_ID}: 
    {
      "companyName": "XYZ",
      "someDocumentKey": "someDocValue"
    }
    

    and then query against this field

Is there any performance differentiator in case when we will query against part of document identifier instead of custom field in document?
Would both cases should result in the similar overhead in terms of additional storage/rebalance times/speed of solution?

Could you guide us which Indexing mechanism would be better and how we should approach storing info about company name to support such use case? Or maybe you see different solution that we could try?

Thanks in advance for any insights and info :slight_smile: !

Hi, @sebarys,

Thanks for the details in our DM to recap

I asked many (millions) 50, 100, 500, 900 million? Can you be a bit more accurate?

  • depends on company, but in general between 0.5million and 50million documents - this number is dynamic and correlate to traffic that we’re receiving for given company

You went through documentation and used the follwoing for your rare removal request (less than 5 archive or removal requests per year)

  • on such rare request we will create conditional SGI
    * CREATE INDEX company_xyz_documents_idx ON bucket_name(companyName) WHERE companyName='xyz';
  • data export - using created SGI and pagination Using OFFSET and Keyset in N1QL | The Couchbase Blog we will export all company document IDs and then using our API get all documents
  • data removal - using created SGI delete all company related documents using DELTE N1QL query
  • after operation will be completed index will be removed

ARCHIVING A COMPANY’S DOC SET VIA THE EVENTING SERVICE

Now that I know the above scale and pattern you could use the Eventing service to archive a company’s doc set to a different bucket (no INDEX needed). This is just 4 lines of JavaScript (sans comments).

// This function requires:
//
// 1. has a source bucket of 'source'
// 2. has an Eventing scratchpad bucket called of 'meta' 100MB is fine.
// 3. has a target bucket called of 'archive' with a bucket binding 
//    aliased to 'dst_bkt' this must be big enough to hold the archived 
//    data in mode read+write.
// 4. If more than 50M documents double the Eventing Memory Quota to 512 MB
// 5. In settings we use 24 workers for more performance (you need 24 vCPUs 
//    to take advantage of this parallelism, use less if you have less vCPUs)
//
// Use as follows:
//
// A. create or flush the bucket 'archive' 
// B. Edit this function perhaps to:
//        if (! meta.id.startsWith("doc-somecompany-")) return;
// C. Deploy this function wait for completion and then undeploy.
// D. Use /opt/data/cbexport
// E. drop or flush the bucket 'archive' 

function OnUpdate(doc, meta) {
    if (! meta.id.startsWith("doc-wigitcompany-")) return;
    dst_bkt[meta.id] = doc;
}

I did a quick test stuck 2 sets of company docs into a single bucket 10M each company. So the Bucket ‘source’ holds 20M docs total or 10M each company. I also made Bucket ‘archive’ and the Eventing scratchpad bucket ‘meta’ 100MB

doc-tirecompany-doc########
doc-widgetcompany-doc########

I deployed the above function (named “archive_all_keys_by_prefix”) on a 12 Core non-MDS single node test server and it archived one company at 85K/docs per seconds. This took less than 2 minutes. It should run at about 60K docs/sec if you had 50 companies because DCP is faster than KV (DCP is a firehose) and we only need to copy or archive data from a single relevant company.

Next I ran the cbexport utility as follows:

/opt/couchbase/bin/cbexport json -c couchbase://127.0.0.1 -u $CB_USERNAME -p $CB_PASSWORD -b archive -f lines -o /tmp/tmp.out.json

This also took a bit less than 2 minutes. At this point we can delete the Bucket Archive.

FYI we could adjust the function to also archive the company doc with a minor tweak.

PURGING A COMPANY’S DOC SET VIA THE EVENTING SERVICE

Now that I know the above scale and pattern you could use the Eventing service to purge a company doc set directly from it’s source bucket (no INDEX needed). This is just 4 lines of JavaScript again (sans comments).

// This function requires:
//
// 1. has a source bucket of 'source'
// 2. has an Eventing scratchpad bucket called of 'meta' 100MB is fine.
// 3. has a bucket binding aliased to 'src_bkt' to the 'source' bucket
//    in mode read+write.
// 4. If more than 50M documents double the Eventing Memory Quota to 512 MB
// 5. In settings we use 24 workers for more performance (you need 24 vCPUs 
//    to take advantage of this parallelism, use less if you have less vCPUs)
//
// Use as follows:
//
// A. Make sure you already did an 'archive' if needed.
// B. Edit this function perhaps to:
//        if (! meta.id.startsWith("doc-somecompany-")) return;
// C. Deploy this function wait for completion and then undeploy.

function OnUpdate(doc, meta) {
    if (! meta.id.startsWith("doc-wigitcompany-")) return;
    delete src_bkt[meta.id];
}

I did a quick test stuck 2 sets of company docs into a single bucket 10M each company. So the Bucket ‘source’ holds 20M docs total or 10M each company. I also made the Eventing scratchpad bucket ‘meta’ 100MB

We are deleting all the docs related to

doc-widgetcompany-doc########

But leaving

doc-tirecompany-doc########

I deployed the above function (named “purge_all_keys_by_prefix”) on a 12 Core non-MDS single node test server and it archived one company at 90K/docs per seconds. This deletion also took less than 2 minutes.

FYI we could adjust the function to also delete the company doc with a minor tweak.

FINAL COMMENTS

The two Eventing Functions (attached) could also be combined, we are using the Eventing service as a one off utility I beleive it will fit your needs perfectly.

Best

Jon Strabala
Principal Product Manager - Server‌Functions_for_sebarys.zip (2.1 KB)

1 Like

Another option here once Couchbase Server 7.0 is release is to use scopes and collections. Each company could have its own collection.

  • any company can ask to export all its document - very rare operation

For exporting cbexport supporting exporting all the docs from a single collection. For example exporting all the documents for the couchbase collection in the companies scope:

/opt/couchbase/bin/cbexport json -c couchbase://127.0.0.1 -u $CB_USERNAME -p $CB_PASSWORD -b $BUCKET -f lines -o /tmp/tmp.out.json --include-data companies.couchbase
  • in case of company removal we should delete all documents that belong to given company - very rare operation

A collection can be drop deleting all the documents in it.

Couchbase Server 7.0 is release later in the year .

@jon.strabala @pvarley thank you for ideas and your responses! Much appreciated!

I think that for now we will stay with indexing as it looks like the most natural way for handling such use case and don’
t require using very specific Couchbase features that would be hard to translate in case we will decide migrate to another DB engine:

  • Eventing Service - this stream-like approach looks very nice but require Enterprise version of Couchbase where we’re currently using Community edition
  • Collections - in the future when migration to Couchbase 7.0 would be possible we would definitely look on it but as we need something that is already available, so it is no-go solution for now

Thanks again for replies and alternatives - it definitely gave us a broader overview on what is possible when using Couchbase!