Analytics datasets in multi tenants environment

Our application may have hundreds or thousands of tenants and all of our tenants are stored in the same bucket. We may have lots of Ad hoc queries and we are not able to create those indexes in advance, that’s why we would like to use the Couchbase Analytics feature. We have two proposals to model our data in Analytics:
/////////////

  1. Create one single dataset based on the bucket

This means that the dataset will contain data for all the tenants.

  1. Create each dataset for each tenant

If we have thousands of tenants, we will have thousands of datasets.
////////////////////

All our queries will always be based on a given tenant and document type. I have the following questions:

  1. Will #2 have better query performance than #1?
  2. Will there be any issues with too many datasets in the system? Is there any limit on how many datasets we are allowed to have for each cluster?
  3. Since our application can query against the Analytics dataset, do you think we still need to use the regular Query service? What is the advantage of using the Query service instead of the Analytics service? If we only run query against the Analytics service, are there any potential issues?

The maximum number of datasets you can create per cluster in Analytics is a configurable value and by default is set to 8, but you can change that value at any time. While you can create as many as hundreds or thousands datasets, we do not recommend creating that many as that might have some negative consequences on managing the system resources per dataset. For that, I recommend that you create a single dataset (the first proposal).

Since you already know that your queries will be based a tenant and a document type, you can always create a secondary index (not to be confused with the Index service) on one or both of these fields on the single dataset to speed up your Analytics queries. The performance in that case will be very close to having an individual dataset per tenant. Analytics secondary indexes do not require the Index service to be available in the cluster and more information on how to utilize them can be found here.

As for your question related to using Analytics or Query service, it really depends in your application and its operational requirements. Applications may use Query service only, Analytics service only, or a combination of both for different use cases within the same application. If the Analytics service operational properties satisfies your application’s, you may use the Analytics service only. You can check the small write up here on when to use each service.

Hope this helps.

You could also consider creating one dataset per each document type. Each dataset will contain documents of the same type for all tenants. Then for each dataset create a secondary index in Analytics on the tenant id.

Thanks, @mhubail and @dmitry.lychagin.

@mhubail I will take a look at the links you provided.
@dmitry.lychagin, creating one dataset per each document type might not work our application since we have many ( over 30 out of box document types) and our application allows our the users to upload documents with their own custom document types. We have no control on the number of document types on your system since the customers can have their own custom document types. Base on the limitation on the number of datasets we can have per cluster (as metioned by @mhubail), creating one dataset per document type will not work for our application. Please let me know if you have additional comments/suggestions. Thanks.

@mhubail I have read the links you provided. Assuming that we create the indexes on both the Indexing service and the Analytics dataset, is the query performance comparable between the Query service and the Analytics?

@jessyang,
That answer really depends on your exact queries, data size, and cluster resources. If your queries are mainly single/few documents look ups and your cluster is sized right, Query service should give you a better query response time. If your queries involve analytical queries (e.g., ad-hoc joins and aggregations) and your data size is big enough compared to your cluster resources, Analytics should give you a better query response time.

Hi, @mhubail, I have some questions regarding your last reply.

  1. You mentioned that
    //////////////////
    If your queries are mainly single/few documents look ups and your cluster is sized right, Query service should give you a better query response time.
    //////////////

Can you please clarify what you mean by “your cluster is sized right”?

  1. You mentioned the following:
    ////////////
    If your queries involve analytical queries (e.g., ad-hoc joins and aggregations) and your data size is big enough compared to your cluster resources,
    ////////////////

Can you please clarify what you mean by “data size is big enough compared to your cluster resources”?

Thank you.

Hi, @mhubail, another question I have is regarding the query consistency. Assuming we create secondary search index on the regular Index service , and also create index on the Analytics dataset, will the Query service has better data consistency than the Analytics? Thanks.

@jessyang,
Please see this video and hopefully that will make my points related to the use cases clear.
As for data consistency, in the current Couchbase Server release (6.0.x), Query service supports the consistency parameters listed in this page under Consistency parameters. As for Analytics, all queries are currently executed with not_bounded scan consistency. Support for request_plus scan consistency in the Analytics service is planned for the next Couchbase Server release.