We have a new mid-scale app that will be the company’s first app using Couchbase as the only database. We store data as documents in Couchbase at the rate of about 35 million per month. Analytics will be very important and we are looking at some BI suites and tools as well as dedicated databases. We’ve thought of a couple of options:
Store everything in Couchbase forever and build some views and queries to do the analytics. Maybe something like the Spark connector for certain processing. I am worried that an ever-growing production database will hurt performance and drive up costs as we have to increase the number of nodes to support it.
Move data to a secondary database, transforming it into a snowflake schema for more traditional data analytics. How will we move this data reliably? Can we query for all new or changed documents?
Couchbase Analytics is the newest service in the Couchbase data platform which offers parallel evaluation of analytical queries without impacting the operational performance of the Couchbase cluster. It is currently available in Developer Preview 4 here.
Couchbase Analytics uses Multidimensional Scaling to create a shadow dataset of the operational data in Couchbase and makes it available for analytical processing in real-time. There was no need for any data warehouse schema design, no ETL workflow development, and no ETL batch job to modify - Analytics brings NoETL to NoSQL.
Couchbase Analytics has a full MPP (massively parallel processing) based query processor that splits the work of processing a single query across all of the Analytics nodes in a Couchbase data platform cluster. This enables analytical queries to run quickly and in a scalable manner.
The combination of the immediate availability of data with the use of parallel processing allows us to reach the #1 objective for Analytics: reduced time-to-insight for the business.
You can download the Developer Preview 4 of Analytics here and get started using the beer sample tutorial here.
I would like to learn more about your use case and the outcomes you are solving for. Please let me know if you have additional questions or need to dive deeper into Analytics.
Thanks for the information. Couchbase Analytics looks very interesting, especially being able to avoid ETL and latency moving data into the warehouse. The documentation is a bit thin right now, so I still have a couple of questions.
Does Couchbase Analytics 2-way sync with the data on Couchbase Server, or will it store data after it’s deleted or expires in Couchbase Server? If we have to store several years’ worth of data in Couchbase Server we will run into SSD storage issues, so I want to keep the production database small.
Are there any BI tools that work with Couchbase Analytics, such as Pentaho, BIRT, Jaspersoft, or Metabase?
Couchbase Analytics is a shadow dataset - all changes to the data in Couchbase Server are carried over to the Analytics nodes. If the data is deleted from Couchbase Server, then it will be deleted from the Analytics node as well.
Currently the data flows from Couchbase Server to Analytics. If you’d like to publish the results of analytical processing back to Couchbase Server (2-way sync) and eventually to a business application, it can be done by persisting the results back to Couchbase Server. I am happy to discuss this in more detail if i have misunderstood what you meant by 2-way sync.
We are working on creating partnerships with BI tool vendors and would be happy to discuss how to integrate with the BI tool deployed in your organization. At this point we’ve integrated with knowi for exploring data in Analytics and creating dashboards.
So it’s been 2 years now, and it seems Knowi, a commercial solution, remains the only BI solution with native Couchbase integration, according to my research…
None of the open-source solution (Metabase, Superset, Redash) appear to have support for Couchbase, except though an ODBC/JDBC driver (which also only exist as commercial options - and not particularly cheap).