It is crucial to plan for performance optimization and resource utilization now that there is support for external Couchbase Analytics collections (AWS S3 and Azure Blob Storage). External Couchbase Analytics collections read the data from external storage and do not store it locally, which can be an expensive operation if not optimized.

This article discusses the factors that significantly impact performance and resource utilization when querying data from external storage. We will also look at a detailed example to see how effective our optimizations can be.

We will consider the following items and discuss their impact on performance and utilization of resources:

  • Internet bandwidth and distance from external storage
  • Number of Couchbase Analytics nodes and partitions and number of files in external storage
  • Prefixes and include/exclude filters

Internet bandwidth and distance

Since external data is not stored in Couchbase Analytics’ native storage, each query request transfers the current version of the data over the network. This makes network bandwidth a significant factor in the performance of such queries. Network latency is another factor–the distance the data has to travel will impact latency. Therefore, it is recommended to have the targeted storage hosted in a nearby region.

Note: Some storage service providers do not charge any transmission cost if the application and host storage is in the same region, improving performance and reducing cost.

Partitions vs. number of files

When querying an external Couchbase Analytics collection, the data scanning workload will be evenly distributed across available Analytics nodes and partitions to read the data in parallel. (This is similar to how internal storage is read in parallel). To ensure that this parallel reading capability is fully utilized, the data in external storage must be stored in multiple files rather than in one or a few large files.

For example, let’s assume that we have three Analytics nodes with four partitions per node. This configuration results in a parallel degree of 12 (12 partitions in total). When an external Analytics collection is queried, it should be able to read 12 files in parallel from external storage.

If we have our data organized in multiple files, the above setup will read 12 files at a time, but if we have only a few files (e.g., five large files), then only five partitions will have work to do, leaving the rest of the available Analytics resources idle.

Analytics aims to distribute the scanning workload evenly among the total number of partitions, not across nodes. In the above example, each of the 12 partitions will have 1/12 of the total workload, and each node will have 1/3 of the load. If the nodes had different numbers of partitions for some reason, the per-partition workload would remain uniform, but the nodes themselves would be non-uniformly loaded. That’s why it is recommended to have the data distributed in multiple files so the workload can be distributed. 

However, it is just as crucial to avoid having too many small files like hundreds of thousands or millions of files, as this could result in poor performance due to file metadata handling and too many file open/close operations.

Prefixes and include/exclude filters

Data organization

Let’s start by looking at the structure of the data we will be working with here for illustration. We will work with three years worth of reviews data. We will organize our data per year, per quarter, and per month. The data is for the years 2018, 2019 and 2020.

Here are samples of the directory layout and of a typical document:

Document sample:

The above structure will enable us to have a per-month granularity over the data, which we will utilize in further examples.

Let us suppose that the total size of our data for the three years is 720GB. This gives us around 240GB of data per year (three years) and 20GB per month (36 months in three years).

How do External Collections work?

Since the data organization can significantly impact performance, let’s revisit how Couchbase Analytics reads data from external storage.

The external analytics collections data are stored in remote external storage such as AWS S3, Azure Blob/Data Lake, or Google Cloud Storage. Azure Blob Storage and Google Cloud Storage are currently in developer preview. 

When an external link and an external Analytics collection are created, they are definitions of entities configured to read the data from the external source, but the data is not stored in Analytics. This enables Analytics to always see the latest versions of external data – but it can also be slow. We can optimize the reading process using prefixes and other filters to save time and resources.

We will see in the following example how big of a difference a filter can make if they are judiciously used in conjunction with well-organized data.

Requirements

Let’s take a look at the following simple query requests to see the difference in performance with and without filtering:

  • Get the total number of reviews from customers for 2018, 2019, and 2020.
  • Get the total number of reviews from customers for the year 2019.
  • Get the total number of reviews from customers for January 2020.

Non-filtered solution

Let’s assume there is an external link already available called myLink (see External Datasets: Accessing AWS S3 in Couchbase Analytics for an example of how a link is created). We start by creating an external Analytics collection for all review data with no prefix and no include/exclude filters. The syntax is straightforward:

With this external Analytics collection created, let’s write the queries, which are simple and straightforward, as follows:

Given our Internet bandwidth and distance from the service provider, we might (and in fact did) get the following:

  • These queries took 435, 448 and 430 seconds to finish, which shows that they nearly performed the same.
  • Each query resulted in 720GB worth of data transfer.

What’s happening here is that each of these queries resulted in reading every data record (i.e., all of the data is transferred), then applying any filtration necessary to give the correct result. However, the data has already been transferred by the time it gets filtered. This is the primary determining factor of performance in the above example.

In the absence of further optimizations, all queries against our reviews will read the whole data set again and again, which can be very expensive. While this may be acceptable for infrequently accessed external data, we can do (much) better if desired.

Filtered solution

Now, let’s examine how we can utilize the prefix and include/exclude filters available for external Analytics collections and how we structure our data to improve the performance of our queries.

The first query requires reading all the data, so let’s leave that as it is. Let’s instead look at queries 2 and 3.

In query 2, the data required is only for the year 2019, so we can create a new external Analytics collection and use the year as a prefix when we do so:

The DDL results in an external Analytics collection configured to read the data from the specified prefix only, that is, “reviews/2019“. It is important to note that this collection is virtual; it is just another named entry point into the same externally stored data. (No data is being replicated, and no additional space is being used due to the creation of this additional external collection – it’s just metadata.)

To understand what’s happening, let’s look at what happens internally. When an external Analytics collection is queried, the query engine starts by retrieving the metadata of the files, which contains only information about the files. The contents of the files are not read yet. Part of the returned metadata information is the full name/path of the files and the sizes of their content.

Analytics utilizes this information to perform two steps:

  • It applies any prefix, include/exclude filtration on the file name, to decide whether to include each file or exclude it altogether, hence saving resources.
  • It calculates the overall size of the data, which leads to distributing the workload evenly among all the partitions of the Analytics nodes for parallel access and processing.

Applying the above, queries on the my2019Collection will only read the files prefixed with “reviews/2019“, which will only include 33% of the data. The relevant query can now be written as:

Running the above query, we observe the following:

  • The total execution time is now 160 seconds, down from our previous 448 seconds, only 35% of the original time. This is expected, as we read only 33% of the data.
  • The total amount of data transferred is 240GB, down from our previous 720GB.

This is a considerable improvement in both performance and resource utilization.

Following the same pattern, let’s use the include/exclude filter capability of external collections. Query 3 is interested in only the data for January 2020.

Let’s create an appropriate external Analytics collection as follows:

When data is read from the above collection, it will only include files prefixed with “reviews/2020“, effectively including only the 2020 data, and then (also before reading the data!) Analytics will also apply the include filter. As a result, the query will only retrieve files that match the pattern “*/Jan/*.json” and will thus include only JSON files for January. Having determined the target files for the query, the query engine will distribute the workload across the partitions and commence with reading the data.

Executing our 3rd query, written against this new collection:

We observe the following:

  • The total execution time is now just 12 seconds, as we are effectively reading just 2.78% of the data.
  • The total amount of data transferred is now only 20GB.

Note that include/exclude filters can also support wildcard expressions, and multiple filters can be specified at a time, for example:

This can be used to include the data for January and March only.

Next steps

Since accessing external data can be costly, applying data organization best practices and corresponding external collection definitions are recommended. This will go a long way towards getting the best possible performance by enabling Couchbase Analytics to utilize available resources in parallel as much as possible and keep the costs involved in executing queries to a minimum.

Follow up with the following links used in this article to learn more about Couchbase Analytics:

Author

Posted by Hussain Towaileb, Software Engineer

Hussain Towaileb is a Software Engineer working on Couchbase Analytics. He focuses on external links and external datasets.

Leave a reply