Very slow performance on query without index on simple documents

Hello,

We work in a large European bank and we are starting to make tests with Couchbase in order to eventually migrate some of our SQL databases. We have initially inserted 5 million documents that look like this:

{
“fileName”: “au_fut_floor.dico”,
“uniqueID”: “000002”,
“mnemo”: “AA”,
“instrumentType”: “F”,
“strike”: 0,
“expiry”: 1464732000000,
“version”: “|”,
“feedType”: “10”,
“feedCode”: “SFE-AAM6”,
“feedName”: “SFE”,
“orderPassingCode”: “SFE-AAM6”,
“quotity”: 1,
“tickSizeRule”: “sfe-10-1Fut”,
“minQuantity”: 1,
“maxQuantity”: 0,
“additionalData”: “|”,
“serieCentraleFlag”: “-1”,
“lotSize”: 1,
“currency”: “AUD”,
“isinCode”: “|”,
“underlyingName”: “AA”,
“segmentName”: “|”,
“sectorName”: “|”,
“updateDate”: 1451499570578
}

Please note the “currency” field that always consists of a 3 character string.
Yesterday we executed what we thought was a simple N1QL Query using CBQ. Basically, we wanted to count the number of instruments that have currency = “EUR” :

SELECT COUNT(*) FROM dico WHERE currency="EUR"

Here are the query results along with performance numbers:

As you can see, it took 19 minutes to execute!! Now, please do keep in mind there is no index on this field, so a full scan was performed. Furthermore, here are the server hardware details:

OS: Windows(64bit) Server 2012
Processor: Intel Xeon E5-2690 v3 @ 2.60 GHz( 4 processors)
RAM: 32 GB

So my question is the following. Once we hit an important number of documents, let’s say > 1 million, are we supposed to have an index on every field we are going to search? I understand that doing a full document scan from disk is expensive, but 19 minutes? For comparison purposes, we ran the same query(count number of instruments with currency = “EUR”) on a MS SQL Server table that has very similar fields and 14 million records with no indices, it executed in 18 seconds. We later created the index on the currency field with Couchbase and the query executed in 16 seconds.

Evidently, going from 19m to 16s is quite a gain, but again, compared to the performance on MS SQL without index, far away from being impressive. I want to continue pushing a document technology here in our department because it adds a lot of flexibility to our applications and avoids tables with 110 columns for example. It could be the case that the speed of this type of queries will always be better with SQL: searching one column will always be faster than it’s NoSQL counterpart. But it’s a tradeoff against having documents with data from 5 different tables, therefore avoiding expensive JOIN operations ?

Please help me understand this performance and if you have any comments regarding the benefits-costs of replacing SQL with a document database(Couchbase) pertinent to the example(above) please don’t hesitate.

1 Like

Thanks for the comments @alsa,
I had a chance to work in developing both databases in question so happy to answer in details.

I can understand why you are surprised by overhead of a simple query at the through of “1”. However I’d point out a few issues:

  • Couchbase is a distributed system vs SQL Server is a single node monolithic process. So Comparing a scale up system to a distributed system on one node can be misleading when measuring performance.
  • Performance measurement under throughput of 1 vs 1000 queries/sec can yield very different performance curves. Under a non-concurrent execution of a single query SQL Server have an advantage. However in reality systems will deal with more than 1 query executing at any 1 time and increasing the throughput, Couchbase Server will gain the advantage. With higher throughput, you will find the throughput ceiling with sql server that you cannot shatter. Couchbase is tuned to scale-out. Almost all production systems run on multi node deployments with Couchbase for availability and scalibility. And under higher throughput and when scaled out, Couchbase will perform better.
  • Indexes are more important to Couchbase: Couchbase is designed to distribute data elastically as nodes come and go/fail. We do redistribution of data for better availability. Unlike SQL Server, we cannot hard-code a “local access path” to data at all times. So we have more of a dependence on indexing in Couchbase to locate the data. Good news is we have built the system to sustain many more indexes compared to SQL Server for better scalability and performance. and that is my next point…
  • Indexing has lighter cost on Couchbase as opposed to SQL Server: In relational systems and even in many nosql systems (mongodb for example), index maintenance is done as part of the data mutations (INSERT/UPDATE/DELETE). This means indexing is a big compromise between fast inserts vs fast queries. Couchbase is architected differently and tradeoff between fast insert vs fast query is not there. We have customers running with >100 indexes with sub-millisecond mutations. So you can “over-index” in couchbase.

Aside from perf/scalem, you identified the difference between relational and NoSQL technology well. Flexibility is a big reason to come to native JSON databases. I’ll add a few more reasons to your list;
- Develop with schema agility and without ALTER TABLE/downtime: Let app drive schema and continuously change the app.
- Eliminate ORMs to translate your data from objects to relational and back on each call.
- More importantly, Get a flexible data model. What that means is you can model your data to optimize your access paths with more options in the data model. You can choose to normalize all the way much like what you’d do in relational and use JOINS with Couchbase. OR use the power of JSON to embed natural relationships like orders + order_line_items into a single document or anything in between.

I hope this helps answer your question.

Last a quick tip: this will not improve your 19 min scan to seconds but it will improve the perf of the scan somewhat. Given you don’t concurrently execute many queries, you can use the max_parallelism to utilize the full machine and get your scan to parallelize. We don’t do this auto-magically yet. Unfortunately trying this will require a different tool. cbq tool is not designed for high performance query executions so does not allow this. You need to do this through curl or another REST API capable tool like chrome browser. I like to use a tool like postman. Here is the sample REST API call;

http://10.0.0.45:8093/query/service?statement=select * from default where a2=4703626&max_parallelism=4

attached the screenshot of my execution with max_parallelism on my small Couchbase 4.1 cluster. First execution shows a 5K item scan throughput with max_parallism=1 the next execution with taller 10K/sec read throughput is run with max_parallelism=2.

Hi @alsa.

In addition to @cihangirb’s points, I would point out that we do not see the same latencies you are seeing in our environments. Did you run multiple tests? If so, we would be happy to work with you to troubleshoot.

We also have customers in production with similar use cases as yours, and they do not see the same latencies.

Thanks,
Gerald

Thanks a lot @cihangirb for your very complete answer. I now have a better grasp of how to achieve a better leverage on a NoSQL solution, notably vis-a-vis a SQL one. I will act accordingly and may come back to you with further questions.

@geraldss I am going to run more tests using the same query, same cluster, with/without index and come back to you if we keep seeing the same latency.

Thanks!

i have the same problem when execute COUNT or MAX like this
SELECT MAX(timestamp) FROM gts where accountid = ‘demo’ ;

if even i have created index for accountid and timestamp but the result take about 6 ~ 10 seconds.
If i don’t use MAX and limit data to 500 and it’s take about 140ms.
Total data is about 65000 documents.

i really don’t understand why got this problem because this really a simple usecase?
does someone can help?

Hi @Hiro,

You can do the following. This is a workaround for MAX. Currently, we handle MIN better than MAX. We will fix this.

If you post additional queries, we can suggest the necessary indexes.

For MAX workaround:

CREATE INDEX idx_max ON gts( -MILLIS( timestamp ) ) WHERE accountId = 'demo';

SELECT MIN( -MILLIS( timestamp ) ) FROM gts WHERE accountid = 'demo' ;

For COUNT:

CREATE INDEX idx_count ON gts( accountId );

SELECT COUNT(*) FROM gts WHERE accountId = 'demo';

Hi @geraldss

Thank you for your quick response!

CREATE INDEX idx_max ON gts( -MILLIS( timestamp ) ) WHERE accountId = ‘demo’;

this function (-MILLIS) is actually (-str_to_millis) and it does not really help because [timestamp] is already an unix timestamp (only number) and i also create index for it like this
CREATE INDEX idx_timestamp ON gts(timestamp) USING GSI

About COUNT, i also create index for it like this
CREATE INDEX idx_accountid ON gts(accountid) USING GSI

It’s still take over 6 seconds.

By the way, this result is on CentOS6.7 64 bit with 4G RAM.
My co-employee told me that he got the result lower than 1s if OS is Window Server 2008.
Do you have any idea about this?

Thanks a lot!

If you already have a UNIX timestamp, you can do:

CREATE INDEX idx_max ON gts( -timestamp ) WHERE accountId = 'demo';

SELECT MIN( -timestamp ) FROM gts WHERE accountid = 'demo' ;

For COUNT:

CREATE INDEX idx_count ON gts( accountId ) WHERE accountId = 'demo';

SELECT COUNT(*) FROM gts WHERE accountId = 'demo';

Hi @geraldss

Thank you for your help, i really appreciate about it.
I have followed your instruction, but query still take nearly 6s.
It’s really too slow for get MAX timestamp.

SELECT MIN( -timestamp ) FROM gts WHERE accountid = ‘demo’ ;

I try to setting max_parallelism = 4 but it does not help either.
should i have to reference to any document optimization?

Best regards,

Hi @Hiro,

What version of Couchbase are you using?

Thanks to @egrep for the suggestion / question.