Query cache and index selectivity

yebomagic · January 16, 2017, 7:05pm

Hi,

[1]
I have around 300 million documents in my test system, similar to below

{
“ownerId” : 1,
“playerId” : 500,
“gems” : 750
}

There are about 50 owners, with the top 4 owners having around 30 million documents each. The rest are around the 1-5 million document mark per ownerId.

For a specific ownerId I am looking for playerId that either has an exact amount of gems or a range of gems.

SELECT playerId FROM gemshop WHERE ownerId = 1 AND gems = 750
SELECT playerId FROM gemshop WHERE ownerId = 1 AND gems > 700

To speed this up I created covering indexes, a generic to cover all and another for the large owners

CREATE INDEX idx_owner ON gemshop (ownerId, gems, playerId)
CREATE INDEX idx_owner_1 ON gemshop (ownerId, gems, playerId) WHERE ownerId = 1

These are standard GSI index, not memory. The machines where the indexes ran have 128GB of memory and the indexes are around 10-15GB each (indexes are created on all index nodes).

When I run my queries, the results are a little inconsistent. When I run it the first time for a range value (>750) it takes a few seconds to a minute to complete. If I run the query again it completes in less than a second. If I change the value to search on (for example to < 900), it takes long on the first search and then fast on subsequent re-searches for that value.

At the time of the searching there are no changes being done to the documents.

Is this expected that the times are so volatile? Is there anything else I can do to make the search time a little more deterministic (without changing to MOI)? Would prepared statements help?

[2]
Any other ideas on what I could do improve the doc to aid the searching? I was thinking of adding a bucketing value to the document, that will save a “rolled down to the nearest 500” value.

{
“ownerId” : 1,
“playerId” : 500,
“gems” : 750,
“atLeast” : 500
}

That way my query might be more selective

CREATE INDEX idx_owner ON gemshop (ownerId, atLeast, gems, playerId)
SELECT playerId FROM gemshop WHERE ownerId = 1 AND gems = 750 AND atLeast = 500
SELECT playerId FROM gemshop WHERE ownerId = 1 AND gems > 700 AND atLeast = 500

Apologies if the JSON or queries have syntax errors, this was typed in the browser.

geraldss · January 17, 2017, 2:48am

Interesting problem. The important thing is to partition the indexes by ownerId and gems. I don’t know the range and distribution of gems, but let’s assume a uniform distribution from 0 to 1000. I would partition ownerId into 5: one for each of the large owners, and one for everyone else. That’s five partitions. If you partition the gems into 4 ranges, that’s a total of 20 partitions. Your queries would have to be aware of partition boundaries.

CREATE INDEX owner1_0 ON gemshop( ownerId, gems, playerId ) WHERE ownerId = 1 AND gems BETWEEN 0 and 249;

CREATE INDEX owner1_250 ON gemshop( ownerId, gems, playerId ) WHERE ownerId = 1 AND gems BETWEEN 250 and 499;

etc.

EXPLAIN
SELECT playerId
FROM gemshop
WHERE ownerId = 1 AND gems = 750;

CREATE INDEX ownerOther_0 ON gemshop( ownerId, gems, playerId ) WHERE ownerId NOT IN [ 1, 2, 3, 4 ] AND gems BETWEEN 0 AND 249;

To query across partition boundaries, use UNION ALL.

yebomagic · January 17, 2017, 9:07am

That works for the equality operators, when it comes to the range queries it becomes a little more tricky.

CREATE INDEX owner1_0 ON gemshop( ownerId, gems, playerId ) WHERE ownerId = 1 AND gems BETWEEN 0 and 249;
CREATE INDEX owner1_250 ON gemshop( ownerId, gems, playerId ) WHERE ownerId = 1 AND gems BETWEEN 250 and 499;

When I do a range it wants to use the primary index for the query

SELECT playerId FROM gemshop WHERE gems >= 100

Then I thought I would be clever and force usage across the indexes and union the results, but these just fall back to primary scans as they aren’t covered

SELECT playerId FROM gemshop USING INDEX (owner1_0) WHERE gems >= 100
UNION
SELECT playerId FROM gemshop USING INDEX (owner1_250) WHERE gems >= 100

To get this to work requires more knowledge about the indexes in the app logic that generates the queries.

SELECT playerId FROM gemshop USING INDEX (owner1_0) WHERE gems >= 100 AND gems <= 249
UNION
SELECT playerId FROM gemshop USING INDEX (owner1_250) WHERE gems >= 250 AND gems <= 499
etc.

But it doesn’t seem to perform that great.

geraldss · January 17, 2017, 9:15pm

Try UNION ALL instead of UNION.

Also, all your queries must have a condition on ownerId. And they must all be covered.

yebomagic · January 17, 2017, 9:40pm

Will try the union all and check. I accidentally left out the ownerId in my last example query, I made sure that the individual SELECT queries were all covered by their respective indexes before adding the UNION.

In reality I should also not expect sub-second performance on all queries , if the query was for >100, with a gem range of 0-1000 and even distribution over millions of documents, that is still almost 90% of millions of playerIds that could be returned after doing large index scans.

Topic		Replies	Views
Explanation on Primary Scan / Secondary Scan SQL++	13	5185	December 10, 2015
Covering index range spans SQL++	2	1719	March 8, 2017
Help for optimizing SQL++	46	1906	April 30, 2019
Inconsistent data returned with index Java SDK query , n1ql , index	24	1230	June 15, 2022
N1QL Query Performance SQL++	13	3189	August 1, 2016

Query cache and index selectivity

Related topics