Index on LEFT JOIN

yian.shih · August 15, 2023, 4:38am

Hi all,

I have a collection with up to one millions documents and I’d like to get a combined data from more than one documents.

Document has five fields, [id, key, category, label, value] where id always consists of key:category:label in lower casePreformatted text

For example,

DocA

{
  id: "a:my-category:my-label",
  key: "a",
  category: "my-category",
  label: "my-label",
  value:"This is DocA"
}

DocB

{
  id: "b:my-category:my-label",
  key: "b",
  category: "my-category",
  label: "my-label",
  value:"This is DocB"
}

DocA and DocB have same category and label

I have a requirement to fetch the combined data based on key

Desired Data format

{
  id: "a:my-category:my-label",
  key: "a",
  category: "my-category",
  label: "my-label",
  value:"This is DocA"
  extraValue:"This is DocB"
}

This is my n1ql query:

SELECT d1.*, d2.value AS extraValue
FROM `doc-collection` AS d1
LEFT JOIN `doc-collection` AS d2 ON KEYS REPLACE(d1.`id`,"a","b")
WHERE d1.`key` = "a" OFFSET 0 LIMIT 10

It works well when offset is low, but it takes forever when offset is huge.

I assuming this query needs to scan the entire index.

Here is my index

CREATE INDEX by_category_label ON doc-collection(category,label)

CREATE INDEX by_key_category ON doc-collection(key,category)

What would be the most efficient way to achieve this?

Much appreciate for any suggestions.

dh · August 15, 2023, 7:32am

Without ordering you can’t be assured of unique consecutive batches (or indeed repeated results for a single batch). Your offset isn’t pushed down because of the join. You could eliminate the ORDER BY from the examples below too but I’m assuming you want reliable pagination of the results.

I’m assuming that “id” mirrors meta().id (i.e. the document key) always.

You can try:

CREATE INDEX ixtest ON `doc-collection`(`key`,`category`,`label`,`value`,`id`) WHERE `key` = "a";

with:

SELECT d1.id, d1.`key`, d1.category, d1.label, d1.`value`,
       (SELECT d2.value
        FROM `doc-collection` AS d2
        USE KEYS "b:"||TO_LOWER(d1.category)||":"||TO_LOWER(d1.label)
       )[0].`value` AS extraValue
FROM `doc-collection` AS d1
WHERE d1.`key` = "a"
ORDER BY d1.`key`, d1.category, d1.label
OFFSET 0
LIMIT 10

to cover the “a” documents and eliminate one fetch.

(You could also have the sub-select covered but for a single result the overhead is likely more than the direct key fetch.)

Note the change to the USE KEYS: REPLACE as you stated it wouldn’t necessarily produce the correct key - say all 3 elements were the same text (e.g. “a:a:a”) you’d be wanting to match on “b:a:a” but the REPLACE would generate “b:b:b”. You could use REPLACE(id,“a:”,“b:”,1) to handle this case but it could still generate an unwanted key, e.g. transforming “c:a:e” to “c:b:e”. Hence the construction of the document key from the source fields.

“id” is a rather redundant field if my assumption of it mirroring meta().id is true, so if eliminated the index would be smaller and you could have the same result with:

CREATE INDEX ixtest ON `doc-collection`(`key`,`category`,`label`,`value`) WHERE `key` = "a";

and:

SELECT meta(d1).id, d1.`key`, d1.category, d1.label, d1.`value`,
       (SELECT d2.value
        FROM `doc-collection` AS d2
        USE KEYS "b:"||TO_LOWER(d1.category)||":"||TO_LOWER(d1.label)
       )[0].`value` AS extraValue
FROM `doc-collection` AS d1
WHERE d1.`key` = "a"
ORDER BY d1.`key`, d1.category, d1.label
OFFSET 0
LIMIT 10

HTH.

vsr1 · August 15, 2023, 1:27pm

Try keyset pagination Using OFFSET and Keyset in N1QL - The Couchbase Blog


CREATE INDEX ix1 ON `doc-collection` (META().id) WHERE `key` = "a";
$id = ""

SELECT d1.*, (SELECT RAW d2.`value`
              FROM `doc-collection` AS d2 USE KEYS "b:"||TO_LOWER(d1.category)||":"||TO_LOWER(d1.label))[0] AS extraValue
FROM `doc-collection` AS d1
WHERE d1.`key` = "a" AND META(d1).id > $id
LIMIT 10;

Repeat id = last id from previous results

Instead of META().id you can use id from document too.

yian.shih · August 16, 2023, 12:38am

Hi @dh and @vsr1 ,
Thanks for your answers. It works perfectly for the requirement I need.
However, we have other scenarios like we need to select the data conditionally.

Scenario 1
We need to select d1 where d2 does not exist

SELECT d1.*, d2.value AS extraValue
FROM `doc-collection` AS d1
LEFT JOIN `doc-collection` AS d2 ON KEYS REPLACE(d1.`id`,“a”,“b”)
WHERE d1.`key` = “a” AND d2 IS MISSING
OFFSET 0 LIMIT 10

Which performs like this
Screen Shot 2023-08-16 at 10.30.46 am

Scenario 2
I’d like to search the data which either matches d1 or d2

SELECT d1.value AS baseValue , d2.value AS extraValue
FROM `doc-collection` AS d1
LEFT JOIN `doc-collection` AS d2 ON KEYS REPLACE(d1.`id`,“a”,“b”)
WHERE d1.`key` = “a” AND search(d1, some query) AND search(d2, some query)

OFFSET 200000 LIMIT 10

Thanks again for your time.

vsr1 · August 16, 2023, 1:08am

Extend earlier suggestion, Start where you left off vs from start. As LIMIT may not able to push to indexer due to additional predicates ( there might be extra fetches those are not big concern). But still can start where left off and stop when LIMIT satisfied

LEFT side index scan is on document key, which is unique and sorted so you can start where you left off.

Scenario 1:

SELECT d1.*, d2.`value` AS extraValue
FROM `doc-collection` AS d1
LET d2 = (SELECT RAW d3
          FROM `doc-collection` AS d3 USE KEYS "b:"||TO_LOWER(d1.category)||":"||TO_LOWER(d1.label))[0]
WHERE d1.`key` = "a" AND META(d1).id > $id AND d2 IS MISSING
LIMIT 10;

Scenario 2:

SELECT d1.*, d2.`value` AS extraValue
FROM `doc-collection` AS d1
LET d2 = (SELECT RAW d3
          FROM `doc-collection` AS d3 USE KEYS "b:"||TO_LOWER(d1.category)||":"||TO_LOWER(d1.label))[0]
WHERE d1.`key` = "a" AND META(d1).id > $id AND search(d1, some query) AND search(d2, some query)
LIMIT 10;

yian.shih · August 16, 2023, 7:04am

Hi @vsr1 ,
Thanks so much for your reply.

I’m experiencing a case in Scenario 1 that
When dataset B has most of the data overlapped with dataset A, it takes minutes to find the data in your answer above.
I’m wondering if there is any way to handle this scenario?

Thanks again for your help

vsr1 · August 16, 2023, 12:29pm

May be first produce missing document keys using covering index and then retrieve actual values

CREATE INDEX ix2 ON `doc-collection` (META().id) WHERE  `key` = "b";

SELECT  RAW META(d).id
FROM `doc-collection` AS d
WHERE d.`key` = "a"

EXCEPT

SELECT  RAW REPLACE(META(d).id,"b:","a:")
FROM `doc-collection` AS d
WHERE d.`key` = "b"

yian.shih · August 21, 2023, 6:30am

@vsr1 Thanks for all the advices. They work well for my requirements.

system · November 19, 2023, 6:30am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Creating Index when 2 documents having array fields as predicates are joined SQL++ n1ql	6	1185	October 14, 2017
N1QL - JOIN Query SQL++ n1ql	1	586	May 19, 2021
About left join SQL++	1	1599	December 18, 2017
Multiple Index JOIN from a single document SQL++ n1ql	6	1196	November 30, 2018
Issues on querying data SQL++ query , java , n1ql	4	578	October 3, 2022

Index on LEFT JOIN

Related topics