Search in array and regroup based on other key

flaviu · November 6, 2020, 10:46am

I wonder if this is possible to be implemented in N1SQL

so, I have multiple documents of the format:

Ex. 1

{
  "country_code": "ae",
  "chart_timestamp": 1597533003,
  "chart_a": {
       "1000008721": 6,
        "403274099": 2,
        "432756187": 1,
        "454403780": 3,
        "463569251": 4,
        "493073807": 7,
        "493837285": 5,
        "501344243": 8
},
  "version": 1
}

Ex. 2

{
  "country_code": "ae",
  "chart_timestamp": 1597534004,
  "chart_a": {
       "1000008721": 5,
        "403274099": 2,
        "432756187": 1,
        "454403780": 3,
        "463569251": 4,
        "493073807": 7,
        "493837285": 6,
        "501344243": 8
},
  "version": 1
}

Ex. 3

{
  "country_code": "ae",
  "chart_timestamp": 1597535005,
  "chart_a": {
       "1000008721":3,
        "403274099": 2,
        "432756187": 1,
        "454403780": 5,
        "463569251": 4,
        "493073807": 7,
        "493837285": 6,
        "501344243": 8
},
  "version": 1
}

Now, what I would like to achieve is a document of the below format, when searching for a key in he chart_a array.

For example, if the key is 1000008721

"position_history":{
    "1597535005": 3,
    "1597534004": 5,
    "1597533003": 6
},
"smaller_position": 3
}

smaller_position is the MIN value of the resulting position_history object
the key of position_history is the chart_timestamp of each individual document
the where clause of the query should be based on the presence of the key (in our case: 1000008721) in each document chart_a ARRAY.
Also, I wonder how should the index look like to index these arrays in every object

vsr1 · November 6, 2020, 1:07pm

Set positional parameters "1000008721";

INSERT INTO default VALUES("f01",{ "country_code": "ae", "chart_timestamp": 1597533003, "chart_a": { "1000008721": 6, "403274099": 2, "432756187": 1, "454403780": 3, "463569251": 4, "493073807": 7, "493837285": 5, "501344243": 8 }, "version": 1 });
INSERT INTO default VALUES("f02",{ "country_code": "ae", "chart_timestamp": 1597534004, "chart_a": { "1000008721": 5, "403274099": 2, "432756187": 1, "454403780": 3, "463569251": 4, "493073807": 7, "493837285": 6, "501344243": 8 }, "version": 1 });
INSERT INTO default VALUES("f03",{ "country_code": "ae", "chart_timestamp": 1597535005, "chart_a": { "1000008721":3, "403274099": 2, "432756187": 1, "454403780": 5, "463569251": 4, "493073807": 7, "493837285": 6, "501344243": 8 }, "version": 1 });


SELECT OBJECT v1.f:v1.v FOR v1 IN av END AS position_history,
                   ARRAY_MIN(av[*].v) AS smaller_position
FROM default AS d
WHERE d.country_code = "ae" AND d.chart_a.[$1] IS NOT NULL
LETTING av = ARRAY_AGG({ "f":TO_STR(d.chart_timestamp), "v": d.chart_a.[$1] });

flaviu · November 6, 2020, 5:07pm

@vsr1, sorry for my late answer.

That’s incredible how powerful N1QL is . Also, it is incredible what YOU could do with it…

Thanks for updating the post, initially it didn;t work and I tried to make it work before I saw your answer.

The problem I have now, is that it is super slow on a large number of documents. Can you please recomand me how to index this better? Actually, right now, I don;t have any index on the chart_a array.

vsr1 · November 6, 2020, 5:42pm

You must add WHERE clause on index avoid big index

CREATE INDEX ix10 ON default(ALL ARRAY [v.name, v.val] FOR v IN OBJECT_PAIRS(chart_a) END, country_code, chart_timestamp) ;

SELECT OBJECT v1.f:v1.v FOR v1 IN av END AS position_history,
                   ARRAY_MIN(av[*].v) AS smaller_position
FROM default AS d
UNNEST OBJECT_PAIRS(d.chart_a) AS u
LET fltr = [u.name, u.val]
WHERE d.country_code = "ae" AND  fltr >= [$1] AND fltr < [SUCCESSOR($1)]
LETTING av =  ARRAY_AGG({ "f":TO_STR(d.chart_timestamp), "v": fltr[1] });

non covered query

CREATE INDEX ix11 ON default(DISTINCT OBJECT_NAMES(chart_a), country_code, chart_timestamp) ;

SELECT OBJECT v1.f:v1.v FOR v1 IN av END AS position_history,
                   ARRAY_MIN(av[*].v) AS smaller_position
FROM default AS d
UNNEST OBJECT_NAMES(d.chart_a) AS u
WHERE d.country_code = "ae" AND  u = $1 ;
LETTING av =  ARRAY_AGG({ "f":TO_STR(d.chart_timestamp), "v": d.chart_a.[u]});

flaviu · November 8, 2020, 2:54pm

Are you sure, here “v.name” should be v.name instead of just v ? from what i understand OBJECT_NAMES(chart_a) return just an array so there is no v.name

Also, for some reason I canont make this work fast enough. I am going to play with it and come back with more questions.

I have 145M documents and the chart_a has approx 400 keys, so the index is growing very fast. I am trying to do a partioning now, hope this will be faster.

If the WHERE clause reduce the number of documents close to 1000 do you think it make sense to index the chart_a data or the query engine can get these 1000 documents in parallel fast enough?

vsr1 · November 8, 2020, 3:09pm

You are right.
CREATE INDEX ix11 ON default(country_code, DISTINCT OBJECT_NAMES(chart_a)) ;
Do you any other fields that says chart_a present? you should add that to index where clause

Is this is your exact query

flaviu · November 8, 2020, 3:18pm

Thank you for answering me over the weekend. You are doing a very good job!!!

regarding my query, the query is a little bit more complicated:

SELECT OBJECT v1.f:v1.v FOR v1 IN av END AS position_history,
       ARRAY_MIN(av[*].v) AS smaller_position
FROM tracker as d
UNNEST OBJECT_NAMES(d.chart_a) AS u
WHERE d.type = "document_type"
 AND d.provider = "provider_name"
    AND d.chart_type = "chart_type"
    AND d.platform = "platform"
    AND d.genre = id
    AND d.country_code = "country_two_letter_code"
    AND d.chart_timestamp > unixtimestamp
    AND d.chart_timestamp < unixtimestamp
    AND u = "app_id" LETTING av = ARRAY_AGG({ "f":TO_STR(d.chart_timestamp), "v": tracker.chart_a.[u]});

not sure if it helps having the entire query

vsr1 · November 8, 2020, 3:31pm

CREATE INDEX ix10 ON default(provider, chart_type, platform, genre, country_code, DISTINCT OBJECT_NAMES(chart_a), chart_timestamp)
WHERE type = "document_type";

SELECT OBJECT v1.f:v1.v FOR v1 IN av END AS position_history,
       ARRAY_MIN(av[*].v) AS smaller_position
FROM default as d
WHERE d.type = "document_type"
    AND d.provider = "provider_name"
    AND d.chart_type = "chart_type"
    AND d.platform = "platform"
    AND d.genre = 1234
    AND d.country_code = "country_two_letter_code"
    AND d.chart_timestamp > unixtimestamp
    AND d.chart_timestamp < unixtimestamp
    AND ANY v IN OBJECT_NAMES(d.chart_a) SATISFIES v = "app_id" END
LETTING av = ARRAY_AGG({ "f":TO_STR(d.chart_timestamp), "v": d.chart_a.["app_id"]});

You can create partition index on any immutable fields

Also checkout https://index-advisor.couchbase.com/indexadvisor/#1

vsr1 · November 8, 2020, 4:17pm

The following is even better to avoid Aggregation

CREATE INDEX ix10 ON default(provider, chart_type, platform, genre, country_code, DISTINCT OBJECT_NAMES(chart_a), chart_timestamp)
WHERE type = "document_type";


SELECT OBJECT v1.f:v1.v FOR v1 IN av END AS position_history,
       ARRAY_MIN(av[*].v) AS smaller_position
LET av = (SELECT TO_STR(d.chart_timestamp) AS f,
               d.chart_a.["app_id"] AS v
        FROM default as d
        WHERE d.type = "document_type"
            AND d.provider = "provider_name"
            AND d.chart_type = "chart_type"
            AND d.platform = "platform"
            AND d.genre = 1234
            AND d.country_code = "country_two_letter_code"
            AND d.chart_timestamp > unixtimestamp
            AND d.chart_timestamp < unixtimestamp
            AND ANY v IN OBJECT_NAMES(d.chart_a) SATISFIES v = "app_id" END);

flaviu · November 8, 2020, 4:31pm

Why is the DISTINCT OBJECT_NAMES(chart_a) before the chart_timestamp ? From what I know the order in the index should be the same as the one in the query. Isn’t this true?

In our query the chart_timestamp is used before the (ANY v IN OBJECT_NAMES(d.chart_a) SATISFIES v = "app_id" END)

I am missing something?

vsr1 · November 8, 2020, 4:34pm

No. That is not right assumption.

Order in query predicates no affect. (https://blog.couchbase.com/create-right-index-get-right-performance/), Rule #11

Query ORDER BY (if present) and index key order must match to use index order.

flaviu · November 8, 2020, 4:35pm

oh, ok, thank you, didn’t know this. I had the impression that the order is important

Topic		Replies	Views
N1QL with array SQL++	4	629	November 5, 2020
Search in array SQL++	4	511	May 16, 2021
Join based on keys and arrays SQL++	9	875	February 28, 2021
Grouping Duplicate key values of an array and displaying the count in N1QL SQL++ query , n1ql	1	1186	September 18, 2020
N1QL Query assistance SQL++	3	702	February 20, 2019

Search in array and regroup based on other key

Related topics