Difference between collections

Hi,

I have 2 collections with same documents but there are differences between the documents . How to find the difference between the two?

Example:

collection1:

{
    "name" : "Adam",
    "Vehicles" :
      [
       "car","bike"
      ]
}
{
    "name" : "Wright",
    "Vehicles" :
      [
       "Aeroplane"
      ]
}

collection 2:

{
    "name" : "Adam",
    "Vehicles" :
      [
       "car,bike,jeep"
      ]
}
{
    "name" : "Wright",
    "Vehicles" :
      [
       "Aeroplane"
      ]
}

Output:

{
    "name" : "Adam",
    "Vehicles" :
      [
       "car,bike,jeep"
      ]
}

If “name” is actually a key you’d like to join on and you are only interested in records where the “Vehicles” array doesn’t match between the two collections:

SELECT *                                                                                                                            
FROM default.s1.c1, default.s1.c2
WHERE c1.name = c2.name
AND c1.Vehicles != c2.Vehicles;

(returns both collections’ documents where they don’t match; change the * to c1.* or c2.* should you want to report only the one)
This could be made more efficient with a suitable index.

If you want to include those that exist in each collection but are absent from the other:

SELECT *                                                                                                                            
FROM default.s1.c1 LEFT OUTER JOIN default.s1.c2 ON c1.name = c2.name
WHERE c2.name is MISSING OR c1.Vehicles != c2.Vehicles
UNION
SELECT *
FROM default.s1.c2 LEFT OUTER JOIN default.s1.c1 ON c2.name = c1.name
WHERE c1.name IS MISSING
;

If you want to match other fields as well as “Vehicles”, you should be able to include them as additional conditions.
HTH.

1 Like

thanks for the solution . when using spring boot how to rewrite this query using placeholders like #n1q1.bucket,#n1ql.collection,#n1ql.selectEntity etc so that the bucket , scope , collection names are not hardcoded. Since we are using two different documents doubt these placeholders can be used.

When constructing the @Query :
You can use #n1ql place holders for the bucket and scope and the first collection.
For the second collection, you can use the SPEL positional place-holders and pass the second collection name as an argument. The parameter indexes start at 0. i.e.: #{[0]}

There is mention of using the SPEL positional parameters for a different purpose in Fluent API needs a parametrized Query [DATACOUCH-551] · Issue #859 · spring-projects/spring-data-couchbase · GitHub

thank you . not sure if it is possible . From the query given we can find the difference between the collections . is there a way to also find what is the difference ? Like whether it is insert or update or delete

There is no way to tell [edit: with an SQL query] if the absence of data is because it was never added or if it was added and later deleted.

Similarly there is no way to tell if the difference in an individual field comes about because it was inserted with the difference or later updated with it.

You can use the audit log to track when changes have been made to documents. Auditing can be configured on what is logged or not, tracking every document mutation will use a lot of resources though. Auditing | Couchbase Docs

Hi ,

Can you please suggest index for this query? I created index on name but still it takes around 50 seconds

How many documents are there in each of the collections?

The index advisor suggests these two:

CREATE INDEX adv_name_Vehicles ON `default`:`default`.`s1`.`c1`(`name`,`Vehicles`);
CREATE INDEX adv_name_Vehicles ON `default`:`default`.`s1`.`c2`(`name`,`Vehicles`);

for the first (non-UNION) statement.

(Edit: Deleted advice for the second statement; on review the suggest indices won’t help.)

HTH.

If you change the second statement to:

SELECT c1.name                                                                                                                      
FROM default.s1.c1 LEFT OUTER JOIN default.s1.c2 ON c1.name = c2.name
WHERE c1.name IS VALUED AND (c2.name is MISSING OR c1.Vehicles != c2.Vehicles)
UNION
SELECT c2.name
FROM default.s1.c2 LEFT OUTER JOIN default.s1.c1 ON c2.name = c1.name
WHERE c2.name IS VALUED AND c1.name IS MISSING

the above two indices should be used and may help. The addition of the IS VALUED clauses allows the selection of indices on “name” to be considered. Of course this does subtly change the possible results but I believe this to still be in line with your intentions (i.e. don’t return documents that don’t include “name”).