Duplicate documents in Elasticsearch

Hello everyone!

I guess there’s something I must be doing wrong, but my cbes connector keeps creating duplicate documents in Elasticsearch on every document mutation in Couchbase.

It looks like the connector is not trying to upsert the document, but uses a new Elasticsearch document id.

This is seen in the logDocumentLifecycle logs:

13:27:25.142 [dcp-io-7-1] INFO  c.c.c.e.DocumentLifecycle - {"milestone":"RECEIVED_FROM_COUCHBASE","tracingToken":30750,"documentId":"_default._default.client-unit:33009682:646:15146049:BIOLITE","revision":2,"type":"mutation","partition":609,"sequenceNumber":29251,"assignedToWorker":1,"usSinceCouchbaseChange(might be inaccurate before Couchbase 7)":78833,"usSinceReceipt":102}
13:27:25.142 [es-worker-1] INFO  c.c.c.e.DocumentLifecycle - {"milestone":"MATCHED_TYPE_RULE","tracingToken":30750,"documentId":"_default._default.client-unit:33009682:646:15146049:BIOLITE","elasticsearchIndex":"shs-client-units","typeConfig":"TypeConfig{index=shs-client-units, pipeline=cbes-filter, ignore=false, ignoreDeletes=false, matchOnQualifiedKey=false, matcher=prefix='client-unit'; qualifiedKey=false}","usSinceReceipt":609}
13:27:25.142 [es-worker-1] INFO  c.c.c.e.DocumentLifecycle - {"milestone":"ELASTICSEARCH_WRITE_STARTED","tracingToken":30750,"documentId":"_default._default.client-unit:33009682:646:15146049:BIOLITE","attempt":1,"usSinceReceipt":976}
13:27:25.160 [es-worker-1] INFO  c.c.c.e.DocumentLifecycle - {"milestone":"ELASTICSEARCH_WRITE_SUCCEEDED","tracingToken":30750,"documentId":"_default._default.client-unit:33009682:646:15146049:BIOLITE","usSinceReceipt":18451}

It’s clearly able to tell that the update is a mutation from Couchbase, but in ES I can see that a brand new doc was created (with the same metadata.id but a new metadata.revSeqno coming from Couchbase), with a different Elasticsearch _id.

Naturally in Couchbase there’s only one such doc.

The basic config I’m using is:

[elasticsearch.docStructure]
  metadataFieldName = 'metadata'
  documentContentAtTopLevel = true
  wrapCounters = false

[elasticsearch.typeDefaults]
  index = ''
  pipeline = 'cbes-filter'
  typeName = '_doc'
  ignore = true
  ignoreDeletes = false

[[elasticsearch.type]]
  prefix = 'client-unit'
  index = 'shs-client-units'
  ignore = false
  ignoreDeletes = false

I’m using latest 4.4.2 from the official docker image.

Is there anything I’m suppose to do to to make the connector upsert into Elasticsearch and avoid creating duplicates? How can I make it use the same Couchbase document id in Elasticsearch?

Many thanks!

Hi Yann!

Can you try removing this line, so we can tell if the pipeline is messing up the document ID?

pipeline = ‘cbes-filter’

Thanks,
David

Thanks, I thought about this as well.

I don’t think it’s the case, since Elasticsearch ingest pipelines are actually not allowed to modify the document _id… That being said, I cannot test this because our original Couchbase documents do have a regular field called _id (set to the same as the real Couchbase doc id), which is not allowed in Elastic, so we have to either drop or rename it (we tried both).

This is what the ingest pipeline does:

{
  "description": "Pre-process Couchbase documents to ensure they are compatible with Elastic",
  "processors": [
    {
      "rename": {
        "field": "_id",
        "ignore_missing": true,
        "target_field": "couchbase_id",
        "description": "Drop _id field"
      }
    }
  ]
}

I will still try to see if we can remove this field from the documents and get rid of the ingest pipeline, and report back, as it’s still looking like a promising lead, as I just noticed this (slightly confusing) note in the docs:

If you automatically generate document IDs, you cannot use {{{_id}}} in a processor. Elasticsearch assigns auto-generated _id values after ingest.

1 Like

OK, so I can confirm that the ingest pipeline is the root cause that messes up the elasticsearch _id. Regardless of what the pipeline does, the id provided by the connector is trashed in favor of a random id, which means it will always lead the creation of a new document.

I don’t believe there’s any way to fix this, other than not using ingest pipelines altogether (which in our case is difficult because we have fields that have an invalid name in Elasticsearch and will be rejected)…

Thanks for the hint @david.nault

1 Like

Thanks for confirming, @Yann_Jouanique .

Setting documentContentAtTopLevel to false would avoid the name conflict, but would nest your document content in a “doc” element, which might not be desirable.

Do you think the connector should have an option for automatically renaming top-level fields that conflict with Elasticsearch’s metadata fields like _id?

Indeed, it seems that using _id is possible when not at the root level, which I think is an acceptable solution for most cases.

I’m not sure a renaming option for these fields would be that useful, as it’s probably not such a common case and the workaround seems acceptable… Unless something more generic can be done, such as a way to specify general field renaming patterns, since this could be quite useful to e.g. conform with some dynamic mapping definitions on ES side (e.g. if we have generic index templates that cause all keyword_* fields to be indexed as keywords).

I am more worried about what this issue means. It seems that it’s basically impossible to use an ingest pipeline with the connector. The documents will get indexed, but never mutated, causing duplicates… For immutable collections this is fine, but will be misleading otherwise. I don’t think there’s very much that the connector can do about this, it seems to be how ingest pipelines work in general… Maybe worth mentioning in the docs to avoid surprises…

Anyway in our case we have removed the problematic field and dropped the pipeline, all seems good now…

Thanks for the help on this!

1 Like