Feature Request: Filter document fields

Hi,

we have a use case which requires transferring only some of the fields in the document to the ES, not the whole thing. For example, we have 30 fields in the document but we only want 4 of them to be transferred because they’re the only ones we need. As we know, there is no way of doing that, for now.

After we examined the source project, we actually implemented this feature using TransformerChain. Then, we add the fields name to the .toml file and only they are transferred to the ES.

Example:


But this feature only remains in our current connector version. Which means whenever you upgrade the connector version, we can’t upgrade our project directly because it may fail without noticing.

We wonder what you think about this request because we heard lots of teams requiring this feature.

1 Like

Hi Emre,

Adding new config options is a slippery slope, since there are so many possible ways to transform a document. The alternative we recommend is to use Elasticsearch ingest pipelines.

Ingest pipelines let you perform common transformations on your data before indexing. For example, you can use pipelines to remove fields, extract values from text, and enrich your data.

Arun Vijayraghavan wrote a blog article showing how it’s done: Using Elasticsearch Connector with Ingest Node Pipeline - The Couchbase Blog

Thanks,
David

Hi David,

thanks for your response. Your explanation sounds logical and doable to me. Using ingest pipelines can serve the same purpose.

But I only have one question. If we use ingest pipelines, all the documents with all the fields will be transmitted through the network between couchbase and elasticsearch. It won’t be a problem for a few thousand documents for sure but, do you think it will cause some performance degradation and latencies when working with millions of documents?

That’s a good insight. I would not expect performance or latency to degrade significantly. The network is typically not the bottleneck for writing to Elasticsearch; the work of indexing after the document is received is orders of magnitude slower. Of course, the only way to be sure is to measure.

Another alternative, in case your bandwidth is metered and you’re paying by the byte, would be to use a Couchbase Eventing Function to filter the documents before they even reach the connector. The function could write the filtered documents to a separate bucket.

Thanks,
David

1 Like