Elasticsearch connector's `BigInteger` support

zoltan.zvara · July 30, 2022, 9:00am

I’m not sure if this is a problem on the Elasticsearch or the Connector side, but some of the documents can not be indexed because of the following. (I did not create the indexes upfront for the Couchbase collections.)

400 failed to parse field [doc.number] of type [long] in document with id '123456'. Preview of field's value: '15000000000000000282'

We store documents with these incredibly large values in Couchbase with no effort. These values come and go between Couchbase and a JVM backend (serialized with BigDecimal support using JSON4S), in addition, JavaScript frontends can handle these values as well.

How is the index created in Elasticsearch with the Connector? Is this something that the Connector would create in advance, before inserting the first document?

Based on my search of Elastic issues, it should have support for BigIntegers in some way, but there is no dedicated Elasticsearch Type for it. Some sources suggest indexing these as keywords.

I also attempted to drop these values in an Elasticsearch Pipeline, however, the JSON is possibly parsed before it is sent to the Pipeline.

Did someone else stumbled upon this problem with the Couchbase Elasticsearch Connector?

david.nault · August 1, 2022, 3:44pm

Hi Zoltan,

I’m under the impression JavaScript cannot accurately represent integers larger than 2^53 - 1 unless you’re using the new BigInt type added in ECMAScript2020. Is it possible the frontend is silently losing precision?

The connector does not create indexes or type mappings. It relies on Elasticsearch’s automatic index creation and dynamic type mapping features, or pre-existing indexes and mappings.

What happens if you manually create the index with a type mapping that says the field is a keyword?

Thanks,
David

zoltan.zvara · August 1, 2022, 4:27pm

Hi David,

We do’not lose precision. I immediately had that wierd feeling as well and got paniced, but we serialize BigDecimals in Scala and it seems to me we are properly using BigInt on the JS side as well.

This is what we must do in this case. However, I see that the Connector attempts to insert these big numbers 15000000000000000282, so I’m thinking on maybe we could do some automatic conversion to String, therefore it would transform to a keyword type in ES. This would win sleepless hours in developing and maintaining index mappings. Would such a “transform” feature be useful for the Connector? If so, I’ll look into the complexity of developing it.

david.nault · August 1, 2022, 5:30pm

If it’s useful to you, others might find it useful as well

Perhaps the type definitions in the connector config could include a list of JSON pointers to fields that should be coerced to strings? Something like:

[[elasticsearch.type]]
  matchOnQualifiedKey = true
  regex = '[^.]+.widgets.*' # all documents in the "widgets" collection in any scope
  coerceToString = ['/doc/number'] # new field, list of JSON pointers

This would let users continue to use automatic index creation and dynamic type mapping.

Let me know how your investigation goes. I’m happy to answer any questions about the code.

Thanks,
David

zoltan.zvara · August 3, 2022, 5:09pm

Thanks for your suggestions David, I’ll do the research in the near future to obtain this feature and get back to you with my findings.

zoltan.zvara · August 10, 2022, 3:32pm

I checked the code and quickly started out by adding this to TypeConfig:

@Nullable
ConfigArray coerceToString();

Then I started to dig deeper and into the part that would write out a mutation to ES.

I transformed the part where it takes the mutation byte[] bytes and converts it into a Map<String, Object> to be able to create an ES document. I saw that it calls the com.fasterxml dependency shaded into the DCP libarary to do the byte[] to Map<String, Object> conversion. This looks like a good place to inject some configuration to the fasterxml ObjectMapper as follows:

As you can see on the above code snippets, the ObjectMapper could be configured with custom deserializers and possibly with SerializationFeature’s. There are quite a few possibilities there, so I suggest creating a serialization block to each ES type in the configuration so that in the near/far future, more feature could be added. A configuration key could be added under the TypeConfig for example serialization.writeBigDecimalAsString boolean and it could be true by default.

David, please suggest directions in which the idea could be better.

Thanks,
Zoltán

david.nault · August 12, 2022, 9:11pm

Hi Zoltan,

I’d prefer not to directly expose the Jackson features, just in case we need to migrate to a different JSON library in the future.

However, I agree it would be great to have a config option that enabled Jackson’s WRITE_NUMBERS_AS_STRINGS feature as part of a type definition (and the type defaults).

The coerceToString property would still be useful, in case users want only certain fields to be stringified.

Instead of:

@Nullable
ConfigArray coerceToString();

I would recommend:

List<JsonPointer> coerceToString();

and parse the JSON pointers as the config is read. This is more in line with how the other config fields are parsed, and would make it easy to fail fast if the user specifies an invalid pointer.

Something else to be aware of: there’s an “optimized passthrough” fast path for when the Elasticsearch document is exactly the same as the Couchbase document. In that case, we’d need to check whether any coercion is required; if so, we’d have to parse, transform, and re-serialize the document.

Thanks,
David

zoltan.zvara · August 18, 2022, 2:18pm

David thanks for your kind help, this is still on the roadmap for me, unfortunately I got cought up on priority work, but I will continue my implementation.

Topic		Replies	Views
Strange behavior when saving big long numbers [3.0.1 CE] Couchbase Server	2	2587	August 13, 2015
Handling BigInteger Type? Java SDK data_modelling , java	8	5890	October 14, 2016
Some numbers within Json get modified on save Couchbase Server	2	1938	October 29, 2014
Custome type for Elasticsearch Couchbase Server	2	1926	November 21, 2014
Elasticsearch: access non json keys Other	1	2811	March 13, 2015

Elasticsearch connector's `BigInteger` support

Related topics