FTS query returns Sync GW temporary (_sync:rev) documents

henk.kampman · May 24, 2017, 6:17am

After a database update FTS returns documents with object id’s prefixed with “_sync:rev”

When checked some of the referenced documents are emtpy while others contain data.

I assume that this is a bug?

Question: how can I prevent FTS from indexing/returning GW temporary documents?

btw This is on Couchbase 4.6.1-3652

daschl · May 24, 2017, 6:24am

Actually I think you need to use a shadow bucket and do the FTS index from the shadow bucket so it doesn’t interfer with the SG metadata.

adamf · May 24, 2017, 4:30pm

I don’t think that FTS on its own would interfere with SG’s metadata (since I don’t think it’s writing anything to the documents) - it may be about being able to exclude some documents (and potentially some properties) from FTS based on key/property name.

For internal documents - Sync Gateway’s internal documents have key values prefixed with underscore - is there a way to exclude those from FTS?

For internal document metadata, it’s similar - SG’s metadata is stored as properties prefixed with underscore. Can those be excluded from FTS?

mschoch · May 24, 2017, 4:50pm

You can exclude whole documents by having them match a type, and configuring that type to not index anything. Type identification can be done 3 ways, type field, id prefix, or id regex.

You can exclude document sub-sections, by defining an explicit mapping for the parent key, and setting only index explicit fields to true (and then defining 0 explicit fields).

There is no way to exclude fields by prefix.

adamf · May 24, 2017, 4:59pm

Thanks @mschoch.

So using an id regex will let you exclude SG’s internal documents (define a regex that includes everything except ids with a leading underscore).

Excluding the _sync property from the document should be sufficient to hide Sync Gateway’s metadata - I don’t think it’s required to exclude any other fields.

mschoch · May 24, 2017, 5:08pm

Setting the type based on a regex of the ID doesn’t quite work the way you described. Instead it matches a portion of the id and uses that as the type. You might still be able to use it in a non-intuitive way. Like, if the id starts with _ it gets type “_” and if it doesn’t start with _, then it gets type “”. Extremely non-obvious, but I think it would work.

henk.kampman · May 24, 2017, 6:26pm

The regex method could work but we currenlty use a ‘type field’, none of the ID’s contain type information and unfortunately the format is out of our control.

That means we have to filter the FTS responses on the client?

Surely thats not the way its susposed to work!

Sigh… why isn’t there a FTS index option to ignore internal object id’s?

mschoch · May 24, 2017, 6:32pm

Because I don’t know what an internal object id is. Sounds like something made up by sync gateway?

henk.kampman · May 24, 2017, 6:42pm

Here is an example:

This is what FTS returns as a hit:

_sync:rev:{myobjectid}:34:1-0e2e03f32eda9ab3c32a085bd2be5918

{myobjectid} = is the actual objectid.

adamf · May 24, 2017, 6:55pm

@henk.kampman I think the approach already described by @mschoch earlier should work for you.

You can use a regex to assign documents with an ID prefixed by ‘_sync’ to type ‘_sync’, and then set ‘only index explicit fields’ to true (meaning that nothing ends up getting indexed for that type).

The rest of your type definitions should work as usual. If that’s not working for you, can you provide some more specifics?

henk.kampman · May 24, 2017, 9:16pm

I was completely mislead by the UI.
The UI suggests (to me at least) that the ‘type identifier’ setting is global for the entire index.
It never occured to me that you can add multiple type mappings

mschoch · May 24, 2017, 9:26pm

The type identifier is global for the entire index.

As I understand it, you have 2 problems.

Documents like _sync:* are getting indexed. In order to fix this, you should define custom type mapping using a regex on the id field, which maps _sync documents to type _sync. You then configure all documents of this type to “only index specified fields”. This means effectively these documents will not be indexed.
Documents may contain a field of sync gateway metadata. I don’t even know if it’s called _meta or _sync. Whatever it’s called, presumably you want to ignore this content within documents. So, to address this, you go to the default mapping (which is still handling all non-sync documents). Add a field “_meta” or “_sync”, whatever that sub-section of content is called. Then, for this too, you specify “only index specified fields”.

henk.kampman · May 24, 2017, 10:04pm

Documents like _sync:* are getting indexed. In order to fix this, you should define custom type mapping using a regex on the id field, which maps _sync documents to type _sync. You then configure all documents of this type to “only index specified fields”. This means effectively these documents will not be indexed.

The bucket contains multiple types (identified by a type field)!

Example:

“_sync:1” {
“type” : “book”
},

“2” {
“type” : “book”
},

“3” {
“type” : “paper”
},
“4” {
“type” : “paper”
}

Type mapping configuration:

DocID with regex: “^_sync*”
Type mapping: “_sync” “enabled” and “only index specified fields check” checked.

The first document matches the type and will not be indexed.

So far so good.

The problem

Documents of type “paper” should not be indexed!
Question: How can I add the remaining documents of type “book” to the index?

mschoch · May 25, 2017, 12:43am

Ah OK, so you can only determine type from one source. You’ll have to choose between using the type field (which can discriminate between books and papers) or the regular expression (which can identify _sync documents).

If you changed your doc ID scheme to include the type like book_XYZ or paper_XYZ you could do it all through regular expression. Otherwise I don’t see a way to make it work.

adamf · May 25, 2017, 4:30pm

This problem (specifically for the temporary revision docs) is addressed by changes made in Sync Gateway 1.5.0 - the temporary revision documents are being stored as binary documents, and so shouldn’t be indexed by FTS.

If you’re interested in giving that a try, the developer preview (beta) of SG 1.5.0 is available at https://www.couchbase.com/downloads#couchbase-mobile.

Arihant · March 22, 2018, 12:27pm

@adamf we have tried syncgateway 1.5.1 but looks like _sync:rev documents still exists as normal documents.
also _sync:att binary documents and _sync:rev are still trying to get indexed in FTS.

Observed below in FTS logs
2018-03-22T17:54:35.111+05:30 [INFO] bleve: json.Unmarshal, partition: 383, key: “_sync:att:sha1-l16+KUpoc5PMNOfOmXPYSXx0t1U=”, seq: 152, err: invalid character ‘b’ looking for beginning of value
2018-03-22T17:54:35.124+05:30 [INFO] bleve: json.Unmarshal, partition: 564, key: “_sync:att:sha1-Zpw/X9xA8sBLb3zk0vOlJarTlE0=”, seq: 334, err: invalid character ‘\x0e’ looking for beginning of value
2018-03-22T17:54:35.128+05:30 [INFO] bleve: json.Unmarshal, partition: 183, key: “_sync:att:sha1-yuawNboERpWJ5wiGOBI5Xl7jDEA=”, seq: 142, err: invalid character ‘I’ looking for beginning of value
2018-03-22T17:54:35.136+05:30 [INFO] bleve: json.Unmarshal, partition: 832, key: “_sync:att:sha1-m9s1E2Fal24jLdo85qNtEgBwZuA=”, seq: 575, err: invalid character ‘Î’ looking for beginning of value

2018-03-22T17:56:55.755+05:30 [INFO] bleve: json.Unmarshal, partition: 120, key: “_sync:rev:762-writer-user-25-meta2:34:5-3088c2ddcb96045a2f28b1010a866e7e”, seq: 1111, err: invalid character ‘\x01’ looking for beginning of value
2018-03-22T17:56:55.756+05:30 [INFO] bleve: json.Unmarshal, partition: 120, key: “_sync:rev:312-writer-user-25-meta2:34:4-8ab2423c4edd4f59a9f185e683df52e4”, seq: 1114, err: invalid character ‘\x01’ looking for beginning of value
2018-03-22T17:56:55.757+05:30 [INFO] bleve: json.Unmarshal, partition: 120, key: “_sync:rev:1507-writer-user-22-meta2:34:1-11c59eb3bd381499187cec4090c3ceba”, seq: 1115, err: invalid character ‘\x01’ looking for beginning of value

Topic		Replies	Views
Proper way to FTS Index and search Meta().id or docids using regex or wildcard seaches Full Text Search node	15	3456	July 5, 2020
Couchbase FTS type mapping Full Text Search	1	902	September 17, 2021
How to skip "_sync:rev:<docID>:<rev>" data from N1QL Sync Gateway n1ql	5	2987	October 18, 2017
FTS index based on condition Full Text Search	17	3571	April 9, 2020
Creating index on specific document types Full Text Search	29	2039	February 5, 2021

FTS query returns Sync GW temporary (_sync:rev) documents

Related topics