FTS query returns Sync GW temporary (_sync:rev) documents

After a database update FTS returns documents with object id’s prefixed with “_sync:rev”

When checked some of the referenced documents are emtpy while others contain data.

I assume that this is a bug?

Question: how can I prevent FTS from indexing/returning GW temporary documents?

btw This is on Couchbase 4.6.1-3652

Actually I think you need to use a shadow bucket and do the FTS index from the shadow bucket so it doesn’t interfer with the SG metadata.

I don’t think that FTS on its own would interfere with SG’s metadata (since I don’t think it’s writing anything to the documents) - it may be about being able to exclude some documents (and potentially some properties) from FTS based on key/property name.

For internal documents - Sync Gateway’s internal documents have key values prefixed with underscore - is there a way to exclude those from FTS?

For internal document metadata, it’s similar - SG’s metadata is stored as properties prefixed with underscore. Can those be excluded from FTS?

You can exclude whole documents by having them match a type, and configuring that type to not index anything. Type identification can be done 3 ways, type field, id prefix, or id regex.

You can exclude document sub-sections, by defining an explicit mapping for the parent key, and setting only index explicit fields to true (and then defining 0 explicit fields).

There is no way to exclude fields by prefix.

Thanks @mschoch.

So using an id regex will let you exclude SG’s internal documents (define a regex that includes everything except ids with a leading underscore).

Excluding the _sync property from the document should be sufficient to hide Sync Gateway’s metadata - I don’t think it’s required to exclude any other fields.

Setting the type based on a regex of the ID doesn’t quite work the way you described. Instead it matches a portion of the id and uses that as the type. You might still be able to use it in a non-intuitive way. Like, if the id starts with _ it gets type “_” and if it doesn’t start with _, then it gets type “”. Extremely non-obvious, but I think it would work.

The regex method could work but we currenlty use a ‘type field’, none of the ID’s contain type information and unfortunately the format is out of our control.

That means we have to filter the FTS responses on the client?

Surely thats not the way its susposed to work!

Sigh… why isn’t there a FTS index option to ignore internal object id’s?

Because I don’t know what an internal object id is. Sounds like something made up by sync gateway?

Here is an example:

This is what FTS returns as a hit:

_sync:rev:{myobjectid}:34:1-0e2e03f32eda9ab3c32a085bd2be5918

{myobjectid} = is the actual objectid.

@henk.kampman I think the approach already described by @mschoch earlier should work for you.

You can use a regex to assign documents with an ID prefixed by ‘_sync’ to type ‘_sync’, and then set ‘only index explicit fields’ to true (meaning that nothing ends up getting indexed for that type).

The rest of your type definitions should work as usual. If that’s not working for you, can you provide some more specifics?

I was completely mislead by the UI.
The UI suggests (to me at least) that the ‘type identifier’ setting is global for the entire index.
It never occured to me that you can add multiple type mappings :slight_smile:

The type identifier is global for the entire index.

As I understand it, you have 2 problems.

  1. Documents like _sync:* are getting indexed. In order to fix this, you should define custom type mapping using a regex on the id field, which maps _sync documents to type _sync. You then configure all documents of this type to “only index specified fields”. This means effectively these documents will not be indexed.

  2. Documents may contain a field of sync gateway metadata. I don’t even know if it’s called _meta or _sync. Whatever it’s called, presumably you want to ignore this content within documents. So, to address this, you go to the default mapping (which is still handling all non-sync documents). Add a field “_meta” or “_sync”, whatever that sub-section of content is called. Then, for this too, you specify “only index specified fields”.

  1. Documents like _sync:* are getting indexed. In order to fix this, you should define custom type mapping using a regex on the id field, which maps _sync documents to type _sync. You then configure all documents of this type to “only index specified fields”. This means effectively these documents will not be indexed.

The bucket contains multiple types (identified by a type field)!

Example:

“_sync:1” {
“type” : “book”
},

“2” {
“type” : “book”
},

“3” {
“type” : “paper”
},
“4” {
“type” : “paper”
}

Type mapping configuration:

DocID with regex: “^_sync*”
Type mapping: “_sync” “enabled” and “only index specified fields check” checked.

The first document matches the type and will not be indexed.

So far so good.

The problem

Documents of type “paper” should not be indexed!
Question: How can I add the remaining documents of type “book” to the index?

Ah OK, so you can only determine type from one source. You’ll have to choose between using the type field (which can discriminate between books and papers) or the regular expression (which can identify _sync documents).

If you changed your doc ID scheme to include the type like book_XYZ or paper_XYZ you could do it all through regular expression. Otherwise I don’t see a way to make it work.

This problem (specifically for the temporary revision docs) is addressed by changes made in Sync Gateway 1.5.0 - the temporary revision documents are being stored as binary documents, and so shouldn’t be indexed by FTS.

If you’re interested in giving that a try, the developer preview (beta) of SG 1.5.0 is available at https://www.couchbase.com/downloads#couchbase-mobile.

@adamf we have tried syncgateway 1.5.1 but looks like _sync:rev documents still exists as normal documents.
also _sync:att binary documents and _sync:rev are still trying to get indexed in FTS.

Observed below in FTS logs
2018-03-22T17:54:35.111+05:30 [INFO] bleve: json.Unmarshal, partition: 383, key: “_sync:att:sha1-l16+KUpoc5PMNOfOmXPYSXx0t1U=”, seq: 152, err: invalid character ‘b’ looking for beginning of value
2018-03-22T17:54:35.124+05:30 [INFO] bleve: json.Unmarshal, partition: 564, key: “_sync:att:sha1-Zpw/X9xA8sBLb3zk0vOlJarTlE0=”, seq: 334, err: invalid character ‘\x0e’ looking for beginning of value
2018-03-22T17:54:35.128+05:30 [INFO] bleve: json.Unmarshal, partition: 183, key: “_sync:att:sha1-yuawNboERpWJ5wiGOBI5Xl7jDEA=”, seq: 142, err: invalid character ‘I’ looking for beginning of value
2018-03-22T17:54:35.136+05:30 [INFO] bleve: json.Unmarshal, partition: 832, key: “_sync:att:sha1-m9s1E2Fal24jLdo85qNtEgBwZuA=”, seq: 575, err: invalid character ‘Î’ looking for beginning of value

2018-03-22T17:56:55.755+05:30 [INFO] bleve: json.Unmarshal, partition: 120, key: “_sync:rev:762-writer-user-25-meta2:34:5-3088c2ddcb96045a2f28b1010a866e7e”, seq: 1111, err: invalid character ‘\x01’ looking for beginning of value
2018-03-22T17:56:55.756+05:30 [INFO] bleve: json.Unmarshal, partition: 120, key: “_sync:rev:312-writer-user-25-meta2:34:4-8ab2423c4edd4f59a9f185e683df52e4”, seq: 1114, err: invalid character ‘\x01’ looking for beginning of value
2018-03-22T17:56:55.757+05:30 [INFO] bleve: json.Unmarshal, partition: 120, key: “_sync:rev:1507-writer-user-22-meta2:34:1-11c59eb3bd381499187cec4090c3ceba”, seq: 1115, err: invalid character ‘\x01’ looking for beginning of value