Corruption from PHP client when compression is enabled

Hello,

I have a strange situation, from time to time JSON objects, are reported as being Binary documents, but they are definitely simple JSON. What could be the problem?

I see that it is saying that it is an invalid JSON, but not sure how it is possible as there is a simple function which is transformed an array into a JSON and then it is saving it in the DB

I have been able to get the document (and it contains garbage inside. So, definitely something is very wrong)

{"app_uuid":"41354852-c166-46ee-b510-3f53ac800a0a","should_refresh":true,"country_code":"lb","sub_type":"app","app_store":{"app_id":1180905152,"should_refresh":true},"app_play":{"app_id":null,"should_refresh":false},"key":"meta:41354852-c166-46ee-b510-3f53ac800a0a:lb","created_timestamp":1602149598,"type":"tracker-app","version":2},!E��meta:41354852-c166-46ee-b510-3f53ac800a0a:lb��Z{"app_uuid":"41354852-c166-46ee-b510-3f53ac800a0a","should_refresh":true,"country_code":"lb*ub_typapp",4_store":	�	�(1180905152,Rh},D(_playD
appEnullF?fals@(key":"meta:��:�|created_timestamp":1602149598,"t	�tracker-	�(version":2}��,,,�meta:1cbc8404-ca72-48f1-bf3a-77e1912eb45d:lb,,D��meta:1cbc8404-ca72-48f1-bf3a-77e1912eb45d:lb��Z{"app_uuid":"1cbc8404-ca72-48f1-bf3a-77e1912eb45d","should_refresh":true,"country_code":"lb*ub_typapp",4_store":	�	�(1437710400,Rh}	D(_playD
appEnullF?fals@(key":"meta:��:�|created_timestamp":1602149991,"t	�tracker-	�(version":2},,D��meta:1cbc8404-ca72-48f1-bf3a-77e1912eb45d:lb��Z{"app_uuid":"1cbc8404-ca72-48f1-bf3a-77e1912eb45d","should_refresh":true,"country_code":"lb*ub_typapp",4_store":	�	�(1437710400,Rh}	D(_playD
app

the JSON object should have been like this.

{
  "app_uuid": "41354852-c166-46ee-b510-3f53ac800a0a",
  "should_refresh": true,
  "country_code": "lb",
  "sub_type": "app",
  "app_store": {
  "app_id": 1180905152,
    "should_refresh": true,
    },
  "app_play": {
    "app_id": null,
    "should_refresh": false
  },
  "key": "meta:41354852-c166-46ee-b510-3f53ac800a0a:us",
  "created_timestamp": 1602149430,
  "type": "tracker-app",
  "version": 2,
  "updated_timestamp": 1602149596
}

Can anybody from Couchbase help me fix this bug?

Couchbase environment:

  Runtime: Version=3.0.5, Changeset=b14a8e4c392add9ba367a47c4936a793ccb1c5fe
  Headers: Version=3.0.5, Changeset=b14a8e4c392add9ba367a47c4936a793ccb1c5fe
  Build Timestamp: 2020-09-21 11:33:37
  Default plugin directory: /usr/lib64/libcouchbase
  IO: Default=libevent, Current=select, Accessible=select
  SSL Runtime: OpenSSL 1.0.2k-fips  26 Jan 2017
  SSL Headers: OpenSSL 1.0.2k-fips  26 Jan 2017
  Snappy: 1.1.1
  Tracing: SUPPORTED
  System: Linux-4.15.0-91-generic; x86_64
  CC: GNU 4.8.5; -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic  -fno-strict-aliasing -ggdb3 -pthread
  CXX: GNU 4.8.5; -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches   -m64 -mtune=generic  -fno-strict-aliasing -ggdb3 -pthread

Couchbase version 6.6 Enterprise

rpm -qa | grep couchbase
libcouchbase3-3.0.5-1.el7.x86_64
libcouchbase-devel-3.0.5-1.el7.x86_64
libcouchbase3-tools-3.0.5-1.el7.x86_64

I tried this on two different servers and it is happening the same. So I don’t suspect a hardware problem

I have updated the PHP from 7.4.9 to 7.4.11 and I have the same issues. Also, I have discovered some documents with keys and data from other documents, so this cannot happen from my current code.

It is either a problem on the Couchbase server or in the Coucbase lib couchbase.so

I have Couchbase in debug mode, so we can investigate the logs, not sure how I can correlate the PHP code and the Couchbase debug log.

Definitely, there is a problem outside of PHP, I am logging all the insert s and gets and logging all the documents which are not of type array/json. Always the error appears when I read it from the database, not before adding it in the database.

@ingenthr Can you please help?

Hi @flaviu, any chance you can provide a small snippet of PHP that reliably creates this?

One thing that may help us quickly see what is happening is a packet capture from an interaction that gets into this state. Something along the lines of:

# tcpdump -C 500 -W 10 -w /var/tmp/couchbase-test.pcap -s 0 port 11210 or port 8093

This would let us see the client to server interaction in the file.

The setup to replicate this is very complicated, and involves high traffic between the consumers and the Coucbase DB.

I can give you access on the servers, or just paste here logs. Let me try to do the TCP dump, and please let me know if want access on the servers.

In the meantime, I found these logs on one of the CB servers.

[ns_server:info,2020-10-09T13:36:01.025Z,babysitter_of_ns_1@cb.local:<0.115.0>:ns_port_server:log:224]memcached<0.115.0>: 2020-10-09T13:36:00.824454+00:00 WARNING 8115: Slow operation. {“cid”:“ff16091d32c4682d/08495835734103cb/c1020000”,“duration”:“24 s”,“trace”:“request=1007669824101573:24492058 store=0:0 sync_write.ack_local=1007694315568756:0 store=1007694315555140:14 sync_write.ack_remote=1007694316109025:0 sync_write.prepare=1007694315562303:558 store=0:0 store=1007694316155502:0”,“command”:“SET”,“peer”:":55386",“bucket”:“app-live”,“packet”:{“bodylen”:1434,“cas”:0,“datatype”:“raw”,“extlen”:8,“frameid”:{“durability”:{“level”:“Majority”,“timeout”:2250}},“key”:“meta:e703047d-fc26-409d-bf8f-fe7a735a3ffa:st”,“keylen”:44,“magic”:“AltClientRequest”,“opaque”:705,“opcode”:“SET”,“vbucket”:106}}
memcached<0.115.0>: 2020-10-09T13:36:00.824494+00:00 WARNING 8115: Invalid packet format detected (magic: 0x0), closing connection

[ns_server:info,2020-10-09T13:36:01.510Z,babysitter_of_ns_1@cb.local:<0.115.0>:ns_port_server:log:224]memcached<0.115.0>: 2020-10-09T13:36:01.309095+00:00 WARNING 14541: Slow operation. {“cid”:“5ead4ef8afbab2e2/7014f5f2dccc34d8/ec040000”,“duration”:“19 s”,“trace”:“request=1007675315608654:19485184 store=0:0 sync_write.ack_local=1007694800187839:0 store=1007694800176187:12 sync_write.ack_remote=1007694800708521:0 sync_write.prepare=1007694800183941:564 store=0:0 store=1007694800786047:0”,“command”:“SET”,“peer”:":33903",“bucket”:“app-live”,“packet”:{“bodylen”:1434,“cas”:0,“datatype”:“raw”,“extlen”:8,“frameid”:{“durability”:{“level”:“Majority”,“timeout”:2250}},“key”:“meta:70d24711-99f7-4abd-9c81-2208ed5c412f:tn”,“keylen”:44,“magic”:“AltClientRequest”,“opaque”:1260,“opcode”:“SET”,“vbucket”:231}}
memcached<0.115.0>: 2020-10-09T13:36:01.309120+00:00 WARNING 14541: Invalid packet format detected (magic: 0xca), closing connection
memcached<0.115.0>: 2020-10-09T13:36:01.362517+00:00 WARNING 8712: Slow operation. {“cid”:“3119c340b080ffb2/e64081d1ff36b330/6f010000”,“duration”:“113 s”,“trace”:“request=1007581334687490:113519537 store=0:0 sync_write.ack_local=1007694853657445:0 store=1007694853642396:15 sync_write.ack_remote=1007694854181734:0 sync_write.prepare=1007694853650351:542 store=0:0 store=1007694854219535:0”,“command”:“SET”,“peer”:":33815",“bucket”:“app-live”,“packet”:{“bodylen”:1434,“cas”:0,“datatype”:“raw”,“extlen”:8,“frameid”:{“durability”:{“level”:“Majority”,“timeout”:2250}},“key”:“meta:d81a7490-d10f-44d5-8854-1baa785feaa8:it”,“keylen”:44,“magic”:“AltClientRequest”,“opaque”:367,“opcode”:“SET”,“vbucket”:16}}
memcached<0.115.0>: 2020-10-09T13:36:01.362558+00:00 WARNING 8712: Invalid packet format detected (magic: 0x1), closing connection

@ingenthr and @graham.pople

Matt, I have sent you a direct message with the TCP dump.

In the meantime, I have found this in a message coming from the Cb DB,

image

Not sure if it is normal or not.

Thanks for the additional info. I’ll definitely have a look. Question: did you gather the pcap from the application side or from the server side?

It does look like the data is corrupted even at the basic framing level. This may be from a bug in the library or some kind of memory corruption. Is this the same app that was crashing from time to time?

Not as likely, but it can also come about from the network or a system that is not healthy. One similar bit of corruption some years back was traced to a bad network driver in a particular OS image on EC2.

I did the capturing on the application-side (the server with the IP ending in 114.115 is the application server). The server is a dedicated machine with 10 Gb card.

It does look like the data is corrupted even at the basic framing level. This may be from a bug in the library or some kind of memory corruption. Is this the same app that was crashing from time to time?

Yes

I am getting the same errors from two different application servers (identical servers and software)

Hi @flaviu. libcouchbase itself does not generate these GET_WITH_META or SET_WITH_META packets, so those packets might be unrelated.

Could you dump contents of the “broken” document with cbc cat ID and capture it with wireshark?

Also is UI the only place where the document called “invalid”? Or the application also sees something weird?

I’m asking because in some conditions it might be okay that on the wires data does not go in plain form (we might use compression)

grim_2020-10-09_184551

Note the “Snappy” bit on the image above

Hello Sergey (@avsej) ,

I have sent you a direct message with the TCP dump.

From what I see happening is that we save a document in the DB in JSON format and somehow it is saved corrupted.

From the UI interface, it is just saying it is a binary document, not that it is corrupted. We can get the document from the application but it is corrupted.

What I also noticed is that sometimes one of these documents contain parts of other documents… somehow mixed up.

I am not 100% sure that the PHP is not trying to save a corrupted document, but I tried to get the document type (array/JSON) before inserting it in the DB, and we could not identify any document to be saved in another format other than json.

I am 99% sure there is no problem at the PHP level, but taking in consideration the number of sets( 40-50k/second ) and gets( 150-200k / second) there may be something that I’ve missed.

The first TCP dump you analyzed (the one sent to Matt), I think it contains a complete set and get with corrupted documents ( in between other tens of thousands which are correct)

An additional infomration that may be helpful is that the php code is retying an operation in the case of : TempFailException, TimeoutException, NetworkException, DurabilityException

What is interesting is that I saw in one corrupted document two keys key from another documents repeated exactly the number of retries.

So what I suspect is that somehow the libcouchbase in case of an error reuse a connection or a buffer and somewho things get mixedup

@avsej

these are some keys which you could find in the big tcp dump file which ended up corrupted in DB

meta:5116e388-7add-4944-9cb3-8db36eedcdf2:sn
meta:717d32e1-b9df-4f8f-b03d-10744de2eeda:sn
meta:3fb495a1-8619-40ce-ae8d-d7b12b61c230:sn
meta:4aad5934-a264-4b1b-8e90-8dd293ee1119:sn
meta:07ab5dd0-c372-4b52-8bf0-5e589420a025:fj

maybe you could figgure out where the document is corrupted

@avsej Do you know if ONE packet/frame can contain multiple keys?

I have found a strange packet which is containing multiple objects with different keys

0000   7b 22 61 70 70 5f 75 75 69 64 22 3a 22 34 61 61   {"app_uuid":"4aa
0010   64 35 39 33 34 2d 61 32 36 34 2d 34 62 31 62 2d   d5934-a264-4b1b-
0020   38 65 39 30 2d 38 64 64 32 39 33 65 65 31 31 31   8e90-8dd293ee111
0030   39 22 2c 22 73 68 6f 75 6c 64 5f 72 65 66 72 65   9","should_refre
0040   73 68 22 3a 74 72 75 65 2c 22 63 6f 75 6e 74 72   sh":true,"countr
0050   79 5f 63 6f 64 65 22 3a 22 73 6e 22 2c 22 73 75   y_code":"sn","su
0060   62 5f 74 79 70 65 22 3a 22 61 70 70 22 2c 22 61   b_type":"app","a
0070   70 70 5f 61 70 70 6c 65 5f 73 74 6f 72 65 22 3a   pp_*****_store":
0080   7b 22 61 70 70 5f 61 70 70 6c 65 5f 69 64 22 3a   {"app_******_id":
0090   31 33 34 37 35 34 32 34 31 31 2c 22 73 68 6f 75   1347542411,"shou
00a0   6c 64 5f 72 65 66 72 65 73 68 22 3a 74 72 75 65   ld_refresh":true
00b0   7d 2c 22 61 70 70 5f 67 6f 6f 67 6c 65 5f 70 6c   },"app_*****_pl
00c0   61 79 22 3a 7b 22 67 6f 6f 67 6c 65 5f 61 70 70   ay":{"*****_app
00d0   5f 69 64 22 3a 6e 75 6c 6c 2c 22 73 68 6f 75 6c   _id":null,"shoul
00e0   64 5f 72 65 66 72 65 73 68 22 3a 66 61 6c 73 65   d_refresh":false
00f0   7d 2c 22 6b 65 79 22 3a 22 6d 65 74 61 3a 34 61   },"key":"meta:4a
0100   61 64 35 39 33 34 2d 61 32 36 34 2d 34 62 31 62   ad5934-a264-4b1b
0110   2d 38 65 39 30 2d 38 64 64 32 39 33 65 65 31 31   -8e90-8dd293ee11
0120   31 39 3a 73 6e 22 2c 22 63 72 65 61 74 65 64 5f   19:sn","created_
0130   74 69 6d 65 73 74 61 6d 70 22 3a 31 36 30 32 32   timestamp":16022
0140   35 30 30 39 37 2c 22 74 79 70 65 22 3a 22 74 72   50097,"type":"tr
0150   61 63 6b 65 72 2d 61 70 70 22 2c 22 76 65 72 73   acker-app","vers
0160   69 6f 6e 22 3a 32 7d 08 01 04 2c 08 02 02 7c 00   ion":2}...,...|.
0170   00 01 44 6c 1b 00 00 00 00 00 00 00 00 00 00 13   ..Dl............
0180   01 08 ca 02 00 00 06 00 00 00 00 6d 65 74 61 3a   ...........meta:
0190   34 61 61 64 35 39 33 34 2d 61 32 36 34 2d 34 62   4aad5934-a264-4b
01a0   31 62 2d 38 65 39 30 2d 38 64 64 32 39 33 65 65   1b-8e90-8dd293ee
01b0   31 31 31 39 3a 73 6e e7 02 f0 5a 7b 22 61 70 70   1119:sn...Z{"app
01c0   5f 75 75 69 64 22 3a 22 34 61 61 64 35 39 33 34   _uuid":"4aad5934
01d0   2d 61 32 36 34 2d 34 62 31 62 2d 38 65 39 30 2d   -a264-4b1b-8e90-
01e0   38 64 64 32 39 33 65 65 31 31 31 39 22 2c 22 73   8dd293ee1119","s
01f0   68 6f 75 6c 64 5f 72 65 66 72 65 73 68 22 3a 74   hould_refresh":t
0200   72 75 65 2c 22 63 6f 75 6e 74 72 79 5f 63 6f 64   rue,"country_cod
0210   65 22 3a 22 73 6e 01 2a 14 75 62 5f 74 79 70 01   e":"sn.*.ub_typ.
0220   10 10 61 70 70 22 2c 01 06 34 5f 61 70 70 6c 65   ..app",..4_*****
0230   5f 73 74 6f 72 65 22 3a 09 80 09 13 01 84 28 31   _store":......(1
0240   33 34 37 35 34 32 34 31 31 2c 52 68 00 00 7d 09   347542411,Rh..}.
0250   44 28 67 6f 6f 67 6c 65 5f 70 6c 61 79 01 44 0d   D(*****_play.D.
0260   0f 08 61 70 70 05 45 0c 6e 75 6c 6c 46 3f 00 0c   ..app.E.nullF?..
0270   66 61 6c 73 01 40 28 6b 65 79 22 3a 22 6d 65 74   fals.@(key":"met
0280   61 3a 8e f1 00 00 3a 05 ca 7c 63 72 65 61 74 65   a:....:..|create
0290   64 5f 74 69 6d 65 73 74 61 6d 70 22 3a 31 36 30   d_timestamp":160
02a0   32 32 35 30 30 39 37 2c 22 74 09 e5 1c 74 72 61   2250097,"t...tra
02b0   63 6b 65 72 2d 09 ed 28 76 65 72 73 69 6f 6e 22   cker-..(version"
02c0   3a 32 7d 80 a0 00 2c 00 00 00 0e 00 00 00 2c 78   :2}...,.......,x
02d0   1b 00 00 00 00 00 00 00 00 00 00 6d 65 74 61 3a   ...........meta:
02e0   37 31 37 64 33 32 65 31 2d 62 39 64 66 2d 34 66   717d32e1-b9df-4f
02f0   38 66 2d 62 30 33 64 2d 31 30 37 34 34 64 65 32   8f-b03d-10744de2
0300   65 65 64 61 3a 73 6e 08 01 04 2c 08 02 00 0e 00   eeda:sn...,.....
0310   00 01 44 7a 1b 00 00 00 00 00 00 00 00 00 00 13   ..Dz............
0320   01 08 ca 02 00 00 06 00 00 00 00 6d 65 74 61 3a   ...........meta:
0330   37 31 37 64 33 32 65 31 2d 62 39 64 66 2d 34 66   717d32e1-b9df-4f
0340   38 66 2d 62 30 33 64 2d 31 30 37 34 34 64 65 32   8f-b03d-10744de2
0350   65 65 64 61 3a 73 6e e7 02 f0 5a 7b 22 61 70 70   eeda:sn...Z{"app
0360   5f 75 75 69 64 22 3a 22 37 31 37 64 33 32 65 31   _uuid":"717d32e1
0370   2d 62 39 64 66 2d 34 66 38 66 2d 62 30 33 64 2d   -b9df-4f8f-b03d-
0380   31 30 37 34 34 64 65 32 65 65 64 61 22 2c 22 73   10744de2eeda","s
0390   68 6f 75 6c 64 5f 72 65 66 72 65 73 68 22 3a 74   hould_refresh":t
03a0   72 75 65 2c 22 63 6f 75 6e 74 72 79 5f 63 6f 64   rue,"country_cod
03b0   65 22 3a 22 73 6e 01 2a 14 75 62 5f 74 79 70 01   e":"sn.*.ub_typ.
03c0   10 10 61 70 70 22 2c 01 06 34 5f 61 70 70 6c 65   ..app",..4_*****
03d0   5f 73 74 6f 72 65 22 3a 09 80 09 13 01 84 28 31   _store":......(1
03e0   31 36 34 33 37 31 34 36 34 2c 52 68 00 00 7d 09   164371464,Rh..}.
03f0   44 28 67 6f 6f 67 6c 65 5f 70 6c 61 79 01 44 0d   D(*****_play.D.
0400   0f 08 61 70 70 05 45 0c 6e 75 6c 6c 46 3f 00 0c   ..app.E.nullF?..
0410   66 61 6c 73 01 40 28 6b 65 79 22 3a 22 6d 65 74   fals.@(key":"met
0420   61 3a 8e f1 00 00 3a 05 ca 7c 63 72 65 61 74 65   a:....:..|create
0430   64 5f 74 69 6d 65 73 74 61 6d 70 22 3a 31 36 30   d_timestamp":160
0440   32 32 35 30 31 38 32 2c 22 74 09 e5 1c 74 72 61   2250182,"t...tra
0450   63 6b 65 72 2d 09 ed 28 76 65 72 73 69 6f 6e 22   cker-..(version"
0460   3a 32 7d 80 b5 00 00 00 00 00 00 00 00 00 00 7b   :2}............{
0470   1b 00 00 00 00 00 00 00 00 00 00 08 01 04 2c 08   ..............,.
0480   02 00 0e 00 00 01 44 7d 1b 00 00 00 00 00 00 00   ......D}........
0490   00 00 00 13 01 08 ca 02 00 00 06 00 00 00 00 6d   ...............m
04a0   65 74 61 3a 37 31 37 64 33 32 65 31 2d 62 39 64   eta:717d32e1-b9d
04b0   66 2d 34 66 38 66 2d 62 30 33 64 2d 31 30 37 34   f-4f8f-b03d-1074
04c0   34 64 65 32 65 65 64 61 3a 73 6e e7 02 f0 5a 7b   4de2eeda:sn...Z{
04d0   22 61 70 70 5f 75 75 69 64 22 3a 22 37 31 37 64   "app_uuid":"717d
04e0   33 32 65 31 2d 62 39 64 66 2d 34 66 38 66 2d 62   32e1-b9df-4f8f-b
04f0   30 33 64 2d 31 30 37 34 34 64 65 32 65 65 64 61   03d-10744de2eeda
0500   22 2c 22 73 68 6f 75 6c 64 5f 72 65 66 72 65 73   ","should_refres
0510   68 22 3a 74 72 75 65 2c 22 63 6f 75 6e 74 72 79   h":true,"country
0520   5f 63 6f 64 65 22 3a 22 73 6e 01 2a 14 75 62 5f   _code":"sn.*.ub_
0530   74 79 70 01 10 10 61 70 70 22 2c 01 06 34 5f 61   typ...app",..4_a
0540   70 70 6c 65 5f 73 74 6f 72 65 22 3a 09 80 09 13   pple_store":....
0550   01 84 28 31 31 36 34 33 37 31 34 36 34 2c 52 68   ..(1164371464,Rh
0560   00 00 7d                                          ..}

look for

meta:4aad5934-a264-4b1b-8e90-8dd293ee1119:sn
meta:717d32e1-b9df-4f8f-b03d-10744de2eeda:sn

you can find both of them in the same packet above

It’s certainly possible that one packet contains multiple keys. This is a bit less true in PHP because you have one lcb instance per process, and @avsej would know better than I would, but the way this works internally in libcouchbase it’s definitely possible.

The data does seem consistent with that. I think this is one situation where I would try to run that test with valgrind or a simple memory debugging interposer to validate. I think if one of the lcb buffers is that corrupted, you would expect a lot more dropped connections and protocol violations.

@brett19 and @avsej know better than I do, but there are some lightweight tools that will write patterns into buffers when allocated and free’d to find issues more quickly and they can just be LD_PRELOADed. libumem and the tools on Mac OS are the ones I am most familiar with. Valgrind is a bigger toolset on linux so if you can use it easily, that’d be great. Do be aware that it’s active on memory operations, so it slows things down significantly but is great for testing.

By the way, we have a fairly comprehensive test suite for libcouchbase that duplicates a lot of these scenarios. There could definitely be a situation we haven’t found, but we force libcouchbase into a lot of bad network states and test to ensure recovery.

I don’t think this is related to the network, but somehow related to couchbase lib. Not sure these are related to this thread problem

Also, I see a lot of these type of logs:

[Tue Oct 06 12:04:27.212488 2020] [proxy_fcgi:error] [pid 7822] [client 172.69.134.155:36166] AH01071: Got error '
PHP message: [cb,WARN] (server L:382 I:404836903) <srv1-redacted:11210> (CTX=0x55f23f7d4130,memcached,SRV=0x55f23f7d5e80,IX=0) Received server error LOCKED (0x9) on packet: OP=0x94, RC=0x9, SEQ=14
PHP message: [cb,WARN] (server L:382 I:404836903) <srv1-redacted:11210> (CTX=0x55f23f7d4130,memcached,SRV=0x55f23f7d5e80,IX=0) Received server error LOCKED (0x9) on packet: OP=0x94, RC=0x9, SEQ=16
PHP message: [cb,WARN] (server L:382 I:404836903) <srv1-redacted:11210> (CTX=0x55f23f7d4130,memcached,SRV=0x55f23f7d5e80,IX=0) Received server error LOCKED (0x9) on packet: OP=0x94, RC=0x9, SEQ=17'
[Tue Oct 06 12:04:27.239932 2020] [proxy_fcgi:error] [pid 1492] [client 172.68.239.85:41094] AH01071: Got error '
PHP message: [cb,WARN] (server L:382 I:150776847) <srv2-redacted:11210> (CTX=0x55f23f7a7df0,memcached,SRV=0x55f23f7b10a0,IX=2) Received server error LOCKED (0x9) on packet: OP=0x94, RC=0x9, SEQ=8
PHP message: [cb,WARN] (server L:382 I:150776847) <srv2-redacted:11210> (CTX=0x55f23f7a7df0,memcached,SRV=0x55f23f7b10a0,IX=2) Received server error LOCKED (0x9) on packet: OP=0x94, RC=0x9, SEQ=11
PHP message: [cb,WARN] (server L:382 I:150776847) <srv2-redacted:11210> (CTX=0x55f23f7a7df0,memcached,SRV=0x55f23f7b10a0,IX=2) Received server error LOCKED (0x9) on packet: OP=0x94, RC=0x9, SEQ=12'
[Tue Oct 06 12:04:27.265845 2020] [proxy_fcgi:error] [pid 3461] [client 172.69.130.30:18368] AH01071: Got error '
PHP message: [cb,WARN] (server L:382 I:1675335737) <srv1-redacted:11210> (CTX=0x55f23f76c890,memcached,SRV=0x55f23f7880b0,IX=0) Received server error LOCKED (0x9) on packet: OP=0x94, RC=0x9, SEQ=0
PHP message: [cb,WARN] (server L:382 I:1675335737) <srv1-redacted:11210> (CTX=0x55f23f76c890,memcached,SRV=0x55f23f7880b0,IX=0) Received server error LOCKED (0x9) on packet: OP=0x94, RC=0x9, SEQ=1
PHP message: [cb,WARN] (server L:382 I:1675335737) <srv1-redacted:11210> (CTX=0x55f23f76c890,memcached,SRV=0x55f23f7880b0,IX=0) Received server error LOCKED (0x9) on packet: OP=0x94, RC=0x9, SEQ=2'
[Tue Oct 06 12:04:27.284334 2020] [proxy_fcgi:error] [pid 6508] [client 141.101.99.129:13298] AH01071: Got error '
PHP message: [cb,WARN] (server L:382 I:2352095019) <srv3-redacted:11210> (CTX=0x55f23f7ccdf0,memcached,SRV=0x55f23f79dbb0,IX=3) Received server error LOCKED (0x9) on packet: OP=0x94, RC=0x9, SEQ=7
PHP message: [cb,WARN] (server L:382 I:2352095019) <srv3-redacted:11210> (CTX=0x55f23f7ccdf0,memcached,SRV=0x55f23f79dbb0,IX=3) Received server error LOCKED (0x9) on packet: OP=0x94, RC=0x9, SEQ=9
PHP message: [cb,WARN] (server L:382 I:2352095019) <srv3-redacted:11210> (CTX=0x55f23f7ccdf0,memcached,SRV=0x55f23f79dbb0,IX=3) Received server error LOCKED (0x9) on packet: OP=0x94, RC=0x9, SEQ=10'

and these:

Oct  6 09:52:16 dv-wb01 AppPopulateQueue: [cb,WARN] (lcbio_mgr L:463 I:2192574396) <srv6-redacted:8093> (HE=0x55efb4f62320) Pooled socket is dead. Continuing to next one
Oct  6 09:53:43 dv-wb01 AppPopulateQueue: [cb,WARN] (lcbio_mgr L:463 I:2192574396) <srv6-redacted:8093> (HE=0x55efb4f62320) Pooled socket is dead. Continuing to next one
Oct  6 09:54:27 dv-wb01 AppPopulateQueue: [cb,WARN] (lcbio_mgr L:463 I:2192574396) <srv6-redacted:8093> (HE=0x55efb4f62320) Pooled socket is dead. Continuing to next one
Oct  6 09:56:00 dv-wb01 AppPopulateQueue: [cb,WARN] (lcbio_mgr L:463 I:2192574396) <srv6-redacted:8093> (HE=0x55efb4f62320) Pooled socket is dead. Continuing to next one
Oct  6 09:57:17 dv-wb01 AppPopulateQueue: [cb,WARN] (lcbio_mgr L:463 I:2192574396) <srv6-redacted:8093> (HE=0x55efb4f62320) Pooled socket is dead. Continuing to next one

Why do you think the Cochbase servers (babysiter.log) is generating these messages?

[ns_server:info,2020-10-12T16:44:05.123Z,babysitter_of_ns_1@cb.local:<0.115.0>:ns_port_server:log:224]memcached<0.115.0>: 2020-10-12T16:44:04.922293+00:00 WARNING 737: Invalid packet format detected (magic: 0x72), closing connection

[ns_server:info,2020-10-12T16:44:34.136Z,babysitter_of_ns_1@cb.local:<0.115.0>:ns_port_server:log:224]memcached<0.115.0>: 2020-10-12T16:44:33.935573+00:00 WARNING 256: Slow operation. {"cid":"a06755b21e165d4d/442ea74070e360c5/1f060000","duration":"28 s","trace":"request=1278178447275177:28980012 store=0:0 sync_write.ack_local=1278207426695186:0 store=1278207426659518:36 sync_write.ack_remote=1278207427177566:0 sync_write.prepare=1278207426676861:515 store=0:0 store=1278207427281933:0","command":"SET","peer":"ip-redacted:49292","bucket":"app-live","packet":{"bodylen":1434,"cas":0,"datatype":"raw","extlen":8,"frameid":{"durability":{"level":"Majority","timeout":2250}},"key":"<ud>meta:0b1a8c89-7868-45a4-90b3-a38767f8fc4f:si</ud>","keylen":44,"magic":"AltClientRequest","opaque":1567,"opcode":"SET","vbucket":154}}
memcached<0.115.0>: 2020-10-12T16:44:33.935595+00:00 WARNING 256: Invalid packet format detected (magic: 0x75), closing connection

[ns_server:info,2020-10-12T16:44:47.644Z,babysitter_of_ns_1@cb.local:<0.115.0>:ns_port_server:log:224]memcached<0.115.0>: 2020-10-12T16:44:47.443535+00:00 WARNING 256: Slow operation. {"cid":"a06755b21e165d4d/1bfca1d8b865e178/2a060000","duration":"13 s","trace":"request=1278207459739537:13475501 store=0:0 sync_write.ack_local=1278220934719070:0 store=1278220934677231:42 sync_write.ack_remote=1278220935160222:0 sync_write.prepare=1278220934697709:478 store=0:0 store=1278220935231827:0","command":"SET","peer":"ip-redacted:49396","bucket":"app-live","packet":{"bodylen":1435,"cas":0,"datatype":"raw","extlen":8,"frameid":{"durability":{"level":"Majority","timeout":2250}},"key":"<ud>meta:7ceb89b3-23e3-4a28-b19e-54c056efb8e3:si</ud>","keylen":44,"magic":"AltClientRequest","opaque":1578,"opcode":"SET","vbucket":255}}
memcached<0.115.0>: 2020-10-12T16:44:47.443564+00:00 WARNING 256: Invalid packet format detected (magic: 0xe5), closing connection

[ns_server:info,2020-10-12T16:45:10.154Z,babysitter_of_ns_1@cb.local:<0.115.0>:ns_port_server:log:224]memcached<0.115.0>: 2020-10-12T16:45:09.953623+00:00 WARNING 256: Slow operation. {"cid":"a06755b21e165d4d/3a61fe0831a1d166/30060000","duration":"22 s","trace":"request=1278220967430183:22477908 store=0:0 sync_write.ack_local=1278243444799744:0 store=1278243444765509:35 sync_write.ack_remote=1278243445277055:0 sync_write.prepare=1278243444782191:512 store=0:0 store=1278243445331758:0","command":"SET","peer":"ip-redacted:49452","bucket":"app-live","packet":{"bodylen":1435,"cas":0,"datatype":"raw","extlen":8,"frameid":{"durability":{"level":"Majority","timeout":2250}},"key":"<ud>meta:7ceb89b3-23e3-4a28-b19e-54c056efb8e3:si</ud>","keylen":44,"magic":"AltClientRequest","opaque":1584,"opcode":"SET","vbucket":255}}
memcached<0.115.0>: 2020-10-12T16:45:09.953653+00:00 WARNING 256: Invalid packet format detected (magic: 0x69), closing connection

[ns_server:info,2020-10-12T16:45:41.662Z,babysitter_of_ns_1@cb.local:<0.115.0>:ns_port_server:log:224]memcached<0.115.0>: 2020-10-12T16:45:41.461109+00:00 WARNING 256: Slow operation. {"cid":"a06755b21e165d4d/796b189f5ee745d3/37060000","duration":"31 s","trace":"request=1278243477977768:31474842 store=0:0 sync_write.ack_local=1278274952279679:0 store=1278274952246216:34 sync_write.ack_remote=1278274952698630:0 sync_write.prepare=1278274952262106:453 store=0:0 store=1278274952813737:0","command":"SET","peer":"ip-redacted:49498","bucket":"app-live","packet":{"bodylen":1435,"cas":0,"datatype":"raw","extlen":8,"frameid":{"durability":{"level":"Majority","timeout":2250}},"key":"<ud>meta:7ceb89b3-23e3-4a28-b19e-54c056efb8e3:si</ud>","keylen":44,"magic":"AltClientRequest","opaque":1591,"opcode":"SET","vbucket":255}}
memcached<0.115.0>: 2020-10-12T16:45:41.461140+00:00 WARNING 256: Invalid packet format detected (magic: 0xe5), closing connection

[ns_server:info,2020-10-12T16:46:02.672Z,babysitter_of_ns_1@cb.local:<0.115.0>:ns_port_server:log:224]memcached<0.115.0>: 2020-10-12T16:46:02.471803+00:00 WARNING 256: Slow operation. {"cid":"a06755b21e165d4d/519134deb113ce04/3e060000","duration":"20 s","trace":"request=1278274985698808:20977814 store=0:0 sync_write.ack_local=1278295962919206:0 store=1278295962878550:41 sync_write.ack_remote=1278295963430592:0 sync_write.prepare=1278295962896373:548 store=0:0 store=1278295963505192:0","command":"SET","peer":"ip-redacted:49536","bucket":"app-live","packet":{"bodylen":1435,"cas":0,"datatype":"raw","extlen":8,"frameid":{"durability":{"level":"Majority","timeout":2250}},"key":"<ud>meta:7ceb89b3-23e3-4a28-b19e-54c056efb8e3:si</ud>","keylen":44,"magic":"AltClientRequest","opaque":1598,"opcode":"SET","vbucket":255}}
memcached<0.115.0>: 2020-10-12T16:46:02.471836+00:00 WARNING 256: Invalid packet format detected (magic: 0x7d), closing connection

[ns_server:info,2020-10-12T16:46:19.179Z,babysitter_of_ns_1@cb.local:<0.115.0>:ns_port_server:log:224]memcached<0.115.0>: 2020-10-12T16:46:18.978422+00:00 WARNING 256: Slow operation. {"cid":"a06755b21e165d4d/75741331ef1ccb6b/48060000","duration":"16 s","trace":"request=1278295998514752:16471617 store=0:0 sync_write.ack_local=1278312469521207:0 store=1278312469488636:33 sync_write.ack_remote=1278312470012735:0 sync_write.prepare=1278312469505017:525 store=0:0 store=1278312470124586:0","command":"SET","peer":"ip-redacted:49568","bucket":"app-live","packet":{"bodylen":1435,"cas":0,"datatype":"raw","extlen":8,"frameid":{"durability":{"level":"Majority","timeout":2250}},"key":"<ud>meta:d11dd988-d1ed-4565-a1c0-62f5532cb44e:si</ud>","keylen":44,"magic":"AltClientRequest","opaque":1608,"opcode":"SET","vbucket":254}}
memcached<0.115.0>: 2020-10-12T16:46:18.978457+00:00 WARNING 256: Invalid packet format detected (magic: 0x5f), closing connection

[ns_server:info,2020-10-12T16:46:44.688Z,babysitter_of_ns_1@cb.local:<0.115.0>:ns_port_server:log:224]memcached<0.115.0>: 2020-10-12T16:46:44.487316+00:00 WARNING 689: Slow operation. {"cid":"a06755b21e165d4d/94df6c18def68c4b/51060000","duration":"25 s","trace":"request=1278312503395165:25475636 store=0:0 sync_write.ack_local=1278337978545001:0 store=1278337978509264:36 sync_write.ack_remote=1278337978972388:0 sync_write.prepare=1278337978527168:455 store=0:0 store=1278337979027622:0","command":"SET","peer":"ip-redacted5:49586","bucket":"app-live","packet":{"bodylen":1435,"cas":0,"datatype":"raw","extlen":8,"frameid":{"durability":{"level":"Majority","timeout":2250}},"key":"<ud>meta:d11dd988-d1ed-4565-a1c0-62f5532cb44e:si</ud>","keylen":44,"magic":"AltClientRequest","opaque":1617,"opcode":"SET","vbucket":254}}
memcached<0.115.0>: 2020-10-12T16:46:44.487339+00:00 WARNING 689: Invalid packet format detected (magic: 0x74), closing connection

Do you think these are errors, or normal messages?

Definitely errors, but they are consistent with some kind of corruption. These messages are coming from the memcached process, which is what is receiving data from the php/libcouchbase client. If the magic byte on the command boundary does not match, it is effectively detecting corruption and dropping the connection.

Let me circle up with @avsej to see what we should do next here.

Is there anything I could do to help? @ingenthr @avsej @graham.pople

It is more than a week and things seem to evolve very slowly. We are trying to develop our product on top of the latest version of Couchbase, using the latest libs, nothing fancy.

We are not asking for special functionalities, just the minimum functionality to set and get an object from the DB.

I wonder what would happen if we would have paid licenses, how would you handle this?

Our management already is pushing me to use different technology, This can be a good opportunity for you guys. I am talking here more than 15 - 30 servers with Couchbase.

if this is not going to be fixed this week, we are going to switch our development from Couchbase starting next week.

I am ready to give you access to all these servers to do your own tests, I am available at any hour to help you guys, just let me know what you need.

Hey @flaviu, just following up here. I heard from @avsej that you and he were able to sync up. My understanding is when you paired up, you isolated it down to something having to do with compression. If compression is disabled, at least so far, the corruption hasn’t been observed.

@avsej tells me that he’s going to pick up his investigation tomorrow.

I hope that’s helped!

Under Couchbase Enterprise Subscriptions you would have access to a larger team of folks to investigate and isolate anything like this, yes. The service level is different as a subscriber than you can receive from forums, yes.

That said, the word “corruption” always grabs our interest. We want to be sure anyone using Couchbase under a subscription or not doesn’t have lost or corrupted data. It may take us a bit of time since we have to balance a lot of things as a growing company, so I appreciate your patience as we got to this one.

Please ping me or DM me if you have further questions.