[MB-6538] In rare cases CRC codes dont match when reading data from couch file Created: 05/Sep/12  Updated: 24/Oct/12  Resolved: 24/Oct/12

Status: Resolved
Project: Couchbase Server
Component/s: storage-engine
Affects Version/s: None
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Blocker
Reporter: Aleksey Kondratenko Assignee: Aaron Miller (Inactive)
Resolution: Incomplete Votes: 0
Labels: 2.0-beta-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 158.couch.4.xz     File 252.couch.1.xz     File 253.couch.1.xz     PNG File corrupt2.png     PNG File corruption.png     File ns-diag-20120905213312.txt.xz     File ns-diag-20121023170207.txt.xz     PNG File Untitled 2 vs Untitled.png    

I experimented with building index on 6 cluster_run nodes and 9E6 simple docs. Everything went fine and results appeared right, but I'm seeing

[ns_server:debug,2012-09-05T21:31:14.218,n_5@]Finished compaction too soon. Next run will be in 30s
[couchdb:error,2012-09-05T21:31:14.296,n_2@<0.9681.0>:couch_log:error:42]Set view `default`, replica group `_design/dev_t`, doc loader error
error: {file_corruption,<<"file corruption">>}
stacktrace: [{couch_file,pread_iolist,2},

[couchdb:error,2012-09-05T21:31:14.297,n_2@<0.6715.0>:couch_log:error:42]Set view `default`, replica group `_design/dev_t`, received error from updater: {file_corruption,
                                                                                 <<"file corruption">>}
[couchdb:info,2012-09-05T21:31:17.856,n_2@<0.6715.0>:couch_log:info:39]Starting updater for set view `default`, replica group `_design/dev_t`
[couchdb:info,2012-09-05T21:31:17.856,n_2@<0.9753.0>:couch_log:info:39]Updater for set view `default`, replica group `_design/dev_t` started

in logs. Will attach logs from this box.

Comment by Karan Kumar (Inactive) [ 05/Sep/12 ]
Which build?
Comment by Karan Kumar (Inactive) [ 05/Sep/12 ]
ohh. cluster_run
Comment by Filipe Manana [ 06/Sep/12 ]
This happens when reading from a database file, not from an index file.
Comment by Aleksey Kondratenko [ 06/Sep/12 ]
corrupted files attached
Comment by Aaron Miller (Inactive) [ 10/Sep/12 ]
in the corrupted doc in 252.couch it looks like the file got stomped on by one byte. Both docs have the same CRC, and should have the same data, but this byte got messed up somehow.
Comment by Aaron Miller (Inactive) [ 10/Sep/12 ]
see attached screenshot
Comment by Aaron Miller (Inactive) [ 10/Sep/12 ]
other file (253.couch.1)
Comment by Aaron Miller (Inactive) [ 16/Sep/12 ]
I don't understand the name change here. The files in question were never compacted.
Comment by kzeller [ 17/Sep/12 ]
Added to beta release notes: In rare cases codes used to test for data corruption (CRC, checksum) codes do not match when reading data from couch
Comment by Farshid Ghods (Inactive) [ 02/Oct/12 ]

did you use RAM disk for persistence when running this test ?
Comment by Aleksey Kondratenko [ 03/Oct/12 ]
No. Don't understand why this would matter. _Any_ (well except for direct io) write to filesystem is write to kernel's page cache first.
Comment by damien [ 04/Oct/12 ]
We think this was a regression, possibly a dangling pointer, in the ep-engine that has since been fixed. Please reopen if there is another instance of the recently.
Comment by Aleksey Kondratenko [ 23/Oct/12 ]
got this again
Comment by Aleksey Kondratenko [ 23/Oct/12 ]
vbucket in question was in bucket other which was populated by incoming xdcr
Comment by Aleksey Kondratenko [ 23/Oct/12 ]
attaching diags from node having that badness
Comment by Aaron Miller (Inactive) [ 24/Oct/12 ]
Single byte error again.
Comment by Aleksey Kondratenko [ 24/Oct/12 ]
Sorry folks, found that my box actually has bad RAM.
Generated at Fri Sep 19 05:53:58 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.