Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Blocker
-
Resolution: Fixed
-
Affects Version/s: 2.0
-
Fix Version/s: 2.0
-
Component/s: couchbase-bucket, storage-engine
-
Security Level: Public
-
Labels:None
-
Environment:2.0.0-1314-rel
Description
in viewquerytests.ViewQueryTests.test_employee_dataset_startkey_endkey_queries_rebalance_in
Load 200k docs
Add+Rebalance 6 nodes to cluster while running queries
Looks like what happens is the keys being moved to new servers are not always being persisted to disk.
...from test log's we report when key a did not persist and cannot be indexed by view-engine.
["query doc_id: admin0150-2008_12_27 doesn\'t exist in bucket: default", "Error expected in results for key with invalid state {\'key_vb_state\': \'active\', \'key_last_modification_time\': \'1339537159\', \'key_data_age\': \'0\', \'key_cas\': \'7680066560307527\', \'key_exptime\': \'0\', \'key_is_dirty\': \'0\', \'key_flags\': \'0\', \'key_valid\': \'ram_but_not_disk\'}",
There should be plenty of time to allow this as we retry the query several times and no new docs are being loaded/updated during the rebalance.
diags attached.
Load 200k docs
Add+Rebalance 6 nodes to cluster while running queries
Looks like what happens is the keys being moved to new servers are not always being persisted to disk.
...from test log's we report when key a did not persist and cannot be indexed by view-engine.
["query doc_id: admin0150-2008_12_27 doesn\'t exist in bucket: default", "Error expected in results for key with invalid state {\'key_vb_state\': \'active\', \'key_last_modification_time\': \'1339537159\', \'key_data_age\': \'0\', \'key_cas\': \'7680066560307527\', \'key_exptime\': \'0\', \'key_is_dirty\': \'0\', \'key_flags\': \'0\', \'key_valid\': \'ram_but_not_disk\'}",
There should be plenty of time to allow this as we retry the query several times and no new docs are being loaded/updated during the rebalance.
diags attached.
-
- 10.2.2.67-8091-diag.txt.gz
- 12/Jun/12 6:01 PM
- 567 kB
- Tommie McAfee
-
- 10.2.2.109-8091-diag.txt.gz
- 12/Jun/12 6:01 PM
- 1.69 MB
- Tommie McAfee
-
- 10.2.2.65-8091-diag.txt.gz
- 12/Jun/12 6:01 PM
- 559 kB
- Tommie McAfee
-
- 10.2.2.64-8091-diag.txt.gz
- 12/Jun/12 6:01 PM
- 562 kB
- Tommie McAfee
-
- 10.2.2.63-8091-diag.txt.gz
- 12/Jun/12 6:01 PM
- 1.68 MB
- Tommie McAfee
-
- 10.2.2.108-8091-diag.txt.gz
- 12/Jun/12 6:01 PM
- 555 kB
- Tommie McAfee
-
- 10.2.2.60-8091-diag.txt.gz
- 12/Jun/12 6:01 PM
- 1.43 MB
- Tommie McAfee
-
- finddoc.pl
- 15/Jun/12 3:33 PM
- 0.9 kB
- Tommie McAfee
-
- get_key_meta.py
- 15/Jun/12 3:33 PM
- 0.8 kB
- Tommie McAfee
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
Permalink
Tommie McAfee
added a comment -
Hi Jin, trying to repro for you. I did not see anything in the logs about error's while persisting keys.
Show
Tommie McAfee
added a comment - Hi Jin, trying to repro for you. I did not see anything in the logs about error's while persisting keys.
Hide
Tommie McAfee
added a comment -
Seeing now the state of the servers are in the same condition as MB-5511 with rebalance hanging and deadlock in indexer. Probably a side-effect of that bug.
Show
Tommie McAfee
added a comment - Seeing now the state of the servers are in the same condition as MB-5511 with rebalance hanging and deadlock in indexer. Probably a side-effect of that bug.
Hide
MB-5511 is partially fixed (index deadlock) but rebalance is still hanging and this bug is still reproducible in latest 2.0
Tommie McAfee
added a comment -
Show
Tommie McAfee
added a comment - MB-5511 is partially fixed (index deadlock) but rebalance is still hanging and this bug is still reproducible in latest 2.0
Hide
Tommie McAfee
added a comment -
Hey Jin, I currently have a cluster in this state were keys are not persisted and am using a script to help get the following info:
Here state of key(admin0110-2008_10_06) on host 10.2.2.64 says is 'ram but not disk':
{'key_vb_state': 'replica', 'key_last_modification_time': '1339724247', 'key_data_age': '0', 'key_cas': '7866802812969421', 'key_exptime': '0', 'key_is_dirty': '0', 'key_flags': '0', 'key_valid': 'item_deleted'}
If I query the database, notice this key is skipped in the results:
startkey = [2008,10,6] , startkey_docid = admin0109-2008_10_06
…..
{"id":"admin0109-2008_10_06","key":[2008,10,6],"value":["employee-109-0522e76","109@couchbase.com"]},
{"id":"admin0111-2008_10_06","key":[2008,10,6],"value":["employee-111-0522e76","111@couchbase.com"]},
{"id":"admin0112-2008_10_06","key":[2008,10,6],"value":["employee-112-0522e76","112@couchbase.com"]},
{"id":"admin0113-2008_10_06","key":[2008,10,6],"value":["employee-113-0522e76","113@couchbase.com"]},
This is post rebalance in.
master node is 10.2.2.60
Would be great if someone could take a look.
Here state of key(admin0110-2008_10_06) on host 10.2.2.64 says is 'ram but not disk':
{'key_vb_state': 'replica', 'key_last_modification_time': '1339724247', 'key_data_age': '0', 'key_cas': '7866802812969421', 'key_exptime': '0', 'key_is_dirty': '0', 'key_flags': '0', 'key_valid': 'item_deleted'}
If I query the database, notice this key is skipped in the results:
startkey = [2008,10,6] , startkey_docid = admin0109-2008_10_06
…..
{"id":"admin0109-2008_10_06","key":[2008,10,6],"value":["employee-109-0522e76","109@couchbase.com"]},
{"id":"admin0111-2008_10_06","key":[2008,10,6],"value":["employee-111-0522e76","111@couchbase.com"]},
{"id":"admin0112-2008_10_06","key":[2008,10,6],"value":["employee-112-0522e76","112@couchbase.com"]},
{"id":"admin0113-2008_10_06","key":[2008,10,6],"value":["employee-113-0522e76","113@couchbase.com"]},
This is post rebalance in.
master node is 10.2.2.60
Would be great if someone could take a look.
Show
Tommie McAfee
added a comment - Hey Jin, I currently have a cluster in this state were keys are not persisted and am using a script to help get the following info:
Here state of key(admin0110-2008_10_06) on host 10.2.2.64 says is 'ram but not disk':
{'key_vb_state': 'replica', 'key_last_modification_time': '1339724247', 'key_data_age': '0', 'key_cas': '7866802812969421', 'key_exptime': '0', 'key_is_dirty': '0', 'key_flags': '0', 'key_valid': 'item_deleted'}
If I query the database, notice this key is skipped in the results:
startkey = [2008,10,6] , startkey_docid = admin0109-2008_10_06
…..
{"id":"admin0109-2008_10_06","key":[2008,10,6],"value":["employee-109-0522e76"," 109@couchbase.com "]},
{"id":"admin0111-2008_10_06","key":[2008,10,6],"value":["employee-111-0522e76"," 111@couchbase.com "]},
{"id":"admin0112-2008_10_06","key":[2008,10,6],"value":["employee-112-0522e76"," 112@couchbase.com "]},
{"id":"admin0113-2008_10_06","key":[2008,10,6],"value":["employee-113-0522e76"," 113@couchbase.com "]},
This is post rebalance in.
master node is 10.2.2.60
Would be great if someone could take a look.
Show
Tommie McAfee
added a comment - build 1334
Hide
Filipe Manana
added a comment -
Unlikely to be because of MB-5511 or MB-5541.
We've seen this error "ram_but_not_disk" before the changes that added the occasional deadlock were merged. Also, if the deadlock was triggered, all subsequent operations and tests would just timeout or never finish, and this wasn't the case in the results ofMB-5331.
We've seen this error "ram_but_not_disk" before the changes that added the occasional deadlock were merged. Also, if the deadlock was triggered, all subsequent operations and tests would just timeout or never finish, and this wasn't the case in the results of
Show
Filipe Manana
added a comment - Unlikely to be because of MB-5511 or MB-5541 .
We've seen this error "ram_but_not_disk" before the changes that added the occasional deadlock were merged. Also, if the deadlock was triggered, all subsequent operations and tests would just timeout or never finish, and this wasn't the case in the results of MB-5331 .
Hide
Filipe Manana
added a comment -
With build 1338, which includes the view-engine dead lock fixes (unrelated to the problem btw, MB-5541 MB-5511), we're still seeing this issue:
http://qa.hq.northscale.net/job/centos-64-vew-merge-tests/202/
The test detects, via memcached, that some keys are in RAM but didn't get persisted to disk.
http://qa.hq.northscale.net/job/centos-64-vew-merge-tests/202/
The test detects, via memcached, that some keys are in RAM but didn't get persisted to disk.
Show
Filipe Manana
added a comment - With build 1338, which includes the view-engine dead lock fixes (unrelated to the problem btw, MB-5541 MB-5511 ), we're still seeing this issue:
http://qa.hq.northscale.net/job/centos-64-vew-merge-tests/202/
The test detects, via memcached, that some keys are in RAM but didn't get persisted to disk.
Hide
Filipe Manana
added a comment -
Also, looking at one of the tests output:
[2012-06-15 09:47:29,432] - [data_helper] [140513919293184] - INFO - key arch0008-2008_09_22 doesn't exist in memcached
[2012-06-15 09:47:29,511] - [data_helper] [140513919293184] - INFO - key arch0000-2008_10_09 doesn't exist in memcached
[2012-06-15 09:47:29,581] - [data_helper] [140513919293184] - INFO - key arch0191-2008_01_23 doesn't exist in memcached
[2012-06-15 09:47:29,659] - [data_helper] [140513919293184] - INFO - key arch0187-2008_05_14 doesn't exist in memcached
[2012-06-15 09:47:29,739] - [data_helper] [140513919293184] - INFO - key arch0070-2008_09_19 doesn't exist in memcached
[2012-06-15 09:47:29,815] - [data_helper] [140513919293184] - INFO - key arch0040-2008_07_27 doesn't exist in memcached
[2012-06-15 09:47:29,894] - [data_helper] [140513919293184] - INFO - key arch0095-2008_10_05 doesn't exist in memcached
[2012-06-15 09:47:29,979] - [data_helper] [140513919293184] - INFO - key arch0087-2008_03_13 doesn't exist in memcached
[2012-06-15 09:47:30,058] - [data_helper] [140513919293184] - INFO - key arch0034-2008_04_13 doesn't exist in memcached
[2012-06-15 09:47:30,139] - [data_helper] [140513919293184] - INFO - key arch0084-2008_04_08 doesn't exist in memcached
From:
http://qa.hq.northscale.net/job/centos-64-vew-merge-tests/202/artifact/logs/testrunner-12-Jun-15_05-28-10/test_employee_dataset_alldocs_queries_rebalance_in.log
[2012-06-15 09:47:29,432] - [data_helper] [140513919293184] - INFO - key arch0008-2008_09_22 doesn't exist in memcached
[2012-06-15 09:47:29,511] - [data_helper] [140513919293184] - INFO - key arch0000-2008_10_09 doesn't exist in memcached
[2012-06-15 09:47:29,581] - [data_helper] [140513919293184] - INFO - key arch0191-2008_01_23 doesn't exist in memcached
[2012-06-15 09:47:29,659] - [data_helper] [140513919293184] - INFO - key arch0187-2008_05_14 doesn't exist in memcached
[2012-06-15 09:47:29,739] - [data_helper] [140513919293184] - INFO - key arch0070-2008_09_19 doesn't exist in memcached
[2012-06-15 09:47:29,815] - [data_helper] [140513919293184] - INFO - key arch0040-2008_07_27 doesn't exist in memcached
[2012-06-15 09:47:29,894] - [data_helper] [140513919293184] - INFO - key arch0095-2008_10_05 doesn't exist in memcached
[2012-06-15 09:47:29,979] - [data_helper] [140513919293184] - INFO - key arch0087-2008_03_13 doesn't exist in memcached
[2012-06-15 09:47:30,058] - [data_helper] [140513919293184] - INFO - key arch0034-2008_04_13 doesn't exist in memcached
[2012-06-15 09:47:30,139] - [data_helper] [140513919293184] - INFO - key arch0084-2008_04_08 doesn't exist in memcached
From:
http://qa.hq.northscale.net/job/centos-64-vew-merge-tests/202/artifact/logs/testrunner-12-Jun-15_05-28-10/test_employee_dataset_alldocs_queries_rebalance_in.log
Show
Filipe Manana
added a comment - Also, looking at one of the tests output:
[2012-06-15 09:47:29,432] - [data_helper] [140513919293184] - INFO - key arch0008-2008_09_22 doesn't exist in memcached
[2012-06-15 09:47:29,511] - [data_helper] [140513919293184] - INFO - key arch0000-2008_10_09 doesn't exist in memcached
[2012-06-15 09:47:29,581] - [data_helper] [140513919293184] - INFO - key arch0191-2008_01_23 doesn't exist in memcached
[2012-06-15 09:47:29,659] - [data_helper] [140513919293184] - INFO - key arch0187-2008_05_14 doesn't exist in memcached
[2012-06-15 09:47:29,739] - [data_helper] [140513919293184] - INFO - key arch0070-2008_09_19 doesn't exist in memcached
[2012-06-15 09:47:29,815] - [data_helper] [140513919293184] - INFO - key arch0040-2008_07_27 doesn't exist in memcached
[2012-06-15 09:47:29,894] - [data_helper] [140513919293184] - INFO - key arch0095-2008_10_05 doesn't exist in memcached
[2012-06-15 09:47:29,979] - [data_helper] [140513919293184] - INFO - key arch0087-2008_03_13 doesn't exist in memcached
[2012-06-15 09:47:30,058] - [data_helper] [140513919293184] - INFO - key arch0034-2008_04_13 doesn't exist in memcached
[2012-06-15 09:47:30,139] - [data_helper] [140513919293184] - INFO - key arch0084-2008_04_08 doesn't exist in memcached
From:
http://qa.hq.northscale.net/job/centos-64-vew-merge-tests/202/artifact/logs/testrunner-12-Jun-15_05-28-10/test_employee_dataset_alldocs_queries_rebalance_in.log
Hide
Filipe Manana
added a comment -
Farshid, can you give this high priority, like trunk green blocker?
Show
Filipe Manana
added a comment - Farshid, can you give this high priority, like trunk green blocker?
Hide
Jin Lim
added a comment -
The db files being used for this failed test are gone by now. Otherwise, we could have first verified whether the missing document has never persisted or has actually persisted on disk but somehow being not known to upper layers. Anyway, while looking at remaining logs, I found a bunch of compactor crash one just like this below. And, I think this needs to be looked at by engineering team - as the compactor crash could have left a db file unknown state including missing docs.
[ns_server:error] [2012-06-15 10:02:37] [ns_1@10.2.2.60:compaction_daemon:compaction_daemon:handle_info:211] Compactor <0.10399.2> exited unexpectedly: {timeout,
{gen_server,call,
['capi_set_view_manager-default',
fetch_ddocs]}}. Moving to the next bucket.
[error_logger:error] [2012-06-15 10:02:37] [ns_1@10.2.2.60:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: compaction_daemon:-spawn_bucket_compactor/3-fun-2-/0
pid: <0.10399.2>
registered_name: []
exception exit: {timeout,
{gen_server,call,
['capi_set_view_manager-default',fetch_ddocs]}}
in function gen_server:call/2
in call from compaction_daemon:ddoc_names/1
in call from compaction_daemon:'-spawn_bucket_compactor/3-fun-2-'/4
ancestors: [compaction_daemon,<0.431.0>,ns_server_sup,
ns_server_cluster_sup,<0.59.0>]
messages: []
links: [<0.438.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 6765
stack_size: 24
reductions: 174901
neighbours:
[ns_server:error] [2012-06-15 10:02:37] [ns_1@10.2.2.60:compaction_daemon:compaction_daemon:handle_info:211] Compactor <0.10399.2> exited unexpectedly: {timeout,
{gen_server,call,
['capi_set_view_manager-default',
fetch_ddocs]}}. Moving to the next bucket.
[error_logger:error] [2012-06-15 10:02:37] [ns_1@10.2.2.60:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: compaction_daemon:-spawn_bucket_compactor/3-fun-2-/0
pid: <0.10399.2>
registered_name: []
exception exit: {timeout,
{gen_server,call,
['capi_set_view_manager-default',fetch_ddocs]}}
in function gen_server:call/2
in call from compaction_daemon:ddoc_names/1
in call from compaction_daemon:'-spawn_bucket_compactor/3-fun-2-'/4
ancestors: [compaction_daemon,<0.431.0>,ns_server_sup,
ns_server_cluster_sup,<0.59.0>]
messages: []
links: [<0.438.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 6765
stack_size: 24
reductions: 174901
neighbours:
Show
Jin Lim
added a comment - The db files being used for this failed test are gone by now. Otherwise, we could have first verified whether the missing document has never persisted or has actually persisted on disk but somehow being not known to upper layers. Anyway, while looking at remaining logs, I found a bunch of compactor crash one just like this below. And, I think this needs to be looked at by engineering team - as the compactor crash could have left a db file unknown state including missing docs.
[ns_server:error] [2012-06-15 10:02:37] [ ns_1@10.2.2.60 :compaction_daemon:compaction_daemon:handle_info:211] Compactor <0.10399.2> exited unexpectedly: {timeout,
{gen_server,call,
['capi_set_view_manager-default',
fetch_ddocs]}}. Moving to the next bucket.
[error_logger:error] [2012-06-15 10:02:37] [ ns_1@10.2.2.60 :error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
crasher:
initial call: compaction_daemon:-spawn_bucket_compactor/3-fun-2-/0
pid: <0.10399.2>
registered_name: []
exception exit: {timeout,
{gen_server,call,
['capi_set_view_manager-default',fetch_ddocs]}}
in function gen_server:call/2
in call from compaction_daemon:ddoc_names/1
in call from compaction_daemon:'-spawn_bucket_compactor/3-fun-2-'/4
ancestors: [compaction_daemon,<0.431.0>,ns_server_sup,
ns_server_cluster_sup,<0.59.0>]
messages: []
links: [<0.438.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 6765
stack_size: 24
reductions: 174901
neighbours:
Hide
Filipe Manana
added a comment -
Jin, that doesn't cause the data loss (either persistence or memory) issues.
I'll put my hands on fire if it does.
I'll put my hands on fire if it does.
Show
Filipe Manana
added a comment - Jin, that doesn't cause the data loss (either persistence or memory) issues.
I'll put my hands on fire if it does.
Hide
Filipe Manana
added a comment -
And btw, this is not the compactor, it's just the compaction scheduler. This failure happened before compaction being triggered - even if it failing during compaction, when compaction is restarted it never reuses an existing partially compacted file - it truncates the existing compact file (if any) to 0 bytes - this applies both to database and view compaction.
Show
Filipe Manana
added a comment - And btw, this is not the compactor, it's just the compaction scheduler. This failure happened before compaction being triggered - even if it failing during compaction, when compaction is restarted it never reuses an existing partially compacted file - it truncates the existing compact file (if any) to 0 bytes - this applies both to database and view compaction.
Hide
Tommie McAfee
added a comment -
Using a script from Jin that looks for the document in couchdb we found that the missing key exists in disk on a node different than the one reporting "ram_but_not_disk". Something may have happened after rebalance to cause this. Below you can see two nodes are returning meta_data for the key (probably the original and new node). However, the first node has state "item_deleted"
./get_key_meta.py
10.2.2.108
{'key_vb_state': 'replica', 'key_last_modification_time': '1339788618', 'key_data_age': '0', 'key_cas': '7930769912841273', 'key_exptime': '0', 'key_is_dirty': '0', 'key_flags': '0', 'key_valid': 'item_deleted'}
trying diff memcached....
trying diff memcached....
trying diff memcached....
trying diff memcached....
10.2.2.63
{'key_vb_state': 'active', 'key_last_modification_time': '1339788619', 'key_data_age': '0', 'key_cas': '7930769912841273', 'key_exptime': '0', 'key_is_dirty': '0', 'key_flags': '0', 'key_valid': 'ram_but_not_disk'}
--------COUCH SEARCH ON 10.2.2.108---------
[root@localhost default]# ./finddoc.pl arch0107-2008_11_14
Unquoted string "break" may clash with future reserved word at ./finddoc.pl line 28.
Useless use of a constant in void context at ./finddoc.pl line 28.
--------------------------------------------------
finding doc, id: arch0107-2008_11_14, in /opt/couchbase/var/lib/couchbase/data/default, may take a while...
--------------------------------------------------
skipping master doc, master.couch.1
found doc, id: arch0107-2008_11_14, in 56.couch.1
[root@localhost default]#
--------COUCH SEARCH ON 10.2.2.63---------
[root@localhost default]# ./finddoc.pl arch0107-2008_11_14
Unquoted string "break" may clash with future reserved word at ./finddoc.pl line 28.
Useless use of a constant in void context at ./finddoc.pl line 28.
--------------------------------------------------
finding doc, id: arch0107-2008_11_14, in /opt/couchbase/var/lib/couchbase/data/default, may take a while...
--------------------------------------------------
skipping master doc, master.couch.1
[root@localhost default]#
./get_key_meta.py
10.2.2.108
{'key_vb_state': 'replica', 'key_last_modification_time': '1339788618', 'key_data_age': '0', 'key_cas': '7930769912841273', 'key_exptime': '0', 'key_is_dirty': '0', 'key_flags': '0', 'key_valid': 'item_deleted'}
trying diff memcached....
trying diff memcached....
trying diff memcached....
trying diff memcached....
10.2.2.63
{'key_vb_state': 'active', 'key_last_modification_time': '1339788619', 'key_data_age': '0', 'key_cas': '7930769912841273', 'key_exptime': '0', 'key_is_dirty': '0', 'key_flags': '0', 'key_valid': 'ram_but_not_disk'}
--------COUCH SEARCH ON 10.2.2.108---------
[root@localhost default]# ./finddoc.pl arch0107-2008_11_14
Unquoted string "break" may clash with future reserved word at ./finddoc.pl line 28.
Useless use of a constant in void context at ./finddoc.pl line 28.
--------------------------------------------------
finding doc, id: arch0107-2008_11_14, in /opt/couchbase/var/lib/couchbase/data/default, may take a while...
--------------------------------------------------
skipping master doc, master.couch.1
found doc, id: arch0107-2008_11_14, in 56.couch.1
[root@localhost default]#
--------COUCH SEARCH ON 10.2.2.63---------
[root@localhost default]# ./finddoc.pl arch0107-2008_11_14
Unquoted string "break" may clash with future reserved word at ./finddoc.pl line 28.
Useless use of a constant in void context at ./finddoc.pl line 28.
--------------------------------------------------
finding doc, id: arch0107-2008_11_14, in /opt/couchbase/var/lib/couchbase/data/default, may take a while...
--------------------------------------------------
skipping master doc, master.couch.1
[root@localhost default]#
Show
Tommie McAfee
added a comment - Using a script from Jin that looks for the document in couchdb we found that the missing key exists in disk on a node different than the one reporting "ram_but_not_disk". Something may have happened after rebalance to cause this. Below you can see two nodes are returning meta_data for the key (probably the original and new node). However, the first node has state "item_deleted"
./get_key_meta.py
10.2.2.108
{'key_vb_state': 'replica', 'key_last_modification_time': '1339788618', 'key_data_age': '0', 'key_cas': '7930769912841273', 'key_exptime': '0', 'key_is_dirty': '0', 'key_flags': '0', 'key_valid': 'item_deleted'}
trying diff memcached....
trying diff memcached....
trying diff memcached....
trying diff memcached....
10.2.2.63
{'key_vb_state': 'active', 'key_last_modification_time': '1339788619', 'key_data_age': '0', 'key_cas': '7930769912841273', 'key_exptime': '0', 'key_is_dirty': '0', 'key_flags': '0', 'key_valid': 'ram_but_not_disk'}
--------COUCH SEARCH ON 10.2.2.108---------
[ root@localhost default]# ./finddoc.pl arch0107-2008_11_14
Unquoted string "break" may clash with future reserved word at ./finddoc.pl line 28.
Useless use of a constant in void context at ./finddoc.pl line 28.
--------------------------------------------------
finding doc, id: arch0107-2008_11_14, in /opt/couchbase/var/lib/couchbase/data/default, may take a while...
--------------------------------------------------
skipping master doc, master.couch.1
found doc, id: arch0107-2008_11_14, in 56.couch.1
[ root@localhost default]#
--------COUCH SEARCH ON 10.2.2.63---------
[ root@localhost default]# ./finddoc.pl arch0107-2008_11_14
Unquoted string "break" may clash with future reserved word at ./finddoc.pl line 28.
Useless use of a constant in void context at ./finddoc.pl line 28.
--------------------------------------------------
finding doc, id: arch0107-2008_11_14, in /opt/couchbase/var/lib/couchbase/data/default, may take a while...
--------------------------------------------------
skipping master doc, master.couch.1
[ root@localhost default]#
Hide
Filipe Manana
added a comment -
Tommie what's the vbucket ID?
As a side questions, where do those scripts live?
As a side questions, where do those scripts live?
Show
Filipe Manana
added a comment - Tommie what's the vbucket ID?
As a side questions, where do those scripts live?
Hide
Tommie McAfee
added a comment -
>Tommie what's the vbucket ID?
56
>where do those scripts live?
They were local. I will attach them here
56
>where do those scripts live?
They were local. I will attach them here
Show
Tommie McAfee
added a comment - >Tommie what's the vbucket ID?
56
>where do those scripts live?
They were local. I will attach them here
Hide
Filipe Manana
added a comment -
Thanks.
So it seems really like a persistence issue in the first place.
The key should be persisted in node 10.2.2.63, but it is not.
Hard to tell when exactly the failing queries happened, but I see that at the end of the logs for those 2 nodes the vbucket 56 is not marked as active in the indexes.
So it seems really like a persistence issue in the first place.
The key should be persisted in node 10.2.2.63, but it is not.
Hard to tell when exactly the failing queries happened, but I see that at the end of the logs for those 2 nodes the vbucket 56 is not marked as active in the indexes.
Show
Filipe Manana
added a comment - Thanks.
So it seems really like a persistence issue in the first place.
The key should be persisted in node 10.2.2.63, but it is not.
Hard to tell when exactly the failing queries happened, but I see that at the end of the logs for those 2 nodes the vbucket 56 is not marked as active in the indexes.
Show
Sharon Barr
added a comment - Looks like a blocker rather then a major issue
Hide
Tommie McAfee
added a comment -
More debugging tips from Filipe…. discovered update_seq = 0 for db(56) which corresponds to this key on active node. This db also reports "no documents" so no writes have happened here:
/opt/couchbase/bin/couch_dbinfo 56.couch.1
DB Info (56.couch.1)
file format version: 10
update_seq: 0
no documents
B-tree size: 0 bytes
total disk size: 4.0 kB
Information also reflected in index info:
curl -s 'http://10.2.2.63:8092/_set_view/default/_design/dev_test_view-6ffa498/_info' | json_xs
….
"active_partitions" : [
56,
57,
58,
59,
60,
61,
62,
63,
64,
65,
66,
67,
68,
69,
70,
71,
72,
73
],
"pending_transition" : null,
"update_seqs" : {
"67" : 1560,
"63" : 1571,
"71" : 1585,
"70" : 1581,
"68" : 1558,
"72" : 1585,
"65" : 1580,
"57" : 1573,
"64" : 1581,
"61" : 1584,
"58" : 1585,
"59" : 1590,
"69" : 1563,
"60" : 1125,
"56" : 0,
"73" : 1581,
"66" : 1562,
"62" : 1572
},
/opt/couchbase/bin/couch_dbinfo 56.couch.1
DB Info (56.couch.1)
file format version: 10
update_seq: 0
no documents
B-tree size: 0 bytes
total disk size: 4.0 kB
Information also reflected in index info:
curl -s 'http://10.2.2.63:8092/_set_view/default/_design/dev_test_view-6ffa498/_info' | json_xs
….
"active_partitions" : [
56,
57,
58,
59,
60,
61,
62,
63,
64,
65,
66,
67,
68,
69,
70,
71,
72,
73
],
"pending_transition" : null,
"update_seqs" : {
"67" : 1560,
"63" : 1571,
"71" : 1585,
"70" : 1581,
"68" : 1558,
"72" : 1585,
"65" : 1580,
"57" : 1573,
"64" : 1581,
"61" : 1584,
"58" : 1585,
"59" : 1590,
"69" : 1563,
"60" : 1125,
"56" : 0,
"73" : 1581,
"66" : 1562,
"62" : 1572
},
Show
Tommie McAfee
added a comment - More debugging tips from Filipe…. discovered update_seq = 0 for db(56) which corresponds to this key on active node. This db also reports "no documents" so no writes have happened here:
/opt/couchbase/bin/couch_dbinfo 56.couch.1
DB Info (56.couch.1)
file format version: 10
update_seq: 0
no documents
B-tree size: 0 bytes
total disk size: 4.0 kB
Information also reflected in index info:
curl -s ' http://10.2.2.63:8092/_set_view/default/_design/dev_test_view-6ffa498/_info' | json_xs
….
"active_partitions" : [
56,
57,
58,
59,
60,
61,
62,
63,
64,
65,
66,
67,
68,
69,
70,
71,
72,
73
],
"pending_transition" : null,
"update_seqs" : {
"67" : 1560,
"63" : 1571,
"71" : 1585,
"70" : 1581,
"68" : 1558,
"72" : 1585,
"65" : 1580,
"57" : 1573,
"64" : 1581,
"61" : 1584,
"58" : 1585,
"59" : 1590,
"69" : 1563,
"60" : 1125,
"56" : 0,
"73" : 1581,
"66" : 1562,
"62" : 1572
},
Hide
Tommie McAfee
added a comment -
Here's the test runner command to reproduce:
python testrunner -i resources/jenkins/centos-64-7node-viewquery.ini -t viewquerytests.ViewQueryTests.test_employee_dataset_startkey_endkey_queries_rebalance_in
I've disabled the job in jenkins so you can run this exact command and it will run the test against our servers if you cannot reproduce in your own vm's.
Some final observations in manually trying to reproduce this is that heavy queries causes disk write queue to drain slower, and when items are still in queue before rebalance(in) then we can get into this state.
Let me know if anything else is needed.
thanks,
python testrunner -i resources/jenkins/centos-64-7node-viewquery.ini -t viewquerytests.ViewQueryTests.test_employee_dataset_startkey_endkey_queries_rebalance_in
I've disabled the job in jenkins so you can run this exact command and it will run the test against our servers if you cannot reproduce in your own vm's.
Some final observations in manually trying to reproduce this is that heavy queries causes disk write queue to drain slower, and when items are still in queue before rebalance(in) then we can get into this state.
Let me know if anything else is needed.
thanks,
Show
Tommie McAfee
added a comment - Here's the test runner command to reproduce:
python testrunner -i resources/jenkins/centos-64-7node-viewquery.ini -t viewquerytests.ViewQueryTests.test_employee_dataset_startkey_endkey_queries_rebalance_in
I've disabled the job in jenkins so you can run this exact command and it will run the test against our servers if you cannot reproduce in your own vm's.
Some final observations in manually trying to reproduce this is that heavy queries causes disk write queue to drain slower, and when items are still in queue before rebalance(in) then we can get into this state.
Let me know if anything else is needed.
thanks,
Hide
The test now passes without any failure after locally build ep engine with the fix, http://review.couchbase.org/#change,17367.
Note: the number of items being persisted while rebalancing still fluctuates for vbuckets movements, which is perfectly normal under the current design. After the completion of rebalancing one should expect NO data loss at all.
Note: the number of items being persisted while rebalancing still fluctuates for vbuckets movements, which is perfectly normal under the current design. After the completion of rebalancing one should expect NO data loss at all.
Show
Jin Lim
added a comment - - edited The test now passes without any failure after locally build ep engine with the fix, http://review.couchbase.org/#change,17367 .
Note: the number of items being persisted while rebalancing still fluctuates for vbuckets movements, which is perfectly normal under the current design. After the completion of rebalancing one should expect NO data loss at all.
Show
Filipe Manana
added a comment - Super, thanks Jin! :)