Details
-
Type:
Bug
-
Status:
In Progress
-
Priority:
Blocker
-
Resolution: Unresolved
-
Affects Version/s: 2.0
-
Fix Version/s: 2.1
-
Component/s: storage-engine
-
Security Level: Public
-
Environment:Build 1925.
Windows, 2<->2 nodes, 2 buckets, 2 unidir stream
Description
While it works fine in bidirectional case, there is obvious issue with 2 buckets and double unidirectional replication - even after 5 hours of access phase server didn't manage to compact data.
I believe it may be related to general issues with capacity planning (as MB-6172) but apparently we should not ignore that.
I believe it may be related to general issues with capacity planning (as MB-6172) but apparently we should not ignore that.
-
Hide
- 192.168.162.30-1112012-diag.zip
- 01/Nov/12 12:51 PM
- 6.40 MB
- Pavel Paulau
-
- cbcollect_info_ns_1@10.2.3.31_20121101-153030/couchbase.log 198 kB
- cbcollect_info_ns_1@10.2.3.31_20121101-153030/ns_server.xdcr.log 10.98 MB
- cbcollect_info_ns_1@10.2.3.31_20121101-153030/ns_server.couchdb.log 10.98 MB
- cbcollect_info_ns_1@10.2.3.31_20121101-153030/stats.log 2.51 MB
- cbcollect_info_ns_1@10.2.3.31_20121101-153030/ns_server.stats.log 10.98 MB
- cbcollect_info_ns_1@10.2.3.31_20121101-153030/ns_server.error.log 10.98 MB
- cbcollect_info_ns_1@10.2.3.31_20121101-153030/ns_server.views.log 10.98 MB
- cbcollect_info_ns_1@10.2.3.31_20121101-153030/ns_server.info.log 10.98 MB
- cbcollect_info_ns_1@10.2.3.31_20121101-153030/ns_server.xdcr_errors.log 10.98 MB
- cbcollect_info_ns_1@10.2.3.31_20121101-153030/ns_server.mapreduce_errors.log 10.98 MB
- cbcollect_info_ns_1@10.2.3.31_20121101-153030/ns_server.debug.log 10.98 MB
- cbcollect_info_ns_1@10.2.3.31_20121101-153030/memcached.log 1.43 MB
-
Hide
- 192.168.162.31-1112012-diag.zip
- 01/Nov/12 12:51 PM
- 6.53 MB
- Pavel Paulau
-
- cbcollect_info_ns_1@10.2.3.34_20121101-153502/couchbase.log 198 kB
- cbcollect_info_ns_1@10.2.3.34_20121101-153502/ns_server.xdcr.log 11.14 MB
- cbcollect_info_ns_1@10.2.3.34_20121101-153502/ns_server.couchdb.log 11.14 MB
- cbcollect_info_ns_1@10.2.3.34_20121101-153502/stats.log 2.53 MB
- cbcollect_info_ns_1@10.2.3.34_20121101-153502/ns_server.stats.log 11.14 MB
- cbcollect_info_ns_1@10.2.3.34_20121101-153502/ns_server.error.log 11.14 MB
- cbcollect_info_ns_1@10.2.3.34_20121101-153502/ns_server.views.log 11.14 MB
- cbcollect_info_ns_1@10.2.3.34_20121101-153502/ns_server.info.log 11.14 MB
- cbcollect_info_ns_1@10.2.3.34_20121101-153502/ns_server.xdcr_errors.log 11.14 MB
- cbcollect_info_ns_1@10.2.3.34_20121101-153502/ns_server.mapreduce_errors.log 11.14 MB
- cbcollect_info_ns_1@10.2.3.34_20121101-153502/ns_server.debug.log 11.14 MB
- cbcollect_info_ns_1@10.2.3.34_20121101-153502/memcached.log 1.53 MB
-
Hide
- 192.168.162.32-1112012-diag.zip
- 01/Nov/12 12:51 PM
- 4.57 MB
- Pavel Paulau
-
- cbcollect_info_ns_1@10.2.3.35_20121101-153246/couchbase.log 203 kB
- cbcollect_info_ns_1@10.2.3.35_20121101-153246/ns_server.xdcr.log 9.52 MB
- cbcollect_info_ns_1@10.2.3.35_20121101-153246/ns_server.couchdb.log 9.52 MB
- cbcollect_info_ns_1@10.2.3.35_20121101-153246/stats.log 2.53 MB
- cbcollect_info_ns_1@10.2.3.35_20121101-153246/ns_server.stats.log 9.52 MB
- cbcollect_info_ns_1@10.2.3.35_20121101-153246/ns_server.error.log 9.52 MB
- cbcollect_info_ns_1@10.2.3.35_20121101-153246/ns_server.views.log 9.52 MB
- cbcollect_info_ns_1@10.2.3.35_20121101-153246/ns_server.info.log 9.52 MB
- cbcollect_info_ns_1@10.2.3.35_20121101-153246/ns_server.xdcr_errors.log 9.52 MB
- cbcollect_info_ns_1@10.2.3.35_20121101-153246/ns_server.mapreduce_errors.log 9.52 MB
- cbcollect_info_ns_1@10.2.3.35_20121101-153246/ns_server.debug.log 9.52 MB
- cbcollect_info_ns_1@10.2.3.35_20121101-153246/memcached.log 1.21 MB
-
Hide
- 192.168.162.33-1112012-diag.zip
- 01/Nov/12 12:51 PM
- 5.99 MB
- Pavel Paulau
-
- cbcollect_info_ns_1@10.2.3.33_20121101-153716/couchbase.log 201 kB
- cbcollect_info_ns_1@10.2.3.33_20121101-153716/ns_server.xdcr.log 10.58 MB
- cbcollect_info_ns_1@10.2.3.33_20121101-153716/ns_server.couchdb.log 10.58 MB
- cbcollect_info_ns_1@10.2.3.33_20121101-153716/stats.log 2.51 MB
- cbcollect_info_ns_1@10.2.3.33_20121101-153716/ns_server.stats.log 10.58 MB
- cbcollect_info_ns_1@10.2.3.33_20121101-153716/ns_server.error.log 10.58 MB
- cbcollect_info_ns_1@10.2.3.33_20121101-153716/ns_server.views.log 10.58 MB
- cbcollect_info_ns_1@10.2.3.33_20121101-153716/ns_server.info.log 10.58 MB
- cbcollect_info_ns_1@10.2.3.33_20121101-153716/ns_server.xdcr_errors.log 10.58 MB
- cbcollect_info_ns_1@10.2.3.33_20121101-153716/ns_server.mapreduce_errors.log 10.58 MB
- cbcollect_info_ns_1@10.2.3.33_20121101-153716/ns_server.debug.log 10.58 MB
- cbcollect_info_ns_1@10.2.3.33_20121101-153716/memcached.log 1.29 MB
-
- xperf-mixed-1-uni-uni-2-nodes-ext.loop_2.0.0-1925-rel-enterprise_2.0.0-1925-rel-enterprise_DEST_Nov-01-2012_08-23-52.pdf
- 01/Nov/12 12:51 PM
- 3.08 MB
- Pavel Paulau
-
- xperf-mixed-1-uni-uni-2-nodes-ext.loop_2.0.0-1925-rel-enterprise_2.0.0-1925-rel-enterprise_SOURCE_Nov-01-2012_08-13-19.pdf
- 01/Nov/12 12:51 PM
- 2.98 MB
- Pavel Paulau
-
- xperf-mixed-uni-2-nodes-low-4.loop_2.0.1-170-rel-enterprise_2.0.1-170-rel-enterprise_DEST_Mar-12-2013_20-12-14.pdf
- 26/Mar/13 1:58 AM
- 6.25 MB
- Pavel Paulau
-
- xperf-mixed-uni-2-nodes-low-4.loop_2.0.1-170-rel-enterprise_2.0.1-170-rel-enterprise_SOURCE_Mar-12-2013_19-49-28.pdf
- 26/Mar/13 1:58 AM
- 6.09 MB
- Pavel Paulau
-
- 145_disk_size.png
- 62 kB
- 31/Jan/13 7:09 PM
-
- data_compaction_01.png
- 33 kB
- 01/Nov/12 12:51 PM
-
- data_compaction_02.png
- 33 kB
- 01/Nov/12 12:51 PM
-
- Screen Shot 2013-02-12 at 3.22.47 PM.png
- 64 kB
- 12/Feb/13 5:25 PM
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
Permalink
Junyi Xie
added a comment -
Looks like an issue in storage layer. Aaron, could you plesae look at why compactor did not catch up?
Show
Junyi Xie
added a comment - Looks like an issue in storage layer. Aaron, could you plesae look at why compactor did not catch up?
Hide
Aaron Miller
added a comment -
Nothing in the logs seems to indicate compaction having issues. It was still running at the end of when they were captured. Compaction is pretty I/O heavy, so if there's much other I/O going on it can take a while. Chiyoung has done some experiments around compaction scheduling inside ep-engine that are the sort of thing that would probably help in this type of situation, but I don't think actually making that change is on our radar in the near-term.
Show
Aaron Miller
added a comment - Nothing in the logs seems to indicate compaction having issues. It was still running at the end of when they were captured. Compaction is pretty I/O heavy, so if there's much other I/O going on it can take a while. Chiyoung has done some experiments around compaction scheduling inside ep-engine that are the sort of thing that would probably help in this type of situation, but I don't think actually making that change is on our radar in the near-term.
Hide
Dipti Borkar
added a comment -
Talked more with Pavel and Junyi about this. Please add guidance on where the bottleneck is if the system is not sized appropriately so that users can take the appropriate action.
Show
Dipti Borkar
added a comment - Talked more with Pavel and Junyi about this. Please add guidance on where the bottleneck is if the system is not sized appropriately so that users can take the appropriate action.
Hide
The only resource compaction really competes for is I/O, and XDCR can be I/O heavy.
Unfortunately I think the only good way around these problems is for the things contending for I/O to be managed in such a way as that they don't step on each other. That's why Chiyoung's change that I mentioned earlier helped in plain K/V situations, as it coordinated the compaction against regular K/V persistence. It ought to help in the XDCR case too, as most of the XDCR I/O is through that path.
The only guideline I could think of as-is if compaction can't catch up with ongoing operations is do fewer of those operations.
Unfortunately I think the only good way around these problems is for the things contending for I/O to be managed in such a way as that they don't step on each other. That's why Chiyoung's change that I mentioned earlier helped in plain K/V situations, as it coordinated the compaction against regular K/V persistence. It ought to help in the XDCR case too, as most of the XDCR I/O is through that path.
The only guideline I could think of as-is if compaction can't catch up with ongoing operations is do fewer of those operations.
Show
Aaron Miller
added a comment - - edited The only resource compaction really competes for is I/O, and XDCR can be I/O heavy.
Unfortunately I think the only good way around these problems is for the things contending for I/O to be managed in such a way as that they don't step on each other. That's why Chiyoung's change that I mentioned earlier helped in plain K/V situations, as it coordinated the compaction against regular K/V persistence. It ought to help in the XDCR case too, as most of the XDCR I/O is through that path.
The only guideline I could think of as-is if compaction can't catch up with ongoing operations is do fewer of those operations.
Show
Ketaki Gangal
added a comment - Since this is still an issue, moving this back to 2.0.1
Show
Farshid Ghods
added a comment - Pavel,
are there any xperf results from 2.0.1?
Hide
Pavel Paulau
added a comment -
I have results for build 107 and it's still the issues:
http://qa.hq.northscale.net/job/eperf-graph-loop/1305/artifact/xperf-mixed-uni-2-nodes.loop_2.0.1-107-rel-enterprise_2.0.1-107-rel-enterprise_SOURCE_Dec-19-2012_17%3A54%3A34.pdf
http://qa.hq.northscale.net/job/eperf-graph-loop/1307/artifact/xperf-mixed-uni-2-nodes.loop_2.0.1-107-rel-enterprise_2.0.1-107-rel-enterprise_DEST_Dec-19-2012_18%3A11%3A38.pdf
But frankly speaking I don't remember that we tried to fix that.
http://qa.hq.northscale.net/job/eperf-graph-loop/1305/artifact/xperf-mixed-uni-2-nodes.loop_2.0.1-107-rel-enterprise_2.0.1-107-rel-enterprise_SOURCE_Dec-19-2012_17%3A54%3A34.pdf
http://qa.hq.northscale.net/job/eperf-graph-loop/1307/artifact/xperf-mixed-uni-2-nodes.loop_2.0.1-107-rel-enterprise_2.0.1-107-rel-enterprise_DEST_Dec-19-2012_18%3A11%3A38.pdf
But frankly speaking I don't remember that we tried to fix that.
Show
Pavel Paulau
added a comment - I have results for build 107 and it's still the issues:
http://qa.hq.northscale.net/job/eperf-graph-loop/1305/artifact/xperf-mixed-uni-2-nodes.loop_2.0.1-107-rel-enterprise_2.0.1-107-rel-enterprise_SOURCE_Dec-19-2012_17%3A54%3A34.pdf
http://qa.hq.northscale.net/job/eperf-graph-loop/1307/artifact/xperf-mixed-uni-2-nodes.loop_2.0.1-107-rel-enterprise_2.0.1-107-rel-enterprise_DEST_Dec-19-2012_18%3A11%3A38.pdf
But frankly speaking I don't remember that we tried to fix that.
Hide
Filipe Manana
added a comment -
As an aside, the retry phase of database compaction, still done in Erlang, can be made more efficient (and faster of course). Pretty much in the same way that view compaction retry phase was made much faster some time ago (it used to suffer the same issue several months ago).
Show
Filipe Manana
added a comment - As an aside, the retry phase of database compaction, still done in Erlang, can be made more efficient (and faster of course). Pretty much in the same way that view compaction retry phase was made much faster some time ago (it used to suffer the same issue several months ago).
Hide
Farshid Ghods
added a comment -
per bug scrub .
Damien says there is a possibility that processes inside erlang vm are crashing
Damien says there is a possibility that processes inside erlang vm are crashing
Show
Farshid Ghods
added a comment - per bug scrub .
Damien says there is a possibility that processes inside erlang vm are crashing
Hide
per bug scrub:
Yaseen: rerun the test and ask if this is a capacity issue. if so we can defer this to the next release.
Yaseen: rerun the test and ask if this is a capacity issue. if so we can defer this to the next release.
Show
Farshid Ghods
added a comment - - edited per bug scrub:
Yaseen: rerun the test and ask if this is a capacity issue. if so we can defer this to the next release.
Hide
Pavel Paulau
added a comment -
As issue description says the same workload works pretty well in case of single bucket. It might be helpful insight for investigation.
I tried two extra workloads:
-- reduced ratio of write operations (50% -> 20%)
-- reduced total ops/sec (4K -> 2K)
It didn't help, disk size is growing while compaction doesn't catch up.
I'm trying 1K ops/sec workload now.
I tried two extra workloads:
-- reduced ratio of write operations (50% -> 20%)
-- reduced total ops/sec (4K -> 2K)
It didn't help, disk size is growing while compaction doesn't catch up.
I'm trying 1K ops/sec workload now.
Show
Pavel Paulau
added a comment - As issue description says the same workload works pretty well in case of single bucket. It might be helpful insight for investigation.
I tried two extra workloads:
-- reduced ratio of write operations (50% -> 20%)
-- reduced total ops/sec (4K -> 2K)
It didn't help, disk size is growing while compaction doesn't catch up.
I'm trying 1K ops/sec workload now.
Hide
Jin Lim
added a comment -
Aaron, it appears to be that "heavy I/O + limited resource capacity" might not be culprit for data compaction slowness. Please take a look at Pavel finding so far and provide your insight. This is becoming a high priority issue. Please assign it back to Pavel after you input. Thanks!
Show
Jin Lim
added a comment - Aaron, it appears to be that "heavy I/O + limited resource capacity" might not be culprit for data compaction slowness. Please take a look at Pavel finding so far and provide your insight. This is becoming a high priority issue. Please assign it back to Pavel after you input. Thanks!
Hide
Junyi Xie
added a comment -
Thanks Pavel. That is quite helpful. From users perspective, double replication but tune down the workload by >50% should not give very different performance results.
I am not sure the resource limitation comes from. Here are two possible sources in my mind
1. Double replicators on each node due to 2nd bucket bring a lot more overhead in Erlang.
2. The overhead of scheduling compaction of two buckets are much more than twice of compacting single bucket
Pavel, one way to isolate the problem is to reduce the # of replicators from 32 to16, thus in the case of two replications, # of replicators is still 32 per node which is the same as single bucket. That will in some extent remove the impact of 1), and if the compaction cannot still catch up, probably the culprit is 2).
Another way is to reduce the frequency of compactions, by doubling the interval of compaction to see if there is any difference.
I am not sure the resource limitation comes from. Here are two possible sources in my mind
1. Double replicators on each node due to 2nd bucket bring a lot more overhead in Erlang.
2. The overhead of scheduling compaction of two buckets are much more than twice of compacting single bucket
Pavel, one way to isolate the problem is to reduce the # of replicators from 32 to16, thus in the case of two replications, # of replicators is still 32 per node which is the same as single bucket. That will in some extent remove the impact of 1), and if the compaction cannot still catch up, probably the culprit is 2).
Another way is to reduce the frequency of compactions, by doubling the interval of compaction to see if there is any difference.
Show
Junyi Xie
added a comment - Thanks Pavel. That is quite helpful. From users perspective, double replication but tune down the workload by >50% should not give very different performance results.
I am not sure the resource limitation comes from. Here are two possible sources in my mind
1. Double replicators on each node due to 2nd bucket bring a lot more overhead in Erlang.
2. The overhead of scheduling compaction of two buckets are much more than twice of compacting single bucket
Pavel, one way to isolate the problem is to reduce the # of replicators from 32 to16, thus in the case of two replications, # of replicators is still 32 per node which is the same as single bucket. That will in some extent remove the impact of 1), and if the compaction cannot still catch up, probably the culprit is 2).
Another way is to reduce the frequency of compactions, by doubling the interval of compaction to see if there is any difference.
Hide
The most recent results with 1K ops/sec: disk data size grew from 11GB to 22GB in 6 hours with rather pessimistic perspectives.
http://qa.hq.northscale.net/job/eperf-graph-loop/1396/artifact/xperf-mixed-uni-2-nodes-low-3.loop_2.0.1-145-rel-enterprise_2.0.1-145-rel-enterprise_SOURCE_Jan-30-2013_16%3A29%3A56.pdf
http://qa.hq.northscale.net/job/eperf-graph-loop/1397/artifact/xperf-mixed-uni-2-nodes-low-3.loop_2.0.1-145-rel-enterprise_2.0.1-145-rel-enterprise_DEST_Jan-30-2013_16%3A49%3A21.pdf
There are diags as well:
http://qa.hq.northscale.net/job/xperf-win/43/artifact/
http://qa.hq.northscale.net/job/eperf-graph-loop/1396/artifact/xperf-mixed-uni-2-nodes-low-3.loop_2.0.1-145-rel-enterprise_2.0.1-145-rel-enterprise_SOURCE_Jan-30-2013_16%3A29%3A56.pdf
http://qa.hq.northscale.net/job/eperf-graph-loop/1397/artifact/xperf-mixed-uni-2-nodes-low-3.loop_2.0.1-145-rel-enterprise_2.0.1-145-rel-enterprise_DEST_Jan-30-2013_16%3A49%3A21.pdf
There are diags as well:
http://qa.hq.northscale.net/job/xperf-win/43/artifact/
Show
Pavel Paulau
added a comment - - edited The most recent results with 1K ops/sec: disk data size grew from 11GB to 22GB in 6 hours with rather pessimistic perspectives.
http://qa.hq.northscale.net/job/eperf-graph-loop/1396/artifact/xperf-mixed-uni-2-nodes-low-3.loop_2.0.1-145-rel-enterprise_2.0.1-145-rel-enterprise_SOURCE_Jan-30-2013_16%3A29%3A56.pdf
http://qa.hq.northscale.net/job/eperf-graph-loop/1397/artifact/xperf-mixed-uni-2-nodes-low-3.loop_2.0.1-145-rel-enterprise_2.0.1-145-rel-enterprise_DEST_Jan-30-2013_16%3A49%3A21.pdf
There are diags as well:
http://qa.hq.northscale.net/job/xperf-win/43/artifact/
Show
Pavel Paulau
added a comment - and 30 GB after 12 hours.
Hide
Aaron Miller
added a comment -
@Filipe I'm not familiar with the retry compaction thing. Does it look like this issue is more with retry compaction than initial compaction?
Show
Aaron Miller
added a comment - @Filipe I'm not familiar with the retry compaction thing. Does it look like this issue is more with retry compaction than initial compaction?
Hide
Aliaksey Artamonau
added a comment -
Just merged backport to 2.0.1 branch: http://review.couchbase.org/#/c/24391/
Show
Aliaksey Artamonau
added a comment - Just merged backport to 2.0.1 branch: http://review.couchbase.org/#/c/24391/
Show
Pavel Paulau
added a comment - reproduced in 2.0.1-153.
Hide
Pavel, Aaron is working on some optimizations for data compaction that target for 2.0.2 and beyond. In the mean time, do you happen to concur with doing Junyi's suggestion below?
At least we may want to give a try with less number of replicators (from 32 - 16)? Please advise and assign it back to Jin or Aaron for tracking this with coming optimizations.
=======================================================================================
Pavel, one way to isolate the problem is to reduce the # of replicators from 32 to16, thus in the case of two replications,
# of replicators is still 32 per node which is the same as single bucket. That will in some extent remove the impact of 1),
and if the compaction cannot still catch up, probably the culprit is 2).
=======================================================================================
At least we may want to give a try with less number of replicators (from 32 - 16)? Please advise and assign it back to Jin or Aaron for tracking this with coming optimizations.
=======================================================================================
Pavel, one way to isolate the problem is to reduce the # of replicators from 32 to16, thus in the case of two replications,
# of replicators is still 32 per node which is the same as single bucket. That will in some extent remove the impact of 1),
and if the compaction cannot still catch up, probably the culprit is 2).
=======================================================================================
Show
Jin Lim
added a comment - - edited Pavel, Aaron is working on some optimizations for data compaction that target for 2.0.2 and beyond. In the mean time, do you happen to concur with doing Junyi's suggestion below?
At least we may want to give a try with less number of replicators (from 32 - 16)? Please advise and assign it back to Jin or Aaron for tracking this with coming optimizations.
=======================================================================================
Pavel, one way to isolate the problem is to reduce the # of replicators from 32 to16, thus in the case of two replications,
# of replicators is still 32 per node which is the same as single bucket. That will in some extent remove the impact of 1),
and if the compaction cannot still catch up, probably the culprit is 2).
=======================================================================================
Hide
Pavel Paulau
added a comment -
Sorry, I missed that comment from Junyi. Ok, I will try 16 replicators.
I'd also recommend @ronnie to run KV test with 2 buckets on Windows w/o XDCR.
I'd also recommend @ronnie to run KV test with 2 buckets on Windows w/o XDCR.
Show
Pavel Paulau
added a comment - Sorry, I missed that comment from Junyi. Ok, I will try 16 replicators.
I'd also recommend @ronnie to run KV test with 2 buckets on Windows w/o XDCR.
Show
Ronnie Sun
added a comment - compaction did kick in in kv testcase
Hide
Pavel Paulau
added a comment -
@ronnie
Can you describe your workload and environment configuration.
Can you describe your workload and environment configuration.
Show
Pavel Paulau
added a comment - @ronnie
Can you describe your workload and environment configuration.
Hide
Ronnie Sun
added a comment -
4 windows vms, 2 buckets. active resident ratio : 70%
details
https://github.com/couchbase/testrunner/blob/master/conf/perf/mixed-2suv-2buckets.conf
details
https://github.com/couchbase/testrunner/blob/master/conf/perf/mixed-2suv-2buckets.conf
Show
Ronnie Sun
added a comment - 4 windows vms, 2 buckets. active resident ratio : 70%
details
https://github.com/couchbase/testrunner/blob/master/conf/perf/mixed-2suv-2buckets.conf
Hide
Pavel Paulau
added a comment -
Ok, Ronnie's workload is way more aggressive so it's not just "windows 2 buckets" issue.
16 replicators didn't help. Now it makes sense to try physical environment, it will be my next step.
16 replicators didn't help. Now it makes sense to try physical environment, it will be my next step.
Show
Pavel Paulau
added a comment - Ok, Ronnie's workload is way more aggressive so it's not just "windows 2 buckets" issue.
16 replicators didn't help. Now it makes sense to try physical environment, it will be my next step.
Hide
Jin Lim
added a comment -
Thanks Ronnie and Pavel for your time and help on investigating this.
Summarize what we have found so far:
* Overhead from having multiple buckets doesn't seem to be root cause
* Reducing number of replicators doesn't seem to be root cause either
* However, this seems to be Windows + XDCR only issue - KV only doesn't manifest the same symptom
* easy to reproduce, so users may run into this issue fairly easily on Windwos env
Based on the fact that it is Windows + XDCR only issue (and per bug scrubs) we move it to Windows to do list for 2.0.2.
Summarize what we have found so far:
* Overhead from having multiple buckets doesn't seem to be root cause
* Reducing number of replicators doesn't seem to be root cause either
* However, this seems to be Windows + XDCR only issue - KV only doesn't manifest the same symptom
* easy to reproduce, so users may run into this issue fairly easily on Windwos env
Based on the fact that it is Windows + XDCR only issue (and per bug scrubs) we move it to Windows to do list for 2.0.2.
Show
Jin Lim
added a comment - Thanks Ronnie and Pavel for your time and help on investigating this.
Summarize what we have found so far:
* Overhead from having multiple buckets doesn't seem to be root cause
* Reducing number of replicators doesn't seem to be root cause either
* However, this seems to be Windows + XDCR only issue - KV only doesn't manifest the same symptom
* easy to reproduce, so users may run into this issue fairly easily on Windwos env
Based on the fact that it is Windows + XDCR only issue (and per bug scrubs) we move it to Windows to do list for 2.0.2.
Hide
Dipti Borkar
added a comment -
We need to understand the behavior on physical hardware. Pavel to try to address this for this sprint.
Show
Dipti Borkar
added a comment - We need to understand the behavior on physical hardware. Pavel to try to address this for this sprint.
Hide
Thuan Nguyen
added a comment -
Integrated in ui-testing #11 (See [http://qa.hq.northscale.net/job/ui-testing/11/])
MB-7074: add config with 16 replicators (Revision 530e1e63f1471bfe9f0994e20888e25aa5207802)
MB-7074: add terra-win-xdcr config (Revision 689d09c1ba25a648e1cda909a61a8f57fecb13ed)
Result = SUCCESS
pavelpaulau :
Files :
* conf/perf/xperf-mixed-uni-2-nodes-low-4-16.conf
pavelpaulau :
Files :
* resources/perf/terra-win-xdcr.ini
MB-7074: add config with 16 replicators (Revision 530e1e63f1471bfe9f0994e20888e25aa5207802)
MB-7074: add terra-win-xdcr config (Revision 689d09c1ba25a648e1cda909a61a8f57fecb13ed)
Result = SUCCESS
pavelpaulau :
Files :
* conf/perf/xperf-mixed-uni-2-nodes-low-4-16.conf
pavelpaulau :
Files :
* resources/perf/terra-win-xdcr.ini
Show
Thuan Nguyen
added a comment - Integrated in ui-testing #11 (See [ http://qa.hq.northscale.net/job/ui-testing/11/ ])
MB-7074 : add config with 16 replicators (Revision 530e1e63f1471bfe9f0994e20888e25aa5207802)
MB-7074 : add terra-win-xdcr config (Revision 689d09c1ba25a648e1cda909a61a8f57fecb13ed)
Result = SUCCESS
pavelpaulau :
Files :
* conf/perf/xperf-mixed-uni-2-nodes-low-4-16.conf
pavelpaulau :
Files :
* resources/perf/terra-win-xdcr.ini
Hide
Jin Lim
added a comment -
From 2.0.2 development kickoff meeting today:
* NS SRV team has made some changes that should alleviate this issue in the latest 2.0.1.
* Before further investigation we should rerun the test with the latest 2.0.1 (or 2.0.2) and verify if the symptom still persist.
* If so, first please assign this to NS SRV team for their initial triage
* NS SRV team has made some changes that should alleviate this issue in the latest 2.0.1.
* Before further investigation we should rerun the test with the latest 2.0.1 (or 2.0.2) and verify if the symptom still persist.
* If so, first please assign this to NS SRV team for their initial triage
Show
Jin Lim
added a comment - From 2.0.2 development kickoff meeting today:
* NS SRV team has made some changes that should alleviate this issue in the latest 2.0.1.
* Before further investigation we should rerun the test with the latest 2.0.1 (or 2.0.2) and verify if the symptom still persist.
* If so, first please assign this to NS SRV team for their initial triage
Hide
Pavel Paulau
added a comment -
1. The only physical machine with windows that we have occupied by system tests team and Siri.
2. Problem still exist in 2.0.1 release.
What kind of input does ns_server team need for investigation?
2. Problem still exist in 2.0.1 release.
What kind of input does ns_server team need for investigation?
Show
Pavel Paulau
added a comment - 1. The only physical machine with windows that we have occupied by system tests team and Siri.
2. Problem still exist in 2.0.1 release.
What kind of input does ns_server team need for investigation?
Hide
Aleksey Kondratenko
added a comment -
Thanks for update Pavel, but in fragmentation graph from Feb I'm seeing fragmentation staying below 30 versus November staying above 50.
Please state more specifically how you concluded that problem still exists in 2.0.1? That data is all ns_server team needs as of now.
Please state more specifically how you concluded that problem still exists in 2.0.1? That data is all ns_server team needs as of now.
Show
Aleksey Kondratenko
added a comment - Thanks for update Pavel, but in fragmentation graph from Feb I'm seeing fragmentation staying below 30 versus November staying above 50.
Please state more specifically how you concluded that problem still exists in 2.0.1? That data is all ns_server team needs as of now.
Hide
Pavel Paulau
added a comment -
Sorry, attaching valid reports with reduced front-end workload.
Looks like fragment was attached by mistake.
Looks like fragment was attached by mistake.
Show
Pavel Paulau
added a comment - Sorry, attaching valid reports with reduced front-end workload.
Looks like fragment was attached by mistake.
Hide
Aleksey Kondratenko
added a comment -
Indeed. Looked at graphs and it's clear that issue was not fixed. My thinking was that given previously autocompactor compacted _all_ vbuckets even if just few of them actually needed attention and given that xdcr was producing exactly that (few vbuckets fragmented others largely untouched) I was expecting our fix in autocompactor to address that. Apparently this is not the case.
Show
Aleksey Kondratenko
added a comment - Indeed. Looked at graphs and it's clear that issue was not fixed. My thinking was that given previously autocompactor compacted _all_ vbuckets even if just few of them actually needed attention and given that xdcr was producing exactly that (few vbuckets fragmented others largely untouched) I was expecting our fix in autocompactor to address that. Apparently this is not the case.
Hide
Maria McDuff
added a comment -
per bug scrub: Alk --- are there any more known optimization that can be made for 2.0.2? pls advise.
Show
Maria McDuff
added a comment - per bug scrub: Alk --- are there any more known optimization that can be made for 2.0.2? pls advise.
Hide
Aleksey Kondratenko
added a comment -
Damien and Aaron were working on some plausibly looking speedup of DB compaction. I'm not familiar with their results. From ns_server side I have nothing in mind for helping this case.
Show
Aleksey Kondratenko
added a comment - Damien and Aaron were working on some plausibly looking speedup of DB compaction. I'm not familiar with their results. From ns_server side I have nothing in mind for helping this case.
Hide
Aleksey Kondratenko
added a comment -
This was originally assigned on Damien and I'm returning it to Damien.
Show
Aleksey Kondratenko
added a comment - This was originally assigned on Damien and I'm returning it to Damien.
Hide
Maria McDuff
added a comment -
per bug scrub: not yet supported in windows.
QE action item: verify this is not happening in linux. if it is, update this bug.
QE action item: verify this is not happening in linux. if it is, update this bug.
Show
Maria McDuff
added a comment - per bug scrub: not yet supported in windows.
QE action item: verify this is not happening in linux. if it is, update this bug.
Hide
Aleksey Kondratenko
added a comment -
Looks like somebody updated wrong ticket. This bug has nothing to do with windows or linux.
Show
Aleksey Kondratenko
added a comment - Looks like somebody updated wrong ticket. This bug has nothing to do with windows or linux.
Hide
Maria McDuff
added a comment -
Abhinav, are you seeing this compaction issue in xdcr in 2.0.2? you can close this bug if you are not seeing this issue. Thanks.
Show
Maria McDuff
added a comment - Abhinav, are you seeing this compaction issue in xdcr in 2.0.2? you can close this bug if you are not seeing this issue. Thanks.