Details
-
Type:
Bug
-
Status:
Resolved
-
Priority:
Blocker
-
Resolution: Fixed
-
Affects Version/s: 2.0.1
-
Fix Version/s: 2.0.1
-
Component/s: performance
-
Security Level: Public
-
Labels:None
Description
According to data at: https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdDV1SXhUZjJ0b0RhU3gtdlUzZGloUFE#gid=0
we've regressed 2x in build 160. There were very tiny changes since build 156. With 160 itself no different at all compared to 159. http://builds.hq.northscale.net/latestbuilds/CHANGES_couchbase-server-2.0.1-160-rel.txt
I need somebody answer very simply question. Have we regressed or not?
I'd like to CC folks: Filipe, Damien, Jin, Dipti, Farshid, Pavel, Yaseen but jira doesn't allow me.
we've regressed 2x in build 160. There were very tiny changes since build 156. With 160 itself no different at all compared to 159. http://builds.hq.northscale.net/latestbuilds/CHANGES_couchbase-server-2.0.1-160-rel.txt
I need somebody answer very simply question. Have we regressed or not?
I'd like to CC folks: Filipe, Damien, Jin, Dipti, Farshid, Pavel, Yaseen but jira doesn't allow me.
-
- reb-vperf-60M-in-1d.loop_2.0.1-160-rel-enterprise_2.0.1-160-rel-enterprise_APPOLO_Feb-18-2013_17-20-53.pdf
- 20/Feb/13 11:14 PM
- 6.53 MB
- Pavel Paulau
-
- reb-vperf-60M-in-1d.loop_2.0.1-160-rel-enterprise_2.0.1-160-rel-enterprise_APPOLO_Feb-20-2013_14-00-36.pdf
- 20/Feb/13 11:14 PM
- 4.29 MB
- Pavel Paulau
-
- 160_resident_ratio_comparison.png
- 164 kB
- 20/Feb/13 11:14 PM
-
- Screen Shot 2013-02-21 at 4.27.55 PM.png
- 93 kB
- 21/Feb/13 6:32 PM
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
Permalink
Aleksey Kondratenko
added a comment -
Apparently everyone agrees that there is indeed a regression.
Show
Aleksey Kondratenko
added a comment - Apparently everyone agrees that there is indeed a regression.
Hide
Jin Lim
added a comment -
Per bug scrubs, we will do the following in order to pin-point which commit caused this regression:
1) Phil to provide manifests info for each builds between 154 - 160 (especially ep engine/ns server/couchdb)
2) Jin to provide the list of all the commits went into between 154 - 160 (look below comment)
3-a) Ketaki to drive rerun of the system test on build 160, confirming there is real regression then do step (4) below
3-b) Ronnie to run the perf. test on build 160 as well
5) Do bst, pick one of builds between 154 - 160 (like build 157) and figure if the same regression exists in that build (repeat bst until zooming into the culprit)
1) Phil to provide manifests info for each builds between 154 - 160 (especially ep engine/ns server/couchdb)
2) Jin to provide the list of all the commits went into between 154 - 160 (look below comment)
3-a) Ketaki to drive rerun of the system test on build 160, confirming there is real regression then do step (4) below
3-b) Ronnie to run the perf. test on build 160 as well
5) Do bst, pick one of builds between 154 - 160 (like build 157) and figure if the same regression exists in that build (repeat bst until zooming into the culprit)
Show
Jin Lim
added a comment - Per bug scrubs, we will do the following in order to pin-point which commit caused this regression:
1) Phil to provide manifests info for each builds between 154 - 160 (especially ep engine/ns server/couchdb)
2) Jin to provide the list of all the commits went into between 154 - 160 (look below comment)
3-a) Ketaki to drive rerun of the system test on build 160, confirming there is real regression then do step (4) below
3-b) Ronnie to run the perf. test on build 160 as well
5) Do bst, pick one of builds between 154 - 160 (like build 157) and figure if the same regression exists in that build (repeat bst until zooming into the culprit)
Hide
Jin Lim
added a comment -
NS SERV
MB-7743: always initialize ns_config_ets_dup data
MB-7731: Count moves from 'undefined' nodes in total_in_flight
MB-7706: Delete _replicator db when joining cluster
MB-7562: Send per node diag in binary format
Format master activity events separately
Process processes info one by one when formatting diag
Grab per node diags sequentially
EP ENGINE
MB-7597: adjust the changes for high CPU/correct the logic of updating bgfetcher global variable
MB-7543: Rev sequence number shouldn't be set to zero
MB-7532: Set the default value of a rev sequence num to 1
Couchdb
MB-7631: Interpret bp of 0 as empty body
MB-7702: Fix groups received from the compactor
Format master activity events separately
Process processes info one by one when formatting diag
Grab per node diags sequentially
EP ENGINE
Couchdb
Show
Jin Lim
added a comment - NS SERV
MB-7743 : always initialize ns_config_ets_dup data
MB-7731 : Count moves from 'undefined' nodes in total_in_flight
MB-7706 : Delete _replicator db when joining cluster
MB-7562 : Send per node diag in binary format
Format master activity events separately
Process processes info one by one when formatting diag
Grab per node diags sequentially
EP ENGINE
MB-7597 : adjust the changes for high CPU/correct the logic of updating bgfetcher global variable
MB-7543 : Rev sequence number shouldn't be set to zero
MB-7532 : Set the default value of a rev sequence num to 1
Couchdb
MB-7631 : Interpret bp of 0 as empty body
MB-7702 : Fix groups received from the compactor
Hide
Aleksey Kondratenko
added a comment -
Ep-engine in here: http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.1-154-rel.deb.manifest.xml points to commit that at least my git checkout doesn't have at all !
Moreover build 153 points to commit that doesn't belong to any branch! http://i.imgur.com/yQwR41z.png
Somebody did non-fast-forward update of 2.0.1 branch in ep-engine ?
Moreover build 153 points to commit that doesn't belong to any branch! http://i.imgur.com/yQwR41z.png
Somebody did non-fast-forward update of 2.0.1 branch in ep-engine ?
Show
Aleksey Kondratenko
added a comment - Ep-engine in here: http://builds.hq.northscale.net/latestbuilds/couchbase-server-enterprise_x86_64_2.0.1-154-rel.deb.manifest.xml points to commit that at least my git checkout doesn't have at all !
Moreover build 153 points to commit that doesn't belong to any branch! http://i.imgur.com/yQwR41z.png
Somebody did non-fast-forward update of 2.0.1 branch in ep-engine ?
Hide
Aleksey Kondratenko
added a comment -
And if my gitk doesn't lie to me, the following ex-2.0.1 changes are now lost in 2.0.1 branch.
Author: Chiyoung Seo <chiyoung.seo@gmail.com> 2013-01-15 15:38:13
Committer: Chiyoung Seo <chiyoung.seo@gmail.com> 2013-01-15 17:00:29
Parent: d577556632ef440310290c944a1aac478e5aa4a8 (MB-7459 should count expired items during warmup)
Child: 7c145e7c01bbe284957abb58123f4a3a462b503a (MB-7567: Don't copy chk_end when merging checkpoints)
Child: 35e248148abc810ab67715e86417191cd73e39dd (MB-7543 Rev sequence number shouldn't be set to zero.)
Branches: remotes/couchbase/2.0.2, remotes/couchbase/master, remotes/for-review/2.0.2, remotes/for-review/master, remotes/m/master, remotes/membase/2.0.2, remotes/membase/master
Follows: 2.0.0, 2.0.0-couchbase
Precedes: BUILD-153
MB-7532 Set the default value of a rev sequence num to 1.
As the XDCR always expects a positive value for a revision
sequence number, its default value should be initialized to 1.
Change-Id: I6a281ad55d9c5d7e4ed93a6d4e33c609606f7f29
Reviewed-on: http://review.couchbase.org/23964
Reviewed-by: Michael Wiederhold <mike@couchbase.com>
Reviewed-by: Jin Lim <jin@couchbase.com>
Tested-by: Chiyoung Seo <chiyoung.seo@gmail.com>
and
Author: Chiyoung Seo <chiyoung.seo@gmail.com> 2013-01-16 16:14:46
Committer: Chiyoung Seo <chiyoung.seo@gmail.com> 2013-01-16 20:52:43
Tags: BUILD-153
Parent: 676eb2b9f49b60bdede73a6c1d5226cbe848b39b (MB-7532 Set the default value of a rev sequence num to 1.)
Child: d24ec8590a7713d225833603a631f6562460e397 (WIP on ep-work: 35e2481MB-7543 Rev sequence number shouldn't be set to zero.)
Child: 0c0e76253add3f2bbc35b48bc7b0ceec59d453c0 (index on ep-work: 35e2481MB-7543 Rev sequence number shouldn't be set to zero.)
Branch:
Follows: 2.0.0, 2.0.0-couchbase
Precedes:
MB-7543 Rev sequence number shouldn't be set to zero.
We should make sure that the revision sequence number shouldn't
be zero during warmup or when we receive tap_deletion,
tap_mutation, set_with_meta, and delete_with_meta.
Change-Id: I4316e0b688e2f29efc94533fab6fe61df0ea871a
Reviewed-on: http://review.couchbase.org/24004
Reviewed-by: Jin Lim <jin@couchbase.com>
Tested-by: Chiyoung Seo <chiyoung.seo@gmail.com>
Reviewed-on: http://review.couchbase.org/24007
Reviewed-by: Chiyoung Seo <chiyoung.seo@gmail.com>
Author: Chiyoung Seo <chiyoung.seo@gmail.com> 2013-01-15 15:38:13
Committer: Chiyoung Seo <chiyoung.seo@gmail.com> 2013-01-15 17:00:29
Parent: d577556632ef440310290c944a1aac478e5aa4a8 (
Child: 7c145e7c01bbe284957abb58123f4a3a462b503a (
Child: 35e248148abc810ab67715e86417191cd73e39dd (
Branches: remotes/couchbase/2.0.2, remotes/couchbase/master, remotes/for-review/2.0.2, remotes/for-review/master, remotes/m/master, remotes/membase/2.0.2, remotes/membase/master
Follows: 2.0.0, 2.0.0-couchbase
Precedes: BUILD-153
As the XDCR always expects a positive value for a revision
sequence number, its default value should be initialized to 1.
Change-Id: I6a281ad55d9c5d7e4ed93a6d4e33c609606f7f29
Reviewed-on: http://review.couchbase.org/23964
Reviewed-by: Michael Wiederhold <mike@couchbase.com>
Reviewed-by: Jin Lim <jin@couchbase.com>
Tested-by: Chiyoung Seo <chiyoung.seo@gmail.com>
and
Author: Chiyoung Seo <chiyoung.seo@gmail.com> 2013-01-16 16:14:46
Committer: Chiyoung Seo <chiyoung.seo@gmail.com> 2013-01-16 20:52:43
Tags: BUILD-153
Parent: 676eb2b9f49b60bdede73a6c1d5226cbe848b39b (
Child: d24ec8590a7713d225833603a631f6562460e397 (WIP on ep-work: 35e2481
Child: 0c0e76253add3f2bbc35b48bc7b0ceec59d453c0 (index on ep-work: 35e2481
Branch:
Follows: 2.0.0, 2.0.0-couchbase
Precedes:
We should make sure that the revision sequence number shouldn't
be zero during warmup or when we receive tap_deletion,
tap_mutation, set_with_meta, and delete_with_meta.
Change-Id: I4316e0b688e2f29efc94533fab6fe61df0ea871a
Reviewed-on: http://review.couchbase.org/24004
Reviewed-by: Jin Lim <jin@couchbase.com>
Tested-by: Chiyoung Seo <chiyoung.seo@gmail.com>
Reviewed-on: http://review.couchbase.org/24007
Reviewed-by: Chiyoung Seo <chiyoung.seo@gmail.com>
Show
Aleksey Kondratenko
added a comment - And if my gitk doesn't lie to me, the following ex-2.0.1 changes are now lost in 2.0.1 branch.
Author: Chiyoung Seo < chiyoung.seo@gmail.com > 2013-01-15 15:38:13
Committer: Chiyoung Seo < chiyoung.seo@gmail.com > 2013-01-15 17:00:29
Parent: d577556632ef440310290c944a1aac478e5aa4a8 ( MB-7459 should count expired items during warmup)
Child: 7c145e7c01bbe284957abb58123f4a3a462b503a ( MB-7567 : Don't copy chk_end when merging checkpoints)
Child: 35e248148abc810ab67715e86417191cd73e39dd ( MB-7543 Rev sequence number shouldn't be set to zero.)
Branches: remotes/couchbase/2.0.2, remotes/couchbase/master, remotes/for-review/2.0.2, remotes/for-review/master, remotes/m/master, remotes/membase/2.0.2, remotes/membase/master
Follows: 2.0.0, 2.0.0-couchbase
Precedes: BUILD-153
MB-7532 Set the default value of a rev sequence num to 1.
As the XDCR always expects a positive value for a revision
sequence number, its default value should be initialized to 1.
Change-Id: I6a281ad55d9c5d7e4ed93a6d4e33c609606f7f29
Reviewed-on: http://review.couchbase.org/23964
Reviewed-by: Michael Wiederhold < mike@couchbase.com >
Reviewed-by: Jin Lim < jin@couchbase.com >
Tested-by: Chiyoung Seo < chiyoung.seo@gmail.com >
and
Author: Chiyoung Seo < chiyoung.seo@gmail.com > 2013-01-16 16:14:46
Committer: Chiyoung Seo < chiyoung.seo@gmail.com > 2013-01-16 20:52:43
Tags: BUILD-153
Parent: 676eb2b9f49b60bdede73a6c1d5226cbe848b39b ( MB-7532 Set the default value of a rev sequence num to 1.)
Child: d24ec8590a7713d225833603a631f6562460e397 (WIP on ep-work: 35e2481 MB-7543 Rev sequence number shouldn't be set to zero.)
Child: 0c0e76253add3f2bbc35b48bc7b0ceec59d453c0 (index on ep-work: 35e2481 MB-7543 Rev sequence number shouldn't be set to zero.)
Branch:
Follows: 2.0.0, 2.0.0-couchbase
Precedes:
MB-7543 Rev sequence number shouldn't be set to zero.
We should make sure that the revision sequence number shouldn't
be zero during warmup or when we receive tap_deletion,
tap_mutation, set_with_meta, and delete_with_meta.
Change-Id: I4316e0b688e2f29efc94533fab6fe61df0ea871a
Reviewed-on: http://review.couchbase.org/24004
Reviewed-by: Jin Lim < jin@couchbase.com >
Tested-by: Chiyoung Seo < chiyoung.seo@gmail.com >
Reviewed-on: http://review.couchbase.org/24007
Reviewed-by: Chiyoung Seo < chiyoung.seo@gmail.com >
Hide
Phil Labee
added a comment -
changes 160 - 159: NONE
changes 159 - 158:
couchdb
*******************
8e961880696e0398229e7a837344b73053beed84
MB-7631 Interpret bp of 0 as empty body
c6e715704e9fd7f2ae2e7899dc027727b50c5b8a
MB-7702 Fix groups received from the compactor
ns_server
*******************
828c9378fcdb2271ac1c3cefd54702411960dd0a
MB-7743: always initialize ns_config_ets_dup data
4da96735a947dcd22f72a1f700279621ca44de89
MB-7731 Count moves from 'undefined' nodes in total_in_flight.
changes 158 - 157:
ns_server
*******************
4da96735a947dcd22f72a1f700279621ca44de89
MB-7731 Count moves from 'undefined' nodes in total_in_flight.
d8aa12706c2edac309743a502606594a00dc9929
MB-7706 Delete _replicator db when joining cluster.
changes 157 - 156: NONE
changes 156 - 155:
couchdb
*******************
c6e715704e9fd7f2ae2e7899dc027727b50c5b8a
MB-7702 Fix groups received from the compactor
356ef20516464c5bd0c22dec643bd27fba618d18
MB-6895 Correct error processing for dev views
ep-engine
*******************
eee2e9564ef844cf8cc435911d9e80af0dece244
MB-7597: adjust the changes for high CPU usage.
f6b583f3760cc1e7df85b5bf3abbc2e016a270fc
MB-7597: correct the logic of updating bgfetcher global variable
change 155 - 154:
ep-engine -- ??? (commit in manifest has been removed from history)
changes 154 - 153:
ns_server
*******************
d8aa12706c2edac309743a502606594a00dc9929
MB-7706 Delete _replicator db when joining cluster.
33611bcf7540ac2299694414b48299a9157db042
MB-7562 Send per node diag in binary format.
5d1a097b4dab37c836103b4f296e3d31ee820979
MB-7562 Format master activity events separately.
4d9a35a55b0ef3fac6860835270e91c022904cb3
MB-7562 Process processes info one by one when formatting diag.
45000dfe7384bc21ebdfd9f4ccef1c3891f440bb
MB-7562 Grab per node diags sequentially.
b30d67bfb6c231d04b24fe5b2d4a8baa9fccb457
MB-7676: prioritize view compactions heavily when we wait for it
changes 159 - 158:
couchdb
*******************
8e961880696e0398229e7a837344b73053beed84
c6e715704e9fd7f2ae2e7899dc027727b50c5b8a
ns_server
*******************
828c9378fcdb2271ac1c3cefd54702411960dd0a
4da96735a947dcd22f72a1f700279621ca44de89
changes 158 - 157:
ns_server
*******************
4da96735a947dcd22f72a1f700279621ca44de89
d8aa12706c2edac309743a502606594a00dc9929
changes 157 - 156: NONE
changes 156 - 155:
couchdb
*******************
c6e715704e9fd7f2ae2e7899dc027727b50c5b8a
356ef20516464c5bd0c22dec643bd27fba618d18
ep-engine
*******************
eee2e9564ef844cf8cc435911d9e80af0dece244
f6b583f3760cc1e7df85b5bf3abbc2e016a270fc
change 155 - 154:
ep-engine -- ??? (commit in manifest has been removed from history)
changes 154 - 153:
ns_server
*******************
d8aa12706c2edac309743a502606594a00dc9929
33611bcf7540ac2299694414b48299a9157db042
5d1a097b4dab37c836103b4f296e3d31ee820979
4d9a35a55b0ef3fac6860835270e91c022904cb3
45000dfe7384bc21ebdfd9f4ccef1c3891f440bb
b30d67bfb6c231d04b24fe5b2d4a8baa9fccb457
Show
Phil Labee
added a comment - changes 160 - 159: NONE
changes 159 - 158:
couchdb
*******************
8e961880696e0398229e7a837344b73053beed84
MB-7631 Interpret bp of 0 as empty body
c6e715704e9fd7f2ae2e7899dc027727b50c5b8a
MB-7702 Fix groups received from the compactor
ns_server
*******************
828c9378fcdb2271ac1c3cefd54702411960dd0a
MB-7743 : always initialize ns_config_ets_dup data
4da96735a947dcd22f72a1f700279621ca44de89
MB-7731 Count moves from 'undefined' nodes in total_in_flight.
changes 158 - 157:
ns_server
*******************
4da96735a947dcd22f72a1f700279621ca44de89
MB-7731 Count moves from 'undefined' nodes in total_in_flight.
d8aa12706c2edac309743a502606594a00dc9929
MB-7706 Delete _replicator db when joining cluster.
changes 157 - 156: NONE
changes 156 - 155:
couchdb
*******************
c6e715704e9fd7f2ae2e7899dc027727b50c5b8a
MB-7702 Fix groups received from the compactor
356ef20516464c5bd0c22dec643bd27fba618d18
MB-6895 Correct error processing for dev views
ep-engine
*******************
eee2e9564ef844cf8cc435911d9e80af0dece244
MB-7597 : adjust the changes for high CPU usage.
f6b583f3760cc1e7df85b5bf3abbc2e016a270fc
MB-7597 : correct the logic of updating bgfetcher global variable
change 155 - 154:
ep-engine -- ??? (commit in manifest has been removed from history)
changes 154 - 153:
ns_server
*******************
d8aa12706c2edac309743a502606594a00dc9929
MB-7706 Delete _replicator db when joining cluster.
33611bcf7540ac2299694414b48299a9157db042
MB-7562 Send per node diag in binary format.
5d1a097b4dab37c836103b4f296e3d31ee820979
MB-7562 Format master activity events separately.
4d9a35a55b0ef3fac6860835270e91c022904cb3
MB-7562 Process processes info one by one when formatting diag.
45000dfe7384bc21ebdfd9f4ccef1c3891f440bb
MB-7562 Grab per node diags sequentially.
b30d67bfb6c231d04b24fe5b2d4a8baa9fccb457
MB-7676 : prioritize view compactions heavily when we wait for it
Hide
Ronnie Sun
added a comment -
Show
Ronnie Sun
added a comment - From meeting:
This is a third copy of the reb-view issues. MB-6726 , MB-7785 .
Assign to Pavel and Ketaki for now.
Thanks,
Ronnie
Hide
Ok. No ep-engine commit was lost. Two commits I referred to above are on 2.0.1 via merge of branch 2.0.0
And I've found ep-engine's commit of build-154. https://github.com/couchbase/ep-engine/commit/b1a88bc09eb0d40db7e7495eaeaa1538a1c24d6e
It was too cherry-picked.
Of course it doesn't help in understanding what caused 2x slowdown
And I've found ep-engine's commit of build-154. https://github.com/couchbase/ep-engine/commit/b1a88bc09eb0d40db7e7495eaeaa1538a1c24d6e
It was too cherry-picked.
Of course it doesn't help in understanding what caused 2x slowdown
Show
Aleksey Kondratenko
added a comment - - edited Ok. No ep-engine commit was lost. Two commits I referred to above are on 2.0.1 via merge of branch 2.0.0
And I've found ep-engine's commit of build-154. https://github.com/couchbase/ep-engine/commit/b1a88bc09eb0d40db7e7495eaeaa1538a1c24d6e
It was too cherry-picked.
Of course it doesn't help in understanding what caused 2x slowdown
Hide
Pavel Paulau
added a comment -
I wouldn't say this is "160" problem. But in fact there is a weird variation. At least take look at attached resident ratios.
Also we used to notice spikes in number of connections. According to Alk this is expected behavior but in slow run there is period with no spikes for some reason.
Also we used to notice spikes in number of connections. According to Alk this is expected behavior but in slow run there is period with no spikes for some reason.
Show
Pavel Paulau
added a comment - I wouldn't say this is "160" problem. But in fact there is a weird variation. At least take look at attached resident ratios.
Also we used to notice spikes in number of connections. According to Alk this is expected behavior but in slow run there is period with no spikes for some reason.
Hide
Pavel Paulau
added a comment -
Just noticed that issue is assigned to me. What kind of input do you expect?
Show
Pavel Paulau
added a comment - Just noticed that issue is assigned to me. What kind of input do you expect?
Hide
Jin Lim
added a comment -
Pavel, please see below steps. Since the 2x slowness was detected during the rebalance with view system test on the build 160, can you please rerun any of your view tests on the build 160 as well? We would like to make sure that you see the same slowness with your tests. If you have done so already (or verified the slowness) don't need to do anything. Just assign back to me. Thanks!
Per bug scrubs, we will do the following in order to pin-point which commit caused this regression:
1) Phil to provide manifests info for each builds between 154 - 160 (especially ep engine/ns server/couchdb)
2) Jin to provide the list of all the commits went into between 154 - 160 (look below comment)
3-a) Ketaki to drive rerun of the system test on build 160, confirming there is real regression then do step (4) below
3-b) Pavel to run the perf. test on build 160 as well
5) Do binary search, pick one of builds between 154 - 160 (like build 157), run tests, and figure out if the same regression exists in that build (repeat the bs until finally zooming into the culprit)
Per bug scrubs, we will do the following in order to pin-point which commit caused this regression:
1) Phil to provide manifests info for each builds between 154 - 160 (especially ep engine/ns server/couchdb)
2) Jin to provide the list of all the commits went into between 154 - 160 (look below comment)
3-a) Ketaki to drive rerun of the system test on build 160, confirming there is real regression then do step (4) below
3-b) Pavel to run the perf. test on build 160 as well
5) Do binary search, pick one of builds between 154 - 160 (like build 157), run tests, and figure out if the same regression exists in that build (repeat the bs until finally zooming into the culprit)
Show
Jin Lim
added a comment - Pavel, please see below steps. Since the 2x slowness was detected during the rebalance with view system test on the build 160, can you please rerun any of your view tests on the build 160 as well? We would like to make sure that you see the same slowness with your tests. If you have done so already (or verified the slowness) don't need to do anything. Just assign back to me. Thanks!
Per bug scrubs, we will do the following in order to pin-point which commit caused this regression:
1) Phil to provide manifests info for each builds between 154 - 160 (especially ep engine/ns server/couchdb)
2) Jin to provide the list of all the commits went into between 154 - 160 (look below comment)
3-a) Ketaki to drive rerun of the system test on build 160, confirming there is real regression then do step (4) below
3-b) Pavel to run the perf. test on build 160 as well
5) Do binary search, pick one of builds between 154 - 160 (like build 157), run tests, and figure out if the same regression exists in that build (repeat the bs until finally zooming into the culprit)
Hide
Jin,
2x slowness comes from my performance test. Rebalance performance is very unstable, I observe 20%-100# slowness since build 156.
Builds 154-155 were broken because of ep-engine issue. 153 demonstrates more or less stable results.
2x slowness comes from my performance test. Rebalance performance is very unstable, I observe 20%-100# slowness since build 156.
Builds 154-155 were broken because of ep-engine issue. 153 demonstrates more or less stable results.
Show
Pavel Paulau
added a comment - - edited Jin,
2x slowness comes from my performance test. Rebalance performance is very unstable, I observe 20%-100# slowness since build 156.
Builds 154-155 were broken because of ep-engine issue. 153 demonstrates more or less stable results.
Hide
Jin Lim
added a comment -
Thanks Pavel!
Ketaki - as Pavel pointed out we can skip build 154 - 155 for the ep engine issue. Let's pick one build between 156 and 160 (like 158) and do the step (5) above. I am not sure who can help out with this but believe that you and Farshid can figure it out :)
Ketaki - as Pavel pointed out we can skip build 154 - 155 for the ep engine issue. Let's pick one build between 156 and 160 (like 158) and do the step (5) above. I am not sure who can help out with this but believe that you and Farshid can figure it out :)
Show
Jin Lim
added a comment - Thanks Pavel!
Ketaki - as Pavel pointed out we can skip build 154 - 155 for the ep engine issue. Let's pick one build between 156 and 160 (like 158) and do the step (5) above. I am not sure who can help out with this but believe that you and Farshid can figure it out :)
Hide
Phil Labee
added a comment -
changes to ep-engine since build 2.0.1-155 (occured in -156, code same since then):
------------------------------------------------
commit eee2e9564ef844cf8cc435911d9e80af0dece244
Author: xiaoqin <maxiaoqin2005@gmail.com>
Date: Tue Feb 12 15:27:17 2013 -0800
MB-7597: adjust the changes for high CPU usage.
Change-Id: Ib5c55a13ad5c99d6c920c9617f67de1899a5474f
Reviewed-on: http://review.couchbase.org/24565
Tested-by: xiaoqin ma <maxiaoqin2005@gmail.com>
Reviewed-by: Jin Lim <jin@couchbase.com>
src/bgfetcher.cc
------------------------------------------------
------------------------------------------------
------------------------------------------------
commit eee2e9564ef844cf8cc435911d9e80af0dece244
Author: xiaoqin <maxiaoqin2005@gmail.com>
Date: Tue Feb 12 15:27:17 2013 -0800
Change-Id: Ib5c55a13ad5c99d6c920c9617f67de1899a5474f
Reviewed-on: http://review.couchbase.org/24565
Tested-by: xiaoqin ma <maxiaoqin2005@gmail.com>
Reviewed-by: Jin Lim <jin@couchbase.com>
src/bgfetcher.cc
------------------------------------------------
------------------------------------------------
Show
Phil Labee
added a comment - changes to ep-engine since build 2.0.1-155 (occured in -156, code same since then):
------------------------------------------------
commit eee2e9564ef844cf8cc435911d9e80af0dece244
Author: xiaoqin < maxiaoqin2005@gmail.com >
Date: Tue Feb 12 15:27:17 2013 -0800
MB-7597 : adjust the changes for high CPU usage.
Change-Id: Ib5c55a13ad5c99d6c920c9617f67de1899a5474f
Reviewed-on: http://review.couchbase.org/24565
Tested-by: xiaoqin ma < maxiaoqin2005@gmail.com >
Reviewed-by: Jin Lim < jin@couchbase.com >
src/bgfetcher.cc
------------------------------------------------
------------------------------------------------
Hide
Phil Labee
added a comment -
changes to ns_server since build 155:
changes in ns_server build 159:
------------------------------------------------
commit 828c9378fcdb2271ac1c3cefd54702411960dd0a
Author: Aliaksey Kandratsenka <alk@tut.by>
Date: Thu Feb 14 11:48:46 2013 -0800
MB-7743: always initialize ns_config_ets_dup data
Previously it was working because of ns_config:reannounce as part of
ns_config_rep initialization. But there's clear race between
initialization of ns_config_ets_dup and continuation of startup which
does ns_config_rep initialization.
So we're asking for explicit config reannounce to populate ETS table
now.
Change-Id: I798a1bf0e818f876a7e1c9833e161724249db257
Reviewed-on: http://review.couchbase.org/24606
Tested-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>
Reviewed-by: Aliaksey Artamonau <aliaksiej.artamonau@gmail.com>
src/ns_config_ets_dup.erl
------------------------------------------------
------------------------------------------------
changes in ns_server build 158:
------------------------------------------------
commit 4da96735a947dcd22f72a1f700279621ca44de89
Author: Aliaksey Artamonau <aliaksiej.artamonau@gmail.com>
Date: Tue Feb 12 15:13:41 2013 -0800
MB-7731 Count moves from 'undefined' nodes in total_in_flight.
If moves schedule contains only moves from 'undefined' nodes, we'll
immediately decide that rebalance is done. This is because 'undefined'
moves are not counted in total_in_flight. We don't expect that in
ns_vbucket_mover:on_backfill_done which causes badmatch error. Which
in turn causes rebalance to fail.
Change-Id: I5b1a31b5f51b99bf5ea2e0bf4a2a93ca0a421bc2
Reviewed-on: http://review.couchbase.org/24569
Reviewed-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>
Tested-by: Aliaksey Artamonau <aliaksiej.artamonau@gmail.com>
Tested-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>
Reviewed-by: Jin Lim <jin@couchbase.com>
src/vbucket_move_scheduler.erl
src/vbucket_move_scheduler_validation.erl
------------------------------------------------
------------------------------------------------
changes in ns_server build 159:
------------------------------------------------
commit 828c9378fcdb2271ac1c3cefd54702411960dd0a
Author: Aliaksey Kandratsenka <alk@tut.by>
Date: Thu Feb 14 11:48:46 2013 -0800
Previously it was working because of ns_config:reannounce as part of
ns_config_rep initialization. But there's clear race between
initialization of ns_config_ets_dup and continuation of startup which
does ns_config_rep initialization.
So we're asking for explicit config reannounce to populate ETS table
now.
Change-Id: I798a1bf0e818f876a7e1c9833e161724249db257
Reviewed-on: http://review.couchbase.org/24606
Tested-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>
Reviewed-by: Aliaksey Artamonau <aliaksiej.artamonau@gmail.com>
src/ns_config_ets_dup.erl
------------------------------------------------
------------------------------------------------
changes in ns_server build 158:
------------------------------------------------
commit 4da96735a947dcd22f72a1f700279621ca44de89
Author: Aliaksey Artamonau <aliaksiej.artamonau@gmail.com>
Date: Tue Feb 12 15:13:41 2013 -0800
If moves schedule contains only moves from 'undefined' nodes, we'll
immediately decide that rebalance is done. This is because 'undefined'
moves are not counted in total_in_flight. We don't expect that in
ns_vbucket_mover:on_backfill_done which causes badmatch error. Which
in turn causes rebalance to fail.
Change-Id: I5b1a31b5f51b99bf5ea2e0bf4a2a93ca0a421bc2
Reviewed-on: http://review.couchbase.org/24569
Reviewed-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>
Tested-by: Aliaksey Artamonau <aliaksiej.artamonau@gmail.com>
Tested-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>
Reviewed-by: Jin Lim <jin@couchbase.com>
src/vbucket_move_scheduler.erl
src/vbucket_move_scheduler_validation.erl
------------------------------------------------
------------------------------------------------
Show
Phil Labee
added a comment - changes to ns_server since build 155:
changes in ns_server build 159:
------------------------------------------------
commit 828c9378fcdb2271ac1c3cefd54702411960dd0a
Author: Aliaksey Kandratsenka < alk@tut.by >
Date: Thu Feb 14 11:48:46 2013 -0800
MB-7743 : always initialize ns_config_ets_dup data
Previously it was working because of ns_config:reannounce as part of
ns_config_rep initialization. But there's clear race between
initialization of ns_config_ets_dup and continuation of startup which
does ns_config_rep initialization.
So we're asking for explicit config reannounce to populate ETS table
now.
Change-Id: I798a1bf0e818f876a7e1c9833e161724249db257
Reviewed-on: http://review.couchbase.org/24606
Tested-by: Aliaksey Kandratsenka < alkondratenko@gmail.com >
Reviewed-by: Aliaksey Artamonau < aliaksiej.artamonau@gmail.com >
src/ns_config_ets_dup.erl
------------------------------------------------
------------------------------------------------
changes in ns_server build 158:
------------------------------------------------
commit 4da96735a947dcd22f72a1f700279621ca44de89
Author: Aliaksey Artamonau < aliaksiej.artamonau@gmail.com >
Date: Tue Feb 12 15:13:41 2013 -0800
MB-7731 Count moves from 'undefined' nodes in total_in_flight.
If moves schedule contains only moves from 'undefined' nodes, we'll
immediately decide that rebalance is done. This is because 'undefined'
moves are not counted in total_in_flight. We don't expect that in
ns_vbucket_mover:on_backfill_done which causes badmatch error. Which
in turn causes rebalance to fail.
Change-Id: I5b1a31b5f51b99bf5ea2e0bf4a2a93ca0a421bc2
Reviewed-on: http://review.couchbase.org/24569
Reviewed-by: Aliaksey Kandratsenka < alkondratenko@gmail.com >
Tested-by: Aliaksey Artamonau < aliaksiej.artamonau@gmail.com >
Tested-by: Aliaksey Kandratsenka < alkondratenko@gmail.com >
Reviewed-by: Jin Lim < jin@couchbase.com >
src/vbucket_move_scheduler.erl
src/vbucket_move_scheduler_validation.erl
------------------------------------------------
------------------------------------------------
Hide
Phil Labee
added a comment -
changes in couchdb build 159:
------------------------------------------------
commit 8e961880696e0398229e7a837344b73053beed84
Author: Aaron Miller <apage43@ninjawhale.com>
Date: Wed Feb 13 15:03:45 2013 -0800
MB-7631 Interpret bp of 0 as empty body
By convention, in Couchstore/.couch files a body pointer of zero
indicates that the item has no body/has an empty body.
This commit makes CouchDB also have this behavior (otherwise, it will
attempt to read data from position 0 in the file when encountering a
deleted item)
Change-Id: If4229b68a2b6fc79484535619b7acfbc5055ca0e
Reviewed-on: http://review.couchbase.org/24586
Reviewed-by: Junyi Xie <junyi.couchbase@gmail.com>
Reviewed-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Jin Lim <jin@couchbase.com>
Tested-by: Aaron Miller <apage43@ninjawhale.com>
Tested-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-on: http://review.couchbase.org/24596
Reviewed-by: Aaron Miller <apage43@ninjawhale.com>
src/couchdb/couch_db.erl
------------------------------------------------
------------------------------------------------
changes in couchdb build 156:
------------------------------------------------
commit c6e715704e9fd7f2ae2e7899dc027727b50c5b8a
Author: Filipe David Borba Manana <fdmanana@apache.org>
Date: Tue Feb 12 06:12:39 2013 +0000
MB-7702 Fix groups received from the compactor
Due to the interaction with the updater, the compactor
might receive a group snapshot from the updater that
doesn't reflect the state of the snapshot in the parent
view group process. Therefore, in the view group process,
before accepting a group from the compactor, fix that
group so that it reflects the latest state, just like what
is done when the view group process receives a group
snapshot from the updater.
This could lead to a permanent bad index state (missing
passive partitions, or partitions marked as passive when
they should be active and vice-versa). This was a regression
introduced byMB-7522:
"Optimize very common index state transitions (rebalance)"
Change-Id: I2637d0fda838e803b5bcd474a86fb02c16cfe992
Reviewed-on: http://review.couchbase.org/24534
Reviewed-by: Filipe David Borba Manana <fdmanana@gmail.com>
Tested-by: Filipe David Borba Manana <fdmanana@gmail.com>
src/couch_set_view/src/couch_set_view_group.erl
src/couch_set_view/src/couch_set_view_updater.erl
------------------------------------------------
------------------------------------------------
------------------------------------------------
commit 8e961880696e0398229e7a837344b73053beed84
Author: Aaron Miller <apage43@ninjawhale.com>
Date: Wed Feb 13 15:03:45 2013 -0800
By convention, in Couchstore/.couch files a body pointer of zero
indicates that the item has no body/has an empty body.
This commit makes CouchDB also have this behavior (otherwise, it will
attempt to read data from position 0 in the file when encountering a
deleted item)
Change-Id: If4229b68a2b6fc79484535619b7acfbc5055ca0e
Reviewed-on: http://review.couchbase.org/24586
Reviewed-by: Junyi Xie <junyi.couchbase@gmail.com>
Reviewed-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-by: Jin Lim <jin@couchbase.com>
Tested-by: Aaron Miller <apage43@ninjawhale.com>
Tested-by: Filipe David Borba Manana <fdmanana@gmail.com>
Reviewed-on: http://review.couchbase.org/24596
Reviewed-by: Aaron Miller <apage43@ninjawhale.com>
src/couchdb/couch_db.erl
------------------------------------------------
------------------------------------------------
changes in couchdb build 156:
------------------------------------------------
commit c6e715704e9fd7f2ae2e7899dc027727b50c5b8a
Author: Filipe David Borba Manana <fdmanana@apache.org>
Date: Tue Feb 12 06:12:39 2013 +0000
Due to the interaction with the updater, the compactor
might receive a group snapshot from the updater that
doesn't reflect the state of the snapshot in the parent
view group process. Therefore, in the view group process,
before accepting a group from the compactor, fix that
group so that it reflects the latest state, just like what
is done when the view group process receives a group
snapshot from the updater.
This could lead to a permanent bad index state (missing
passive partitions, or partitions marked as passive when
they should be active and vice-versa). This was a regression
introduced by
"Optimize very common index state transitions (rebalance)"
Change-Id: I2637d0fda838e803b5bcd474a86fb02c16cfe992
Reviewed-on: http://review.couchbase.org/24534
Reviewed-by: Filipe David Borba Manana <fdmanana@gmail.com>
Tested-by: Filipe David Borba Manana <fdmanana@gmail.com>
src/couch_set_view/src/couch_set_view_group.erl
src/couch_set_view/src/couch_set_view_updater.erl
------------------------------------------------
------------------------------------------------
Show
Phil Labee
added a comment - changes in couchdb build 159:
------------------------------------------------
commit 8e961880696e0398229e7a837344b73053beed84
Author: Aaron Miller < apage43@ninjawhale.com >
Date: Wed Feb 13 15:03:45 2013 -0800
MB-7631 Interpret bp of 0 as empty body
By convention, in Couchstore/.couch files a body pointer of zero
indicates that the item has no body/has an empty body.
This commit makes CouchDB also have this behavior (otherwise, it will
attempt to read data from position 0 in the file when encountering a
deleted item)
Change-Id: If4229b68a2b6fc79484535619b7acfbc5055ca0e
Reviewed-on: http://review.couchbase.org/24586
Reviewed-by: Junyi Xie < junyi.couchbase@gmail.com >
Reviewed-by: Filipe David Borba Manana < fdmanana@gmail.com >
Reviewed-by: Jin Lim < jin@couchbase.com >
Tested-by: Aaron Miller < apage43@ninjawhale.com >
Tested-by: Filipe David Borba Manana < fdmanana@gmail.com >
Reviewed-on: http://review.couchbase.org/24596
Reviewed-by: Aaron Miller < apage43@ninjawhale.com >
src/couchdb/couch_db.erl
------------------------------------------------
------------------------------------------------
changes in couchdb build 156:
------------------------------------------------
commit c6e715704e9fd7f2ae2e7899dc027727b50c5b8a
Author: Filipe David Borba Manana < fdmanana@apache.org >
Date: Tue Feb 12 06:12:39 2013 +0000
MB-7702 Fix groups received from the compactor
Due to the interaction with the updater, the compactor
might receive a group snapshot from the updater that
doesn't reflect the state of the snapshot in the parent
view group process. Therefore, in the view group process,
before accepting a group from the compactor, fix that
group so that it reflects the latest state, just like what
is done when the view group process receives a group
snapshot from the updater.
This could lead to a permanent bad index state (missing
passive partitions, or partitions marked as passive when
they should be active and vice-versa). This was a regression
introduced by MB-7522 :
"Optimize very common index state transitions (rebalance)"
Change-Id: I2637d0fda838e803b5bcd474a86fb02c16cfe992
Reviewed-on: http://review.couchbase.org/24534
Reviewed-by: Filipe David Borba Manana < fdmanana@gmail.com >
Tested-by: Filipe David Borba Manana < fdmanana@gmail.com >
src/couch_set_view/src/couch_set_view_group.erl
src/couch_set_view/src/couch_set_view_updater.erl
------------------------------------------------
------------------------------------------------
Hide
Farshid Ghods
added a comment -
Alk,Filipe,Jin,
we looked at the changes that have gone in after 153 and there is one change that is related to rebalancing :
commit c6e715704e9fd7f2ae2e7899dc027727b50c5b8a
Author: Filipe David Borba Manana <fdmanana@apache.org>
Date: Tue Feb 12 06:12:39 2013 +0000
MB-7702 Fix groups received from the compactor
will need to check with Pavel whether there is a perf test results from a build prior to 155
we looked at the changes that have gone in after 153 and there is one change that is related to rebalancing :
commit c6e715704e9fd7f2ae2e7899dc027727b50c5b8a
Author: Filipe David Borba Manana <fdmanana@apache.org>
Date: Tue Feb 12 06:12:39 2013 +0000
will need to check with Pavel whether there is a perf test results from a build prior to 155
Show
Farshid Ghods
added a comment - Alk,Filipe,Jin,
we looked at the changes that have gone in after 153 and there is one change that is related to rebalancing :
commit c6e715704e9fd7f2ae2e7899dc027727b50c5b8a
Author: Filipe David Borba Manana < fdmanana@apache.org >
Date: Tue Feb 12 06:12:39 2013 +0000
MB-7702 Fix groups received from the compactor
will need to check with Pavel whether there is a perf test results from a build prior to 155
Hide
Farshid Ghods
added a comment -
Pavel,
are there results from build 156 ?
can you kick off a run against this build ?
are there results from build 156 ?
can you kick off a run against this build ?
Show
Farshid Ghods
added a comment - Pavel,
are there results from build 156 ?
can you kick off a run against this build ?
Hide
Ronnie Sun
added a comment -
Hi Jin, Pavel
Farshid and I are looking at the results. Looks like we have a regression since build 156. (see screenshot).
Please assign it to Filipe if it's confirmed.
Thanks,
Ronnie
Farshid and I are looking at the results. Looks like we have a regression since build 156. (see screenshot).
Please assign it to Filipe if it's confirmed.
Thanks,
Ronnie
Show
Ronnie Sun
added a comment - Hi Jin, Pavel
Farshid and I are looking at the results. Looks like we have a regression since build 156. (see screenshot).
Please assign it to Filipe if it's confirmed.
Thanks,
Ronnie
Hide
Farshid Ghods
added a comment -
Ronnie/Pavel,
to confirm that the regression is from this change you can take build 156 and replace
couch_set_view_group.beam
and couch_set_view_updater.beam from 153 build and rerun the test
to confirm that the regression is from this change you can take build 156 and replace
couch_set_view_group.beam
and couch_set_view_updater.beam from 153 build and rerun the test
Show
Farshid Ghods
added a comment - Ronnie/Pavel,
to confirm that the regression is from this change you can take build 156 and replace
couch_set_view_group.beam
and couch_set_view_updater.beam from 153 build and rerun the test
Hide
Pavel Paulau
added a comment -
I tried couch_set_view_group.beam from build 153 in build 156 - the same 1.2x regression.
There is problem somewhere else.
There is problem somewhere else.
Show
Pavel Paulau
added a comment - I tried couch_set_view_group.beam from build 153 in build 156 - the same 1.2x regression.
There is problem somewhere else.
Hide
Abhinav Dangeti
added a comment -
Agree with Pavel.
So here are the rebalance times for rebalancing in one node:
Case 1: 2.0.1-160-rel
Rebalance time: 7.5 hours
21:57:12 - Thu Feb 21, 2013: Starting rebalance, KeepNodes = ['ns_1@10.6.2.37','ns_1@10.6.2.38',
'ns_1@10.6.2.39','ns_1@10.6.2.40',
'ns_1@10.6.2.42','ns_1@10.6.2.43',
'ns_1@10.6.2.44','ns_1@10.6.2.45'], EjectNodes = []
05:26:40 - Fri Feb 22, 2013: Rebalance completed successfully.
Case 2: Copy couch_set_view_group.beam and couch_set_view_updater.beam from 2.0.1-153-rel.
Rebalance time:10.5 hrs
17:31:49 - Fri Feb 22, 2013: Starting rebalance, KeepNodes = ['ns_1@10.6.2.37','ns_1@10.6.2.38',
'ns_1@10.6.2.39','ns_1@10.6.2.40',
'ns_1@10.6.2.42','ns_1@10.6.2.43',
'ns_1@10.6.2.44','ns_1@10.6.2.45'], EjectNodes = []
03:57:27 - Sat Feb 23, 2013: Rebalance completed successfully.
**The second rebalance was with a slightly greater number of items.
So here are the rebalance times for rebalancing in one node:
Case 1: 2.0.1-160-rel
Rebalance time: 7.5 hours
21:57:12 - Thu Feb 21, 2013: Starting rebalance, KeepNodes = ['ns_1@10.6.2.37','ns_1@10.6.2.38',
'ns_1@10.6.2.39','ns_1@10.6.2.40',
'ns_1@10.6.2.42','ns_1@10.6.2.43',
'ns_1@10.6.2.44','ns_1@10.6.2.45'], EjectNodes = []
05:26:40 - Fri Feb 22, 2013: Rebalance completed successfully.
Case 2: Copy couch_set_view_group.beam and couch_set_view_updater.beam from 2.0.1-153-rel.
Rebalance time:10.5 hrs
17:31:49 - Fri Feb 22, 2013: Starting rebalance, KeepNodes = ['ns_1@10.6.2.37','ns_1@10.6.2.38',
'ns_1@10.6.2.39','ns_1@10.6.2.40',
'ns_1@10.6.2.42','ns_1@10.6.2.43',
'ns_1@10.6.2.44','ns_1@10.6.2.45'], EjectNodes = []
03:57:27 - Sat Feb 23, 2013: Rebalance completed successfully.
**The second rebalance was with a slightly greater number of items.
Show
Abhinav Dangeti
added a comment - Agree with Pavel.
So here are the rebalance times for rebalancing in one node:
Case 1: 2.0.1-160-rel
Rebalance time: 7.5 hours
21:57:12 - Thu Feb 21, 2013: Starting rebalance, KeepNodes = [' ns_1@10.6.2.37 ',' ns_1@10.6.2.38 ',
' ns_1@10.6.2.39 ',' ns_1@10.6.2.40 ',
' ns_1@10.6.2.42 ',' ns_1@10.6.2.43 ',
' ns_1@10.6.2.44 ',' ns_1@10.6.2.45 '], EjectNodes = []
05:26:40 - Fri Feb 22, 2013: Rebalance completed successfully.
Case 2: Copy couch_set_view_group.beam and couch_set_view_updater.beam from 2.0.1-153-rel.
Rebalance time:10.5 hrs
17:31:49 - Fri Feb 22, 2013: Starting rebalance, KeepNodes = [' ns_1@10.6.2.37 ',' ns_1@10.6.2.38 ',
' ns_1@10.6.2.39 ',' ns_1@10.6.2.40 ',
' ns_1@10.6.2.42 ',' ns_1@10.6.2.43 ',
' ns_1@10.6.2.44 ',' ns_1@10.6.2.45 '], EjectNodes = []
03:57:27 - Sat Feb 23, 2013: Rebalance completed successfully.
**The second rebalance was with a slightly greater number of items.
Hide
Farshid Ghods
added a comment -
Jin,
Phil has posted the changes between 153 and 155/6 and we ruled out the couchdb change by running additional tests.
there are set of ep-engine changes that were merged but commit message looks harmless. worth a second look.
Pavel,
did we run the same test multiple times against 153 to make sure the 1.2x performance regression we see is not just noise ?
Phil has posted the changes between 153 and 155/6 and we ruled out the couchdb change by running additional tests.
there are set of ep-engine changes that were merged but commit message looks harmless. worth a second look.
Pavel,
did we run the same test multiple times against 153 to make sure the 1.2x performance regression we see is not just noise ?
Show
Farshid Ghods
added a comment - Jin,
Phil has posted the changes between 153 and 155/6 and we ruled out the couchdb change by running additional tests.
there are set of ep-engine changes that were merged but commit message looks harmless. worth a second look.
Pavel,
did we run the same test multiple times against 153 to make sure the 1.2x performance regression we see is not just noise ?
Hide
Pavel Paulau
added a comment -
Adding my comments regarding build 153 from email thread:
"I have results for 4 runs, variation is +/- 3.5%. All results after build 153 are at least 13% slower. Please use litmus dashboard for reference."
"I have results for 4 runs, variation is +/- 3.5%. All results after build 153 are at least 13% slower. Please use litmus dashboard for reference."
Show
Pavel Paulau
added a comment - Adding my comments regarding build 153 from email thread:
"I have results for 4 runs, variation is +/- 3.5%. All results after build 153 are at least 13% slower. Please use litmus dashboard for reference."
Hide
Pavel Paulau
added a comment -
Just for tracking:
Regression seems to be fixed in build 167. After a couple of verifying runs I will close the issue.
Regression seems to be fixed in build 167. After a couple of verifying runs I will close the issue.
Show
Pavel Paulau
added a comment - Just for tracking:
Regression seems to be fixed in build 167. After a couple of verifying runs I will close the issue.
Hide
Filipe Manana
added a comment -
Just for reference, can someone explain exactly what was fixed in that build 167, what caused the regression?
I think every detail should be added here, and not keep important information only in internal e-mails (harder to track and not open to the public).
I think every detail should be added here, and not keep important information only in internal e-mails (harder to track and not open to the public).
Show
Filipe Manana
added a comment - Just for reference, can someone explain exactly what was fixed in that build 167, what caused the regression?
I think every detail should be added here, and not keep important information only in internal e-mails (harder to track and not open to the public).
Hide
Aleksey Kondratenko
added a comment -
Your point is correct Filipe.
I don't know if they identified root cause, but they've found the following ep-engine commits to be the cause: f6b583f3760cc1e7df85b5bf3abbc2e016a270fc and f6b583f3760cc1e7df85b5bf3abbc2e016a270fc .
On the other hand 167 has this ep-engines commit 4180bd2a09be9a2c62c22906afcfe119c64c5c93 that's just before first suspect commit.
_But_ this also removes the following commits:
Author: Mike Wiederhold <mike@couchbase.com> 2013-02-18 16:39:49
Committer: Michael Wiederhold <mike@couchbase.com> 2013-02-18 22:54:26
Parent: eee2e9564ef844cf8cc435911d9e80af0dece244 (MB-7597: adjust the changes for high CPU usage.)
Child: b1308a500e9610bb15379de57247b5e53104e1ba (MB-7509 Removed item should be removed from keyIndex too.)
Branches: remotes/couchbase/2.0.1, remotes/for-review/2.0.1, remotes/membase/2.0.1
Follows: BUILD-160
Precedes:
MB-7509: Skip merging checkpoint end items in chk queues
Change-Id: Idd73a04fc0ff8cce774ee956c006b879bfb1fec5
Reviewed-on: http://review.couchbase.org/24675
Reviewed-by: Jin Lim <jin@couchbase.com>
Tested-by: Michael Wiederhold <mike@couchbase.com>
and
Author: xiaoqin <maxiaoqin2005@gmail.com> 2013-02-19 17:51:34
Committer: xiaoqin ma <maxiaoqin2005@gmail.com> 2013-02-20 11:50:37
Parent: b93e048a4321e9891481693ba8139826e9c97314 (MB-7509: Skip merging checkpoint end items in chk queues)
Child: 62717fadb0f59cf50396908845625a4142b4a44f (MB-7797: Lower bgfetch sleep time for 60s to 1s)
Branch: remotes/couchbase/2.0.1
Follows: BUILD-160
Precedes:
MB-7509 Removed item should be removed from keyIndex too.
Change-Id: I842be6f77606e99ca7b47a451fd5f03e4b95feb6
Reviewed-on: http://review.couchbase.org/24730
Reviewed-by: Jin Lim <jin@couchbase.com>
Reviewed-by: Michael Wiederhold <mike@couchbase.com>
Tested-by: xiaoqin ma <maxiaoqin2005@gmail.com>
I don't know if they identified root cause, but they've found the following ep-engine commits to be the cause: f6b583f3760cc1e7df85b5bf3abbc2e016a270fc and f6b583f3760cc1e7df85b5bf3abbc2e016a270fc .
On the other hand 167 has this ep-engines commit 4180bd2a09be9a2c62c22906afcfe119c64c5c93 that's just before first suspect commit.
_But_ this also removes the following commits:
Author: Mike Wiederhold <mike@couchbase.com> 2013-02-18 16:39:49
Committer: Michael Wiederhold <mike@couchbase.com> 2013-02-18 22:54:26
Parent: eee2e9564ef844cf8cc435911d9e80af0dece244 (
Child: b1308a500e9610bb15379de57247b5e53104e1ba (
Branches: remotes/couchbase/2.0.1, remotes/for-review/2.0.1, remotes/membase/2.0.1
Follows: BUILD-160
Precedes:
Change-Id: Idd73a04fc0ff8cce774ee956c006b879bfb1fec5
Reviewed-on: http://review.couchbase.org/24675
Reviewed-by: Jin Lim <jin@couchbase.com>
Tested-by: Michael Wiederhold <mike@couchbase.com>
and
Author: xiaoqin <maxiaoqin2005@gmail.com> 2013-02-19 17:51:34
Committer: xiaoqin ma <maxiaoqin2005@gmail.com> 2013-02-20 11:50:37
Parent: b93e048a4321e9891481693ba8139826e9c97314 (
Child: 62717fadb0f59cf50396908845625a4142b4a44f (
Branch: remotes/couchbase/2.0.1
Follows: BUILD-160
Precedes:
Change-Id: I842be6f77606e99ca7b47a451fd5f03e4b95feb6
Reviewed-on: http://review.couchbase.org/24730
Reviewed-by: Jin Lim <jin@couchbase.com>
Reviewed-by: Michael Wiederhold <mike@couchbase.com>
Tested-by: xiaoqin ma <maxiaoqin2005@gmail.com>
Show
Aleksey Kondratenko
added a comment - Your point is correct Filipe.
I don't know if they identified root cause, but they've found the following ep-engine commits to be the cause: f6b583f3760cc1e7df85b5bf3abbc2e016a270fc and f6b583f3760cc1e7df85b5bf3abbc2e016a270fc .
On the other hand 167 has this ep-engines commit 4180bd2a09be9a2c62c22906afcfe119c64c5c93 that's just before first suspect commit.
_But_ this also removes the following commits:
Author: Mike Wiederhold < mike@couchbase.com > 2013-02-18 16:39:49
Committer: Michael Wiederhold < mike@couchbase.com > 2013-02-18 22:54:26
Parent: eee2e9564ef844cf8cc435911d9e80af0dece244 ( MB-7597 : adjust the changes for high CPU usage.)
Child: b1308a500e9610bb15379de57247b5e53104e1ba ( MB-7509 Removed item should be removed from keyIndex too.)
Branches: remotes/couchbase/2.0.1, remotes/for-review/2.0.1, remotes/membase/2.0.1
Follows: BUILD-160
Precedes:
MB-7509 : Skip merging checkpoint end items in chk queues
Change-Id: Idd73a04fc0ff8cce774ee956c006b879bfb1fec5
Reviewed-on: http://review.couchbase.org/24675
Reviewed-by: Jin Lim < jin@couchbase.com >
Tested-by: Michael Wiederhold < mike@couchbase.com >
and
Author: xiaoqin < maxiaoqin2005@gmail.com > 2013-02-19 17:51:34
Committer: xiaoqin ma < maxiaoqin2005@gmail.com > 2013-02-20 11:50:37
Parent: b93e048a4321e9891481693ba8139826e9c97314 ( MB-7509 : Skip merging checkpoint end items in chk queues)
Child: 62717fadb0f59cf50396908845625a4142b4a44f ( MB-7797 : Lower bgfetch sleep time for 60s to 1s)
Branch: remotes/couchbase/2.0.1
Follows: BUILD-160
Precedes:
MB-7509 Removed item should be removed from keyIndex too.
Change-Id: I842be6f77606e99ca7b47a451fd5f03e4b95feb6
Reviewed-on: http://review.couchbase.org/24730
Reviewed-by: Jin Lim < jin@couchbase.com >
Reviewed-by: Michael Wiederhold < mike@couchbase.com >
Tested-by: xiaoqin ma < maxiaoqin2005@gmail.com >
Hide
Jin Lim
added a comment -
Thanks Filipe. We will make sure that we always leave essential information so everyone is in sync. Basically, Alk is right that the build 167 we reverted ep engine changes between 154 & 160. EP Engine team is still looking into why one of these commits (most likely MB-7597: correct the logic of updating bgfetcher global variable) caused the regression.
Show
Jin Lim
added a comment - Thanks Filipe. We will make sure that we always leave essential information so everyone is in sync. Basically, Alk is right that the build 167 we reverted ep engine changes between 154 & 160. EP Engine team is still looking into why one of these commits (most likely MB-7597 : correct the logic of updating bgfetcher global variable) caused the regression.
Hide
Pavel Paulau
added a comment -
Just in case, 168 took 8.8h vs 5.5h in [148-153, 167]. Only one run so far.
Once reproduced/verified I will create a new bug.
Once reproduced/verified I will create a new bug.
Show
Pavel Paulau
added a comment - Just in case, 168 took 8.8h vs 5.5h in [148-153, 167]. Only one run so far.
Once reproduced/verified I will create a new bug.
Hide
Farshid Ghods
added a comment -
Phil,
please post the manifest diff between build 167 and 168 and then reassign the ticket back to Pavel
please post the manifest diff between build 167 and 168 and then reassign the ticket back to Pavel
Show
Farshid Ghods
added a comment - Phil,
please post the manifest diff between build 167 and 168 and then reassign the ticket back to Pavel
Hide
Xiaoqin Ma
added a comment -
Are there any stats/logs available after the tests are done?
Without fixMB-7597, it more possible run into the cases that BGfetcher is always running without any sleep at all, in some way, it can boost the fetcher performance but at the cost of spending CPU cycles on doing no work sometime. But I need to look at the stats to see what is really going on. Also, grep the CPU usage after the test is done, if possible.
Without fix
Show
Xiaoqin Ma
added a comment - Are there any stats/logs available after the tests are done?
Without fix MB-7597 , it more possible run into the cases that BGfetcher is always running without any sleep at all, in some way, it can boost the fetcher performance but at the cost of spending CPU cycles on doing no work sometime. But I need to look at the stats to see what is really going on. Also, grep the CPU usage after the test is done, if possible.
Hide
@Pavel and Ronnie
Can you give me the pointer to one of the test you observed the performance regression with stats/logs?
Also do you see other performance regression except the rebalance one? especially ones with more bg fetch jobs.
Or do you have any performance test just for bg fetcher or something close?
Thanks!
Can you give me the pointer to one of the test you observed the performance regression with stats/logs?
Also do you see other performance regression except the rebalance one? especially ones with more bg fetch jobs.
Or do you have any performance test just for bg fetcher or something close?
Thanks!
Show
Xiaoqin Ma
added a comment - - edited @Pavel and Ronnie
Can you give me the pointer to one of the test you observed the performance regression with stats/logs?
Also do you see other performance regression except the rebalance one? especially ones with more bg fetch jobs.
Or do you have any performance test just for bg fetcher or something close?
Thanks!
Hide
Phil Labee
added a comment -
Changes between builds 2.0.1-167 and 2.0.1-168
couchdb -- No changes: both builds use revision 8e961880696e0398229e7a837344b73053beed84
couchdbx-app --- No changes: both builds use revision 52d20848312254148c889de517b39199765d8e22
couchstore -- No changes: both builds use revision 146c965a5aa5da21901209122bde60a8f3a52171
ep-engine -- No changes: both builds use revision 4180bd2a09be9a2c62c22906afcfe119c64c5c93
geocouch -- No changes: both builds use revision 5f5706d8a0db214aaa942ca496661938eaf2385e
tlm changes -- No changes: both builds use revision 12abea946eafd7411273d18a10ae1f84390db3d4
ns_server
--------------------------------------------------
commit: 56087e5137cd15774211f8438486aae3172775ef
Author: Aliaksey Artamonau < aliaksiej.artamonau@gmail.com >
Date: Tue Feb 26 22:27:10 2013 +0000
MB-7799 Use precomputed value in local_bucket_disk_usage.
Pre-2.0 nodes do rpc:multicall to
ns_storage_conf:local_bucket_disk_usage whenever someone request
bucket info. So if this is done often our local file_server_2 might
get overloaded by the requests. To prevent this we just use the value
from couch_stats_reader's ets table which gets updated periodically.
Change-Id: I372dd41895d487da78ae5133b255162a5577fac0
Reviewed-on: http://review.couchbase.org/24882
Tested-by: Aliaksey Artamonau <aliaksiej.artamonau@gmail.com>
Reviewed-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>
M src/ns_storage_conf.erl
--------------------------------------------------
couchdb -- No changes: both builds use revision 8e961880696e0398229e7a837344b73053beed84
couchdbx-app --- No changes: both builds use revision 52d20848312254148c889de517b39199765d8e22
couchstore -- No changes: both builds use revision 146c965a5aa5da21901209122bde60a8f3a52171
ep-engine -- No changes: both builds use revision 4180bd2a09be9a2c62c22906afcfe119c64c5c93
geocouch -- No changes: both builds use revision 5f5706d8a0db214aaa942ca496661938eaf2385e
tlm changes -- No changes: both builds use revision 12abea946eafd7411273d18a10ae1f84390db3d4
ns_server
--------------------------------------------------
commit: 56087e5137cd15774211f8438486aae3172775ef
Author: Aliaksey Artamonau < aliaksiej.artamonau@gmail.com >
Date: Tue Feb 26 22:27:10 2013 +0000
Pre-2.0 nodes do rpc:multicall to
ns_storage_conf:local_bucket_disk_usage whenever someone request
bucket info. So if this is done often our local file_server_2 might
get overloaded by the requests. To prevent this we just use the value
from couch_stats_reader's ets table which gets updated periodically.
Change-Id: I372dd41895d487da78ae5133b255162a5577fac0
Reviewed-on: http://review.couchbase.org/24882
Tested-by: Aliaksey Artamonau <aliaksiej.artamonau@gmail.com>
Reviewed-by: Aliaksey Kandratsenka <alkondratenko@gmail.com>
M src/ns_storage_conf.erl
--------------------------------------------------
Show
Phil Labee
added a comment - Changes between builds 2.0.1-167 and 2.0.1-168
couchdb -- No changes: both builds use revision 8e961880696e0398229e7a837344b73053beed84
couchdbx-app --- No changes: both builds use revision 52d20848312254148c889de517b39199765d8e22
couchstore -- No changes: both builds use revision 146c965a5aa5da21901209122bde60a8f3a52171
ep-engine -- No changes: both builds use revision 4180bd2a09be9a2c62c22906afcfe119c64c5c93
geocouch -- No changes: both builds use revision 5f5706d8a0db214aaa942ca496661938eaf2385e
tlm changes -- No changes: both builds use revision 12abea946eafd7411273d18a10ae1f84390db3d4
ns_server
--------------------------------------------------
commit: 56087e5137cd15774211f8438486aae3172775ef
Author: Aliaksey Artamonau < aliaksiej.artamonau@gmail.com >
Date: Tue Feb 26 22:27:10 2013 +0000
MB-7799 Use precomputed value in local_bucket_disk_usage.
Pre-2.0 nodes do rpc:multicall to
ns_storage_conf:local_bucket_disk_usage whenever someone request
bucket info. So if this is done often our local file_server_2 might
get overloaded by the requests. To prevent this we just use the value
from couch_stats_reader's ets table which gets updated periodically.
Change-Id: I372dd41895d487da78ae5133b255162a5577fac0
Reviewed-on: http://review.couchbase.org/24882
Tested-by: Aliaksey Artamonau < aliaksiej.artamonau@gmail.com >
Reviewed-by: Aliaksey Kandratsenka < alkondratenko@gmail.com >
M src/ns_storage_conf.erl
--------------------------------------------------
Hide
Aliaksey Artamonau
added a comment -
I hardly believe that my change could cause the regression. This code is exercised only when some of the cluster nodes are of pre 2.0 version.
Show
Aliaksey Artamonau
added a comment - I hardly believe that my change could cause the regression. This code is exercised only when some of the cluster nodes are of pre 2.0 version.
Hide
Aleksey Kondratenko
added a comment -
And I've shared same opinion with Jin, Farshid, Steve, Ronnie and everyone else.
So far we're waiting results of Pavel's second run. Which will either confirm there's some issue or indicate that it's some "noise". Of course 2x of noise is something to investigate.
So as soon as we have any new data, I'd like Pavel or somebody else from perf team to give us diags and master events from good and bad runs so that we can see what may cause that and _where_ this difference occurs.
So far we're waiting results of Pavel's second run. Which will either confirm there's some issue or indicate that it's some "noise". Of course 2x of noise is something to investigate.
So as soon as we have any new data, I'd like Pavel or somebody else from perf team to give us diags and master events from good and bad runs so that we can see what may cause that and _where_ this difference occurs.
Show
Aleksey Kondratenko
added a comment - And I've shared same opinion with Jin, Farshid, Steve, Ronnie and everyone else.
So far we're waiting results of Pavel's second run. Which will either confirm there's some issue or indicate that it's some "noise". Of course 2x of noise is something to investigate.
So as soon as we have any new data, I'd like Pavel or somebody else from perf team to give us diags and master events from good and bad runs so that we can see what may cause that and _where_ this difference occurs.
Hide
Re-run took normal 5.5h, but this is not the first time when we observe 2x noise:
http://dashboard.hq.couchbase.com/litmus/dashboard/?testcase=reb-vperf-60M-in-1d
MB-7838 is a separate bug for investigation with all required information for Alk and other guys.
http://dashboard.hq.couchbase.com/litmus/dashboard/?testcase=reb-vperf-60M-in-1d
Show
Pavel Paulau
added a comment - - edited Re-run took normal 5.5h, but this is not the first time when we observe 2x noise:
http://dashboard.hq.couchbase.com/litmus/dashboard/?testcase=reb-vperf-60M-in-1d
MB-7838 is a separate bug for investigation with all required information for Alk and other guys.
Hide
Pavel Paulau
added a comment -
@Xiaoqin
This is test spec:
https://raw.github.com/couchbase/testrunner/master/conf/perf/reb-vperf-60M-in-1d.conf
This is test spec:
https://raw.github.com/couchbase/testrunner/master/conf/perf/reb-vperf-60M-in-1d.conf
Show
Pavel Paulau
added a comment - @Xiaoqin
This is test spec:
https://raw.github.com/couchbase/testrunner/master/conf/perf/reb-vperf-60M-in-1d.conf
Hide
Thuan Nguyen
added a comment -
Integrated in github-ep-engine-2-0 #481 (See [http://qa.hq.northscale.net/job/github-ep-engine-2-0/481/])
MB-7797: Lower bgfetch sleep time for 60s to 1s (Revision 62717fadb0f59cf50396908845625a4142b4a44f)
Revert "MB-7797: Lower bgfetch sleep time for 60s to 1s" (Revision 1b81128a14d25bb62ed63ffc14618c3e3a9374c1)
Result = SUCCESS
Mike Wiederhold :
Files :
* src/bgfetcher.cc
Mike Wiederhold :
Files :
* src/bgfetcher.cc
Revert "
Result = SUCCESS
Mike Wiederhold :
Files :
* src/bgfetcher.cc
Mike Wiederhold :
Files :
* src/bgfetcher.cc
Show
Thuan Nguyen
added a comment - Integrated in github-ep-engine-2-0 #481 (See [ http://qa.hq.northscale.net/job/github-ep-engine-2-0/481/ ])
MB-7797 : Lower bgfetch sleep time for 60s to 1s (Revision 62717fadb0f59cf50396908845625a4142b4a44f)
Revert " MB-7797 : Lower bgfetch sleep time for 60s to 1s" (Revision 1b81128a14d25bb62ed63ffc14618c3e3a9374c1)
Result = SUCCESS
Mike Wiederhold :
Files :
* src/bgfetcher.cc
Mike Wiederhold :
Files :
* src/bgfetcher.cc