Details
-
Type:
Improvement
-
Status:
Reopened
-
Priority:
Critical
-
Resolution: Unresolved
-
Affects Version/s: 2.0
-
Fix Version/s: .major-release
-
Component/s: cross-datacenter-replication, UI
-
Security Level: Public
-
Labels:None
Description
After seeing XDCR in action, would like to propose a few enhancements:
-Put certain statistics in the XDCR screen as well as on the graph page:
-Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
-Items per second to see speed of each stream and in total
-Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total
-On the graph page of outgoing, I would recommend removing "mutations checked", "mutations replicated", "data replication", "active vb reps", "waiting vb reps", "secs in replicating", "secs in checkpointing", "checkpoints issued" and "checkpoints failed". These stats really aren't useful from the perspective of someone trying to monitor or troubleshoot the current state of their cluster.
-On the graph page of outbound, there's a bit of confusion over the difference between "mutations to replicate", "mutations in queue" and "queue size". Unless they are showing significantly (and usefully) different metrics, recommend to remove all but one
-On the graph page of incoming, recommend to put "total ops/sec" on the far left to line up with the "ops/sec" in the summary section
-"XDCR dest ops per sec" is confusing because this cluster is the "destination" yet the stat implies the other way around. Recommend "Incoming XDCR ops per sec"
-"XDCR docs to replicate" is a little confusing because it doesn't match the same stat in the "outbound". Recommend to change "mutations to replicate" to "XDCR docs to replicate"
-Would also be good to see outbound ops/sec in the summary section alongside the number remaining to replicate
-Put certain statistics in the XDCR screen as well as on the graph page:
-Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
-Items per second to see speed of each stream and in total
-Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total
-On the graph page of outgoing, I would recommend removing "mutations checked", "mutations replicated", "data replication", "active vb reps", "waiting vb reps", "secs in replicating", "secs in checkpointing", "checkpoints issued" and "checkpoints failed". These stats really aren't useful from the perspective of someone trying to monitor or troubleshoot the current state of their cluster.
-On the graph page of outbound, there's a bit of confusion over the difference between "mutations to replicate", "mutations in queue" and "queue size". Unless they are showing significantly (and usefully) different metrics, recommend to remove all but one
-On the graph page of incoming, recommend to put "total ops/sec" on the far left to line up with the "ops/sec" in the summary section
-"XDCR dest ops per sec" is confusing because this cluster is the "destination" yet the stat implies the other way around. Recommend "Incoming XDCR ops per sec"
-"XDCR docs to replicate" is a little confusing because it doesn't match the same stat in the "outbound". Recommend to change "mutations to replicate" to "XDCR docs to replicate"
-Would also be good to see outbound ops/sec in the summary section alongside the number remaining to replicate
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
Perry Krug
added a comment -
Thanks Junyi. I'd actually like to continue the discussion about removing those stats because anything that a customer sees will generate a question as to the purpose...meaningful or not. We want the UI the be simple and direct to our users for the purpose of understanding what the cluster/node is doing...I don't think these 11 stats help accomplish that for our customers. Additionally, I think the ns_server team would agree that the overall less stats we have the better for performance and maintenance.
To be clear, I'm not advocating for these stats removed from the system completely, just from the UI.
To be clear, I'm not advocating for these stats removed from the system completely, just from the UI.
Show
Perry Krug
added a comment - Thanks Junyi. I'd actually like to continue the discussion about removing those stats because anything that a customer sees will generate a question as to the purpose...meaningful or not. We want the UI the be simple and direct to our users for the purpose of understanding what the cluster/node is doing...I don't think these 11 stats help accomplish that for our customers. Additionally, I think the ns_server team would agree that the overall less stats we have the better for performance and maintenance.
To be clear, I'm not advocating for these stats removed from the system completely, just from the UI.
Hide
Dipti,
Perry suggested removing some XDCR stats on UI and add some new stats. This is big change in XDCR UI and it woud be better that you are aware of this. Before going ahead and implement this, I would like to have your comments here if
1) Are these new stats necessary?
2) Are these old XDCR stats which Perry suggested to remove, still valid to some customers?
3) Which version do you want this change happens, say 2.0.1 (too late?), 2.1, or 3.0 etc.
Please add others whom you think should be aware of this.
Thanks.
Perry suggested removing some XDCR stats on UI and add some new stats. This is big change in XDCR UI and it woud be better that you are aware of this. Before going ahead and implement this, I would like to have your comments here if
1) Are these new stats necessary?
2) Are these old XDCR stats which Perry suggested to remove, still valid to some customers?
3) Which version do you want this change happens, say 2.0.1 (too late?), 2.1, or 3.0 etc.
Please add others whom you think should be aware of this.
Thanks.
Show
Junyi Xie
added a comment - - edited Dipti,
Perry suggested removing some XDCR stats on UI and add some new stats. This is big change in XDCR UI and it woud be better that you are aware of this. Before going ahead and implement this, I would like to have your comments here if
1) Are these new stats necessary?
2) Are these old XDCR stats which Perry suggested to remove, still valid to some customers?
3) Which version do you want this change happens, say 2.0.1 (too late?), 2.1, or 3.0 etc.
Please add others whom you think should be aware of this.
Thanks.
Hide
Ketaki Gangal
added a comment -
Adding some more here
- Rate of Replication [items sent / sec]
- Average Replication Rate
- Lag in Replication ( Helpful to understand/observe If receiving too many back-offs/Timeouts)
- Average Replication lag
- Items replicated
- Items to replicate
- Percentage Conflicts in Data
Other Useful ones
-------------------------------------
-one checkpoint every minute .
-back off handled by ns-server
-how many times retry
-timeouts - failed to replicate
-average replication lag
- XDCR data size
- Rate of Replication [items sent / sec]
- Average Replication Rate
- Lag in Replication ( Helpful to understand/observe If receiving too many back-offs/Timeouts)
- Average Replication lag
- Items replicated
- Items to replicate
- Percentage Conflicts in Data
Other Useful ones
-------------------------------------
-one checkpoint every minute .
-back off handled by ns-server
-how many times retry
-timeouts - failed to replicate
-average replication lag
- XDCR data size
Show
Ketaki Gangal
added a comment - Adding some more here
- Rate of Replication [items sent / sec]
- Average Replication Rate
- Lag in Replication ( Helpful to understand/observe If receiving too many back-offs/Timeouts)
- Average Replication lag
- Items replicated
- Items to replicate
- Percentage Conflicts in Data
Other Useful ones
-------------------------------------
-one checkpoint every minute .
-back off handled by ns-server
-how many times retry
-timeouts - failed to replicate
-average replication lag
- XDCR data size
Hide
Ketaki Gangal
added a comment -
Based on our discussion today, can we have the following changes/edits on the current XDCR stats.
1. On the XDCR tab, in addition to existing information add per Replication Setup
a. Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
b.Replication rate-Items per second to see speed of each stream and in total
c.Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total
2. On the Main bucket section
a. Rename XDC Dest ops/sec to "Incoming XDCR ops/sec"
b. Rename XDC docs to replicate " Outbound XDCR docs"
c. Add Percentage Complete
d. Add XDCR Replication Rate
3. On Outgoing XDCR section
a. Remove "mutations checked" and "mutations replicated", move it at a logging level.
b. Remove "active vb reps" and "waiting vb reps" , move it to a logging level.
c. Remove "mutations in queue"
d. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR docs"
d. Rename "queue size" as "XDCR queue size"
e. Edit "num checkpoints issued ", "num checkpoints failed" to last 10 checkpoint instead of the entire set of checkpoints.
@Perry - Stats "secs in replicating" and "secs in checkpointing" have been useful in triaging xdcr bugs in the past.
Currently most of the xdc stats are aggregate at the ns_server, mnesia level. The individual( @ a vbucket level) logging is maintained at the log level. Considering the criticality of this stat, we ve decided to continue maintaining this information for xdc checkpointing.
1. On the XDCR tab, in addition to existing information add per Replication Setup
a. Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
b.Replication rate-Items per second to see speed of each stream and in total
c.Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total
2. On the Main bucket section
a. Rename XDC Dest ops/sec to "Incoming XDCR ops/sec"
b. Rename XDC docs to replicate " Outbound XDCR docs"
c. Add Percentage Complete
d. Add XDCR Replication Rate
3. On Outgoing XDCR section
a. Remove "mutations checked" and "mutations replicated", move it at a logging level.
b. Remove "active vb reps" and "waiting vb reps" , move it to a logging level.
c. Remove "mutations in queue"
d. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR docs"
d. Rename "queue size" as "XDCR queue size"
e. Edit "num checkpoints issued ", "num checkpoints failed" to last 10 checkpoint instead of the entire set of checkpoints.
@Perry - Stats "secs in replicating" and "secs in checkpointing" have been useful in triaging xdcr bugs in the past.
Currently most of the xdc stats are aggregate at the ns_server, mnesia level. The individual( @ a vbucket level) logging is maintained at the log level. Considering the criticality of this stat, we ve decided to continue maintaining this information for xdc checkpointing.
Show
Ketaki Gangal
added a comment - Based on our discussion today, can we have the following changes/edits on the current XDCR stats.
1. On the XDCR tab, in addition to existing information add per Replication Setup
a. Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
b.Replication rate-Items per second to see speed of each stream and in total
c.Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total
2. On the Main bucket section
a. Rename XDC Dest ops/sec to "Incoming XDCR ops/sec"
b. Rename XDC docs to replicate " Outbound XDCR docs"
c. Add Percentage Complete
d. Add XDCR Replication Rate
3. On Outgoing XDCR section
a. Remove "mutations checked" and "mutations replicated", move it at a logging level.
b. Remove "active vb reps" and "waiting vb reps" , move it to a logging level.
c. Remove "mutations in queue"
d. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR docs"
d. Rename "queue size" as "XDCR queue size"
e. Edit "num checkpoints issued ", "num checkpoints failed" to last 10 checkpoint instead of the entire set of checkpoints.
@Perry - Stats "secs in replicating" and "secs in checkpointing" have been useful in triaging xdcr bugs in the past.
Currently most of the xdc stats are aggregate at the ns_server, mnesia level. The individual( @ a vbucket level) logging is maintained at the log level. Considering the criticality of this stat, we ve decided to continue maintaining this information for xdc checkpointing.
Hide
Ketaki Gangal
added a comment -
Of these , these stats are most critical
1. On the XDCR tab, in addition to existing information add per Replication Setup
a. Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
b.Replication rate-Items per second to see speed of each stream and in total
c.Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total
1. On the XDCR tab, in addition to existing information add per Replication Setup
a. Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
b.Replication rate-Items per second to see speed of each stream and in total
c.Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total
Show
Ketaki Gangal
added a comment - Of these , these stats are most critical
1. On the XDCR tab, in addition to existing information add per Replication Setup
a. Percentage complete/caught up. While backfilling replication this would describe the number of items already sent to the remote side out of the total in the bucket. Once running, it would show whether there is a significant amount of backup in the queue
b.Replication rate-Items per second to see speed of each stream and in total
c.Bandwidth in use. As per a customer, the most important thing with XDCR is going to be the possibly cross-country internet bandwidth and will need to monitor that for each replication stream and in total
Hide
Dipti Borkar
added a comment -
Ketaki, sorry I couldn't attend the meeting today. I want some clarification on some of these before we implement. I'll sync up with you tomorrow.
Show
Dipti Borkar
added a comment - Ketaki, sorry I couldn't attend the meeting today. I want some clarification on some of these before we implement. I'll sync up with you tomorrow.
Hide
Perry Krug
added a comment -
Thank you Ketaki.
A few more comments:
-I don't know that "percentage complete" and "XDCR replication rate" is necessarily needed in the "main bucket section"...those are really specific to each stream below and may not make sense to aggregate together.
-Are we planning on keeping "mutation to replicate" and "XDCR docs to replicate" as separate stats?
-Along with above, what is the difference between (and do we need to keep all) "XDCR queue size", and "Outbound XDCR docs"?
-I still question the usefulness of the "secs in replicating" and "secs in checkpointing"...won't these values be constantly incrementing for the life of the replication stream? When looking at a customer's environment after running for days/weeks/months, what are these stats expected to show? Apologies if I'm not understanding them correctly...
Thanks
A few more comments:
-I don't know that "percentage complete" and "XDCR replication rate" is necessarily needed in the "main bucket section"...those are really specific to each stream below and may not make sense to aggregate together.
-Are we planning on keeping "mutation to replicate" and "XDCR docs to replicate" as separate stats?
-Along with above, what is the difference between (and do we need to keep all) "XDCR queue size", and "Outbound XDCR docs"?
-I still question the usefulness of the "secs in replicating" and "secs in checkpointing"...won't these values be constantly incrementing for the life of the replication stream? When looking at a customer's environment after running for days/weeks/months, what are these stats expected to show? Apologies if I'm not understanding them correctly...
Thanks
Show
Perry Krug
added a comment - Thank you Ketaki.
A few more comments:
-I don't know that "percentage complete" and "XDCR replication rate" is necessarily needed in the "main bucket section"...those are really specific to each stream below and may not make sense to aggregate together.
-Are we planning on keeping "mutation to replicate" and "XDCR docs to replicate" as separate stats?
-Along with above, what is the difference between (and do we need to keep all) "XDCR queue size", and "Outbound XDCR docs"?
-I still question the usefulness of the "secs in replicating" and "secs in checkpointing"...won't these values be constantly incrementing for the life of the replication stream? When looking at a customer's environment after running for days/weeks/months, what are these stats expected to show? Apologies if I'm not understanding them correctly...
Thanks
Hide
Ketaki Gangal
added a comment -
@Dipti - Sure, lets sync up today on this.
@Perry -
c. Add Percentage Complete - yes, this is more pertinent at a replication stream level
d. Add XDCR Replication Rate - yes, this is more pertinent at a replication stream level
Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR docs" , so they should be the same stats.
@Junyi - Correct me if this is a wrong assumption.
XDCR Queue size : Is the actual memory being used currently to store the current queue ( which is a much smaller subset of all items to be replicated) We figured this would be useful to know while sizing the bucket/memory with ref to xdcr.
Outbound XDCR docs : Is the total items that are to be replicated, not all of them are in-memory at all times.
For "secs in checkpointing" and "secs in replicating", I agree this is a ever-growing number and when we run into a much larger runtime , typical customer scenario, this would be a huge number. However, we ve detected issues w/ XDCR in our previous testing very easily by using these stats,for example if the secs in checkpointing is way-off , it clearly shows some badness in xdcr.
Another way to do this would mean adding logging/some information elsewhere, but the current stats @ ns_server/xdcr level show these values on a per-vbucket basis which may/not essentially be very useful while triaging any errors of this kind.
We can however have a call to discuss more ,if there is a better way to implement this.
@Perry -
c. Add Percentage Complete - yes, this is more pertinent at a replication stream level
d. Add XDCR Replication Rate - yes, this is more pertinent at a replication stream level
Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR docs" , so they should be the same stats.
@Junyi - Correct me if this is a wrong assumption.
XDCR Queue size : Is the actual memory being used currently to store the current queue ( which is a much smaller subset of all items to be replicated) We figured this would be useful to know while sizing the bucket/memory with ref to xdcr.
Outbound XDCR docs : Is the total items that are to be replicated, not all of them are in-memory at all times.
For "secs in checkpointing" and "secs in replicating", I agree this is a ever-growing number and when we run into a much larger runtime , typical customer scenario, this would be a huge number. However, we ve detected issues w/ XDCR in our previous testing very easily by using these stats,for example if the secs in checkpointing is way-off , it clearly shows some badness in xdcr.
Another way to do this would mean adding logging/some information elsewhere, but the current stats @ ns_server/xdcr level show these values on a per-vbucket basis which may/not essentially be very useful while triaging any errors of this kind.
We can however have a call to discuss more ,if there is a better way to implement this.
Show
Ketaki Gangal
added a comment - @Dipti - Sure, lets sync up today on this.
@Perry -
c. Add Percentage Complete - yes, this is more pertinent at a replication stream level
d. Add XDCR Replication Rate - yes, this is more pertinent at a replication stream level
Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR docs" , so they should be the same stats.
@Junyi - Correct me if this is a wrong assumption.
XDCR Queue size : Is the actual memory being used currently to store the current queue ( which is a much smaller subset of all items to be replicated) We figured this would be useful to know while sizing the bucket/memory with ref to xdcr.
Outbound XDCR docs : Is the total items that are to be replicated, not all of them are in-memory at all times.
For "secs in checkpointing" and "secs in replicating", I agree this is a ever-growing number and when we run into a much larger runtime , typical customer scenario, this would be a huge number. However, we ve detected issues w/ XDCR in our previous testing very easily by using these stats,for example if the secs in checkpointing is way-off , it clearly shows some badness in xdcr.
Another way to do this would mean adding logging/some information elsewhere, but the current stats @ ns_server/xdcr level show these values on a per-vbucket basis which may/not essentially be very useful while triaging any errors of this kind.
We can however have a call to discuss more ,if there is a better way to implement this.
Hide
Perry Krug
added a comment -
Thanks for continuing the conversation Ketaki. A few more follow ons from my side:
XDCR Queue size : Is the actual memory being used currently to store the current queue ( which is a much smaller subset of all items to be replicated) We figured this would be useful to know while sizing the bucket/memory with ref to xdcr.
[pk] - Can you explain a bit more about memory being taken up for xdcr? Is this source or destination? What exactly is the RAM being used for? Is it in memcached or beam.smp?
For "secs in checkpointing" and "secs in replicating", I agree this is a ever-growing number and when we run into a much larger runtime , typical customer scenario, this would be a huge number. However, we ve detected issues w/ XDCR in our previous testing very easily by using these stats,for example if the secs in checkpointing is way-off , it clearly shows some badness in xdcr.
[pk] - When you say "way-off"...what do you mean? Between nodes within a cluster? Between clusters? What is the difference between the checkpointing meausurement and the replicating measurement? What do you mean by "badness" specifically?
Thanks Ketaki. This is all good information for our documentation and internal information as well.
Perry
XDCR Queue size : Is the actual memory being used currently to store the current queue ( which is a much smaller subset of all items to be replicated) We figured this would be useful to know while sizing the bucket/memory with ref to xdcr.
[pk] - Can you explain a bit more about memory being taken up for xdcr? Is this source or destination? What exactly is the RAM being used for? Is it in memcached or beam.smp?
For "secs in checkpointing" and "secs in replicating", I agree this is a ever-growing number and when we run into a much larger runtime , typical customer scenario, this would be a huge number. However, we ve detected issues w/ XDCR in our previous testing very easily by using these stats,for example if the secs in checkpointing is way-off , it clearly shows some badness in xdcr.
[pk] - When you say "way-off"...what do you mean? Between nodes within a cluster? Between clusters? What is the difference between the checkpointing meausurement and the replicating measurement? What do you mean by "badness" specifically?
Thanks Ketaki. This is all good information for our documentation and internal information as well.
Perry
Show
Perry Krug
added a comment - Thanks for continuing the conversation Ketaki. A few more follow ons from my side:
XDCR Queue size : Is the actual memory being used currently to store the current queue ( which is a much smaller subset of all items to be replicated) We figured this would be useful to know while sizing the bucket/memory with ref to xdcr.
[pk] - Can you explain a bit more about memory being taken up for xdcr? Is this source or destination? What exactly is the RAM being used for? Is it in memcached or beam.smp?
For "secs in checkpointing" and "secs in replicating", I agree this is a ever-growing number and when we run into a much larger runtime , typical customer scenario, this would be a huge number. However, we ve detected issues w/ XDCR in our previous testing very easily by using these stats,for example if the secs in checkpointing is way-off , it clearly shows some badness in xdcr.
[pk] - When you say "way-off"...what do you mean? Between nodes within a cluster? Between clusters? What is the difference between the checkpointing meausurement and the replicating measurement? What do you mean by "badness" specifically?
Thanks Ketaki. This is all good information for our documentation and internal information as well.
Perry
Hide
Ketaki Gangal
added a comment -
Hi Perry,
[pk] - Can you explain a bit more about memory being taken up for xdcr? Is this source or destination? What exactly is the RAM being used for? Is it in memcached or beam.smp?
Xdcr queue size - is the total memory used for the xdcr queue per node. We want to account for memory overhead w/ xdcr(we only store key and metadata.)
This is the memory on the source node. It is accounted in the beam.smp memory.
For each vb replicator:
the queue is created with following limits
maximum number of items in the queue: BatchSize * NumWorkers * 2, by default, the batch size is 500, and NumWorkers is 4, so the queue can hold at most 4000 mutations
maximum size of queue: 100 * 1024 * NumWorkers, by default, it is 400KB
In short, the queue is bounded by 400KB or hold 4000 items, whichever is reached first.
On each node there is max 32 active replicators, so it is 32*400KB = 12800KB = 12.8MB maximum memory overhead used by the queue.
[pk] - When you say "way-off"...what do you mean? Between nodes within a cluster? Between clusters? What is the difference between the checkpointing meausurement and the replicating measurement? What do you mean by "badness" specifically?
For "secs in replicating" v/s "secs in checkpointing" I am not sure of the exact difference between the two.
@Junyi - Could you explain more here?
I should ve referred the "Docs to replicate" inplace of the "secs checkpointing" which lead to significant checkpoint changes in the past - my bad. This "http://www.couchbase.com/issues/browse/MB-6939" was the one I had in mind while referring to badness.
thanks,
Ketaki
[pk] - Can you explain a bit more about memory being taken up for xdcr? Is this source or destination? What exactly is the RAM being used for? Is it in memcached or beam.smp?
Xdcr queue size - is the total memory used for the xdcr queue per node. We want to account for memory overhead w/ xdcr(we only store key and metadata.)
This is the memory on the source node. It is accounted in the beam.smp memory.
For each vb replicator:
the queue is created with following limits
maximum number of items in the queue: BatchSize * NumWorkers * 2, by default, the batch size is 500, and NumWorkers is 4, so the queue can hold at most 4000 mutations
maximum size of queue: 100 * 1024 * NumWorkers, by default, it is 400KB
In short, the queue is bounded by 400KB or hold 4000 items, whichever is reached first.
On each node there is max 32 active replicators, so it is 32*400KB = 12800KB = 12.8MB maximum memory overhead used by the queue.
[pk] - When you say "way-off"...what do you mean? Between nodes within a cluster? Between clusters? What is the difference between the checkpointing meausurement and the replicating measurement? What do you mean by "badness" specifically?
For "secs in replicating" v/s "secs in checkpointing" I am not sure of the exact difference between the two.
@Junyi - Could you explain more here?
I should ve referred the "Docs to replicate" inplace of the "secs checkpointing" which lead to significant checkpoint changes in the past - my bad. This "http://www.couchbase.com/issues/browse/MB-6939" was the one I had in mind while referring to badness.
thanks,
Ketaki
Show
Ketaki Gangal
added a comment - Hi Perry,
[pk] - Can you explain a bit more about memory being taken up for xdcr? Is this source or destination? What exactly is the RAM being used for? Is it in memcached or beam.smp?
Xdcr queue size - is the total memory used for the xdcr queue per node. We want to account for memory overhead w/ xdcr(we only store key and metadata.)
This is the memory on the source node. It is accounted in the beam.smp memory.
For each vb replicator:
the queue is created with following limits
maximum number of items in the queue: BatchSize * NumWorkers * 2, by default, the batch size is 500, and NumWorkers is 4, so the queue can hold at most 4000 mutations
maximum size of queue: 100 * 1024 * NumWorkers, by default, it is 400KB
In short, the queue is bounded by 400KB or hold 4000 items, whichever is reached first.
On each node there is max 32 active replicators, so it is 32*400KB = 12800KB = 12.8MB maximum memory overhead used by the queue.
[pk] - When you say "way-off"...what do you mean? Between nodes within a cluster? Between clusters? What is the difference between the checkpointing meausurement and the replicating measurement? What do you mean by "badness" specifically?
For "secs in replicating" v/s "secs in checkpointing" I am not sure of the exact difference between the two.
@Junyi - Could you explain more here?
I should ve referred the "Docs to replicate" inplace of the "secs checkpointing" which lead to significant checkpoint changes in the past - my bad. This " http://www.couchbase.com/issues/browse/MB-6939 " was the one I had in mind while referring to badness.
thanks,
Ketaki
Hide
Junyi Xie
added a comment -
This bug will spawn a list of fixes. My tentative plan is to resolve this bug by several commits, based on all discussion above.
First of all, let me make clear that the "docs" (or "items") XDCR replicate is actually "mutations", say, suppose we send 10 docs via XDCR to remote cluster, it is possible all these docs are 10 mutations for the single document (item), rather than from 10 different docs(items). So, in the stats section, we should use "mutations", instead of "docs" when applicable.
Here is my summary, please let me know if any question or I miss anything
Commit 1: Rename current stats, just renaming, no change to the underlying stats
In the MAIN bucket section:
a. Rename XDC Dest ops/sec to "Incoming XDCR ops/sec"
b. Rename XDC docs to replicate " Outbound XDCR mutations"
In the Outbound XDCR stats section:
c. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR mutations"
d. Rename "queue size" as "XDCR queue size"
Commit 2: Change current stats
In the Outbound XDCR stats section:
a. Change "num checkpoints issued ", "num checkpoints failed" to last 10 checkpoint instead of the entire set of checkpoints, also rename them correspondingly
Commit 3: Add new stats
In the Outbound XDCR stats section:
a. add new stat "Percentage of completeness", which is computed as the
"number of mutations already sent to remote side" / ("number of mutations already sent to remote side" + "number of mutations waiting to be sent to remote side").
Here "number of mutations waiting to be sent to remote side" is the stat "Outbound XDCR mutations"
b.add new stats "Replication rate" which is the number of mutations we sent per second to see speed of each stream. Unit: #ofmutations/per second
c.add new stats "Bandwidth in use", which is defined as the number of bytes, the bandwidth XDCR uses on the fly. Unit: Bytes/per second
Commit 4: remove all uninteresting stats and route them to logs
In Outbound XDCR stats section:
a. Remove "mutations checked" and "mutations replicated", move it at a logging level.
b. Remove "active vb reps" and "waiting vb reps" , move it to a logging level.
c. Remove "mutations in queue", move it to a logging level.
First of all, let me make clear that the "docs" (or "items") XDCR replicate is actually "mutations", say, suppose we send 10 docs via XDCR to remote cluster, it is possible all these docs are 10 mutations for the single document (item), rather than from 10 different docs(items). So, in the stats section, we should use "mutations", instead of "docs" when applicable.
Here is my summary, please let me know if any question or I miss anything
Commit 1: Rename current stats, just renaming, no change to the underlying stats
In the MAIN bucket section:
a. Rename XDC Dest ops/sec to "Incoming XDCR ops/sec"
b. Rename XDC docs to replicate " Outbound XDCR mutations"
In the Outbound XDCR stats section:
c. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR mutations"
d. Rename "queue size" as "XDCR queue size"
Commit 2: Change current stats
In the Outbound XDCR stats section:
a. Change "num checkpoints issued ", "num checkpoints failed" to last 10 checkpoint instead of the entire set of checkpoints, also rename them correspondingly
Commit 3: Add new stats
In the Outbound XDCR stats section:
a. add new stat "Percentage of completeness", which is computed as the
"number of mutations already sent to remote side" / ("number of mutations already sent to remote side" + "number of mutations waiting to be sent to remote side").
Here "number of mutations waiting to be sent to remote side" is the stat "Outbound XDCR mutations"
b.add new stats "Replication rate" which is the number of mutations we sent per second to see speed of each stream. Unit: #ofmutations/per second
c.add new stats "Bandwidth in use", which is defined as the number of bytes, the bandwidth XDCR uses on the fly. Unit: Bytes/per second
Commit 4: remove all uninteresting stats and route them to logs
In Outbound XDCR stats section:
a. Remove "mutations checked" and "mutations replicated", move it at a logging level.
b. Remove "active vb reps" and "waiting vb reps" , move it to a logging level.
c. Remove "mutations in queue", move it to a logging level.
Show
Junyi Xie
added a comment - This bug will spawn a list of fixes. My tentative plan is to resolve this bug by several commits, based on all discussion above.
First of all, let me make clear that the "docs" (or "items") XDCR replicate is actually "mutations", say, suppose we send 10 docs via XDCR to remote cluster, it is possible all these docs are 10 mutations for the single document (item), rather than from 10 different docs(items). So, in the stats section, we should use "mutations", instead of "docs" when applicable.
Here is my summary, please let me know if any question or I miss anything
Commit 1: Rename current stats, just renaming, no change to the underlying stats
In the MAIN bucket section:
a. Rename XDC Dest ops/sec to "Incoming XDCR ops/sec"
b. Rename XDC docs to replicate " Outbound XDCR mutations"
In the Outbound XDCR stats section:
c. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR mutations"
d. Rename "queue size" as "XDCR queue size"
Commit 2: Change current stats
In the Outbound XDCR stats section:
a. Change "num checkpoints issued ", "num checkpoints failed" to last 10 checkpoint instead of the entire set of checkpoints, also rename them correspondingly
Commit 3: Add new stats
In the Outbound XDCR stats section:
a. add new stat "Percentage of completeness", which is computed as the
"number of mutations already sent to remote side" / ("number of mutations already sent to remote side" + "number of mutations waiting to be sent to remote side").
Here "number of mutations waiting to be sent to remote side" is the stat "Outbound XDCR mutations"
b.add new stats "Replication rate" which is the number of mutations we sent per second to see speed of each stream. Unit: #ofmutations/per second
c.add new stats "Bandwidth in use", which is defined as the number of bytes, the bandwidth XDCR uses on the fly. Unit: Bytes/per second
Commit 4: remove all uninteresting stats and route them to logs
In Outbound XDCR stats section:
a. Remove "mutations checked" and "mutations replicated", move it at a logging level.
b. Remove "active vb reps" and "waiting vb reps" , move it to a logging level.
c. Remove "mutations in queue", move it to a logging level.
Hide
Perry Krug
added a comment -
Thanks Junyie.
A couple quick questions/clarifications:
c. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR mutations"
[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
d. Rename "queue size" as "XDCR queue size"
[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage. Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
Commit 3: Add new stats
[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?
A couple quick questions/clarifications:
c. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR mutations"
[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
d. Rename "queue size" as "XDCR queue size"
[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage. Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
Commit 3: Add new stats
[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?
Show
Perry Krug
added a comment - Thanks Junyie.
A couple quick questions/clarifications:
c. Rename "mutation to replicate", "XDCR docs to replicate" consistently as "Outbound XDCR mutations"
[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
d. Rename "queue size" as "XDCR queue size"
[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage. Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
Commit 3: Add new stats
[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?
Hide
Junyi Xie
added a comment -
Perry,
[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
[jx] - Yes, this will unify these two stats. Actually they are the same stat with different names, one is at Main Section and the other is in Outbound XDCR stats. Per your comments, I will use the same name to remove the confusion.
[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage.
[jx] - This is defined in Bytes. If you move your mouse over the stat on UI, you will see the text "Size of bytes of XDC replication queue". If the data is a KB, MB, GB scale, you will see KB, MB, GB on the UI. There should not be confusion.
[pk]Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
[jx] - This 12.8MB is the just user-data (docs, mutations) queued to be replicated, it is just the queue created by XDCR but not including any other overhead. XDCR lives in ns_server erlang process, per node it will create 32 replicator, each replicator will create several worker process, and other erlang processes at run-time, for which there will be some memory overhead, which could be big, but I do not have number at this time.
Fro where do you get 2GB of "extra memory"? Is it per node or per cluster?
[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
[jx] - Oh, I thought these new stats are in Outbound XDCR section, which is graph and per replication base. Why do we need a separate stat at different places?
[pk] - Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?
First, both are aggregated elapsed time from each vb replicator.
"secs in checkpointing" means how much time XDCR vb replicator is working on checkpointing.
"secs in replicating measurement" means how much time XDCR vb replicator is working on replicating the mutations.
By monitoring these two stats, we can have some idea where XDCR spent the time and what XDCR is busy working on.
For these two stats, I understand they may create some confusion at customer side. As Ketaki said, these stats are still useful for QE and performance team. If customers really dislike these stats, we can remove them. :) Personally I am OK with either.
[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
[jx] - Yes, this will unify these two stats. Actually they are the same stat with different names, one is at Main Section and the other is in Outbound XDCR stats. Per your comments, I will use the same name to remove the confusion.
[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage.
[jx] - This is defined in Bytes. If you move your mouse over the stat on UI, you will see the text "Size of bytes of XDC replication queue". If the data is a KB, MB, GB scale, you will see KB, MB, GB on the UI. There should not be confusion.
[pk]Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
[jx] - This 12.8MB is the just user-data (docs, mutations) queued to be replicated, it is just the queue created by XDCR but not including any other overhead. XDCR lives in ns_server erlang process, per node it will create 32 replicator, each replicator will create several worker process, and other erlang processes at run-time, for which there will be some memory overhead, which could be big, but I do not have number at this time.
Fro where do you get 2GB of "extra memory"? Is it per node or per cluster?
[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
[jx] - Oh, I thought these new stats are in Outbound XDCR section, which is graph and per replication base. Why do we need a separate stat at different places?
[pk] - Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?
First, both are aggregated elapsed time from each vb replicator.
"secs in checkpointing" means how much time XDCR vb replicator is working on checkpointing.
"secs in replicating measurement" means how much time XDCR vb replicator is working on replicating the mutations.
By monitoring these two stats, we can have some idea where XDCR spent the time and what XDCR is busy working on.
For these two stats, I understand they may create some confusion at customer side. As Ketaki said, these stats are still useful for QE and performance team. If customers really dislike these stats, we can remove them. :) Personally I am OK with either.
Show
Junyi Xie
added a comment - Perry,
[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
[jx] - Yes, this will unify these two stats. Actually they are the same stat with different names, one is at Main Section and the other is in Outbound XDCR stats. Per your comments, I will use the same name to remove the confusion.
[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage.
[jx] - This is defined in Bytes. If you move your mouse over the stat on UI, you will see the text "Size of bytes of XDC replication queue". If the data is a KB, MB, GB scale, you will see KB, MB, GB on the UI. There should not be confusion.
[pk]Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
[jx] - This 12.8MB is the just user-data (docs, mutations) queued to be replicated, it is just the queue created by XDCR but not including any other overhead. XDCR lives in ns_server erlang process, per node it will create 32 replicator, each replicator will create several worker process, and other erlang processes at run-time, for which there will be some memory overhead, which could be big, but I do not have number at this time.
Fro where do you get 2GB of "extra memory"? Is it per node or per cluster?
[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
[jx] - Oh, I thought these new stats are in Outbound XDCR section, which is graph and per replication base. Why do we need a separate stat at different places?
[pk] - Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?
First, both are aggregated elapsed time from each vb replicator.
"secs in checkpointing" means how much time XDCR vb replicator is working on checkpointing.
"secs in replicating measurement" means how much time XDCR vb replicator is working on replicating the mutations.
By monitoring these two stats, we can have some idea where XDCR spent the time and what XDCR is busy working on.
For these two stats, I understand they may create some confusion at customer side. As Ketaki said, these stats are still useful for QE and performance team. If customers really dislike these stats, we can remove them. :) Personally I am OK with either.
Hide
Perry Krug
added a comment -
Thanks so much Junyie.
Perry,
[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
[jx] - Yes, this will unify these two stats. Actually they are the same stat with different names, one is at Main Section and the other is in Outbound XDCR stats. Per your comments, I will use the same name to remove the confusion.
[pk] - Perfect, thank you.
[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage.
[jx] - This is defined in Bytes. If you move your mouse over the stat on UI, you will see the text "Size of bytes of XDC replication queue". If the data is a KB, MB, GB scale, you will see KB, MB, GB on the UI. There should not be confusion.
[pk] - Yes, that will be great, thanks.
[pk]Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
[jx] - This 12.8MB is the just user-data (docs, mutations) queued to be replicated, it is just the queue created by XDCR but not including any other overhead. XDCR lives in ns_server erlang process, per node it will create 32 replicator, each replicator will create several worker process, and other erlang processes at run-time, for which there will be some memory overhead, which could be big, but I do not have number at this time.
Fro where do you get 2GB of "extra memory"? Is it per node or per cluster?
[pk] - This was the recommendation from QE based upon some analysis we did at Concur. Would be *extremely* helpful to get accurate and specific sizing information, and what takes up that size in whatever form.
[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
[jx] - Oh, I thought these new stats are in Outbound XDCR section, which is graph and per replication base. Why do we need a separate stat at different places?
[pk] - This has to do with how and why these stats are being consumed. When a user is looking at their cluster to determine the replication status, it will be much easier to look at all the streams together...this is much harder to do when you have to click into each bucket and look at each individual stream. It's in the same line as why we have item counts on the manage servers screen.
[pk] - Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?
First, both are aggregated elapsed time from each vb replicator.
"secs in checkpointing" means how much time XDCR vb replicator is working on checkpointing.
"secs in replicating measurement" means how much time XDCR vb replicator is working on replicating the mutations.
By monitoring these two stats, we can have some idea where XDCR spent the time and what XDCR is busy working on.
For these two stats, I understand they may create some confusion at customer side. As Ketaki said, these stats are still useful for QE and performance team. If customers really dislike these stats, we can remove them. :) Personally I am OK with either.
[pk] - Thanks for the explanation. I would still advocate for removing them. The main reason being that they do not materially help identify any issue or behavior after the cluster has been running for an extended period of time. The up-to-the-second monitoring of these stats will show an extremely high number for both after just a few days or a week of a replication stream running...let alone multiple weeks or months. I can definitely see that they would be useful when debugging the initial stream or trying to identify an issue, but I would ask that they be moved to the log or other stat area outside of the UI.
Which leads me to another question :-) Do we have documented already (or can you help with that) where and how to get these "other" stats regarding XDCR? Is there only a REST API to query? Are they printed into some log periodically? Could we get that detailed and written up?
Thanks again!
Perry,
[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
[jx] - Yes, this will unify these two stats. Actually they are the same stat with different names, one is at Main Section and the other is in Outbound XDCR stats. Per your comments, I will use the same name to remove the confusion.
[pk] - Perfect, thank you.
[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage.
[jx] - This is defined in Bytes. If you move your mouse over the stat on UI, you will see the text "Size of bytes of XDC replication queue". If the data is a KB, MB, GB scale, you will see KB, MB, GB on the UI. There should not be confusion.
[pk] - Yes, that will be great, thanks.
[pk]Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
[jx] - This 12.8MB is the just user-data (docs, mutations) queued to be replicated, it is just the queue created by XDCR but not including any other overhead. XDCR lives in ns_server erlang process, per node it will create 32 replicator, each replicator will create several worker process, and other erlang processes at run-time, for which there will be some memory overhead, which could be big, but I do not have number at this time.
Fro where do you get 2GB of "extra memory"? Is it per node or per cluster?
[pk] - This was the recommendation from QE based upon some analysis we did at Concur. Would be *extremely* helpful to get accurate and specific sizing information, and what takes up that size in whatever form.
[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
[jx] - Oh, I thought these new stats are in Outbound XDCR section, which is graph and per replication base. Why do we need a separate stat at different places?
[pk] - This has to do with how and why these stats are being consumed. When a user is looking at their cluster to determine the replication status, it will be much easier to look at all the streams together...this is much harder to do when you have to click into each bucket and look at each individual stream. It's in the same line as why we have item counts on the manage servers screen.
[pk] - Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?
First, both are aggregated elapsed time from each vb replicator.
"secs in checkpointing" means how much time XDCR vb replicator is working on checkpointing.
"secs in replicating measurement" means how much time XDCR vb replicator is working on replicating the mutations.
By monitoring these two stats, we can have some idea where XDCR spent the time and what XDCR is busy working on.
For these two stats, I understand they may create some confusion at customer side. As Ketaki said, these stats are still useful for QE and performance team. If customers really dislike these stats, we can remove them. :) Personally I am OK with either.
[pk] - Thanks for the explanation. I would still advocate for removing them. The main reason being that they do not materially help identify any issue or behavior after the cluster has been running for an extended period of time. The up-to-the-second monitoring of these stats will show an extremely high number for both after just a few days or a week of a replication stream running...let alone multiple weeks or months. I can definitely see that they would be useful when debugging the initial stream or trying to identify an issue, but I would ask that they be moved to the log or other stat area outside of the UI.
Which leads me to another question :-) Do we have documented already (or can you help with that) where and how to get these "other" stats regarding XDCR? Is there only a REST API to query? Are they printed into some log periodically? Could we get that detailed and written up?
Thanks again!
Show
Perry Krug
added a comment - Thanks so much Junyie.
Perry,
[pk] - Will this result in the current "mutations to replicate" and "XDCR docs to replicate" to be merged into one stat called "Outbound XDCR mutations" in the UI?
[jx] - Yes, this will unify these two stats. Actually they are the same stat with different names, one is at Main Section and the other is in Outbound XDCR stats. Per your comments, I will use the same name to remove the confusion.
[pk] - Perfect, thank you.
[pk] - Can it be made clear that this is measured in KB/MB/GB? As per Ketaki's note, this is a memory size, not a number of items nor mutations in queue. It would be good to explain even further in the "hover over" description of the statistic to say that it will be reflected in the beam.smp/erl.exe memory usage.
[jx] - This is defined in Bytes. If you move your mouse over the stat on UI, you will see the text "Size of bytes of XDC replication queue". If the data is a KB, MB, GB scale, you will see KB, MB, GB on the UI. There should not be confusion.
[pk] - Yes, that will be great, thanks.
[pk]Digging in further, it was my understanding that we need nearly 2GB of "extra" RAM to support XDCR...yet it appears from Ketaki's description that the maximum memory usage is 12.8MB, can you explain the rest?
[jx] - This 12.8MB is the just user-data (docs, mutations) queued to be replicated, it is just the queue created by XDCR but not including any other overhead. XDCR lives in ns_server erlang process, per node it will create 32 replicator, each replicator will create several worker process, and other erlang processes at run-time, for which there will be some memory overhead, which could be big, but I do not have number at this time.
Fro where do you get 2GB of "extra memory"? Is it per node or per cluster?
[pk] - This was the recommendation from QE based upon some analysis we did at Concur. Would be *extremely* helpful to get accurate and specific sizing information, and what takes up that size in whatever form.
[pk] - Just wanted to clarify that these are requested to be displayed per-replication stream in the XDCR configuration section...*not* the graphed stats.
[jx] - Oh, I thought these new stats are in Outbound XDCR section, which is graph and per replication base. Why do we need a separate stat at different places?
[pk] - This has to do with how and why these stats are being consumed. When a user is looking at their cluster to determine the replication status, it will be much easier to look at all the streams together...this is much harder to do when you have to click into each bucket and look at each individual stream. It's in the same line as why we have item counts on the manage servers screen.
[pk] - Can you explain further the difference between the secs in checkpointing meausurement and the secs in replicating measurement? Will those be renamed/removed?
First, both are aggregated elapsed time from each vb replicator.
"secs in checkpointing" means how much time XDCR vb replicator is working on checkpointing.
"secs in replicating measurement" means how much time XDCR vb replicator is working on replicating the mutations.
By monitoring these two stats, we can have some idea where XDCR spent the time and what XDCR is busy working on.
For these two stats, I understand they may create some confusion at customer side. As Ketaki said, these stats are still useful for QE and performance team. If customers really dislike these stats, we can remove them. :) Personally I am OK with either.
[pk] - Thanks for the explanation. I would still advocate for removing them. The main reason being that they do not materially help identify any issue or behavior after the cluster has been running for an extended period of time. The up-to-the-second monitoring of these stats will show an extremely high number for both after just a few days or a week of a replication stream running...let alone multiple weeks or months. I can definitely see that they would be useful when debugging the initial stream or trying to identify an issue, but I would ask that they be moved to the log or other stat area outside of the UI.
Which leads me to another question :-) Do we have documented already (or can you help with that) where and how to get these "other" stats regarding XDCR? Is there only a REST API to query? Are they printed into some log periodically? Could we get that detailed and written up?
Thanks again!
Hide
Junyi Xie
added a comment -
Perry, you are highly welcome. Please see my response below.
[pk] - This has to do with how and why these stats are being consumed. When a user is looking at their cluster to determine the replication status, it will be much easier to look at all the streams together...this is much harder to do when you have to click into each bucket and look at each individual stream. It's in the same line as why we have item counts on the manage servers screen.
[jx] -- I see. Thanks for explanation. I agree from user perspective, it is better to have summary stat of ALL replications, not just per-replication stream.
Today seems we do not have anything like this (stats across all buckets)?, there is no stat at XDCR tab either, so I need to talk to UI guys how to add these stats and where to add them. It involves some UI design change and more than adding another per-replication stat on UI. Better to
[pk] -- Which leads me to another question :-) Do we have documented already (or can you help with that) where and how to get these "other" stats regarding XDCR? Is there only a REST API to query? Are they printed into some log periodically? Could we get that detailed and written up?
[jx] -- Other than UI stats, XDCR also dumped a lot of stats and information to log files, but I am afraid they are too detailed and hard to parse from customers perspective :) Today I put all XDCR stats on UI. Tomorrow, after we remove some stats on UI (like secs in checkpointing), I will put them into log and document how to get them easily. For all stats on UI, you could use standard REST API to get them.
[pk] - This has to do with how and why these stats are being consumed. When a user is looking at their cluster to determine the replication status, it will be much easier to look at all the streams together...this is much harder to do when you have to click into each bucket and look at each individual stream. It's in the same line as why we have item counts on the manage servers screen.
[jx] -- I see. Thanks for explanation. I agree from user perspective, it is better to have summary stat of ALL replications, not just per-replication stream.
Today seems we do not have anything like this (stats across all buckets)?, there is no stat at XDCR tab either, so I need to talk to UI guys how to add these stats and where to add them. It involves some UI design change and more than adding another per-replication stat on UI. Better to
[pk] -- Which leads me to another question :-) Do we have documented already (or can you help with that) where and how to get these "other" stats regarding XDCR? Is there only a REST API to query? Are they printed into some log periodically? Could we get that detailed and written up?
[jx] -- Other than UI stats, XDCR also dumped a lot of stats and information to log files, but I am afraid they are too detailed and hard to parse from customers perspective :) Today I put all XDCR stats on UI. Tomorrow, after we remove some stats on UI (like secs in checkpointing), I will put them into log and document how to get them easily. For all stats on UI, you could use standard REST API to get them.
Show
Junyi Xie
added a comment - Perry, you are highly welcome. Please see my response below.
[pk] - This has to do with how and why these stats are being consumed. When a user is looking at their cluster to determine the replication status, it will be much easier to look at all the streams together...this is much harder to do when you have to click into each bucket and look at each individual stream. It's in the same line as why we have item counts on the manage servers screen.
[jx] -- I see. Thanks for explanation. I agree from user perspective, it is better to have summary stat of ALL replications, not just per-replication stream.
Today seems we do not have anything like this (stats across all buckets)?, there is no stat at XDCR tab either, so I need to talk to UI guys how to add these stats and where to add them. It involves some UI design change and more than adding another per-replication stat on UI. Better to
[pk] -- Which leads me to another question :-) Do we have documented already (or can you help with that) where and how to get these "other" stats regarding XDCR? Is there only a REST API to query? Are they printed into some log periodically? Could we get that detailed and written up?
[jx] -- Other than UI stats, XDCR also dumped a lot of stats and information to log files, but I am afraid they are too detailed and hard to parse from customers perspective :) Today I put all XDCR stats on UI. Tomorrow, after we remove some stats on UI (like secs in checkpointing), I will put them into log and document how to get them easily. For all stats on UI, you could use standard REST API to get them.
Hide
Perry Krug
added a comment -
Thanks Junyi. Do we have a bug open already for the UI enhancements around this?
Show
Perry Krug
added a comment - Thanks Junyi. Do we have a bug open already for the UI enhancements around this?
Hide
Junyi Xie
added a comment -
I mean you can open another bug for the bandwidth usage, which is purely a UI work, nothing to do with XDCR code.
For this particular bug MB-7432, all work on XDCR side is done except the stats removal (Dipti will make decision for that, probably she will file another bug). So please close this bug if you do not need any thing from me.
For this particular bug MB-7432, all work on XDCR side is done except the stats removal (Dipti will make decision for that, probably she will file another bug). So please close this bug if you do not need any thing from me.
Show
Junyi Xie
added a comment - I mean you can open another bug for the bandwidth usage, which is purely a UI work, nothing to do with XDCR code.
For this particular bug MB-7432 , all work on XDCR side is done except the stats removal (Dipti will make decision for that, probably she will file another bug). So please close this bug if you do not need any thing from me.
Hide
Perry Krug
added a comment -
So it sounds like this is not yet resolved if all the decisions haven't been made yet.
Assigning to Dipti to make the final decisions...I want to leave it open to make sure things get wrapped up.
Adding a UI component for the bandwidth request.
Assigning to Dipti to make the final decisions...I want to leave it open to make sure things get wrapped up.
Adding a UI component for the bandwidth request.
Show
Perry Krug
added a comment - So it sounds like this is not yet resolved if all the decisions haven't been made yet.
Assigning to Dipti to make the final decisions...I want to leave it open to make sure things get wrapped up.
Adding a UI component for the bandwidth request.
Show
Maria McDuff
added a comment - deferred out of 2.0.2
I will certainly add the stats you suggested, and reorder some stats to make it more readable.
For current stats, they exist for some reasons, actually most of them are there because of request from QE and performance team, although apparently there are not quite interesting to users. If they do not cause big downside, I would like to keep them at this time.