Details
-
Type:
Bug
-
Status:
Resolved
-
Priority:
Major
-
Resolution: Won't Fix
-
Affects Version/s: 1.7.3
-
Fix Version/s: None
-
Component/s: couchbase-bucket
-
Security Level: Public
-
Labels:
-
Environment:All environments
Description
When a large number of mutations (sets and deletes) are carried out on the master which is connected to a slave via vbucketmigrator with a registered tab name it is seen that vbucketmigrator dies.
Vbucketmigrator was run with the following options:
/opt/membase/bin/vbucketmigrator -<master_host >:11211 -d <slave_host>:11211 -b 0 -v -A -N <registered_tap_name> -r
Memcached was run with the following parameters:
Master:
/opt/membase/bin/memcached -v -d -p 11211 -u nobody -c 65535 -P /var/run/memcached/memcached.pid -E /opt/membase/lib/memcached/ep.so -e dbname=/db/membase/ep.db;min_data_age=0;queue_age_cap=900;max_size=64424509440;initfile=/opt/membase/membase-init.sql;ht_size=12582917;tap_keepalive=600
Slave:
/opt/membase/bin/memcached -v -d -p 11211 -u nobody -c 65535 -P /var/run/memcached/memcached.pid -E /opt/membase/lib/memcached/ep.so -e dbname=/db/membase/ep.db;min_data_age=0;queue_age_cap=900;max_size=64424509440;initfile=/opt/membase/membase-init.sql;ht_size=12582917;tap_keepalive=600;inconsistent_slave_chk=true
Vbucketmigrator was run with the following options:
/opt/membase/bin/vbucketmigrator -<master_host >:11211 -d <slave_host>:11211 -b 0 -v -A -N <registered_tap_name> -r
Memcached was run with the following parameters:
Master:
/opt/membase/bin/memcached -v -d -p 11211 -u nobody -c 65535 -P /var/run/memcached/memcached.pid -E /opt/membase/lib/memcached/ep.so -e dbname=/db/membase/ep.db;min_data_age=0;queue_age_cap=900;max_size=64424509440;initfile=/opt/membase/membase-init.sql;ht_size=12582917;tap_keepalive=600
Slave:
/opt/membase/bin/memcached -v -d -p 11211 -u nobody -c 65535 -P /var/run/memcached/memcached.pid -E /opt/membase/lib/memcached/ep.so -e dbname=/db/membase/ep.db;min_data_age=0;queue_age_cap=900;max_size=64424509440;initfile=/opt/membase/membase-init.sql;ht_size=12582917;tap_keepalive=600;inconsistent_slave_chk=true
-
- logs.rar
- 19/Dec/11 4:47 AM
- 21 kB
- Pritish Pratap
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
Permalink
Chiyoung Seo
added a comment -
Do you have any core dumps from vbucketmigrator?
Show
Chiyoung Seo
added a comment - Do you have any core dumps from vbucketmigrator?
Hide
Pritish Pratap
added a comment -
Please find below the logs for the vbucketmigrator exit that we encountered.
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad1)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad2)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad2)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad3)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad4)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad3)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad4)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TDELETE (tap seqno: 1b1fad5)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TDELETE (tap seqno: 1b1fad5)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad6)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad6)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TDELETE (tap seqno: 1b1fad7)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TDELETE (tap seqno: 1b1fad7)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad8)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad8)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad9)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad9)]
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: Failed to read from stream: Connection timed out
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: An error occured on the downstream connection..
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: Downstream connection closed.. shutdown upstream
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: vbucketmigrator exit
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad1)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad2)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad2)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad3)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad4)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad3)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad4)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TDELETE (tap seqno: 1b1fad5)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TDELETE (tap seqno: 1b1fad5)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad6)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad6)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TDELETE (tap seqno: 1b1fad7)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TDELETE (tap seqno: 1b1fad7)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad8)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad8)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad9)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad9)]
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: Failed to read from stream: Connection timed out
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: An error occured on the downstream connection..
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: Downstream connection closed.. shutdown upstream
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: vbucketmigrator exit
Show
Pritish Pratap
added a comment - Please find below the logs for the vbucketmigrator exit that we encountered.
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad1)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad2)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad2)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad3)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad4)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad3)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad4)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TDELETE (tap seqno: 1b1fad5)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TDELETE (tap seqno: 1b1fad5)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad6)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad6)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TDELETE (tap seqno: 1b1fad7)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TDELETE (tap seqno: 1b1fad7)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad8)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad8)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad9)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad9)]
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: Failed to read from stream: Connection timed out
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: An error occured on the downstream connection..
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: Downstream connection closed.. shutdown upstream
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: vbucketmigrator exit
Hide
Chiyoung Seo
added a comment -
Looks like the downstream master (i.e., slave) closed the connection to vbucketmigrator due to the unexpected behavior (crash or tapclient receiving wrong messages) in the downstream memcached/ep-engine. Do you have any core dump from the downstream memcached? If not, can you increase the log level on the downstream memcached process?
Show
Chiyoung Seo
added a comment - Looks like the downstream master (i.e., slave) closed the connection to vbucketmigrator due to the unexpected behavior (crash or tapclient receiving wrong messages) in the downstream memcached/ep-engine. Do you have any core dump from the downstream memcached? If not, can you increase the log level on the downstream memcached process?
Hide
Pritish Pratap
added a comment -
Please find attached the membase logs taken on both the master, slave and the vbucketmigrator logs taken with -vv option.
Show
Pritish Pratap
added a comment - Please find attached the membase logs taken on both the master, slave and the vbucketmigrator logs taken with -vv option.
Show
Pritish Pratap
added a comment - vbucketmigrator + membase_master + membase_slave logs
Hide
Can we quantify "large number" here? I'm keeping a pretty steady 90-110K mutations/s on the frontend over 1M keys replicating to a slave. About how much time or how many reqs should I expect to pass before seeing an issue?
Show
Dustin Sallings
added a comment - - edited Can we quantify "large number" here? I'm keeping a pretty steady 90-110K mutations/s on the frontend over 1M keys replicating to a slave. About how much time or how many reqs should I expect to pass before seeing an issue?
Hide
Dustin Sallings
added a comment -
I'm doing several million every few seconds right now. It's not complaining.
One thing I don't quite understand here -- is anything actually wrong? Does it resume properly if restarted? The software is designed to allow an occasional hiccup to fail and be resumable. I would assume vbm would be running under daemontools or similar so that a network failure would only be detected by the logs and stats it affects.
One thing I don't quite understand here -- is anything actually wrong? Does it resume properly if restarted? The software is designed to allow an occasional hiccup to fail and be resumable. I would assume vbm would be running under daemontools or similar so that a network failure would only be detected by the logs and stats it affects.
Show
Dustin Sallings
added a comment - I'm doing several million every few seconds right now. It's not complaining.
One thing I don't quite understand here -- is anything actually wrong? Does it resume properly if restarted? The software is designed to allow an occasional hiccup to fail and be resumable. I would assume vbm would be running under daemontools or similar so that a network failure would only be detected by the logs and stats it affects.
Hide
Dustin Sallings
added a comment -
Failures are a given, though. If a process restarts once a day for any reason whatsoever and has no effect on the quality of service, then I wouldn't consider it a bug worth spending time on. Switches reboot, networks and machines get overloaded, minor bugs manifest, etc...
I've run just under a billion ops through one of my membase processes replicating into another one today and haven't had any issues yet. Meanwhile, I've run a few thousand basic S3 fetches into an EC2 node and I've run into all kinds of random failures I can't explain. They worked the second or third time, though.
This is what I'm trying to understand. If a spurious occasional failure is resumable with no negative impact on the system, then I think it's working correctly as a distributed system.
I've run just under a billion ops through one of my membase processes replicating into another one today and haven't had any issues yet. Meanwhile, I've run a few thousand basic S3 fetches into an EC2 node and I've run into all kinds of random failures I can't explain. They worked the second or third time, though.
This is what I'm trying to understand. If a spurious occasional failure is resumable with no negative impact on the system, then I think it's working correctly as a distributed system.
Show
Dustin Sallings
added a comment - Failures are a given, though. If a process restarts once a day for any reason whatsoever and has no effect on the quality of service, then I wouldn't consider it a bug worth spending time on. Switches reboot, networks and machines get overloaded, minor bugs manifest, etc...
I've run just under a billion ops through one of my membase processes replicating into another one today and haven't had any issues yet. Meanwhile, I've run a few thousand basic S3 fetches into an EC2 node and I've run into all kinds of random failures I can't explain. They worked the second or third time, though.
This is what I'm trying to understand. If a spurious occasional failure is resumable with no negative impact on the system, then I think it's working correctly as a distributed system.
Hide
Dustin Sallings
added a comment -
Is this the right commandline? It's reporting a timeout failure, but I'm having trouble reproducing that without adding a -T (timeout) value. I faulted the connection to the destination for around 20 minutes with no end-to-end issues *except* when -T is used (at which point it was quite happy to resume and continue working).
I'm approaching 2 billion ops now including multiple restarts, timeouts, downstream failures, etc...
I'm approaching 2 billion ops now including multiple restarts, timeouts, downstream failures, etc...
Show
Dustin Sallings
added a comment - Is this the right commandline? It's reporting a timeout failure, but I'm having trouble reproducing that without adding a -T (timeout) value. I faulted the connection to the destination for around 20 minutes with no end-to-end issues *except* when -T is used (at which point it was quite happy to resume and continue working).
I'm approaching 2 billion ops now including multiple restarts, timeouts, downstream failures, etc...
Hide
Pritish Pratap
added a comment -
We are actually using a daemon script to restart vbucketmigrator every time it crashes.
The only time when the daemon dies is when it does not find the tap name when trying to reconnect.
While looking into the vbucketmigrator logs, it can be seen that the error message is thrown about 16 minutes after the last mutation is received. In this period the daemon is constantly trying to restart vbucketmigrator but at the point when it finds that the tap is not present (Dec 14 00:37:37) it throws the error and dies. An attempt to restart the daemon again will result in a complete backfill which we are trying to avoid.
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad8)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad9)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad9)]
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: Failed to read from stream: Connection timed out
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: An error occured on the downstream connection..
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: Downstream connection closed.. shutdown upstream
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: vbucketmigrator exit
Hope this adds some more context to the kind of problem we’re facing.
The only time when the daemon dies is when it does not find the tap name when trying to reconnect.
While looking into the vbucketmigrator logs, it can be seen that the error message is thrown about 16 minutes after the last mutation is received. In this period the daemon is constantly trying to restart vbucketmigrator but at the point when it finds that the tap is not present (Dec 14 00:37:37) it throws the error and dies. An attempt to restart the daemon again will result in a complete backfill which we are trying to avoid.
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad8)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad9)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad9)]
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: Failed to read from stream: Connection timed out
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: An error occured on the downstream connection..
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: Downstream connection closed.. shutdown upstream
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: vbucketmigrator exit
Hope this adds some more context to the kind of problem we’re facing.
Show
Pritish Pratap
added a comment - We are actually using a daemon script to restart vbucketmigrator every time it crashes.
The only time when the daemon dies is when it does not find the tap name when trying to reconnect.
While looking into the vbucketmigrator logs, it can be seen that the error message is thrown about 16 minutes after the last mutation is received. In this period the daemon is constantly trying to restart vbucketmigrator but at the point when it finds that the tap is not present (Dec 14 00:37:37) it throws the error and dies. An attempt to restart the daemon again will result in a complete backfill which we are trying to avoid.
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad8)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TMUTATION (tap seqno: 1b1fad9)]
Dec 14 00:21:01 pritish-test-membase171_S2_3 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TMUTATION (tap seqno: 1b1fad9)]
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: Failed to read from stream: Connection timed out
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: An error occured on the downstream connection..
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: Downstream connection closed.. shutdown upstream
Dec 14 00:37:37 pritish-test-membase171_S2_3 [vbucketmigrator]: vbucketmigrator exit
Hope this adds some more context to the kind of problem we’re facing.
Hide
Dustin Sallings
added a comment -
Do you have a timeout set? It seems like setting a timeout to something less than your named tap timeout would solve the problem. The client would fail when the downstream became unavailable for too long and restart. If the problem was transient (firewall ate the conn or similar), it'd happily resume.
I tested this on my environment several times by completely wedging the downstream process, repairing it, and having it resume.
I tested this on my environment several times by completely wedging the downstream process, repairing it, and having it resume.
Show
Dustin Sallings
added a comment - Do you have a timeout set? It seems like setting a timeout to something less than your named tap timeout would solve the problem. The client would fail when the downstream became unavailable for too long and restart. If the problem was transient (firewall ate the conn or similar), it'd happily resume.
I tested this on my environment several times by completely wedging the downstream process, repairing it, and having it resume.
Hide
Sriharsha Krishnamurthy
added a comment -
We are not setting timeout (-T ). We only have keep_alive set to 600secs. Are you suggesting to set -T less than 600secs ?
we are facing disconnects when we attached 1.6 to 1.7 using the below command line option:
/opt/membase/bin/vbucketmigrator -<master_host >:11211 -d <slave_host>:11211 -b 0 -v -A -N <registered_tap_name>
we are facing disconnects when we attached 1.6 to 1.7 using the below command line option:
/opt/membase/bin/vbucketmigrator -<master_host >:11211 -d <slave_host>:11211 -b 0 -v -A -N <registered_tap_name>
Show
Sriharsha Krishnamurthy
added a comment - We are not setting timeout (-T ). We only have keep_alive set to 600secs. Are you suggesting to set -T less than 600secs ?
we are facing disconnects when we attached 1.6 to 1.7 using the below command line option:
/opt/membase/bin/vbucketmigrator -<master_host >:11211 -d <slave_host>:11211 -b 0 -v -A -N <registered_tap_name>
Hide
Dustin Sallings
added a comment -
If you expect a lot of traffic, then -T 60 or so should basically never fire unless something is wrong, and there should be plenty of time to recover.
Show
Dustin Sallings
added a comment - If you expect a lot of traffic, then -T 60 or so should basically never fire unless something is wrong, and there should be plenty of time to recover.
Hide
Sriharsha Krishnamurthy
added a comment -
we tried by running vbucketmigrator with -T 60 :
/opt/membase/bin/vbucketmigrator -<master_host >:11211 -d <slave_host>:11211 -b 0 -v -A -N <registered_tap_name> -T 60
Now the replication is going very slow.
Master was having 35M keys.
Slave attached without -T had 18M keys after 12 hours
Slave attached with -T 60 had 1.7M keys after 12 hours
Does replication slow so drastically having -T option ?
/opt/membase/bin/vbucketmigrator -<master_host >:11211 -d <slave_host>:11211 -b 0 -v -A -N <registered_tap_name> -T 60
Now the replication is going very slow.
Master was having 35M keys.
Slave attached without -T had 18M keys after 12 hours
Slave attached with -T 60 had 1.7M keys after 12 hours
Does replication slow so drastically having -T option ?
Show
Sriharsha Krishnamurthy
added a comment - we tried by running vbucketmigrator with -T 60 :
/opt/membase/bin/vbucketmigrator -<master_host >:11211 -d <slave_host>:11211 -b 0 -v -A -N <registered_tap_name> -T 60
Now the replication is going very slow.
Master was having 35M keys.
Slave attached without -T had 18M keys after 12 hours
Slave attached with -T 60 had 1.7M keys after 12 hours
Does replication slow so drastically having -T option ?
Hide
Dustin Sallings
added a comment -
It would be very unlikely that specifying a timeout would have a noticeable impact on performance. It could increase the syscalls a bit and perhaps make the event loop slightly less efficient, but it's a fairly normal way to use libevent, so I don't think it's terribly unusual here.
Do you have server tap stats from the endpoints? Is it perhaps hitting that 60s timeout and reconnecting a lot?
Do you have server tap stats from the endpoints? Is it perhaps hitting that 60s timeout and reconnecting a lot?
Show
Dustin Sallings
added a comment - It would be very unlikely that specifying a timeout would have a noticeable impact on performance. It could increase the syscalls a bit and perhaps make the event loop slightly less efficient, but it's a fairly normal way to use libevent, so I don't think it's terribly unusual here.
Do you have server tap stats from the endpoints? Is it perhaps hitting that 60s timeout and reconnecting a lot?
Hide
Sriharsha Krishnamurthy
added a comment -
Ignore the speed as that was due to misconfiguration.
This is what I got with -T set.
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Connecting to {Sock 127.0.0.1:11211}
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Connecting to {Sock membase-cluster-004:11211}
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Message from downstream sent upstream: [ REQ V: 0 TCONNECT k: <membase-cluster-004>]
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Received message from upstream server: [ REQ V: 0 TOPAQUE (tap seqno: 6 ACK request)]
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TOPAQUE (tap seqno: 6)]
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TOPAQUE (tap seqno: 6)]
Dec 25 02:34:29 membase-cluster-005 [vbucketmigrator]: Received message from upstream server: [ REQ V: 0 NOOP]
Dec 25 02:35:29 membase-cluster-005 [vbucketmigrator]: Timed out on BinaryMessagePipe from membase-cluster-004:11211 on fd 9
Dec 25 02:35:29 membase-cluster-005 [vbucketmigrator]: vbucketmigrator exit
Dec 25 02:35:29 membase-cluster-005 [vbucketmigrator]: tap information not available for membase-cluster-004:11211 on this host. Exiting vbucketmigrator
This is what I got with -T set.
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Connecting to {Sock 127.0.0.1:11211}
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Connecting to {Sock membase-cluster-004:11211}
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Message from downstream sent upstream: [ REQ V: 0 TCONNECT k: <membase-cluster-004>]
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Received message from upstream server: [ REQ V: 0 TOPAQUE (tap seqno: 6 ACK request)]
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TOPAQUE (tap seqno: 6)]
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TOPAQUE (tap seqno: 6)]
Dec 25 02:34:29 membase-cluster-005 [vbucketmigrator]: Received message from upstream server: [ REQ V: 0 NOOP]
Dec 25 02:35:29 membase-cluster-005 [vbucketmigrator]: Timed out on BinaryMessagePipe from membase-cluster-004:11211 on fd 9
Dec 25 02:35:29 membase-cluster-005 [vbucketmigrator]: vbucketmigrator exit
Dec 25 02:35:29 membase-cluster-005 [vbucketmigrator]: tap information not available for membase-cluster-004:11211 on this host. Exiting vbucketmigrator
Show
Sriharsha Krishnamurthy
added a comment - Ignore the speed as that was due to misconfiguration.
This is what I got with -T set.
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Connecting to {Sock 127.0.0.1:11211}
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Connecting to {Sock membase-cluster-004:11211}
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Message from downstream sent upstream: [ REQ V: 0 TCONNECT k: <membase-cluster-004>]
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Received message from upstream server: [ REQ V: 0 TOPAQUE (tap seqno: 6 ACK request)]
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Received message from downstream server: [ RES V: 0 TOPAQUE (tap seqno: 6)]
Dec 25 02:34:24 membase-cluster-005 [vbucketmigrator]: Message from downstream sent upstream: [ RES V: 0 TOPAQUE (tap seqno: 6)]
Dec 25 02:34:29 membase-cluster-005 [vbucketmigrator]: Received message from upstream server: [ REQ V: 0 NOOP]
Dec 25 02:35:29 membase-cluster-005 [vbucketmigrator]: Timed out on BinaryMessagePipe from membase-cluster-004:11211 on fd 9
Dec 25 02:35:29 membase-cluster-005 [vbucketmigrator]: vbucketmigrator exit
Dec 25 02:35:29 membase-cluster-005 [vbucketmigrator]: tap information not available for membase-cluster-004:11211 on this host. Exiting vbucketmigrator
Hide
Trond Norbye
added a comment -
Is this still a problem for the customer (and are they still running 1.7?) We're no longer using vbucketmigrator but ebucketmigrator these days (anyway I don't think the bug belongs in bucket_engine category, but couchbase-bucket is probably better (being the producer and consumer of the data the vbucketmigrator times out for)
Show
Trond Norbye
added a comment - Is this still a problem for the customer (and are they still running 1.7?) We're no longer using vbucketmigrator but ebucketmigrator these days (anyway I don't think the bug belongs in bucket_engine category, but couchbase-bucket is probably better (being the producer and consumer of the data the vbucketmigrator times out for)
Show
Mike Wiederhold
added a comment - Vbucketmigrator is no longer used.