[MB-4824] memcached silently closes connection in auth request [was: ns_server: throws exception in ebucketmigrator_srv during rebalance test] Created: 21/Feb/12  Updated: 18/Jun/13  Due: 21/Feb/12  Resolved: 03/Jun/12

Status: Closed
Project: Couchbase Server
Component/s: couchbase-bucket
Affects Version/s: 2.0-developer-preview-4
Fix Version/s: 2.0
Security Level: Public

Type: Bug Priority: Major
Reporter: Tommie McAfee Assignee: Chiyoung Seo
Resolution: Duplicate Votes: 0
Labels: 2.0-dev-preview-4-release-notes
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment: cluster-run 6 nodes.

Attachments: Zip Archive 127.0.0.1-9000-diag.txt.zip     Zip Archive 127.0.0.1-9001-diag.txt.zip     Zip Archive 127.0.0.1-9002-diag.txt.zip     Zip Archive 127.0.0.1-9003-diag.txt.zip     Zip Archive 127.0.0.1-9004-diag.txt.zip     Zip Archive 127.0.0.1-9005-diag.txt.zip    

 Description   
Rebalance-in while inserting spatial docs causes rebalance to fail. During the time of failure, I see the following in the diags:

[error_logger:error] [2012-02-21 10:49:04] [n_0@10.17.4.51:error_logger:ale_error_logger_handler:log_report:72]
=========================CRASH REPORT=========================
  crasher:
    initial call: ebucketmigrator_srv:init/1
    pid: <0.25874.6>
    registered_name: []
    exception error: no match of right hand side value {error,closed}
      in function mc_client_binary:cmd_binary_vocal_recv/5 (src/mc_client_binary.erl, line 103)
      in call from mc_client_binary:auth/2 (src/mc_client_binary.erl, line 145)
      in call from ebucketmigrator_srv:connect/4 (src/ebucketmigrator_srv.erl, line 281)
      in call from ebucketmigrator_srv:init/1 (src/ebucketmigrator_srv.erl, line 133)
    ancestors: [<0.25873.6>]
    messages: []
    links: [#Port<0.454477>,<0.25873.6>,#Port<0.454476>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 987
    stack_size: 24
    reductions: 960
  neighbours:

SpatialRebalanceTests.test_insert_x_docs_during_rebalance





 Comments   
Comment by Aleksey Kondratenko [ 22/Feb/12 ]
this error means that source node closed connection when we tried to establish tap connection. I see no traces of ns_server-side issues that could lead to this. We honestly did what we're normally doing.

BTW, thumbs up for using R15 which provides line numbers in backtrace.

So it closes connection when SASL auth is performed. Just after establishing tcp connection and prior to selecting bucket. And much prior to sending TAP_CONNECT command.

This could be something OS-related (like exhausting file descriptors in memcached for example).
Comment by Aleksey Kondratenko [ 22/Feb/12 ]
Assigning to our memcached & bucket engine guy.

Trond, ns_server is using bucket name and password in AUTH request. But that's bucket default with forcibly empty password. So it should not be AUTH failure. Even if it is, it should not be intermittent and silent (i.e. no error is returned but socket just closed).
Comment by Trond Norbye [ 24/Feb/12 ]
I'll look into this today
Comment by Trond Norbye [ 24/Feb/12 ]
Are you absolutely sure it's not sending any data before the socket is closed? I just added plaintext sasl support for mcstat and ran memcached like:

export ISASL_PWFILE=/Users/trondnorbye/compile/MB-4824/memcached/isasl.pw
(isasl.pw contains:
default
)

When I connect ./mcstat -u default -h localhost:11211 -P ' '
It receives an error package before the socket is closed (as memcached prints out:)
./memcached -vvv -S
Loaded isasl db from /Users/trondnorbye/compile/MB-4824/memcached/isasl.pw
Initialized SASL.
isasl checking DB every 60s
Loaded engine: Default engine v0.1
Supplying the following features: LRU, compare and swap
<22 server listening (binary)
<23 server listening (binary)
<24 send buffer was 9216, now 3728270
<24 server listening (udp)
<25 send buffer was 9216, now 3728270
<24 server listening (udp)
<25 server listening (udp)
<25 server listening (udp)
<24 server listening (udp)
<24 server listening (udp)
<25 server listening (udp)
<25 server listening (udp)
23 - Running task: (conn_listening)
<26 new binary client connection.
26 - Running task: (conn_new_cmd)
26: going from conn_new_cmd to conn_waiting
26 - Running task: (conn_waiting)
26: going from conn_waiting to conn_read
26 - Running task: (conn_read)
26: going from conn_read to conn_parse_cmd
26 - Running task: (conn_parse_cmd)
>26 Read binary protocol data:
>26 0x80 0x21 0x00 0x05
>26 0x00 0x00 0x00 0x00
>26 0x00 0x00 0x00 0x10
>26 0x00 0x00 0x00 0x00
>26 0x00 0x00 0x00 0x00
>26 0x00 0x00 0x00 0x00

26: authenticated() in cmd 0x21 is true
26: going from conn_parse_cmd to conn_nread
26 - Running task: (conn_nread)
26 - Running task: (conn_nread)
26 - Running task: (conn_nread)
26 - Running task: (conn_nread)
26: mech: ``PLAIN'' with 11 bytes of data
26: sasl result code: -1
26: Unknown sasl response: -1
>26 Writing an error: Auth failure
<26 Writing bin response:
<26 0x81 0x21 0x00 0x00
<26 0x00 0x00 0x00 0x20
<26 0x00 0x00 0x00 0x0c
<26 0x00 0x00 0x00 0x00
<26 0x00 0x00 0x00 0x00
<26 0x00 0x00 0x00 0x00

26: going from conn_nread to conn_mwrite
26 - Running task: (conn_mwrite)
26 - Running task: (conn_mwrite)
26: going from conn_mwrite to conn_new_cmd
26 - Running task: (conn_new_cmd)
26: going from conn_new_cmd to conn_waiting
26 - Running task: (conn_waiting)
26: going from conn_waiting to conn_read
26 - Running task: (conn_read)
26: going from conn_read to conn_closing
26 - Running task: (conn_closing)
-1: going from conn_closing to conn_immediate_close
-1 - Running task: (conn_immediate_close)
Immediate close of 0x7ff80053d6c0

Could it be that the code detects that the socket is closed and doesn't read out the error message? If no, how can I reproduce this issue?
Comment by Trond Norbye [ 24/Feb/12 ]
Please let me know how to reproduce if you're absolutely sure that you're not getting an error message
Comment by Aleksey Kondratenko [ 24/Feb/12 ]
Yes I am absolutely sure it is not being sent auth reply. It could be unrelated to AUTH, because error is seemingly intermittent.

And apparently the way to reproduce it is by running testrunner test that was mentioned as part of bug report.
Comment by Trond Norbye [ 24/Feb/12 ]
If you spawn memcached with more logging you should see when it disconnects you and you can map that with the error you're seeing.. try add -vvv (but you probably need to extend the logger to not drop lines...)
Comment by Farshid Ghods (Inactive) [ 29/Feb/12 ]
results in a rebalance hung
Comment by Aleksey Kondratenko [ 29/Feb/12 ]
no it does not. If rebalance hungs anywhere it's a different bug.
Comment by Aleksey Kondratenko [ 25/Apr/12 ]
We've seen very similar issue in 1.8.1 recently. I've added better diagnostics and hopefully will soon find out whats wrong
Comment by Farshid Ghods (Inactive) [ 03/Jun/12 ]
there are couple of other 1.8.1 bugs related to this issue.
please close if this is a dupe
Comment by Chiyoung Seo [ 03/Jun/12 ]
These issues are basically caused by the same reason.

http://www.couchbase.com/issues/browse/MB-5343
Generated at Thu Aug 28 04:29:21 CDT 2014 using JIRA 5.2.4#845-sha1:c9f4cc41abe72fb236945343a1f485c2c844dac9.