Details
-
Type:
Bug
-
Status:
Closed
-
Priority:
Major
-
Resolution: Fixed
-
Affects Version/s: 2.0.0beta2
-
Fix Version/s: 2.0.0beta3
-
Component/s: library
-
Security Level: Public
-
Labels:None
-
Environment:CentOS 6.2, 64Bit
Description
Am receiving "network error" back from error callback when using lcb_get() within a loop, within a function, in libev async mode
(regardless if I make 1 call with a long command list or many calls of 1 command each) after returning control to the
event loop (ie; return back from this function).
If I add an occasional lcb_wait() during some interval of calls before returning back, then this does not happen but, of course, I am blocked
during this call so that is not really a solution.
Appears to be event overflow related but not sure. We are not talking about millions of calls, maybe 100k-ish over the network,
not loopback.
(regardless if I make 1 call with a long command list or many calls of 1 command each) after returning control to the
event loop (ie; return back from this function).
If I add an occasional lcb_wait() during some interval of calls before returning back, then this does not happen but, of course, I am blocked
during this call so that is not really a solution.
Appears to be event overflow related but not sure. We are not talking about millions of calls, maybe 100k-ish over the network,
not loopback.
Activity
- All
- Comments
- Work Log
- History
- Activity
- Gerrit Reviews
Hide
Michael Leib
added a comment -
Your "value" is much smaller in length than mine and I think that might be part of the issue. It would not be easy for me to create a simple
test case code because I am using a wrapper library - but I will take the time to do so if you can not recreate the problem as this is
big issue for me as lcb_wait() fires off all pending events, not just those added by libcouchbase, but application ones as well
which should not fire until control is given back to the loop.
Can you try this again without using "foo" as the value? My values are in the neighborhood of a 4k blob per key.
If it still works for you, I will make test code. Please let me know.
test case code because I am using a wrapper library - but I will take the time to do so if you can not recreate the problem as this is
big issue for me as lcb_wait() fires off all pending events, not just those added by libcouchbase, but application ones as well
which should not fire until control is given back to the loop.
Can you try this again without using "foo" as the value? My values are in the neighborhood of a 4k blob per key.
If it still works for you, I will make test code. Please let me know.
Show
Michael Leib
added a comment - Your "value" is much smaller in length than mine and I think that might be part of the issue. It would not be easy for me to create a simple
test case code because I am using a wrapper library - but I will take the time to do so if you can not recreate the problem as this is
big issue for me as lcb_wait() fires off all pending events, not just those added by libcouchbase, but application ones as well
which should not fire until control is given back to the loop.
Can you try this again without using "foo" as the value? My values are in the neighborhood of a 4k blob per key.
If it still works for you, I will make test code. Please let me know.
Hide
Sergey Avseyev
added a comment -
ok, but you mentioned queueing GET operations, so i thought the value doesn't matter here. but i will try
Show
Sergey Avseyev
added a comment - ok, but you mentioned queueing GET operations, so i thought the value doesn't matter here. but i will try
Hide
Mark Nunberg
added a comment -
So this actually made me wonder a bit.
You say you are using an entirely asynchronous event loop, in which case, why are you calling lcb_wait at all?
You may also need to adjust the actions of run_event_loop/stop_event_loop in your iops plugin as well.
lcb_wait is intended for synchronous operation only, and I wouldn't recommend calling it from an async program (unless as I mentioned, you ensure start/stop_event_loop are functioning as intended)
You say you are using an entirely asynchronous event loop, in which case, why are you calling lcb_wait at all?
You may also need to adjust the actions of run_event_loop/stop_event_loop in your iops plugin as well.
lcb_wait is intended for synchronous operation only, and I wouldn't recommend calling it from an async program (unless as I mentioned, you ensure start/stop_event_loop are functioning as intended)
Show
Mark Nunberg
added a comment - So this actually made me wonder a bit.
You say you are using an entirely asynchronous event loop, in which case, why are you calling lcb_wait at all?
You may also need to adjust the actions of run_event_loop/stop_event_loop in your iops plugin as well.
lcb_wait is intended for synchronous operation only, and I wouldn't recommend calling it from an async program (unless as I mentioned, you ensure start/stop_event_loop are functioning as intended)
Hide
Michael Leib
added a comment -
Mark -
I am aware of this and I do not wish to use lcb_wait() (outside of initial connection checks).
However, if I do not, I get network error and then I'm not able to continue.
This was the only way to get it to work. I currently have it run in intervals based on how many lcb_get() calls
have been fired off.
Michael
I am aware of this and I do not wish to use lcb_wait() (outside of initial connection checks).
However, if I do not, I get network error and then I'm not able to continue.
This was the only way to get it to work. I currently have it run in intervals based on how many lcb_get() calls
have been fired off.
Michael
Show
Michael Leib
added a comment - Mark -
I am aware of this and I do not wish to use lcb_wait() (outside of initial connection checks).
However, if I do not, I get network error and then I'm not able to continue.
This was the only way to get it to work. I currently have it run in intervals based on how many lcb_get() calls
have been fired off.
Michael
Hide
Mark Nunberg
added a comment -
A traceback if possible from the callback would be rather helpful. I cannot see anywhere in the code where a NETWORK_ERROR is returned as the value without there actually being a socket error.
I have just tried to reproduce your scenario by batching 500k commands (individual calls) with a 4k value blob (and calling one big lcb_wait at the end). The scenario was not asynchronous, but it was using libev - and I have not seen any errors.
If you are getting this error in the error callback it means the problem is with the REST connection..
Anyway, a traceback would really help, if possible (basically break where your error callback is called)
I have just tried to reproduce your scenario by batching 500k commands (individual calls) with a 4k value blob (and calling one big lcb_wait at the end). The scenario was not asynchronous, but it was using libev - and I have not seen any errors.
If you are getting this error in the error callback it means the problem is with the REST connection..
Anyway, a traceback would really help, if possible (basically break where your error callback is called)
Show
Mark Nunberg
added a comment - A traceback if possible from the callback would be rather helpful. I cannot see anywhere in the code where a NETWORK_ERROR is returned as the value without there actually being a socket error.
I have just tried to reproduce your scenario by batching 500k commands (individual calls) with a 4k value blob (and calling one big lcb_wait at the end). The scenario was not asynchronous, but it was using libev - and I have not seen any errors.
If you are getting this error in the error callback it means the problem is with the REST connection..
Anyway, a traceback would really help, if possible (basically break where your error callback is called)
Hide
Michael Leib
added a comment -
I am using stock beta2 - Please let me know if this helps at all....
Michael
(gdb) where
#0 pcollector_error (cback=0x66cca0, error=LCB_NETWORK_ERROR, error_string=0x45713a "Network error",
error_string1=0x45713a "Network error") at pcollector_main.c:135
#1 0x000000000040972a in LCB_Error (cbhandle=0x672370, error=LCB_NETWORK_ERROR, errinfo=0x45713a "Network error") at liblcb.c:75
#2 0x000000000040987e in LCB_Get (cbhandle=0x672370, cookie=0x6713b0, error=LCB_NETWORK_ERROR, resp=0x7fffffffd560) at liblcb.c:93
#3 0x000000000043d2d9 in lcb_purge_single_server (server=0x675ac0, error=LCB_NETWORK_ERROR) at src/server.c:138
#4 0x000000000043d976 in lcb_failout_server (server=0x675ac0, error=LCB_NETWORK_ERROR) at src/server.c:288
#5 0x0000000000443d6e in lcb_server_event_handler (sock=14, which=2, arg=0x675ac0) at src/event.c:301
#6 0x000000000040a5f3 in handler_thunk (loop=0x66c8c0, io=0x66e560, events=1)
at ../libcouchbase/plugins/io/libev/plugin-libev.c:208
#7 0x0000000000433fac in call_pending (loop=0x66c8c0, flags=0) at ev.c:1749
#8 ev_loop (loop=0x66c8c0, flags=0) at ev.c:2084
#9 0x000000000040bfd7 in main (argc=4, argv=0x7fffffffe9e8, envp=0x7fffffffea10) at pcollector_main.c:445
Michael
(gdb) where
#0 pcollector_error (cback=0x66cca0, error=LCB_NETWORK_ERROR, error_string=0x45713a "Network error",
error_string1=0x45713a "Network error") at pcollector_main.c:135
#1 0x000000000040972a in LCB_Error (cbhandle=0x672370, error=LCB_NETWORK_ERROR, errinfo=0x45713a "Network error") at liblcb.c:75
#2 0x000000000040987e in LCB_Get (cbhandle=0x672370, cookie=0x6713b0, error=LCB_NETWORK_ERROR, resp=0x7fffffffd560) at liblcb.c:93
#3 0x000000000043d2d9 in lcb_purge_single_server (server=0x675ac0, error=LCB_NETWORK_ERROR) at src/server.c:138
#4 0x000000000043d976 in lcb_failout_server (server=0x675ac0, error=LCB_NETWORK_ERROR) at src/server.c:288
#5 0x0000000000443d6e in lcb_server_event_handler (sock=14, which=2, arg=0x675ac0) at src/event.c:301
#6 0x000000000040a5f3 in handler_thunk (loop=0x66c8c0, io=0x66e560, events=1)
at ../libcouchbase/plugins/io/libev/plugin-libev.c:208
#7 0x0000000000433fac in call_pending (loop=0x66c8c0, flags=0) at ev.c:1749
#8 ev_loop (loop=0x66c8c0, flags=0) at ev.c:2084
#9 0x000000000040bfd7 in main (argc=4, argv=0x7fffffffe9e8, envp=0x7fffffffea10) at pcollector_main.c:445
Show
Michael Leib
added a comment - I am using stock beta2 - Please let me know if this helps at all....
Michael
(gdb) where
#0 pcollector_error (cback=0x66cca0, error=LCB_NETWORK_ERROR, error_string=0x45713a "Network error",
error_string1=0x45713a "Network error") at pcollector_main.c:135
#1 0x000000000040972a in LCB_Error (cbhandle=0x672370, error=LCB_NETWORK_ERROR, errinfo=0x45713a "Network error") at liblcb.c:75
#2 0x000000000040987e in LCB_Get (cbhandle=0x672370, cookie=0x6713b0, error=LCB_NETWORK_ERROR, resp=0x7fffffffd560) at liblcb.c:93
#3 0x000000000043d2d9 in lcb_purge_single_server (server=0x675ac0, error=LCB_NETWORK_ERROR) at src/server.c:138
#4 0x000000000043d976 in lcb_failout_server (server=0x675ac0, error=LCB_NETWORK_ERROR) at src/server.c:288
#5 0x0000000000443d6e in lcb_server_event_handler (sock=14, which=2, arg=0x675ac0) at src/event.c:301
#6 0x000000000040a5f3 in handler_thunk (loop=0x66c8c0, io=0x66e560, events=1)
at ../libcouchbase/plugins/io/libev/plugin-libev.c:208
#7 0x0000000000433fac in call_pending (loop=0x66c8c0, flags=0) at ev.c:1749
#8 ev_loop (loop=0x66c8c0, flags=0) at ev.c:2084
#9 0x000000000040bfd7 in main (argc=4, argv=0x7fffffffe9e8, envp=0x7fffffffea10) at pcollector_main.c:445
Hide
Mark Nunberg
added a comment -
aha, so the error state is actually coming from the get callback and not the special error callback.
Looking at the code path, it might seem that there are buffer issues which are causing some packets to be malformed (this would not have been the first time I've seen issues happen with scheduling many commands).
If you were indeed having a network error, it would come from do_fill_input_buffer (in which failout_server() would be called).
This was supposedly resolved in a bug regarding ringbuffer error handling.
I think the problem is ultimately connected to the large buffer size (and lcb wait naturally acts as some kind of 'barrier' to prevent the buffer from getting too large).
As a workaround, instead of partitioning the calls by lcb_wait, can you perhaps schedule them asynchronously using timers? (i.e. invoke 500k commands, and then schedule another timer that will dispatch antoher callback and invoke 500k more, etc. etc.) -- if I'm right, this should help alleviate your issue without having to use lcb_wait to block your program.
Looking at the code path, it might seem that there are buffer issues which are causing some packets to be malformed (this would not have been the first time I've seen issues happen with scheduling many commands).
If you were indeed having a network error, it would come from do_fill_input_buffer (in which failout_server() would be called).
This was supposedly resolved in a bug regarding ringbuffer error handling.
I think the problem is ultimately connected to the large buffer size (and lcb wait naturally acts as some kind of 'barrier' to prevent the buffer from getting too large).
As a workaround, instead of partitioning the calls by lcb_wait, can you perhaps schedule them asynchronously using timers? (i.e. invoke 500k commands, and then schedule another timer that will dispatch antoher callback and invoke 500k more, etc. etc.) -- if I'm right, this should help alleviate your issue without having to use lcb_wait to block your program.
Show
Mark Nunberg
added a comment - aha, so the error state is actually coming from the get callback and not the special error callback.
Looking at the code path, it might seem that there are buffer issues which are causing some packets to be malformed (this would not have been the first time I've seen issues happen with scheduling many commands).
If you were indeed having a network error, it would come from do_fill_input_buffer (in which failout_server() would be called).
This was supposedly resolved in a bug regarding ringbuffer error handling.
I think the problem is ultimately connected to the large buffer size (and lcb wait naturally acts as some kind of 'barrier' to prevent the buffer from getting too large).
As a workaround, instead of partitioning the calls by lcb_wait, can you perhaps schedule them asynchronously using timers? (i.e. invoke 500k commands, and then schedule another timer that will dispatch antoher callback and invoke 500k more, etc. etc.) -- if I'm right, this should help alleviate your issue without having to use lcb_wait to block your program.
Hide
Michael Leib
added a comment -
Not easily and here is why....
The keys I need to get from CB are being delivered in an async result callback from a query to PostgreSQL via pqlib.
So, if I wanted to "chunk" the lcb_get() calls, I would need to store these results myself because soon as I return from my
callback, the postgres results are going to disappear (since they were delivered) and I won't be able to get them again
to move onto the next batch.
Or, alternatively, do the same thing on the other side and make all the commands for lcb_get() in advance and then either
queue them or pull off a list.
This is a huge memory requirement and/or a chunk of code that really shouldn't be required.
I can re-create this at will with my current code....what can I do to actually help fix this issue because I'm sure it's not going away
on it own (and if it did, then we should all start to worry!)
The keys I need to get from CB are being delivered in an async result callback from a query to PostgreSQL via pqlib.
So, if I wanted to "chunk" the lcb_get() calls, I would need to store these results myself because soon as I return from my
callback, the postgres results are going to disappear (since they were delivered) and I won't be able to get them again
to move onto the next batch.
Or, alternatively, do the same thing on the other side and make all the commands for lcb_get() in advance and then either
queue them or pull off a list.
This is a huge memory requirement and/or a chunk of code that really shouldn't be required.
I can re-create this at will with my current code....what can I do to actually help fix this issue because I'm sure it's not going away
on it own (and if it did, then we should all start to worry!)
Show
Michael Leib
added a comment - Not easily and here is why....
The keys I need to get from CB are being delivered in an async result callback from a query to PostgreSQL via pqlib.
So, if I wanted to "chunk" the lcb_get() calls, I would need to store these results myself because soon as I return from my
callback, the postgres results are going to disappear (since they were delivered) and I won't be able to get them again
to move onto the next batch.
Or, alternatively, do the same thing on the other side and make all the commands for lcb_get() in advance and then either
queue them or pull off a list.
This is a huge memory requirement and/or a chunk of code that really shouldn't be required.
I can re-create this at will with my current code....what can I do to actually help fix this issue because I'm sure it's not going away
on it own (and if it did, then we should all start to worry!)
Hide
Mark Nunberg
added a comment -
Yes, I totally agree that the code should be fixed and the 'chunking' of the commands is not the way to go about making a solution; however since I'm unable to reproduce the issue on my own..
Right now I'm trying to ascertain that this is indeed a buffer handling issue; hence the request to 'chunk this up'.
Let me post a summary of the 'root symptom' (based on the backtrace) here just to outline what's happening.
If you look at the file event.c (I know you are using 2.0.0-beta2 and the link is to the current master,
https://github.com/couchbase/libcouchbase/blob/9e90ce8ece72dd3eaad0ca0fcbe9054aecc411fd/src/event.c
- You'll see that line 301 (which is in your code path) is called.
- This is called only if do_read_data returns nonzero
- do_read_data returns nonzero if either there is an issue in parse_single (this is the code which handles the base protocol decoding) OR if there was an issue in do_fill_input_bufffer (actually filling up the input buffer).
- Assuming this is a problem in do_fill_input_buffer, I am seeing the only possible error scenarios being the following:
* There is an actual network issue, in which case there's nothing we can do here
* Your application is making use of signals and the read loop (which is probably very long as you're using a lot of data) is being interrupted by EINTR.
Try putting a 'return 0' in the case of EINTR and see if this goes away (right now it falls out of the switch statement, and ends up returning 1, which is nonzero, which is an error on line 301).
Right now I'm trying to ascertain that this is indeed a buffer handling issue; hence the request to 'chunk this up'.
Let me post a summary of the 'root symptom' (based on the backtrace) here just to outline what's happening.
If you look at the file event.c (I know you are using 2.0.0-beta2 and the link is to the current master,
https://github.com/couchbase/libcouchbase/blob/9e90ce8ece72dd3eaad0ca0fcbe9054aecc411fd/src/event.c
- You'll see that line 301 (which is in your code path) is called.
- This is called only if do_read_data returns nonzero
- do_read_data returns nonzero if either there is an issue in parse_single (this is the code which handles the base protocol decoding) OR if there was an issue in do_fill_input_bufffer (actually filling up the input buffer).
- Assuming this is a problem in do_fill_input_buffer, I am seeing the only possible error scenarios being the following:
* There is an actual network issue, in which case there's nothing we can do here
* Your application is making use of signals and the read loop (which is probably very long as you're using a lot of data) is being interrupted by EINTR.
Try putting a 'return 0' in the case of EINTR and see if this goes away (right now it falls out of the switch statement, and ends up returning 1, which is nonzero, which is an error on line 301).
Show
Mark Nunberg
added a comment - Yes, I totally agree that the code should be fixed and the 'chunking' of the commands is not the way to go about making a solution; however since I'm unable to reproduce the issue on my own..
Right now I'm trying to ascertain that this is indeed a buffer handling issue; hence the request to 'chunk this up'.
Let me post a summary of the 'root symptom' (based on the backtrace) here just to outline what's happening.
If you look at the file event.c (I know you are using 2.0.0-beta2 and the link is to the current master,
https://github.com/couchbase/libcouchbase/blob/9e90ce8ece72dd3eaad0ca0fcbe9054aecc411fd/src/event.c
- You'll see that line 301 (which is in your code path) is called.
- This is called only if do_read_data returns nonzero
- do_read_data returns nonzero if either there is an issue in parse_single (this is the code which handles the base protocol decoding) OR if there was an issue in do_fill_input_bufffer (actually filling up the input buffer).
- Assuming this is a problem in do_fill_input_buffer, I am seeing the only possible error scenarios being the following:
* There is an actual network issue, in which case there's nothing we can do here
* Your application is making use of signals and the read loop (which is probably very long as you're using a lot of data) is being interrupted by EINTR.
Try putting a 'return 0' in the case of EINTR and see if this goes away (right now it falls out of the switch statement, and ends up returning 1, which is nonzero, which is an error on line 301).
Hide
Michael Leib
added a comment -
Ok, hopefully this helps...makes sense to me (and I hate socket programming)
In do_fill_input_buffer(), the switch (c->instance->io->v.v0.error) is falling into the default case because the value is
#define EAGAIN 11 /* Try again */
Looks like some retry code needs to be implemented as this is common with non-blocking sockets.
Please advise.
Michael
In do_fill_input_buffer(), the switch (c->instance->io->v.v0.error) is falling into the default case because the value is
#define EAGAIN 11 /* Try again */
Looks like some retry code needs to be implemented as this is common with non-blocking sockets.
Please advise.
Michael
Show
Michael Leib
added a comment - Ok, hopefully this helps...makes sense to me (and I hate socket programming)
In do_fill_input_buffer(), the switch (c->instance->io->v.v0.error) is falling into the default case because the value is
#define EAGAIN 11 /* Try again */
Looks like some retry code needs to be implemented as this is common with non-blocking sockets.
Please advise.
Michael
Hide
Michael Leib
added a comment -
Ok, my bad...didn't realize EAGAIN is the same as EWOULDBLOCK, which is already covered....
so, this is messing up someplace else but at least you know it's not EINTR
Michael
so, this is messing up someplace else but at least you know it's not EINTR
Michael
Show
Michael Leib
added a comment - Ok, my bad...didn't realize EAGAIN is the same as EWOULDBLOCK, which is already covered....
so, this is messing up someplace else but at least you know it's not EINTR
Michael
Hide
Michael Leib
added a comment -
The error is coming from here:
void lcb_server_event_handler(lcb_socket_t sock, short which, void *arg)
{
lcb_server_t *c = arg;
(void)sock;
lcb_update_server_timer(c);
if (which & LCB_READ_EVENT) {
if (do_read_data(c) != 0) {
/* TODO stash error message somewhere
* "Failed to read from connection to \"%s:%s\"", c->hostname, c->port */
lcb_failout_server(c, LCB_NETWORK_ERROR); <<<<<=========================
return;
}
}
What more can I tell you?
void lcb_server_event_handler(lcb_socket_t sock, short which, void *arg)
{
lcb_server_t *c = arg;
(void)sock;
lcb_update_server_timer(c);
if (which & LCB_READ_EVENT) {
if (do_read_data(c) != 0) {
/* TODO stash error message somewhere
* "Failed to read from connection to \"%s:%s\"", c->hostname, c->port */
lcb_failout_server(c, LCB_NETWORK_ERROR); <<<<<=========================
return;
}
}
What more can I tell you?
Show
Michael Leib
added a comment - The error is coming from here:
void lcb_server_event_handler(lcb_socket_t sock, short which, void *arg)
{
lcb_server_t *c = arg;
(void)sock;
lcb_update_server_timer(c);
if (which & LCB_READ_EVENT) {
if (do_read_data(c) != 0) {
/* TODO stash error message somewhere
* "Failed to read from connection to \"%s:%s\"", c->hostname, c->port */
lcb_failout_server(c, LCB_NETWORK_ERROR); <<<<<=========================
return;
}
}
What more can I tell you?
Hide
Mark Nunberg
added a comment -
Hrrm.. this is interesting. So you're saying do_fill_input_buffer is not returning an error status. This means the problem is inside the packet decoding.
I don't want to jump the gun and say this is a server issue -- as I'm quite sure the server has traveled this path before.
So you're saying you get the error when io->error == EWOULDBLOCK?
I'll need to poke around parse_single and see what might be failing. There might also be possible reentrancy issues as well..
Basically, the callback is invoked from do_parse_single; but the buffer is only "cleaned up" *after* the callback returns.
This means we might be doing weird things with the buffers..
Just wondering (and I'm gonna try and replicate this tomorrow.. now that I might have a bit of a better idea regarding what's happening here..):
Are you issuing any batched requests (i.e. gets where you are passing more than a single command to libcouchbase) before this?
Are any of your results not found?
Does this only occur with get commands, or does this happen with sets as well
Basically there is special handling when we receive responses for batched gets.
It would be awesome if you can see where parse_single is returning -1 (which it now seems it is..)
It also seems that in all cases where that function returns -1, the "global" error handler (i.e. set_error_callback) should tell you something as well.
I don't want to jump the gun and say this is a server issue -- as I'm quite sure the server has traveled this path before.
So you're saying you get the error when io->error == EWOULDBLOCK?
I'll need to poke around parse_single and see what might be failing. There might also be possible reentrancy issues as well..
Basically, the callback is invoked from do_parse_single; but the buffer is only "cleaned up" *after* the callback returns.
This means we might be doing weird things with the buffers..
Just wondering (and I'm gonna try and replicate this tomorrow.. now that I might have a bit of a better idea regarding what's happening here..):
Are you issuing any batched requests (i.e. gets where you are passing more than a single command to libcouchbase) before this?
Are any of your results not found?
Does this only occur with get commands, or does this happen with sets as well
Basically there is special handling when we receive responses for batched gets.
It would be awesome if you can see where parse_single is returning -1 (which it now seems it is..)
It also seems that in all cases where that function returns -1, the "global" error handler (i.e. set_error_callback) should tell you something as well.
Show
Mark Nunberg
added a comment - Hrrm.. this is interesting. So you're saying do_fill_input_buffer is not returning an error status. This means the problem is inside the packet decoding.
I don't want to jump the gun and say this is a server issue -- as I'm quite sure the server has traveled this path before.
So you're saying you get the error when io->error == EWOULDBLOCK?
I'll need to poke around parse_single and see what might be failing. There might also be possible reentrancy issues as well..
Basically, the callback is invoked from do_parse_single; but the buffer is only "cleaned up" *after* the callback returns.
This means we might be doing weird things with the buffers..
Just wondering (and I'm gonna try and replicate this tomorrow.. now that I might have a bit of a better idea regarding what's happening here..):
Are you issuing any batched requests (i.e. gets where you are passing more than a single command to libcouchbase) before this?
Are any of your results not found?
Does this only occur with get commands, or does this happen with sets as well
Basically there is special handling when we receive responses for batched gets.
It would be awesome if you can see where parse_single is returning -1 (which it now seems it is..)
It also seems that in all cases where that function returns -1, the "global" error handler (i.e. set_error_callback) should tell you something as well.
Show
Sergey Avseyev
added a comment - What is the cluster config? Are all the nodes healthy always?
Hide
Sergey Avseyev
added a comment -
Is it possible you are reaching maximum request number?
https://github.com/couchbase/libcouchbase/blob/9e90ce8ece72dd3eaad0ca0fcbe9054aecc411fd/src/event.c#L223
In this case chances are that parse_single() return value bubbles. It returns positive number on success, but the caller treats any non-zero value as error
https://github.com/couchbase/libcouchbase/blob/9e90ce8ece72dd3eaad0ca0fcbe9054aecc411fd/src/event.c#L223
In this case chances are that parse_single() return value bubbles. It returns positive number on success, but the caller treats any non-zero value as error
Show
Sergey Avseyev
added a comment - Is it possible you are reaching maximum request number?
https://github.com/couchbase/libcouchbase/blob/9e90ce8ece72dd3eaad0ca0fcbe9054aecc411fd/src/event.c#L223
In this case chances are that parse_single() return value bubbles. It returns positive number on success, but the caller treats any non-zero value as error
Hide
Sergey Avseyev
added a comment -
If so, it should be fixed if the function will return zero instead of rv at the end
https://github.com/couchbase/libcouchbase/blob/9e90ce8ece72dd3eaad0ca0fcbe9054aecc411fd/src/event.c#L247
https://github.com/couchbase/libcouchbase/blob/9e90ce8ece72dd3eaad0ca0fcbe9054aecc411fd/src/event.c#L247
Show
Sergey Avseyev
added a comment - If so, it should be fixed if the function will return zero instead of rv at the end
https://github.com/couchbase/libcouchbase/blob/9e90ce8ece72dd3eaad0ca0fcbe9054aecc411fd/src/event.c#L247
Show
Sergey Avseyev
added a comment - Fixed in http://review.couchbase.org/22310
Hide
Michael Leib
added a comment -
I have verified that this indeed solves the problem. YEAH!!!!
I am getting timeouts now, but that is unrelated and I will need to increase the wait time - I am beating the server pretty hard.
Thanks to all -
MIchael
I am getting timeouts now, but that is unrelated and I will need to increase the wait time - I am beating the server pretty hard.
Thanks to all -
MIchael
Show
Michael Leib
added a comment - I have verified that this indeed solves the problem. YEAH!!!!
I am getting timeouts now, but that is unrelated and I will need to increase the wait time - I am beating the server pretty hard.
Thanks to all -
MIchael
It is passing when I'm running it with:
LIBCOUCHBASE_EVENT_PLUGIN_NAME=libev make check