We are creating 6 continuous replicators in our C# application. They sync up our gateway with a CBLite 2.1.2 database.
They start up fine and actively sync the changes. After a few minutes however they all get “unclean socket disconnect” error messages associated with an errno of 108 (Unknown error). Then they successfully attempt a restart , work for a while and again fail. It seems that we are eventually replicating the proper data.
Does this error signal a problem with our system or is it a side effect of continuous replication? Do we have too many replicators running at the same time? Why is this socket disconnect happening, what to do to stop it?
Below is the verbose logging of the cycle of one of our replicators.
This log message: “(WebSocketWrapper) [3] Failed to read from stream too many times, signaling closed” sounds like the error comes from the .NET-specific WebSocket client code, not from the core replicator itself. @Sandy_Chuang or @borrrden should be able to find the source of this error.
The exception POSIXDomain / 108 looks weird, as there doesn’t seem to be a well-known errno value for 108 … at least, there isn’t one in <sys/errno.h> on my Mac.
Does Sync Gateway log any errors at the time this occurs? The problem might be on the server side, or at least we might get some clues by seeing the messages there.
We have a proxy load balancing between 2 SG servers. The admin here says we are supposed to be sticky, your requests will always be sent to one or the other server.
Looking at the logs for both SGs I see my ID in both files so perhaps that is an issue.
I also see the following message at about the same time the client reports its 108 error:
http: TLS handshake error from 10.3.1.160:42238: tls: client offered an unsupported, maximum protocol version of 300
I get multiple messages like this - I assume one per replicator. The port number changes for each one.
POSIX codes are hideously nonstandard above a certain point. In fact I went through the trouble of documenting them all in the .NET code base. As you can see 1-10, 12-14, 16 - 25, and 27-34 are consistent between platforms but the rest are not.
108 is a Windows specific POSIX code that represents ECONNRESET (FYI that’s 54 on mac and 104 on Linux). However, it is not being thrown by the system in this case. The log message you see means that there was a failure reading from the websocket stream and .NET is interpreting it as a reset connection so that the replicator can try again.
I am not a network guy so first off, is this impacting the replicators? i.e. do they simply reset and are able to pull/push as needed until they go idle for a while and the connection is reset?
Or do we need to get our network guy in on this as its a failure on our network interface?
If you are seeing a lot of these then it is likely a problem with the network. Specifically a lot of network engineers like to close connections that are open for a long time and that’s not a good thing for web sockets since they are designed to stay open for the long term. The majority of the time people complain about this kind of stuff, it’s because of a proxy being overzealous or something.
Is the default heartbeat for continuous replicators still 30 seconds? Our client side software is not setting any parameter so I assume we are taking the default configuration. Again couchbaselite v2.1.2
If you have a server that shuts off incoming connections due to inactivity then it is very possible that the replicator will be shut off prematurely. “Long poll” is not a relevant term anymore. Web sockets themselves are designed to get rid of the limitations of HTTP with regards to long running connections so basically they are considered open until they receive a message that they are closed, or there is some error in the underlying transport (TCP). For this reason, Sync Gateway will send a ping to the client every 300 seconds (or maybe the reverse, or both but at least one side) to basically make sure it is still there. If it has silently disconnected then the ping will fail (or no pong will be received in a timely manner) and the connection will be torn down.
TL;DR: Yes, 300 seconds is still the case and you should not have any timeouts less than that. Web sockets are better at taking care of themselves than HTTP connections are.
Wireshark is able to see it if it is set up properly (I’m not an expert in what that entails, other than it needs to be running from the beginning of the replication) and you are not using SSL (or you set up wireshark to be able to decode it, which is complicated).
thanks for your reply.
I haven’t use SSL, I used the wireshark to catch the network package, I read the official documents say the lite or sg will send heartbeat per 300s, but I haven’t caught the lite or sg send heartbeat package interval 300s. I use wireshark options to filter my server address. I caught about ten mins., but why I can’t catch any network package sent, what my operate has the problem?