A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

Leslie Lamport, 1987

While those prescient words were uttered in 1987, it remains the case in 2018.  It also understates the problem.

At Couchbase, we’ve been on a mission to deliver an Engagement Database Platform. If you are building an engaging system for this modern world, it must be responsive. If your app is slow to respond, an instant message or a notification about a new “insty” may distract the user working with your app. Your app’s lack of responsiveness means the loss of the user.

There is a challenge though. On a distributed system where there are many cogs in the works, if only one is slowing things down (and possibly only occasionally at that!), how do you identify it?

Gonna Need a Better Boat

As many of our users can attest to, Couchbase is already quite good at finding the problem. We have long had a set of diagnosis tools whether it’s the built-in metrics in the Java SDK, the advanced metrics and profile information in N1QL, or the thresholds and logging introduced in Couchbase Server 5.0.

Our industry is trending toward being more distributed, with more abstraction layers from more cloud and container providers. At the same time, modern systems tend to operate at extremely low latencies in steady state, but system tolerances on latencies are expected to be about the same when there is occasional congestion or an error. Going from tens of microseconds to timeouts in seconds (the TCP specification calls for waiting 1 second on a TCP retransmit!) is like suddenly running into a sheer cliff on a mountain.

In fact, I would argue you have probably experienced this yourself. Many people have made a Skype/Hangout/Conference call with high definition video and high quality, stereo audio. But, I would also wager that 100% of them have seen a few frames of stuck video, the occasional video noise, and dropped or garbled audio.

When you are lucky enough to have wrung out many of the easy problems and your tools are no longer sufficient find the cause of the next hard problem, that’s when you innovate new tools.

Use the Force, Developer

Innovating a solution does not necessarily mean starting with a blank slate though.

We believe in the idea that innovation happens elsewhere and we believe it is in our interest to find passionate, like minded individuals to collaborate on a solution.

The team and I did some research, and a set of research notes in Communications of the ACM last year was inspiring. This lead us us to the OpenTracing project which is part of the Cloud Native Compute Foundation. Couchbase is a member of CNCF.

OpenTracing is working toward being a standardized API for distributed tracing. While we at Couchbase are not in the business of building tracing tools, we do have our own modest needs and if we can both add to and leverage the innovation of a community, we should.

Building on an Open solution also makes it possible to extend our work into a wider set of integrated pieces built on the same interfaces.

All You Have to Decide is What to do with the Timings we’ve Given to You*

With Couchbase Server 5.5, we will be introducing a new capability we call Response Time Observability. This will, out of the box, give system deployers a very simple way to observe response times relative to a (tune-able) threshold. The team carefully considered how to make this efficient and safe to always be on along with possible deployment complexities.

Mike Goldsmith (who lead the development of the sdk-rfc) describes in his blog the ThresholdLoggingTracer as it’s known at a lower level and how it leverages OpenTracing, whose API is still evolving and we’re contributing to.

Then in his blog, Michael Nitschinger describes how the Java SDK implements Couchbase’s ThresholdLoggingTracer and the (currently volatile) OpenTracing interface can be used by other tracing systems or even extended by users themselves to better observe what is happening with systems.

Couchbase Server 5.5 is now available and it’ll be great to get feedback on the forums or in our issue tracker.

 

* okay, that particular reference might be a bit obscure, but someone out there will appreciate it!

Author

Posted by Matt Ingenthron, Senior Director, SDK Engineering, Couchbase

Matt Ingenthron is the Senior Director in Engineering at Couchbase where he focuses on the developer interface across SDKs, connectors and other projects. He has been a contributor to the memcached project, one of the maintainers of the Java spymemcached client, and a core developer on Couchbase.

Leave a reply