How to deal with failures in Changes Worker Pattern

neokree · November 25, 2017, 3:17pm

Hello,

Your workers should be idempotent or you should track the last_seq somewhere durable.

I think this is related with the fact that if the server where the worker runs go offline, all changes that happen in downtime will be lost.
So the two solutions suggested are:

Store the last_seq somewere durable
Have an idempotent worker

Starting a project from scratch, I would like to implement both for my worker.

Store the last_seq somewere durable
This is the way you handle the problem in the CouchChat-iOS sample (link found at the end of the doc page).
It is a very outdated example, but still useful to understand the logic.

Essentially you first connect directly to the bucket using the Couchbase Node SDK.
Then you retrieve the last document processed using the sdk ( in the sample called _push:seq),
and you start to follow document changes from that seq.
When a document that needs to be handled arrive, you manage some business logic (s.g. sending the post notification to mobile apps) and store the new sequence number into the bucket using the sdk.

Now this seems to work, but we are in a node evironment, based on a non-blocking I/O model.
Since every operation will use a callback (writing/reading something from a bucket, sending an email, sending an API request etc.) I suppose that every changes feed like follow will not block the changes feed until the business logic end.

So this example could happen:
Worker starts from req=0
Doc 1 created [req=1]
Doc 2 created [req=2]
Storing [req=2]
Storing [req=1]
Worker stops
Worker starts from req=1
Doc 2 created [req=2] ← This is a serious problem!

How we can be sure that the stored req is the real last one?

Have an idempotent worker
From wikipedia an function is called idempotent if f(f(x)) = f(x), meaning that the result will not change if the function is called once or multiple times.
From a mathematic point of view I have understood what it is, from a development view I have absolutedly no idea how to implement it.
There are currently some examples of idempotent workers? And if my woker is idempotent, I could scale it horizontally across multiple server?

UPDATE
Searching about idempotence I found a wonderful article [link] that helped me understood what it means to create an idempotent worker.
In short:

Always keep in document a state of every business operation you have done (sending a mail, a push notification etc.)
Always check if the business operation was not done before you execute, this implies that you need to retrieve always the last revision of the document

So if my worker is idempotent the first problem auto-resolve itself, the business operation will not trigger anything because I will check if it was already done. Anyway there is a small possibility that the worker have stop just after the operation have finished and before the state have been changed.

Anyway there are some common problems if we try to scale an idempotent worker across multiple servers?
If yes, there are some solutions?

househippo · November 27, 2017, 3:47pm

@neokree,

Here is the offical replication method for CBL. So creating a working similar like it would be best. You want to do “PULL”

On a high level this is what CBL and SG when it does a PULL replication.

To create / update checkpoints you can use SG’s _locol docs, this is what CBL does.
Docs: Couchbase Capella for Mobile Developers

neokree · November 27, 2017, 4:28pm

Thanks for your response, but I think that you have misunderstood my question…

I asked if exists any real world example (possibly not outdated) of Changes Worker to be associated with a SG for handling business logic, and if it is possible to scale it across multiple servers (for speed up business logic processing and avoid downtime). For example i mean a project on github or some similar platform.

An example could be an ecommerce site, which have a Cart document type (created by the user) that need a business logic to generate the Order document.
This obviously cannot be managed directly by the client, that could be able to set all product prices to 0 € from an hacker and cannot be handled by SG validation, since SG sync function cannot load other documents (the products) while validating the cart doc. So the only way I see is to use a Changes Worker, treating the cart doc as a state machine to generate the order.
But what if the Changes worker fails? In that case all users could not be able to pay for their orders until the service return on (since the payment should be made after the order creation).

For this reason I thought it could be a good idea scale horizontally the service workers over multiple servers, but I don’t know if this could be the solution (since all documents will be handled concurrently by different server, wasting CPU and possibly creating conflicts).
Do you think there is a better solution?

househippo · November 27, 2017, 5:11pm

@neokree,

You can filter by channel(s) just like CBL does. And have different script process different parts of the Sync Gateway Feed.

Here is a simple example.

github.com

Fujio-Turner/sg-stream-demo/blob/master/python_sg_feed.py#L17


      
          
          r = requests.get('http://localhost:4984/sync_gateway/_changes'+url_param, stream=True)
          
          for line in r.iter_lines():
          	if line:
          		print line
          
          #docs on _change feed options https://developer.couchbase.com/documentation/mobile/1.4/references/sync-gateway/rest-api/index.html#/database
          
          '''
          ### FILTER BY CHANNELS ###
          channels = 'bob,water,cake'
          url_param = url_param +'&filter=sync_gateway/bychannel&channels='+channels
          '''
          
          '''
          ### Saving My App's Checkpoint inside Sync Gateway ###
          HTTP/1.1 PUT or GET
          Content-Type: application/json   
          http://{hostname}:4984/{DB}/_local/my_check_point_{ip address of app}
          {

neokree · November 27, 2017, 5:29pm

But filtering by documents will not prevent my service to crash, will only split the problem in different services.
Sure reducing the amount of work required for each service will reduce the workload, but it will not be useful in case of server maintenance.

househippo · November 27, 2017, 5:34pm

You could do something like this. Where you have one app in stand by listening to the _changes feed and polling the global checkpoint. So if app 7.7.7.7 does down then 8.8.8.8 takes over.

traun · November 28, 2017, 9:14pm

Here is an example of using a changes worker pattern which stores it’s last sequence in a local database:

github.com

tleyden/todolite-appserver/blob/master/libtodolite/todolite.go

package libtodolite

import (
	"bufio"
	"encoding/json"
	"fmt"
	"io"
	"net/http"
	"os"
	"strings"

	"github.com/couchbaselabs/logg"
	"github.com/tleyden/go-couch"
	openocr "github.com/tleyden/open-ocr-client"
)

type TodoLiteApp struct {
	DatabaseURL string
	OpenOCRURL  string
	Database    couch.Database

This file has been truncated. show original

Note that if there is a catastrophic error while processing the change(s), it will not push the sequence forward, and so if it is restarted it will start where it left off.

Anyway there are some common problems if we try to scale an idempotent worker across multiple servers?

If the workers are truly idempotent, then you could run copies of the worker across multiple servers, and the only harm with multiple workers processing the changes feed item would be extra unnecessary work. You’d probably want to come up with a mechanism to prevent that extra work without accidentally missing items.

A common pattern for this type of problem is to have workers “race” to process an item, and they essentially check out the item with a temporary lease while they try to process it. Imagine two workers running on their own server with the following logic:

Wait for change
Upon receiving doc, update the “processing_state” field in the doc to “pending”
If 409 conflict error, reload the latest version of the doc
If processing_state field is already pending, then ignore this change (another worker grabbed it)
Otherwise, go back to the step Upon receiving doc, update the “processing_state” field in the doc to “pending” and repeat
Once finished processing, update the processing_state to “finished”

and then have a separate garbage collector process that looks for any docs that have been in the “pending” state for too long, and resets their state so that the changes workers will see them again and try to reprocess them.

Or rather than keeping the state variable(s) in the doc itself, a dedicated tracking doc could be used, and document Time-To-Live (TTL) semantics could be leveraged to garbage collect jobs that get stuck in the pending state due to worker failure.