Skip to end of metadata
Go to start of metadata
The recommendations here are under development and may change before implementation.

Overview

The intent of the REPLICA READ (also known as CMD_GET_REPLICA in the server) operation is to allow a client to perform retrieval operations only against one or more replicas. This would be an inconsistent read. While it could be used for nearly any purpose, the only common use case is expected to be in the event of failures when a known inconsistent read is okay.

Client API

Java

Note: Not correct, needs updating

GetFuture resf;
boolean isReplicaRead;

try {
  resf = cbc.asyncGet("foo");
  isReplicaRead = false;
} catch (TimeoutException ex) {
  // uhoh, something went wrong, server isn't there!
  resf = cbc.asyncReplicaGet("foo");
  isReplicaRead = true;
} finally {

  // do something useful
}

.NET

PHP

Ruby

res = nil
is_replica_read = false

begin
  res = cbc.get("foo")
rescue Couchbase::Error::Timeout => ex
  res = cbc.get("foo", :replica => true)
  is_replica_read = true
ensure
  # do something useful
end

Recommended Implementation

These recommendations are preliminary.

When a client library is processing a request on behalf of the end user, it should generally walk the list of current replicas for the vbucket to which the given key matches.

The conversation then is something along these lines.

  • Application tries to retrieve item, client library cannot service this and thus replies with an error or a timeout.
  • Application then requests a replica read through a similar API as listed above.
  • Client library retrieves from the current configuration a map of all nodes for the item requested. Over the course of handling this function invocation by the application, it will continue to use this sequence of primary locations and replica locations even though the map may change.
    • Optionally, the client may provide a function which retrieves an array/map or other structure which contains the given item and the CAS (and optionally, the expiration) from all replicas it can contact. This would be provided as a way of allowing applications to determine what the most up-to-date version of an item may be since we are not guaranteed the replication is in the order it is present in the configuration.
  • The client library will then attempt to retrieve the item from each replica in the configuration, in the specified order in the configuration until the item has been retrieved or until the list of replicas has been exhausted. Upon receiving a response from a given replica, it will then return that value to the calling application.

Implementation Constraints

REPLICA READ is a binary protocol only operation. It could be implemented in ASCII, but that would require changes in moxi and additional clients.

Protocol level details

See authoritative protocol documentation. This section serves only as a reference.

Request

  Byte/     0       |       1       |       2       |       3       |
     /              |               |               |               |
    |0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|
    +---------------+---------------+---------------+---------------+
   0|  0x80         |  0x83         |  0x00         |  0x05         |
    +-----------    +-----------    +-----------    +-----------    +
   4|  0x04         |  0x00         |  0x00         |  0x00         |
    +-----------    +-----------    +-----------    +-----------    +
   8|  0x00         |  0x00         |  0x00         |  0x09         |
    +-----------    +-----------    +-----------    +-----------    +
  12|  0x00         |  0x00         |  0x00         |  0x00         |
    +-----------    +-----------    +-----------    +-----------    +
  16|  0x00         |  0x00         |  0x00         |  0x00         |
    +-----------    +-----------    +-----------    +-----------    +
  20|  0x00         |  0x00         |  0x00         |  0x00         |
    +---------------+---------------+---------------+---------------+
  24|  0x66 ('f')   |  0x6f ('o')   |  0x6f ('o')   |
    +---------------+---------------+---------------+

Field        (offset) (value)
Magic        (0)    : 0x80 (PROTOCOL_BINARY_REQ)
Opcode       (1)    : 0x83
Key length   (2,3)  : 0x0003 (3)
Extra length (0)    : 0x00
Data type    (5)    : 0x00
vbucket      (6,7)  : 0x0000 (0)
Total body   (8-11) : 0x00000003 (3)
Opaque       (12-15): 0x00000000
CAS          (16-23): 0x0000000000000000
Key          (24-26): The textual string "foo"

Response

   Byte/     0       |       1       |       2       |       3       |
      /              |               |               |               |
     |0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|
     +---------------+---------------+---------------+---------------+
    0| 0x81          | 0x83          | 0x00          | 0x00          |
     +---------------+---------------+---------------+---------------+
    4| 0x04          | 0x00          | 0x00          | 0x00          |
     +---------------+---------------+---------------+---------------+
    8| 0x00          | 0x00          | 0x00          | 0x09          |
     +---------------+---------------+---------------+---------------+
   12| 0x00          | 0x00          | 0x00          | 0x00          |
     +---------------+---------------+---------------+---------------+
   16| 0x00          | 0x00          | 0x00          | 0x00          |
     +---------------+---------------+---------------+---------------+
   20| 0x00          | 0x00          | 0x00          | 0x01          |
     +---------------+---------------+---------------+---------------+
   24| 0xde          | 0xad          | 0xbe          | 0xef          |
     +---------------+---------------+---------------+---------------+
   28| 0x57 ('W')    | 0x6f ('o')    | 0x72 ('r')    | 0x6c ('l')    |
     +---------------+---------------+---------------+---------------+
   32| 0x64 ('d')    |
     +---------------+


Field         (offset) (value)
 Magic        (0)    : 0x81 (PROTOCOL_BINARY_RES)
 Opcode       (1)    : 0x83
 Key length   (2,3)  : 0x0000
 Extra length (4)    : 0x04
 Data type    (5)    : 0x00
 Status       (6,7)  : 0x0000
 Total body   (8-11) : 0x00000009
 Opaque       (12-15): 0x00000000
 CAS          (16-23): 0x0000000000000001
 Extras              :
   Flags      (24-27): 0xdeadbeef
 Key                 : None
 Value        (28-32): The textual string "World"

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. May 14, 2012

    Sergey Avseyev says:

    Could you also post packet format here? There no such command definition in http...

    Could you also post packet format here? There no such command definition in https://github.com/membase/memcached/blob/engine/include/memcached/protocol_binary.h

    When it will be accessible?

    1. May 15, 2012

      Matt Ingenthron says:

      It's in the command_ids.h for ep-engine in the master branch (for 2.0), but it's...

      It's in the command_ids.h for ep-engine in the master branch (for 2.0), but it's a valid question as to whether or not it should be in the protocol_binary.h. There are a few things we have in this engine that are extensions.

      Please check with Trond on how they handle this sort of thing-- I don't know without looking a bit deeper.

      1. May 15, 2012

        Sergey Avseyev says:

        Could you review command dissection I've just added?

        Could you review command dissection I've just added?

      2. May 15, 2012

        Sergey Avseyev says:

        Asked Trond, and I think it is ok, because comman_ids.h is public now

        Asked Trond, and I think it is ok, because comman_ids.h is public now

  2. May 15, 2012

    Sergey Avseyev says:

    Is there quiet variant for this command to implement pipelined get?

    Is there quiet variant for this command to implement pipelined get?

    1. May 15, 2012

      Matt Ingenthron says:

      There is not a quiet variant, no. That's a good point though.

      There is not a quiet variant, no. That's a good point though.

  3. May 15, 2012

    Sergey Avseyev says:

    Case 1. The client sends requests to all replicas simultaneously 1. pick the vb...

    Case 1. The client sends requests to all replicas simultaneously

    1. pick the vbucket array from the config
    2. iterate over the array and schedule CMD_GET_REPLICA for each vbucket with given key
    3. flush buffers / start network interaction
    4. collect all requests, skipping errors and find the CAS winner (the request with the most popular CAS version)

    Pros:

    • atomic reading vbucket array from config
    • no need to track vbuckets to which the requests were sent (see Case 2)
    • more consistent value because of quering all replicas
    • less network roundtrips in case of pipelined implementation

    Cons:

    • requires memory to keep all replica responses (currently 3 is max)

    Case 2. The client is iterating over the replicas and stops after first successful response

    1. Pick first replica vbucket from config
    2. Increment variable storing the next position
    3. Schedule CMD_GET_REPLICA request with given key
    4. flush buffers / start network interaction
    5. in the response handler check the status code and return to the user, or continue otherwise
    6. Stop once reached max replica count and return NOTFOUND to user
    7. Pick next replica vbucket from config
    8. Go to step 2

    Pros:

    • Less network roundtrips if pipelining wont be accessible
    • Less memory consumption, because only one response stored

    Cons:

    • The config could be reloaded between tries
    • Isn't multiget friendly
    • less consistent value because it takes first successful response
    1. May 15, 2012

      Sergey Avseyev says:

      The question is what approach is better? or maybe I missed something or there ar...

      The question is what approach is better? or maybe I missed something or there are other options

    2. May 15, 2012

      Matt Ingenthron says:

      First off, there's a problem with case 1. You cannot rely on CAS as a monotonic...

      First off, there's a problem with case 1. You cannot rely on CAS as a monotonic clock.

      Even if it were, I think I prefer case 2.

      Given that a vbucket could move (it is intended to be used with failures, and auto-failover would have the vbucket move within a few seconds), we should walk the array of nodes for that vbucket, but also keep track of the config revision such that if it's updated, we start walking the array from the start again. Otherwise, we're likely to just get not-my-vbucket replies.

      1. May 15, 2012

        Sergey Avseyev says:

        Ok, in this case the client will check no more num_replicas times just iterating...

        Ok, in this case the client will check no more num_replicas times just iterating indexes from 1 to num_replicas-1.