| The recommendations here are under development and may change before implementation. |
Overview
The intent of the REPLICA READ (also known as CMD_GET_REPLICA in the server) operation is to allow a client to perform retrieval operations only against one or more replicas. This would be an inconsistent read. While it could be used for nearly any purpose, the only common use case is expected to be in the event of failures when a known inconsistent read is okay.
Client API
Java
Note: Not correct, needs updating
GetFuture resf; boolean isReplicaRead; try { resf = cbc.asyncGet("foo"); isReplicaRead = false; } catch (TimeoutException ex) { // uhoh, something went wrong, server isn't there! resf = cbc.asyncReplicaGet("foo"); isReplicaRead = true; } finally { // do something useful }
.NET
PHP
Ruby
res = nil
is_replica_read = false
begin
res = cbc.get("foo")
rescue Couchbase::Error::Timeout => ex
res = cbc.get("foo", :replica => true)
is_replica_read = true
ensure
# do something useful
end
Recommended Implementation
| These recommendations are preliminary. |
When a client library is processing a request on behalf of the end user, it should generally walk the list of current replicas for the vbucket to which the given key matches.
The conversation then is something along these lines.
- Application tries to retrieve item, client library cannot service this and thus replies with an error or a timeout.
- Application then requests a replica read through a similar API as listed above.
- Client library retrieves from the current configuration a map of all nodes for the item requested. Over the course of handling this function invocation by the application, it will continue to use this sequence of primary locations and replica locations even though the map may change.
- Optionally, the client may provide a function which retrieves an array/map or other structure which contains the given item and the CAS (and optionally, the expiration) from all replicas it can contact. This would be provided as a way of allowing applications to determine what the most up-to-date version of an item may be since we are not guaranteed the replication is in the order it is present in the configuration.
- The client library will then attempt to retrieve the item from each replica in the configuration, in the specified order in the configuration until the item has been retrieved or until the list of replicas has been exhausted. Upon receiving a response from a given replica, it will then return that value to the calling application.
Implementation Constraints
REPLICA READ is a binary protocol only operation. It could be implemented in ASCII, but that would require changes in moxi and additional clients.
Protocol level details
See authoritative protocol documentation. This section serves only as a reference.
Request
Byte/ 0 | 1 | 2 | 3 |
/ | | | |
|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|
+---------------+---------------+---------------+---------------+
0| 0x80 | 0x83 | 0x00 | 0x05 |
+----------- +----------- +----------- +----------- +
4| 0x04 | 0x00 | 0x00 | 0x00 |
+----------- +----------- +----------- +----------- +
8| 0x00 | 0x00 | 0x00 | 0x09 |
+----------- +----------- +----------- +----------- +
12| 0x00 | 0x00 | 0x00 | 0x00 |
+----------- +----------- +----------- +----------- +
16| 0x00 | 0x00 | 0x00 | 0x00 |
+----------- +----------- +----------- +----------- +
20| 0x00 | 0x00 | 0x00 | 0x00 |
+---------------+---------------+---------------+---------------+
24| 0x66 ('f') | 0x6f ('o') | 0x6f ('o') |
+---------------+---------------+---------------+
Field (offset) (value)
Magic (0) : 0x80 (PROTOCOL_BINARY_REQ)
Opcode (1) : 0x83
Key length (2,3) : 0x0003 (3)
Extra length (0) : 0x00
Data type (5) : 0x00
vbucket (6,7) : 0x0000 (0)
Total body (8-11) : 0x00000003 (3)
Opaque (12-15): 0x00000000
CAS (16-23): 0x0000000000000000
Key (24-26): The textual string "foo"
Response
Byte/ 0 | 1 | 2 | 3 |
/ | | | |
|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|0 1 2 3 4 5 6 7|
+---------------+---------------+---------------+---------------+
0| 0x81 | 0x83 | 0x00 | 0x00 |
+---------------+---------------+---------------+---------------+
4| 0x04 | 0x00 | 0x00 | 0x00 |
+---------------+---------------+---------------+---------------+
8| 0x00 | 0x00 | 0x00 | 0x09 |
+---------------+---------------+---------------+---------------+
12| 0x00 | 0x00 | 0x00 | 0x00 |
+---------------+---------------+---------------+---------------+
16| 0x00 | 0x00 | 0x00 | 0x00 |
+---------------+---------------+---------------+---------------+
20| 0x00 | 0x00 | 0x00 | 0x01 |
+---------------+---------------+---------------+---------------+
24| 0xde | 0xad | 0xbe | 0xef |
+---------------+---------------+---------------+---------------+
28| 0x57 ('W') | 0x6f ('o') | 0x72 ('r') | 0x6c ('l') |
+---------------+---------------+---------------+---------------+
32| 0x64 ('d') |
+---------------+
Field (offset) (value)
Magic (0) : 0x81 (PROTOCOL_BINARY_RES)
Opcode (1) : 0x83
Key length (2,3) : 0x0000
Extra length (4) : 0x04
Data type (5) : 0x00
Status (6,7) : 0x0000
Total body (8-11) : 0x00000009
Opaque (12-15): 0x00000000
CAS (16-23): 0x0000000000000001
Extras :
Flags (24-27): 0xdeadbeef
Key : None
Value (28-32): The textual string "World"
Comments (10)
May 14, 2012
Sergey Avseyev says:
Could you also post packet format here? There no such command definition in http...Could you also post packet format here? There no such command definition in https://github.com/membase/memcached/blob/engine/include/memcached/protocol_binary.h
When it will be accessible?
May 15, 2012
Matt Ingenthron says:
It's in the command_ids.h for ep-engine in the master branch (for 2.0), but it's...It's in the command_ids.h for ep-engine in the master branch (for 2.0), but it's a valid question as to whether or not it should be in the protocol_binary.h. There are a few things we have in this engine that are extensions.
Please check with Trond on how they handle this sort of thing-- I don't know without looking a bit deeper.
May 15, 2012
Sergey Avseyev says:
Could you review command dissection I've just added?Could you review command dissection I've just added?
May 15, 2012
Sergey Avseyev says:
Asked Trond, and I think it is ok, because comman_ids.h is public nowAsked Trond, and I think it is ok, because comman_ids.h is public now
May 15, 2012
Sergey Avseyev says:
Is there quiet variant for this command to implement pipelined get?Is there quiet variant for this command to implement pipelined get?
May 15, 2012
Matt Ingenthron says:
There is not a quiet variant, no. That's a good point though.There is not a quiet variant, no. That's a good point though.
May 15, 2012
Sergey Avseyev says:
Case 1. The client sends requests to all replicas simultaneously 1. pick the vb...Case 1. The client sends requests to all replicas simultaneously
1. pick the vbucket array from the config
2. iterate over the array and schedule CMD_GET_REPLICA for each vbucket with given key
3. flush buffers / start network interaction
4. collect all requests, skipping errors and find the CAS winner (the request with the most popular CAS version)
Pros:
Cons:
Case 2. The client is iterating over the replicas and stops after first successful response
1. Pick first replica vbucket from config
2. Increment variable storing the next position
3. Schedule CMD_GET_REPLICA request with given key
4. flush buffers / start network interaction
5. in the response handler check the status code and return to the user, or continue otherwise
6. Stop once reached max replica count and return NOTFOUND to user
7. Pick next replica vbucket from config
8. Go to step 2
Pros:
Cons:
May 15, 2012
Sergey Avseyev says:
The question is what approach is better? or maybe I missed something or there ar...The question is what approach is better? or maybe I missed something or there are other options
May 15, 2012
Matt Ingenthron says:
First off, there's a problem with case 1. You cannot rely on CAS as a monotonic...First off, there's a problem with case 1. You cannot rely on CAS as a monotonic clock.
Even if it were, I think I prefer case 2.
Given that a vbucket could move (it is intended to be used with failures, and auto-failover would have the vbucket move within a few seconds), we should walk the array of nodes for that vbucket, but also keep track of the config revision such that if it's updated, we start walking the array from the start again. Otherwise, we're likely to just get not-my-vbucket replies.
May 15, 2012
Sergey Avseyev says:
Ok, in this case the client will check no more num_replicas times just iterating...Ok, in this case the client will check no more num_replicas times just iterating indexes from 1 to num_replicas-1.