This document is meant to give an overview of the etcd3 API's central design. It is by no means all encompassing, but intended to focus on the basic ideas needed to understand etcd without the distraction of less common API calls. All etcd3 API's are defined in gRPC services, which categorize remote procedure calls (RPCs) understood by the etcd server. A full listing of all etcd RPCs are documented in markdown in the gRPC API listing.
Every API request sent to an etcd server is a gRPC remote procedure call. RPCs in etcd3 are categorized based on functionality into services.
Services important for dealing with etcd's key space include:
Services which manage the cluster itself include:
All RPCs in etcd3 follow the same format. Each RPC has a function Name
which takes NameRequest
as an argument and returns NameResponse
as a response. For example, here is the Range
RPC description:
service KV {
Range(RangeRequest) returns (RangeResponse)
...
}
All Responses from etcd API have an attached response header which includes cluster metadata for the response:
message ResponseHeader {
uint64 cluster_id = 1;
uint64 member_id = 2;
int64 revision = 3;
uint64 raft_term = 4;
}
An application may read the Cluster_ID
or Member_ID
field to ensure it is communicating with the intended cluster (member).
Applications can use the Revision
field to know the latest revision of the key-value store. This is especially useful when applications specify a historical revision to make a time travel query
and wish to know the latest revision at the time of the request.
Applications can use Raft_Term
to detect when the cluster completes a new leader election.
The Key-Value API manipulates key-value pairs stored inside etcd. The majority of requests made to etcd are usually key-value requests.
A key-value pair is the smallest unit that the key-value API can manipulate. Each key-value pair has a number of fields, defined in protobuf format:
message KeyValue {
bytes key = 1;
int64 create_revision = 2;
int64 mod_revision = 3;
int64 version = 4;
bytes value = 5;
int64 lease = 6;
}
In addition to just the key and value, etcd attaches additional revision metadata as part of the key message. This revision information orders keys by time of creation and modification, which is useful for managing concurrency for distributed synchronization. The etcd client's distributed shared locks use the creation revision to wait for lock ownership. Similarly, the modification revision is used for detecting software transactional memory read set conflicts and waiting on leader election updates.
etcd maintains a 64-bit cluster-wide counter, the store revision, that is incremented each time the key space is modified. The revision serves as a global logical clock, sequentially ordering all updates to the store. The change represented by a new revision is incremental; the data associated with a revision is the data that changed the store. Internally, a new revision means writing the changes to the backend's B+tree, keyed by the incremented revision.
Revisions become more valuable when considering etcd3's multi-version concurrency control backend. The MVCC model means that the key-value store can be viewed from past revisions since historical key revisions are retained. The retention policy for this history can be configured by cluster administrators for fine-grained storage management; usually etcd3 discards old revisions of keys on a timer. A typical etcd3 cluster retains superseded key data for hours. This also provides reliable handling for long client disconnection, not just transient network disruptions: watchers simply resume from the last observed historical revision. Similarly, to read from the store at a particular point-in-time, read requests can be tagged with a revision to return keys from a view of the key space at the point-in-time that revision was committed.
The etcd3 data model indexes all keys over a flat binary key space. This differs from other key-value store systems that use a hierarchical system of organizing keys into directories. Instead of listing keys by directory, keys are listed by key intervals [a, b)
.
These intervals are often referred to as "ranges" in etcd3. Operations over ranges are more powerful than operations on directories. Like a hierarchical store, intervals support single key lookups via [a, a+1)
(e.g., ['a', 'a\x00') looks up 'a') and directory lookups by encoding keys by directory depth. In addition to those operations, intervals can also encode prefixes; for example the interval ['a', 'b')
looks up all keys prefixed by the string 'a'.
By convention, ranges for a request are denoted by the fields key
and range_end
. The key
field is the first key of the range and should be non-empty. The range_end
is the key following the last key of the range. If range_end
is not given or empty, the range is defined to contain only the key argument. If range_end
is key
plus one (e.g., "aa"+1 == "ab", "a\xff"+1 == "b"), then the range represents all keys prefixed with key. If both key
and range_end
are '\0', then range represents all keys. If range_end
is '\0', the range is all keys greater than or equal to the key argument.
Keys are fetched from the key-value store using the Range
API call, which takes a RangeRequest
:
message RangeRequest {
enum SortOrder {
NONE = 0; // default, no sorting
ASCEND = 1; // lowest target value first
DESCEND = 2; // highest target value first
}
enum SortTarget {
KEY = 0;
VERSION = 1;
CREATE = 2;
MOD = 3;
VALUE = 4;
}
bytes key = 1;
bytes range_end = 2;
int64 limit = 3;
int64 revision = 4;
SortOrder sort_order = 5;
SortTarget sort_target = 6;
bool serializable = 7;
bool keys_only = 8;
bool count_only = 9;
int64 min_mod_revision = 10;
int64 max_mod_revision = 11;
int64 min_create_revision = 12;
int64 max_create_revision = 13;
}
The client receives a RangeResponse
message from the Range
call:
message RangeResponse {
ResponseHeader header = 1;
repeated mvccpb.KeyValue kvs = 2;
bool more = 3;
int64 count = 4;
}
Count_Only
is set, Kvs
is empty.limit
is set.Keys are saved into the key-value store by issuing a Put
call, which takes a PutRequest
:
message PutRequest {
bytes key = 1;
bytes value = 2;
int64 lease = 3;
bool prev_kv = 4;
bool ignore_value = 5;
bool ignore_lease = 6;
}
Put
request.The client receives a PutResponse
message from the Put
call:
message PutResponse {
ResponseHeader header = 1;
mvccpb.KeyValue prev_kv = 2;
}
Put
, if Prev_Kv
was set in the PutRequest
.Ranges of keys are deleted using the DeleteRange
call, which takes a DeleteRangeRequest
:
message DeleteRangeRequest {
bytes key = 1;
bytes range_end = 2;
bool prev_kv = 3;
}
The client receives a DeleteRangeResponse
message from the DeleteRange
call:
message DeleteRangeResponse {
ResponseHeader header = 1;
int64 deleted = 2;
repeated mvccpb.KeyValue prev_kvs = 3;
}
DeleteRange
operation.A transaction is an atomic If/Then/Else construct over the key-value store. It provides a primitive for grouping requests together in atomic blocks (i.e., then/else) whose execution is guarded (i.e., if) based on the contents of the key-value store. Transactions can be used for protecting keys from unintended concurrent updates, building compare-and-swap operations, and developing higher-level concurrency control.
A transaction can atomically process multiple requests in a single request. For modifications to the key-value store, this means the store's revision is incremented only once for the transaction and all events generated by the transaction will have the same revision. However, modifications to the same key multiple times within a single transaction are forbidden.
All transactions are guarded by a conjunction of comparisons, similar to an If
statement. Each comparison checks a single key in the store. It may check for the absence or presence of a value, compare with a given value, or check a key's revision or version. Two different comparisons may apply to the same or different keys. All comparisons are applied atomically; if all comparisons are true, the transaction is said to succeed and etcd applies the transaction's then / success
request block, otherwise it is said to fail and applies the else / failure
request block.
Each comparison is encoded as a Compare
message:
message Compare {
enum CompareResult {
EQUAL = 0;
GREATER = 1;
LESS = 2;
NOT_EQUAL = 3;
}
enum CompareTarget {
VERSION = 0;
CREATE = 1;
MOD = 2;
VALUE= 3;
}
CompareResult result = 1;
// target is the key-value field to inspect for the comparison.
CompareTarget target = 2;
// key is the subject key for the comparison operation.
bytes key = 3;
oneof target_union {
int64 version = 4;
int64 create_revision = 5;
int64 mod_revision = 6;
bytes value = 7;
}
}
After processing the comparison block, the transaction applies a block of requests. A block is a list of RequestOp
messages:
message RequestOp {
// request is a union of request types accepted by a transaction.
oneof request {
RangeRequest request_range = 1;
PutRequest request_put = 2;
DeleteRangeRequest request_delete_range = 3;
}
}
RangeRequest
.PutRequest
. The keys must be unique. It may not share keys with any other Puts or Deletes.DeleteRangeRequest
. It may not share keys with any Puts or Deletes requests.All together, a transaction is issued with a Txn
API call, which takes a TxnRequest
:
message TxnRequest {
repeated Compare compare = 1;
repeated RequestOp success = 2;
repeated RequestOp failure = 3;
}
The client receives a TxnResponse
message from the Txn
call:
message TxnResponse {
ResponseHeader header = 1;
bool succeeded = 2;
repeated ResponseOp responses = 3;
}
Compare
evaluated to true or false.Success
block if succeeded is true or the Failure
if succeeded is false.The Responses
list corresponds to the results from the applied RequestOp
list, with each response encoded as a ResponseOp
:
message ResponseOp {
oneof response {
RangeResponse response_range = 1;
PutResponse response_put = 2;
DeleteRangeResponse response_delete_range = 3;
}
}
The Watch
API provides an event-based interface for asynchronously monitoring changes to keys. An etcd3 watch waits for changes to keys by continuously watching from a given revision, either current or historical, and streams key updates back to the client.
Every change to every key is represented with Event
messages. An Event
message provides both the update's data and the type of update:
message Event {
enum EventType {
PUT = 0;
DELETE = 1;
}
EventType type = 1;
KeyValue kv = 2;
KeyValue prev_kv = 3;
}
Watches are long-running requests and use gRPC streams to stream event data. A watch stream is bi-directional; the client writes to the stream to establish watches and reads to receive watch events. A single watch stream can multiplex many distinct watches by tagging events with per-watch identifiers. This multiplexing helps reducing the memory footprint and connection overhead on the core etcd cluster.
Watches make three guarantees about events:
A client creates a watch by sending a WatchCreateRequest
over a stream returned by Watch
:
message WatchCreateRequest {
bytes key = 1;
bytes range_end = 2;
int64 start_revision = 3;
bool progress_notify = 4;
enum FilterType {
NOPUT = 0;
NODELETE = 1;
}
repeated FilterType filters = 5;
bool prev_kv = 6;
}
In response to a WatchCreateRequest
or if there is a new event for some established watch, the client receives a WatchResponse
:
message WatchResponse {
ResponseHeader header = 1;
int64 watch_id = 2;
bool created = 3;
bool canceled = 4;
int64 compact_revision = 5;
repeated mvccpb.Event events = 11;
}
If the client wishes to stop receiving events for a watch, it issues a WatchCancelRequest
:
message WatchCancelRequest {
int64 watch_id = 1;
}
Leases are a mechanism for detecting client liveness. The cluster grants leases with a time-to-live. A lease expires if the etcd cluster does not receive a keepAlive within a given TTL period.
To tie leases into the key-value store, each key may be attached to at most one lease. When a lease expires or is revoked, all keys attached to that lease will be deleted. Each expired key generates a delete event in the event history.
Leases are obtained through the LeaseGrant
API call, which takes a LeaseGrantRequest
:
message LeaseGrantRequest {
int64 TTL = 1;
int64 ID = 2;
}
The client receives a LeaseGrantResponse
from the LeaseGrant
call:
message LeaseGrantResponse {
ResponseHeader header = 1;
int64 ID = 2;
int64 TTL = 3;
}
message LeaseRevokeRequest {
int64 ID = 1;
}
Leases are refreshed using a bi-directional stream created with the LeaseKeepAlive
API call. When the client wishes to refresh a lease, it sends a LeaseKeepAliveRequest
over the stream:
message LeaseKeepAliveRequest {
int64 ID = 1;
}
The keep alive stream responds with a LeaseKeepAliveResponse
:
message LeaseKeepAliveResponse {
ResponseHeader header = 1;
int64 ID = 2;
int64 TTL = 3;
}