Browse Source

Merge pull request #8250 from heyitsanthony/reorg-op-guide

Documentation, op-guide: reorganize etcd operation section
Anthony Romano 8 years ago
parent
commit
17be3b551a

+ 32 - 22
Documentation/docs.md

@@ -22,33 +22,46 @@ The easiest way to get started using etcd as a distributed key-value store is to
 
 ## Operating etcd clusters
 
-Administrators who need to create reliable and scalable key-value stores for the developers they support should begin with a [cluster on multiple machines][clustering].
+Administrators who need a fault-tolerant etcd cluster for either development or production should begin with a [cluster on multiple machines][clustering].
 
- - [Setting up etcd clusters][clustering]
- - [Setting up etcd gateways][gateway]
- - [Setting up etcd gRPC proxy][grpc_proxy]
+### Setting up etcd
+
+ - [Configuration flags][conf]
+ - [Multi-member cluster][clustering]
+ - [gRPC proxy][grpc_proxy]
+ - [L4 gateway][gateway]
+
+### System configuration
+
+ - [Supported systems][supported_platforms]
  - [Hardware recommendations][hardware]
- - [Configuration][conf]
- - [Security][security]
- - [Authentication][authentication]
- - [Monitoring][monitoring]
- - [Maintenance][maintenance]
- - [Understand failures][failures]
- - [Disaster recovery][recovery]
- - [Performance][performance]
- - [Versioning][versioning]
+ - [Performance benchmarking][performance]
+ - [Tuning][tuning]
 
 ### Platform guides
 
- - [Supported systems][supported_platforms]
- - [Docker container][container_docker]
- - [Container Linux, systemd][container_linux_platform]
- - [rkt container][container_rkt]
  - [Amazon Web Services][aws_platform]
+ - [Container Linux, systemd][container_linux_platform]
  - [FreeBSD][freebsd_platform]
+ - [Docker container][container_docker]
+ - [rkt container][container_rkt]
+
+### Security
+
+ - [TLS][security]
+ - [Role-based access control][authentication]
+
+### Maintenance and troubleshooting
+
+ - [Frequently asked questions][faq]
+ - [Monitoring][monitoring]
+ - [Maintenance][maintenance]
+ - [Failure modes][failures]
+ - [Disaster recovery][recovery]
 
 ### Upgrading and compatibility
 
+ - [Version numbers][versioning]
  - [Migrate applications from using API v2 to API v3][v2_migration]
  - [Upgrading a v2.3 cluster to v3.0][v3_upgrade]
  - [Upgrading a v3.0 cluster to v3.1][v31_upgrade]
@@ -65,17 +78,13 @@ To learn more about the concepts and internals behind etcd, read the following p
  - Internals
    - [Auth subsystem][auth_design]
 
-## Frequently Asked Questions (FAQ)
-
-Answers to [common questions] about etcd.
-
 [api_ref]: dev-guide/api_reference_v3.md
 [api_concurrency_ref]: dev-guide/api_concurrency_reference_v3.md
 [api_grpc_gateway]: dev-guide/api_grpc_gateway.md
 [clustering]: op-guide/clustering.md
 [conf]: op-guide/configuration.md
 [system-limit]: dev-guide/limit.md
-[common questions]: faq.md
+[faq]: faq.md
 [why]: learning/why.md
 [data_model]: learning/data_model.md
 [demo]: demo.md
@@ -111,3 +120,4 @@ Answers to [common questions] about etcd.
 [v32_upgrade]: upgrades/upgrade_3_2.md
 [authentication]: op-guide/authentication.md
 [auth_design]: learning/auth_design.md
+[tuning]: tuning.md

+ 24 - 24
Documentation/faq.md

@@ -1,40 +1,40 @@
-## Frequently Asked Questions (FAQ)
+# Frequently Asked Questions (FAQ)
 
-### etcd, general
+## etcd, general
 
-#### Do clients have to send requests to the etcd leader?
+### Do clients have to send requests to the etcd leader?
 
 [Raft][raft] is leader-based; the leader handles all client requests which need cluster consensus. However, the client does not need to know which node is the leader. Any request that requires consensus sent to a follower is automatically forwarded to the leader. Requests that do not require consensus (e.g., serialized reads) can be processed by any cluster member.
 
-### Configuration
+## Configuration
 
-#### What is the difference between listen-<client,peer>-urls, advertise-client-urls or initial-advertise-peer-urls?
+### What is the difference between listen-<client,peer>-urls, advertise-client-urls or initial-advertise-peer-urls?
 
 `listen-client-urls` and `listen-peer-urls` specify the local addresses etcd server binds to for accepting incoming connections. To listen on a port for all interfaces, specify `0.0.0.0` as the listen IP address.
 
 `advertise-client-urls` and `initial-advertise-peer-urls` specify the addresses etcd clients or other etcd members should use to contact the etcd server. The advertise addresses must be reachable from the remote machines. Do not advertise addresses like `localhost` or `0.0.0.0` for a production setup since these addresses are unreachable from remote machines.
 
-#### Why doesn't changing `--listen-peer-urls` or `--initial-advertise-peer-urls` update the advertised peer URLs in `etcdctl member list`?
+### Why doesn't changing `--listen-peer-urls` or `--initial-advertise-peer-urls` update the advertised peer URLs in `etcdctl member list`?
 
 A member's advertised peer URLs come from `--initial-advertise-peer-urls` on initial cluster boot. Changing the listen peer URLs or the initial advertise peers after booting the member won't affect the exported advertise peer URLs since changes must go through quorum to avoid membership configuration split brain. Use `etcdctl member update` to update a member's peer URLs.
 
-### Deployment
+## Deployment
 
-#### System requirements
+### System requirements
 
 Since etcd writes data to disk, SSD is highly recommended. To prevent performance degradation or unintentionally overloading the key-value store, etcd enforces a 2GB default storage size quota, configurable up to 8GB. To avoid swapping or running out of memory, the machine should have at least as much RAM to cover the quota. At CoreOS, an etcd cluster is usually deployed on dedicated CoreOS Container Linux machines with dual-core processors, 2GB of RAM, and 80GB of SSD *at the very least*. **Note that performance is intrinsically workload dependent; please test before production deployment**. See [hardware][hardware-setup] for more recommendations.
 
 Most stable production environment is Linux operating system with amd64 architecture; see [supported platform][supported-platform] for more.
 
-#### Why an odd number of cluster members?
+### Why an odd number of cluster members?
 
 An etcd cluster needs a majority of nodes, a quorum, to agree on updates to the cluster state. For a cluster with n members, quorum is (n/2)+1. For any odd-sized cluster, adding one node will always increase the number of nodes necessary for quorum. Although adding a node to an odd-sized cluster appears better since there are more machines, the fault tolerance is worse since exactly the same number of nodes may fail without losing quorum but there are more nodes that can fail. If the cluster is in a state where it can't tolerate any more failures, adding a node before removing nodes is dangerous because if the new node fails to register with the cluster (e.g., the address is misconfigured), quorum will be permanently lost.
 
-#### What is maximum cluster size?
+### What is maximum cluster size?
 
 Theoretically, there is no hard limit. However, an etcd cluster probably should have no more than seven nodes. [Google Chubby lock service][chubby], similar to etcd and widely deployed within Google for many years, suggests running five nodes. A 5-member etcd cluster can tolerate two member failures, which is enough in most cases. Although larger clusters provide better fault tolerance, the write performance suffers because data must be replicated across more machines.
 
-#### What is failure tolerance?
+### What is failure tolerance?
 
 An etcd cluster operates so long as a member quorum can be established. If quorum is lost through transient network failures (e.g., partitions), etcd automatically and safely resumes once the network recovers and restores quorum; Raft enforces cluster consistency. For power loss, etcd persists the Raft log to disk; etcd replays the log to the point of failure and resumes cluster participation. For permanent hardware failure, the node may be removed from the cluster through [runtime reconfiguration][runtime reconfiguration].
 
@@ -54,19 +54,19 @@ It is recommended to have an odd number of members in a cluster. An odd-size clu
 
 Adding a member to bring the size of cluster up to an even number doesn't buy additional fault tolerance. Likewise, during a network partition, an odd number of members guarantees that there will always be a majority partition that can continue to operate and be the source of truth when the partition ends.
 
-#### Does etcd work in cross-region or cross data center deployments?
+### Does etcd work in cross-region or cross data center deployments?
 
 Deploying etcd across regions improves etcd's fault tolerance since members are in separate failure domains. The cost is higher consensus request latency from crossing data center boundaries. Since etcd relies on a member quorum for consensus, the latency from crossing data centers will be somewhat pronounced because at least a majority of cluster members must respond to consensus requests. Additionally, cluster data must be replicated across all peers, so there will be bandwidth cost as well.
 
 With longer latencies, the default etcd configuration may cause frequent elections or heartbeat timeouts. See [tuning] for adjusting timeouts for high latency deployments.
 
-### Operation
+## Operation
 
-#### How to backup a etcd cluster?
+### How to backup a etcd cluster?
 
 etcdctl provides a `snapshot` command to create backups. See [backup][backup] for more details.
 
-#### Should I add a member before removing an unhealthy member?
+### Should I add a member before removing an unhealthy member?
 
 When replacing an etcd node, it's important  to remove the member first and then add its replacement.
 
@@ -78,21 +78,21 @@ Additionally, that new member is risky because it may turn out to be misconfigur
 
 On the other hand, if the downed member is removed from cluster membership first, the number of members becomes 2 and the quorum remains at 2. Following that removal by adding a new member will also keep the quorum steady at 2. So, even if the new node can't be brought up, it's still possible to remove the new member through quorum on the remaining live members.
 
-#### Why won't etcd accept my membership changes?
+### Why won't etcd accept my membership changes?
 
 etcd sets `strict-reconfig-check` in order to reject reconfiguration requests that would cause quorum loss. Abandoning quorum is really risky (especially when the cluster is already unhealthy). Although it may be tempting to disable quorum checking if there's quorum loss to add a new member, this could lead to full fledged cluster inconsistency. For many applications, this will make the problem even worse ("disk geometry corruption" being a candidate for most terrifying).
 
-#### Why does etcd lose its leader from disk latency spikes?
+### Why does etcd lose its leader from disk latency spikes?
 
 This is intentional; disk latency is part of leader liveness. Suppose the cluster leader takes a minute to fsync a raft log update to disk, but the etcd cluster has a one second election timeout. Even though the leader can process network messages within the election interval (e.g., send heartbeats), it's effectively unavailable because it can't commit any new proposals; it's waiting on the slow disk. If the cluster frequently loses its leader due to disk latencies, try [tuning][tuning] the disk settings or etcd time parameters.
 
-#### What does the etcd warning "request ignored (cluster ID mismatch)" mean?
+### What does the etcd warning "request ignored (cluster ID mismatch)" mean?
 
 Every new etcd cluster generates a new cluster ID based on the initial cluster configuration and a user-provided unique `initial-cluster-token` value. By having unique cluster ID's, etcd is protected from cross-cluster interaction which could corrupt the cluster.
 
 Usually this warning happens after tearing down an old cluster, then reusing some of the peer addresses for the new cluster. If any etcd process from the old cluster is still running it will try to contact the new cluster. The new cluster will recognize a cluster ID mismatch, then ignore the request and emit this warning. This warning is often cleared by ensuring peer addresses among distinct clusters are disjoint.
 
-#### What does "mvcc: database space exceeded" mean and how do I fix it?
+### What does "mvcc: database space exceeded" mean and how do I fix it?
 
 The [multi-version concurrency control][api-mvcc] data model in etcd keeps an exact history of the keyspace. Without periodically compacting this history (e.g., by setting `--auto-compaction`), etcd will eventually exhaust its storage space. If etcd runs low on storage space, it raises a space quota alarm to protect the cluster from further writes. So long as the alarm is raised, etcd responds to write requests with the error `mvcc: database space exceeded`.
 
@@ -102,13 +102,13 @@ To recover from the low space quota alarm:
 2. [Defragment][maintenance-defragment] every etcd endpoint.
 3. [Disarm][maintenance-disarm] the alarm.
 
-### Performance
+## Performance
 
-#### How should I benchmark etcd?
+### How should I benchmark etcd?
 
 Try the [benchmark] tool. Current [benchmark results][benchmark-result] are available for comparison.
 
-#### What does the etcd warning "apply entries took too long" mean?
+### What does the etcd warning "apply entries took too long" mean?
 
 After a majority of etcd members agree to commit a request, each etcd server applies the request to its data store and persists the result to disk. Even with a slow mechanical disk or a virtualized network disk, such as Amazon’s EBS or Google’s PD, applying a request should normally take fewer than 50 milliseconds. If the average apply duration exceeds 100 milliseconds, etcd will warn that entries are taking too long to apply.
 
@@ -120,7 +120,7 @@ Expensive user requests which access too many keys (e.g., fetching the entire ke
 
 If none of the above suggestions clear the warnings, please [open an issue][new_issue] with detailed logging, monitoring, metrics and optionally workload information.
 
-#### What does the etcd warning "failed to send out heartbeat on time" mean?
+### What does the etcd warning "failed to send out heartbeat on time" mean?
 
 etcd uses a leader-based consensus protocol for consistent data replication and log execution. Cluster members elect a single leader, all other members become followers. The elected leader must periodically send heartbeats to its followers to maintain its leadership. Followers infer leader failure if no heartbeats are received within an election interval and trigger an election. If a leader doesn’t send its heartbeats in time but is still running, the election is spurious and likely caused by insufficient resources. To catch these soft failures, if the leader skips two heartbeat intervals, etcd will warn it failed to send a heartbeat on time.
 
@@ -132,7 +132,7 @@ A slow network can also cause this issue. If network metrics among the etcd mach
 
 If none of the above suggestions clear the warnings, please [open an issue][new_issue] with detailed logging, monitoring, metrics and optionally workload information.
 
-#### What does the etcd warning "snapshotting is taking more than x seconds to finish ..." mean?
+### What does the etcd warning "snapshotting is taking more than x seconds to finish ..." mean?
 
 etcd sends a snapshot of its complete key-value store to refresh slow followers and for [backups][backup]. Slow snapshot transfer times increase MTTR; if the cluster is ingesting data with high throughput, slow followers may livelock by needing a new snapshot before finishing receiving a snapshot. To catch slow snapshot performance, etcd warns when sending a snapshot takes more than thirty seconds and exceeds the expected transfer time for a 1Gbps connection.
 

+ 2 - 2
Documentation/op-guide/authentication.md

@@ -1,8 +1,8 @@
-# Authentication Guide
+# Role-based access control
 
 ## Overview
 
-Authentication was added in etcd 2.1. The etcd v3 API slightly modified the authentication feature's API and user interface to better fit the new data model. This guide is intended to help users set up basic authentication in etcd v3.
+Authentication was added in etcd 2.1. The etcd v3 API slightly modified the authentication feature's API and user interface to better fit the new data model. This guide is intended to help users set up basic authentication and role-based access control in etcd v3.
 
 ## Special users and roles
 

+ 1 - 1
Documentation/op-guide/failures.md

@@ -1,4 +1,4 @@
-# Understand failures
+# Failure modes
 
 Failures are common in a large deployment of machines. A machine fails when its hardware or software malfunctions. Multiple machines fail together when there are power failures or network issues. Multiple kinds of failures can also happen at once; it is almost impossible to enumerate all possible failure cases. 
 

+ 3 - 3
Documentation/op-guide/recovery.md

@@ -1,4 +1,4 @@
-## Disaster recovery
+# Disaster recovery
 
 etcd is designed to withstand machine failures. An etcd cluster automatically recovers from temporary failures (e.g., machine reboots) and tolerates up to *(N-1)/2* permanent failures for a cluster of N members. When a member permanently fails, whether due to hardware failure or disk corruption, it loses access to the cluster. If the cluster permanently loses more than *(N-1)/2* members then it disastrously fails, irrevocably losing quorum. Once quorum is lost, the cluster cannot reach consensus and therefore cannot continue accepting updates.
 
@@ -6,7 +6,7 @@ To recover from disastrous failure, etcd v3 provides snapshot and restore facili
 
 [v2_recover]: ../v2/admin_guide.md#disaster-recovery
 
-### Snapshotting the keyspace
+## Snapshotting the keyspace
 
 Recovering a cluster first needs a snapshot of the keyspace from an etcd member. A snapshot may either be taken from a live member with the `etcdctl snapshot save` command or by copying the `member/snap/db` file from an etcd data directory. For example, the following command snapshots the keyspace served by `$ENDPOINT` to the file `snapshot.db`:
 
@@ -14,7 +14,7 @@ Recovering a cluster first needs a snapshot of the keyspace from an etcd member.
 $ ETCDCTL_API=3 etcdctl --endpoints $ENDPOINT snapshot save snapshot.db
 ```
 
-### Restoring a cluster
+## Restoring a cluster
 
 To restore a cluster, all that is needed is a single snapshot "db" file. A cluster restore with `etcdctl snapshot restore` creates new etcd data directories; all members should restore using the same snapshot. Restoring overwrites some snapshot metadata (specifically, the member ID and cluster ID); the member loses its former identity. This metadata overwrite prevents the new member from inadvertently joining an existing cluster. Therefore in order to start a cluster from a snapshot, the restore must start a new logical cluster.
 

+ 1 - 1
Documentation/op-guide/security.md

@@ -1,4 +1,4 @@
-# Security model
+# Transport security model
 
 etcd supports automatic TLS as well as authentication through client certificates for both clients to server as well as peer (server to server / cluster) communication.
 

+ 5 - 5
Documentation/op-guide/supported-platform.md

@@ -1,8 +1,8 @@
-## Supported platforms
+# Supported systems
 
-### Current support
+## Current support
 
-The following table lists etcd support status for common architectures and operating systems,
+The following table lists etcd support status for common architectures and operating systems:
 
 | Architecture | Operating System | Status       | Maintainers                 |
 | ------------ | ---------------- | ------------ | --------------------------- |
@@ -18,7 +18,7 @@ The following table lists etcd support status for common architectures and opera
 
 Experimental platforms appear to work in practice and have some platform specific code in etcd, but do not fully conform to the stable support policy. Unstable platforms have been lightly tested, but less than experimental. Unlisted architecture and operating system pairs are currently unsupported; caveat emptor.
 
-### Supporting a new platform
+## Supporting a new system platform
 
 For etcd to officially support a new platform as stable, a few requirements are necessary to ensure acceptable quality:
 
@@ -28,7 +28,7 @@ For etcd to officially support a new platform as stable, a few requirements are
 4. Set up CI (TravisCI, SemaphoreCI or Jenkins) for running integration tests; etcd must pass intensive tests.
 5. (Optional) Set up a functional testing cluster; an etcd cluster should survive stress testing.
 
-### 32-bit and other unsupported systems
+## 32-bit and other unsupported systems
 
 etcd has known issues on 32-bit systems due to a bug in the Go runtime. See the [Go issue][go-issue] and [atomic package][go-atomic] for more information.
 

+ 3 - 3
Documentation/op-guide/versioning.md

@@ -1,6 +1,6 @@
-## Versioning
+# Versioning
 
-### Service versioning
+## Service versioning
 
 etcd uses [semantic versioning](http://semver.org)
 New minor versions may add additional features to the API.
@@ -11,7 +11,7 @@ Get the running etcd cluster version with `etcdctl`:
 ETCDCTL_API=3 etcdctl --endpoints=127.0.0.1:2379 endpoint status
 ```
 
-### API versioning
+## API versioning
 
 The `v3` API responses should not change after the 3.0.0 release but new features will be added over time.
 

+ 1 - 1
Documentation/platforms/aws.md

@@ -1,4 +1,4 @@
-## Introduction
+# Amazon Web Services
 
 This guide assumes operational knowledge of Amazon Web Services (AWS), specifically Amazon Elastic Compute Cloud (EC2). This guide provides an introduction to design considerations when designing an etcd deployment on AWS EC2 and how AWS specific features may be utilized in that context.
 

+ 1 - 1
Documentation/platforms/container-linux-systemd.md

@@ -1,4 +1,4 @@
-# Run etcd on Container Linux with systemd
+# Container Linux with systemd
 
 The following guide shows how to run etcd with [systemd][systemd-docs] under [Container Linux][container-linux-docs].