|
|
@@ -18,8 +18,6 @@ Using an out-of-date data directory can lead to inconsistency as the member had
|
|
|
For maximum safety, if an etcd member suffers any sort of data corruption or loss, it must be removed from the cluster.
|
|
|
Once removed the member can be re-added with an empty data directory.
|
|
|
|
|
|
-[remove-a-member]: runtime-configuration.md#remove-a-member
|
|
|
-
|
|
|
### Contents
|
|
|
|
|
|
The data directory has two sub-directories in it:
|
|
|
@@ -29,9 +27,6 @@ The data directory has two sub-directories in it:
|
|
|
|
|
|
If `--wal-dir` flag is set, etcd will write the write ahead log files to the specified directory instead of data directory.
|
|
|
|
|
|
-[wal-pkg]: http://godoc.org/github.com/coreos/etcd/wal
|
|
|
-[snap-pkg]: http://godoc.org/github.com/coreos/etcd/snap
|
|
|
-
|
|
|
## Cluster Management
|
|
|
|
|
|
### Lifecycle
|
|
|
@@ -65,7 +60,7 @@ cluster is healthy
|
|
|
|
|
|
#### Runtime Metrics
|
|
|
|
|
|
-etcd uses [Prometheus](http://prometheus.io/) for metrics reporting in the server. You can read more through the runtime metrics [doc](metrics.md).
|
|
|
+etcd uses [Prometheus][prometheus] for metrics reporting in the server. You can read more through the runtime metrics [doc][metrics].
|
|
|
|
|
|
### Debugging
|
|
|
|
|
|
@@ -94,7 +89,7 @@ Debug variables are exposed for real-time debugging purposes. Developers who are
|
|
|
|
|
|
`file_descriptor_limit` is the max number of file descriptors etcd can utilize.
|
|
|
|
|
|
-`memstats` is well explained [here](http://golang.org/pkg/runtime/#MemStats).
|
|
|
+`memstats` is explained in detail in the [Go runtime documentation][golang-memstats].
|
|
|
|
|
|
`raft.status` is useful when you want to debug low level raft issues if you are familiar with raft internals. In most cases, you do not need to check `raft.status`.
|
|
|
|
|
|
@@ -130,7 +125,7 @@ As you can see, adding another member to bring the size of cluster up to an odd
|
|
|
|
|
|
#### Changing Cluster Size
|
|
|
|
|
|
-After your cluster is up and running, adding or removing members is done via [runtime reconfiguration](runtime-configuration.md#cluster-reconfiguration-operations), which allows the cluster to be modified without downtime. The `etcdctl` tool has a `member list`, `member add` and `member remove` commands to complete this process.
|
|
|
+After your cluster is up and running, adding or removing members is done via [runtime reconfiguration][runtime-reconfig], which allows the cluster to be modified without downtime. The `etcdctl` tool has `member list`, `member add` and `member remove` commands to complete this process.
|
|
|
|
|
|
### Member Migration
|
|
|
|
|
|
@@ -138,10 +133,10 @@ When there is a scheduled machine maintenance or retirement, you might want to m
|
|
|
|
|
|
The data directory contains all the data to recover a member to its point-in-time state. To migrate a member:
|
|
|
|
|
|
-* Stop the member process
|
|
|
-* Copy the data directory of the now-idle member to the new machine
|
|
|
-* Update the peer URLs for that member to reflect the new machine according to the [runtime configuration] [change peer url]
|
|
|
-* Start etcd on the new machine, using the same configuration and the copy of the data directory
|
|
|
+* Stop the member process.
|
|
|
+* Copy the data directory of the now-idle member to the new machine.
|
|
|
+* Update the peer URLs for the replaced member to reflect the new machine according to the [runtime reconfiguration instructions][update-member].
|
|
|
+* Start etcd on the new machine, using the same configuration and the copy of the data directory.
|
|
|
|
|
|
This example will walk you through the process of migrating the infra1 member to a new machine:
|
|
|
|
|
|
@@ -212,8 +207,6 @@ etcd -name infra1 \
|
|
|
-advertise-client-urls http://10.0.1.13:2379,http://127.0.0.1:2379
|
|
|
```
|
|
|
|
|
|
-[change peer url]: runtime-configuration.md#update-a-member
|
|
|
-
|
|
|
### Disaster Recovery
|
|
|
|
|
|
etcd is designed to be resilient to machine failures. An etcd cluster can automatically recover from any number of temporary failures (for example, machine reboots), and a cluster of N members can tolerate up to _(N-1)/2_ permanent failures (where a member can no longer access the cluster, due to hardware failure or disk corruption). However, in extreme circumstances, a cluster might permanently lose enough members such that quorum is irrevocably lost. For example, if a three-node cluster suffered two simultaneous and unrecoverable machine failures, it would be normally impossible for the cluster to restore quorum and continue functioning.
|
|
|
@@ -260,9 +253,9 @@ Once you have verified that etcd has started successfully, shut it down and move
|
|
|
|
|
|
#### Restoring the cluster
|
|
|
|
|
|
-Now that if the node is running successfully, you should [change its advertised peer URLs](runtime-configuration.md#update-a-member), as the `--force-new-cluster` has set the peer URL to the default (listening on localhost).
|
|
|
+Now that the node is running successfully, [change its advertised peer URLs][update-member], as the `--force-new-cluster` option has set the peer URL to the default listening on localhost.
|
|
|
|
|
|
-You can then add more nodes to the cluster and restore resiliency. See the [add a new member](runtime-configuration.md#add-a-new-member) guide for more details. **NB:** If you are trying to restore your cluster using old failed etcd nodes, please make sure you have stopped old etcd instances and removed their old data directories specified by the data-dir configuration parameter.
|
|
|
+You can then add more nodes to the cluster and restore resiliency. See the [add a new member][add-a-member] guide for more details. **NB:** If you are trying to restore your cluster using old failed etcd nodes, please make sure you have stopped old etcd instances and removed their old data directories specified by the data-dir configuration parameter.
|
|
|
|
|
|
### Client Request Timeout
|
|
|
|
|
|
@@ -293,6 +286,18 @@ If timeout happens several times continuously, administrators should check statu
|
|
|
|
|
|
#### Maximum OS threads
|
|
|
|
|
|
-By default, etcd uses the default configuration of the Go 1.4 runtime, which means that at most one operating system thread will be used to execute code simultaneously. (Note that this default behavior [has changed in Go 1.5](https://golang.org/doc/go1.5#runtime)).
|
|
|
+By default, etcd uses the default configuration of the Go 1.4 runtime, which means that at most one operating system thread will be used to execute code simultaneously. (Note that this default behavior [has changed in Go 1.5][golang1.5-runtime]).
|
|
|
+
|
|
|
+When using etcd in heavy-load scenarios on machines with multiple cores it will usually be desirable to increase the number of threads that etcd can utilize. To do this, simply set the environment variable GOMAXPROCS to the desired number when starting etcd. For more information on this variable, see the [Go runtime documentation][golang-runtime].
|
|
|
|
|
|
-When using etcd in heavy-load scenarios on machines with multiple cores it will usually be desirable to increase the number of threads that etcd can utilize. To do this, simply set the environment variable `GOMAXPROCS` to the desired number when starting etcd. For more information on this variable, see the Go [runtime](https://golang.org/pkg/runtime) documentation.
|
|
|
+[add-a-member]: runtime-configuration.md#add-a-new-member
|
|
|
+[golang1.5-runtime]: https://golang.org/doc/go1.5#runtime
|
|
|
+[golang-memstats]: https://golang.org/pkg/runtime/#MemStats
|
|
|
+[golang-runtime]: https://golang.org/pkg/runtime
|
|
|
+[metrics]: metrics.md
|
|
|
+[prometheus]: http://prometheus.io/
|
|
|
+[remove-a-member]: runtime-configuration.md#remove-a-member
|
|
|
+[runtime-reconfig]: runtime-configuration.md#cluster-reconfiguration-operations
|
|
|
+[snap-pkg]: http://godoc.org/github.com/coreos/etcd/snap
|
|
|
+[update-a-member]: runtime-configuration.md#update-a-member
|
|
|
+[wal-pkg]: http://godoc.org/github.com/coreos/etcd/wal
|