Browse Source

doc: add monitoring section to admin doc

Xiang Li 10 years ago
parent
commit
13b3c64c10
1 changed files with 68 additions and 0 deletions
  1. 68 0
      Documentation/admin_guide.md

+ 68 - 0
Documentation/admin_guide.md

@@ -39,6 +39,74 @@ If `--wal-dir` flag is set, etcd will write the write ahead log files to the spe
 If you are spinning up multiple clusters for testing it is recommended that you specify a unique initial-cluster-token for the different clusters.
 If you are spinning up multiple clusters for testing it is recommended that you specify a unique initial-cluster-token for the different clusters.
 This can protect you from cluster corruption in case of mis-configuration because two members started with different cluster tokens will refuse members from each other.
 This can protect you from cluster corruption in case of mis-configuration because two members started with different cluster tokens will refuse members from each other.
 
 
+#### Monitoring
+
+It is important to monitor your production etcd cluster for healthy information and runtime metrics.
+
+##### Health Monitoring
+
+At lowest level, etcd exposes health information via HTTP at `/health` in JSON format. If it returns `{"health": "true"}`, then the cluster is healthy. Please note the `/health` endpoint is still an experimental one as in etcd 2.2.
+
+```
+$ curl -L http://127.0.0.1:2379/health
+
+{"health": "true"}
+```
+
+You can also use etcdctl to check the cluster-wide health information. It will contact all the members of the cluster and collect the health information for you.
+
+```
+$./etcdctl cluster-health 
+member 8211f1d0f64f3269 is healthy: got healthy result from http://127.0.0.1:12379
+member 91bc3c398fb3c146 is healthy: got healthy result from http://127.0.0.1:22379
+member fd422379fda50e48 is healthy: got healthy result from http://127.0.0.1:32379
+cluster is healthy
+```
+
+##### Runtime Metrics
+
+etcd uses [Prometheus](http://prometheus.io/) for metrics reporting in the server. You can read more through the runtime metrics [doc](metrics.md).
+
+#### Debugging
+
+Debugging a distributed system can be difficult. etcd provides several ways to make debug
+easier.
+
+##### Enabling Debug Logging
+
+When you want to debug etcd without stopping it, you can enable debug logging at runtime.
+etcd exposes logging configuration at `/config/local/log`.
+
+```
+$ curl http://127.0.0.1:2379/config/local/log -XPUT -d '{"Level":"DEBUG"}'
+$ # debug logging enabled
+$
+$ curl http://127.0.0.1:2379/config/local/log -XPUT -d '{"Level":"INFO"}'
+$ # debug logging disabled
+```
+
+##### Debugging Variables
+
+Debug variables are exposed for real-time debugging purposes. Developers who are familiar with etcd can utilize these variables to debug unexpected behavior. etcd exposes debug variables via HTTP at `/debug/vars` in JSON format. The debug variables contains
+`cmdline`, `file_descriptor_limit`, `memstats` and `raft.status`.
+
+`cmdline` is the command line arguments passed into etcd.
+
+`file_descriptor_limit` is the max number of file descriptors etcd can utilize.
+
+`memstats` is well explained [here](http://golang.org/pkg/runtime/#MemStats).
+
+`raft.status` is useful when you want to debug low level raft issues if you are familiar with raft internals. In most cases, you do not need to check `raft.status`.
+
+```json
+{
+"cmdline": ["./etcd"],
+"file_descriptor_limit": 0,
+"memstats": {"Alloc":4105744,"TotalAlloc":42337320,"Sys":12560632,"...":"..."},
+"raft.status": {"id":"ce2a822cea30bfca","term":5,"vote":"ce2a822cea30bfca","commit":23509,"lead":"ce2a822cea30bfca","raftState":"StateLeader","progress":{"ce2a822cea30bfca":{"match":23509,"next":23510,"state":"ProgressStateProbe"}}}
+}
+```
+
 #### Optimal Cluster Size
 #### Optimal Cluster Size
 
 
 The recommended etcd cluster size is 3, 5 or 7, which is decided by the fault tolerance requirement. A 7-member cluster can provide enough fault tolerance in most cases. While larger cluster provides better fault tolerance the write performance reduces since data needs to be replicated to more machines.
 The recommended etcd cluster size is 3, 5 or 7, which is decided by the fault tolerance requirement. A 7-member cluster can provide enough fault tolerance in most cases. While larger cluster provides better fault tolerance the write performance reduces since data needs to be replicated to more machines.