10 лет назад · 1b2dc1c796
--- a/Documentation/metrics.md
+++ b/Documentation/metrics.md
@@ -30,53 +30,46 @@ Pending proposal (`pending_proposal_total`) gives you an idea about how many pro
 
				 
			
 
				 Failed proposals (`proposal_failed_total`) are normally related to two issues: temporary failures related to a leader election or longer duration downtime caused by a loss of quorum in the cluster.
			
 
				 
			
 
				+### wal
			
 
				 
			
 
				-### store
			
 
				+| Name                               | Description                                      | Type    |
			
 
				+|------------------------------------|--------------------------------------------------|---------|
			
 
				+| fsync_durations_microseconds       | The latency distributions of fsync called by wal | Summary |
			
 
				+| last_index_saved                   | The index of the last entry saved by wal         | Gauge   |
			
 
				 
			
 
				-These metrics describe the accesses into the data store of etcd members that exist in the cluster. They 
			
 
				-are useful to count what kind of actions are taken by users. It is also useful to see and whether all etcd members 
			
 
				-"see" the same set of data mutations, and whether reads and watches (which are local) are equally distributed.
			
 
				+Abnormally high fsync duration (`fsync_durations_microseconds`) indicates disk issues and might cause the cluster to be unstable.
			
 
				 
			
 
				-All these metrics are prefixed with `etcd_store_`. 
			
 
				 
			
 
				-| Name                      | Description                                                                          | Type                   |
			
 
				-|---------------------------|------------------------------------------------------------------------------------------|--------------------|
			
 
				-| reads_total               | Total number of reads from store, should differ among etcd members (local reads).    | Counter(action)        |
			
 
				-| writes_total              | Total number of writes to store, should be same among all etcd members.              | Counter(action)        |
			
 
				-| reads_failed_total        | Number of failed reads from store (e.g. key missing) on local reads.                 | Counter(action)        |
			
 
				-| writes_failed_total     | Number of failed writes to store (e.g. failed compare and swap).                       | Counter(action)        |
			
 
				-| expires_total             | Total number of expired keys (due to TTL).                                           | Counter                |
			
 
				-| watch_requests_totals     | Total number of incoming watch requests to this etcd member (local watches).         | Counter                | 
			
 
				-| watchers                  | Current count of active watchers on this etcd member.                                | Gauge                  |
			
 
				+### http requests
			
 
				 
			
 
				-Both `reads_total` and `writes_total` count both successful and failed requests. `reads_failed_total` and 
			
 
				-`writes_failed_total` count failed requests. A lot of failed writes indicate possible contentions on keys (e.g. when 
			
 
				-doing  `compareAndSet`), and read failures indicate that some clients try to access keys that don't exist.
			
 
				+These metrics describe the serving of requests (non-watch events) served by etcd members in non-proxy mode: total 
			
 
				+incoming requests, request failures and processing latency (inc. raft rounds for storage). They are useful for tracking
			
 
				+ user-generated traffic hitting the etcd cluster . 
			
 
				+
			
 
				+All these metrics are prefixed with `etcd_http_`
			
 
				+
			
 
				+| Name                           | Description                                                                         | Type                   |
			
 
				+|--------------------------------|-----------------------------------------------------------------------------------------|--------------------|
			
 
				+| received_total                 | Total number of events after parsing and auth.                                      | Counter(method)        |
			
 
				+| failed_total                   | Total number of failed events.                                                      | Counter(method,error)  |
			
 
				+| successful_duration_second     |  Bucketed handling times of the requests, including raft rounds for writes.          | Histogram(method)      |
			
 
				 
			
 
				-Example Prometheus queries that may be useful from these metrics (across all etcd members):
			
 
				 
			
 
				- *  `sum(rate(etcd_store_reads_total{job="etcd"}[1m])) by (action)`
			
 
				-    `max(rate(etcd_store_writes_total{job="etcd"}[1m])) by (action)`
			
 
				+Example Prometheus queries that may be useful from these metrics (across all etcd members):
			
 
				+ 
			
 
				+ * `sum(rate(etcd_http_failed_total{job="etcd"}[1m]) by (method) / sum(rate(etcd_http_events_received_total{job="etcd"})[1m]) by (method)` 
			
 
				+    
			
 
				+    Shows the fraction of events that failed by HTTP method across all members, across a time window of `1m`.
			
 
				+ 
			
 
				+ * `sum(rate(etcd_http_received_total{job="etcd",method="GET})[1m]) by (method)`
			
 
				+   `sum(rate(etcd_http_received_total{job="etcd",method~="GET})[1m]) by (method)`
			
 
				     
			
 
				-    Rate of reads and writes by action, across all servers across a time window of `1m`. The reason why `max` is used
			
 
				-     for writes as opposed to `sum` for reads is because all of etcd nodes in the cluster apply all writes to their stores.
			
 
				     Shows the rate of successful readonly/write queries across all servers, across a time window of `1m`.
			
 
				- * `sum(rate(etcd_store_watch_requests_total{job="etcd"}[1m]))`
			
 
				     
			
 
				-    Shows rate of new watch requests per second. Likely driven by how often watched keys change. 
			
 
				- * `sum(etcd_store_watchers{job="etcd"})`
			
 
				+ * `histogram_quantile(0.9, sum(increase(etcd_http_successful_processing_seconds{job="etcd",method="GET"}[5m]) ) by (le))`
			
 
				+   `histogram_quantile(0.9, sum(increase(etcd_http_successful_processing_seconds{job="etcd",method!="GET"}[5m]) ) by (le))`
			
 
				     
			
 
				-    Number of active watchers across all etcd servers.        
			
 
				-
			
 
				-
			
 
				-### wal
			
 
				-
			
 
				-| Name                               | Description                                      | Type    |
			
 
				-|------------------------------------|--------------------------------------------------|---------|
			
 
				-| fsync_durations_microseconds       | The latency distributions of fsync called by wal | Summary |
			
 
				-| last_index_saved                   | The index of the last entry saved by wal         | Gauge   |
			
 
				-
			
 
				-Abnormally high fsync duration (`fsync_durations_microseconds`) indicates disk issues and might cause the cluster to be unstable.
			
 
				+    Show the 0.90-tile latency (in seconds) of read/write (respectively) event handling across all members, with a window of `5m`.      
			
 
				 
			
 
				 ### snapshot
			
 
				 
			
--- a/error/error.go
+++ b/error/error.go
@@ -143,7 +143,7 @@ func (e Error) toJsonString() string {
 
				 	return string(b)
			
 
				 }
			
 
				 
			
 
				-func (e Error) statusCode() int {
			
 
				+func (e Error) StatusCode() int {
			
 
				 	status, ok := errorStatus[e.ErrorCode]
			
 
				 	if !ok {
			
 
				 		status = http.StatusBadRequest
			
@@ -154,6 +154,6 @@ func (e Error) statusCode() int {
 
				 func (e Error) WriteTo(w http.ResponseWriter) {
			
 
				 	w.Header().Add("X-Etcd-Index", fmt.Sprint(e.Index))
			
 
				 	w.Header().Set("Content-Type", "application/json")
			
 
				-	w.WriteHeader(e.statusCode())
			
 
				+	w.WriteHeader(e.StatusCode())
			
 
				 	fmt.Fprintln(w, e.toJsonString())
			
 
				 }
			
--- a/error/error_test.go
+++ b/error/error_test.go
@@ -28,8 +28,8 @@ func TestErrorWriteTo(t *testing.T) {
 
				 		rr := httptest.NewRecorder()
			
 
				 		err.WriteTo(rr)
			
 
				 
			
 
				-		if err.statusCode() != rr.Code {
			
 
				-			t.Errorf("HTTP status code %d, want %d", rr.Code, err.statusCode())
			
 
				+		if err.StatusCode() != rr.Code {
			
 
				+			t.Errorf("HTTP status code %d, want %d", rr.Code, err.StatusCode())
			
 
				 		}
			
 
				 
			
 
				 		gbody := strings.TrimSuffix(rr.Body.String(), "\n")
			
--- a/etcdserver/etcdhttp/client.go
+++ b/etcdserver/etcdhttp/client.go
@@ -128,8 +128,9 @@ func (h *keysHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
 
				 
			
 
				 	ctx, cancel := context.WithTimeout(context.Background(), h.timeout)
			
 
				 	defer cancel()
			
 
				-
			
 
				-	rr, err := parseKeyRequest(r, clockwork.NewRealClock())
			
 
				+	clock := clockwork.NewRealClock()
			
 
				+	startTime := clock.Now()
			
 
				+	rr, err := parseKeyRequest(r, clock)
			
 
				 	if err != nil {
			
 
				 		writeKeyError(w, err)
			
 
				 		return
			
@@ -139,11 +140,14 @@ func (h *keysHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
 
				 		writeKeyNoAuth(w)
			
 
				 		return
			
 
				 	}
			
 
				-
			
 
				+	if !rr.Wait {
			
 
				+		reportRequestReceived(rr)
			
 
				+	}
			
 
				 	resp, err := h.server.Do(ctx, rr)
			
 
				 	if err != nil {
			
 
				 		err = trimErrorPrefix(err, etcdserver.StoreKeysPrefix)
			
 
				 		writeKeyError(w, err)
			
 
				+		reportRequestFailed(rr, err)
			
 
				 		return
			
 
				 	}
			
 
				 	switch {
			
@@ -152,6 +156,7 @@ func (h *keysHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
 
				 			// Should never be reached
			
 
				 			plog.Errorf("error writing event (%v)", err)
			
 
				 		}
			
 
				+		reportRequestCompleted(rr, resp, startTime)
			
 
				 	case resp.Watcher != nil:
			
 
				 		ctx, cancel := context.WithTimeout(context.Background(), defaultWatchTimeout)
			
 
				 		defer cancel()
			
--- a/etcdserver/etcdhttp/metrics.go
+++ b/etcdserver/etcdhttp/metrics.go
@@ -0,0 +1,96 @@
 
				+// Copyright 2015 CoreOS, Inc.
			
 
				+//
			
 
				+// Licensed under the Apache License, Version 2.0 (the "License");
			
 
				+// you may not use this file except in compliance with the License.
			
 
				+// You may obtain a copy of the License at
			
 
				+//
			
 
				+//     http://www.apache.org/licenses/LICENSE-2.0
			
 
				+//
			
 
				+// Unless required by applicable law or agreed to in writing, software
			
 
				+// distributed under the License is distributed on an "AS IS" BASIS,
			
 
				+// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
			
 
				+// See the License for the specific language governing permissions and
			
 
				+// limitations under the License.
			
 
				+
			
 
				+package etcdhttp
			
 
				+
			
 
				+import (
			
 
				+	"strconv"
			
 
				+	"time"
			
 
				+
			
 
				+	"net/http"
			
 
				+
			
 
				+	"github.com/coreos/etcd/Godeps/_workspace/src/github.com/prometheus/client_golang/prometheus"
			
 
				+	etcdErr "github.com/coreos/etcd/error"
			
 
				+	"github.com/coreos/etcd/etcdserver"
			
 
				+	"github.com/coreos/etcd/etcdserver/etcdhttp/httptypes"
			
 
				+	"github.com/coreos/etcd/etcdserver/etcdserverpb"
			
 
				+)
			
 
				+
			
 
				+var (
			
 
				+	incomingEvents = prometheus.NewCounterVec(
			
 
				+		prometheus.CounterOpts{
			
 
				+			Namespace: "etcd",
			
 
				+			Subsystem: "http",
			
 
				+			Name:      "received_total",
			
 
				+			Help:      "Counter of requests received into the system (successfully parsed and authd).",
			
 
				+		}, []string{"method"})
			
 
				+
			
 
				+	failedEvents = prometheus.NewCounterVec(
			
 
				+		prometheus.CounterOpts{
			
 
				+			Namespace: "etcd",
			
 
				+			Subsystem: "http",
			
 
				+			Name:      "failed_total",
			
 
				+			Help:      "Counter of handle failures of requests (non-watches), by method (GET/PUT etc.) and code (400, 500 etc.).",
			
 
				+		}, []string{"method", "code"})
			
 
				+
			
 
				+	successfulEventsHandlingTime = prometheus.NewHistogramVec(
			
 
				+		prometheus.HistogramOpts{
			
 
				+			Namespace: "etcd",
			
 
				+			Subsystem: "http",
			
 
				+			Name:      "successful_duration_second",
			
 
				+			Help:      "Bucketed histogram of processing time (s) of successfully handled requests (non-watches), by method (GET/PUT etc.).",
			
 
				+			Buckets:   prometheus.ExponentialBuckets(0.0005, 2, 13),
			
 
				+		}, []string{"method"})
			
 
				+)
			
 
				+
			
 
				+func init() {
			
 
				+	prometheus.MustRegister(incomingEvents)
			
 
				+	prometheus.MustRegister(failedEvents)
			
 
				+	prometheus.MustRegister(successfulEventsHandlingTime)
			
 
				+}
			
 
				+
			
 
				+func reportRequestReceived(request etcdserverpb.Request) {
			
 
				+	incomingEvents.WithLabelValues(methodFromRequest(request)).Inc()
			
 
				+}
			
 
				+
			
 
				+func reportRequestCompleted(request etcdserverpb.Request, response etcdserver.Response, startTime time.Time) {
			
 
				+	method := methodFromRequest(request)
			
 
				+	successfulEventsHandlingTime.WithLabelValues(method).Observe(time.Since(startTime).Seconds())
			
 
				+}
			
 
				+
			
 
				+func reportRequestFailed(request etcdserverpb.Request, err error) {
			
 
				+	method := methodFromRequest(request)
			
 
				+	failedEvents.WithLabelValues(method, strconv.Itoa(codeFromError(err))).Inc()
			
 
				+}
			
 
				+
			
 
				+func methodFromRequest(request etcdserverpb.Request) string {
			
 
				+	if request.Method == "GET" && request.Quorum {
			
 
				+		return "QGET"
			
 
				+	}
			
 
				+	return request.Method
			
 
				+}
			
 
				+
			
 
				+func codeFromError(err error) int {
			
 
				+	if err == nil {
			
 
				+		return http.StatusInternalServerError
			
 
				+	}
			
 
				+	switch e := err.(type) {
			
 
				+	case *etcdErr.Error:
			
 
				+		return (*etcdErr.Error)(e).StatusCode()
			
 
				+	case *httptypes.HTTPError:
			
 
				+		return (*httptypes.HTTPError)(e).Code
			
 
				+	default:
			
 
				+		return http.StatusInternalServerError
			
 
				+	}
			
 
				+}