doc.go 7.1 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187
  1. // Copyright 2015 CoreOS, Inc.
  2. //
  3. // Licensed under the Apache License, Version 2.0 (the "License");
  4. // you may not use this file except in compliance with the License.
  5. // You may obtain a copy of the License at
  6. //
  7. // http://www.apache.org/licenses/LICENSE-2.0
  8. //
  9. // Unless required by applicable law or agreed to in writing, software
  10. // distributed under the License is distributed on an "AS IS" BASIS,
  11. // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  12. // See the License for the specific language governing permissions and
  13. // limitations under the License.
  14. /*
  15. Package raft provides an implementation of the raft consensus algorithm.
  16. Raft is a protocol by which a cluster of nodes can maintain a replicated state machine.
  17. The state machine is kept in sync through the use of a replicated log.
  18. For more details on Raft, you can read In Search of an Understandable Consensus Algorithm
  19. (https://ramcloud.stanford.edu/raft.pdf) by Diego Ongaro and John Ousterhout.
  20. A simple example application called, raftexample, is also avialable to help illustrate how to use this package in practice:
  21. https://github.com/coreos/etcd/tree/master/contrib/raftexample
  22. Usage
  23. The primary object in raft is a Node. You either start a Node from scratch
  24. using raft.StartNode or start a Node from some initial state using raft.RestartNode.
  25. To start a node from scratch:
  26. storage := raft.NewMemoryStorage()
  27. c := &Config{
  28. ID: 0x01,
  29. ElectionTick: 10,
  30. HeartbeatTick: 1,
  31. Storage: storage,
  32. MaxSizePerMsg: 4096,
  33. MaxInflightMsgs: 256,
  34. }
  35. n := raft.StartNode(c, []raft.Peer{{ID: 0x02}, {ID: 0x03}})
  36. To restart a node from previous state:
  37. storage := raft.NewMemoryStorage()
  38. // recover the in-memory storage from persistent
  39. // snapshot, state and entries.
  40. storage.ApplySnapshot(snapshot)
  41. storage.SetHardState(state)
  42. storage.Append(entries)
  43. c := &Config{
  44. ID: 0x01,
  45. ElectionTick: 10,
  46. HeartbeatTick: 1,
  47. Storage: storage,
  48. MaxSizePerMsg: 4096,
  49. MaxInflightMsgs: 256,
  50. }
  51. // restart raft without peer information.
  52. // peer information is already included in the storage.
  53. n := raft.RestartNode(c)
  54. Now that you are holding onto a Node you have a few responsibilities:
  55. First, you must read from the Node.Ready() channel and process the updates
  56. it contains. These steps may be performed in parallel, except as noted in step
  57. 2.
  58. 1. Write HardState, Entries, and Snapshot to persistent storage if they are
  59. not empty. Note that when writing an Entry with Index i, any
  60. previously-persisted entries with Index >= i must be discarded.
  61. 2. Send all Messages to the nodes named in the To field. It is important that
  62. no messages be sent until after the latest HardState has been persisted to disk,
  63. and all Entries written by any previous Ready batch (Messages may be sent while
  64. entries from the same batch are being persisted). If any Message has type MsgSnap,
  65. call Node.ReportSnapshot() after it has been sent (these messages may be large).
  66. 3. Apply Snapshot (if any) and CommittedEntries to the state machine.
  67. If any committed Entry has Type EntryConfChange, call Node.ApplyConfChange()
  68. to apply it to the node. The configuration change may be cancelled at this point
  69. by setting the NodeID field to zero before calling ApplyConfChange
  70. (but ApplyConfChange must be called one way or the other, and the decision to cancel
  71. must be based solely on the state machine and not external information such as
  72. the observed health of the node).
  73. 4. Call Node.Advance() to signal readiness for the next batch of updates.
  74. This may be done at any time after step 1, although all updates must be processed
  75. in the order they were returned by Ready.
  76. Second, all persisted log entries must be made available via an
  77. implementation of the Storage interface. The provided MemoryStorage
  78. type can be used for this (if you repopulate its state upon a
  79. restart), or you can supply your own disk-backed implementation.
  80. Third, when you receive a message from another node, pass it to Node.Step:
  81. func recvRaftRPC(ctx context.Context, m raftpb.Message) {
  82. n.Step(ctx, m)
  83. }
  84. Finally, you need to call Node.Tick() at regular intervals (probably
  85. via a time.Ticker). Raft has two important timeouts: heartbeat and the
  86. election timeout. However, internally to the raft package time is
  87. represented by an abstract "tick".
  88. The total state machine handling loop will look something like this:
  89. for {
  90. select {
  91. case <-s.Ticker:
  92. n.Tick()
  93. case rd := <-s.Node.Ready():
  94. saveToStorage(rd.State, rd.Entries, rd.Snapshot)
  95. send(rd.Messages)
  96. if !raft.IsEmptySnap(rd.Snapshot) {
  97. processSnapshot(rd.Snapshot)
  98. }
  99. for _, entry := range rd.CommittedEntries {
  100. process(entry)
  101. if entry.Type == raftpb.EntryConfChange {
  102. var cc raftpb.ConfChange
  103. cc.Unmarshal(entry.Data)
  104. s.Node.ApplyConfChange(cc)
  105. }
  106. s.Node.Advance()
  107. case <-s.done:
  108. return
  109. }
  110. }
  111. To propose changes to the state machine from your node take your application
  112. data, serialize it into a byte slice and call:
  113. n.Propose(ctx, data)
  114. If the proposal is committed, data will appear in committed entries with type
  115. raftpb.EntryNormal. There is no guarantee that a proposed command will be
  116. committed; you may have to re-propose after a timeout.
  117. To add or remove node in a cluster, build ConfChange struct 'cc' and call:
  118. n.ProposeConfChange(ctx, cc)
  119. After config change is committed, some committed entry with type
  120. raftpb.EntryConfChange will be returned. You must apply it to node through:
  121. var cc raftpb.ConfChange
  122. cc.Unmarshal(data)
  123. n.ApplyConfChange(cc)
  124. Note: An ID represents a unique node in a cluster for all time. A
  125. given ID MUST be used only once even if the old node has been removed.
  126. This means that for example IP addresses make poor node IDs since they
  127. may be reused. Node IDs must be non-zero.
  128. Implementation notes
  129. This implementation is up to date with the final Raft thesis
  130. (https://ramcloud.stanford.edu/~ongaro/thesis.pdf), although our
  131. implementation of the membership change protocol differs somewhat from
  132. that described in chapter 4. The key invariant that membership changes
  133. happen one node at a time is preserved, but in our implementation the
  134. membership change takes effect when its entry is applied, not when it
  135. is added to the log (so the entry is committed under the old
  136. membership instead of the new). This is equivalent in terms of safety,
  137. since the old and new configurations are guaranteed to overlap.
  138. To ensure that we do not attempt to commit two membership changes at
  139. once by matching log positions (which would be unsafe since they
  140. should have different quorum requirements), we simply disallow any
  141. proposed membership change while any uncommitted change appears in
  142. the leader's log.
  143. This approach introduces a problem when you try to remove a member
  144. from a two-member cluster: If one of the members dies before the
  145. other one receives the commit of the confchange entry, then the member
  146. cannot be removed any more since the cluster cannot make progress.
  147. For this reason it is highly recommended to use three or more nodes in
  148. every cluster.
  149. */
  150. package raft