doc.go 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289
  1. // Copyright 2015 CoreOS, Inc.
  2. //
  3. // Licensed under the Apache License, Version 2.0 (the "License");
  4. // you may not use this file except in compliance with the License.
  5. // You may obtain a copy of the License at
  6. //
  7. // http://www.apache.org/licenses/LICENSE-2.0
  8. //
  9. // Unless required by applicable law or agreed to in writing, software
  10. // distributed under the License is distributed on an "AS IS" BASIS,
  11. // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  12. // See the License for the specific language governing permissions and
  13. // limitations under the License.
  14. /*
  15. Package raft sends and receives messages in the Protocol Buffer format
  16. defined in the raftpb package.
  17. Raft is a protocol with which a cluster of nodes can maintain a replicated state machine.
  18. The state machine is kept in sync through the use of a replicated log.
  19. For more details on Raft, see "In Search of an Understandable Consensus Algorithm"
  20. (https://ramcloud.stanford.edu/raft.pdf) by Diego Ongaro and John Ousterhout.
  21. A simple example application, _raftexample_, is also available to help illustrate
  22. how to use this package in practice:
  23. https://github.com/coreos/etcd/tree/master/contrib/raftexample
  24. Usage
  25. The primary object in raft is a Node. You either start a Node from scratch
  26. using raft.StartNode or start a Node from some initial state using raft.RestartNode.
  27. To start a node from scratch:
  28. storage := raft.NewMemoryStorage()
  29. c := &Config{
  30. ID: 0x01,
  31. ElectionTick: 10,
  32. HeartbeatTick: 1,
  33. Storage: storage,
  34. MaxSizePerMsg: 4096,
  35. MaxInflightMsgs: 256,
  36. }
  37. n := raft.StartNode(c, []raft.Peer{{ID: 0x02}, {ID: 0x03}})
  38. To restart a node from previous state:
  39. storage := raft.NewMemoryStorage()
  40. // recover the in-memory storage from persistent
  41. // snapshot, state and entries.
  42. storage.ApplySnapshot(snapshot)
  43. storage.SetHardState(state)
  44. storage.Append(entries)
  45. c := &Config{
  46. ID: 0x01,
  47. ElectionTick: 10,
  48. HeartbeatTick: 1,
  49. Storage: storage,
  50. MaxSizePerMsg: 4096,
  51. MaxInflightMsgs: 256,
  52. }
  53. // restart raft without peer information.
  54. // peer information is already included in the storage.
  55. n := raft.RestartNode(c)
  56. Now that you are holding onto a Node you have a few responsibilities:
  57. First, you must read from the Node.Ready() channel and process the updates
  58. it contains. These steps may be performed in parallel, except as noted in step
  59. 2.
  60. 1. Write HardState, Entries, and Snapshot to persistent storage if they are
  61. not empty. Note that when writing an Entry with Index i, any
  62. previously-persisted entries with Index >= i must be discarded.
  63. 2. Send all Messages to the nodes named in the To field. It is important that
  64. no messages be sent until after the latest HardState has been persisted to disk,
  65. and all Entries written by any previous Ready batch (Messages may be sent while
  66. entries from the same batch are being persisted). If any Message has type MsgSnap,
  67. call Node.ReportSnapshot() after it has been sent (these messages may be large).
  68. Note: Marshalling messages is not thread-safe; it is important that you
  69. make sure that no new entries are persisted while marshalling.
  70. The easiest way to achieve this is to serialise the messages directly inside
  71. your main raft loop.
  72. 3. Apply Snapshot (if any) and CommittedEntries to the state machine.
  73. If any committed Entry has Type EntryConfChange, call Node.ApplyConfChange()
  74. to apply it to the node. The configuration change may be cancelled at this point
  75. by setting the NodeID field to zero before calling ApplyConfChange
  76. (but ApplyConfChange must be called one way or the other, and the decision to cancel
  77. must be based solely on the state machine and not external information such as
  78. the observed health of the node).
  79. 4. Call Node.Advance() to signal readiness for the next batch of updates.
  80. This may be done at any time after step 1, although all updates must be processed
  81. in the order they were returned by Ready.
  82. Second, all persisted log entries must be made available via an
  83. implementation of the Storage interface. The provided MemoryStorage
  84. type can be used for this (if you repopulate its state upon a
  85. restart), or you can supply your own disk-backed implementation.
  86. Third, when you receive a message from another node, pass it to Node.Step:
  87. func recvRaftRPC(ctx context.Context, m raftpb.Message) {
  88. n.Step(ctx, m)
  89. }
  90. Finally, you need to call Node.Tick() at regular intervals (probably
  91. via a time.Ticker). Raft has two important timeouts: heartbeat and the
  92. election timeout. However, internally to the raft package time is
  93. represented by an abstract "tick".
  94. The total state machine handling loop will look something like this:
  95. for {
  96. select {
  97. case <-s.Ticker:
  98. n.Tick()
  99. case rd := <-s.Node.Ready():
  100. saveToStorage(rd.State, rd.Entries, rd.Snapshot)
  101. send(rd.Messages)
  102. if !raft.IsEmptySnap(rd.Snapshot) {
  103. processSnapshot(rd.Snapshot)
  104. }
  105. for _, entry := range rd.CommittedEntries {
  106. process(entry)
  107. if entry.Type == raftpb.EntryConfChange {
  108. var cc raftpb.ConfChange
  109. cc.Unmarshal(entry.Data)
  110. s.Node.ApplyConfChange(cc)
  111. }
  112. s.Node.Advance()
  113. case <-s.done:
  114. return
  115. }
  116. }
  117. To propose changes to the state machine from your node take your application
  118. data, serialize it into a byte slice and call:
  119. n.Propose(ctx, data)
  120. If the proposal is committed, data will appear in committed entries with type
  121. raftpb.EntryNormal. There is no guarantee that a proposed command will be
  122. committed; you may have to re-propose after a timeout.
  123. To add or remove node in a cluster, build ConfChange struct 'cc' and call:
  124. n.ProposeConfChange(ctx, cc)
  125. After config change is committed, some committed entry with type
  126. raftpb.EntryConfChange will be returned. You must apply it to node through:
  127. var cc raftpb.ConfChange
  128. cc.Unmarshal(data)
  129. n.ApplyConfChange(cc)
  130. Note: An ID represents a unique node in a cluster for all time. A
  131. given ID MUST be used only once even if the old node has been removed.
  132. This means that for example IP addresses make poor node IDs since they
  133. may be reused. Node IDs must be non-zero.
  134. Implementation notes
  135. This implementation is up to date with the final Raft thesis
  136. (https://ramcloud.stanford.edu/~ongaro/thesis.pdf), although our
  137. implementation of the membership change protocol differs somewhat from
  138. that described in chapter 4. The key invariant that membership changes
  139. happen one node at a time is preserved, but in our implementation the
  140. membership change takes effect when its entry is applied, not when it
  141. is added to the log (so the entry is committed under the old
  142. membership instead of the new). This is equivalent in terms of safety,
  143. since the old and new configurations are guaranteed to overlap.
  144. To ensure that we do not attempt to commit two membership changes at
  145. once by matching log positions (which would be unsafe since they
  146. should have different quorum requirements), we simply disallow any
  147. proposed membership change while any uncommitted change appears in
  148. the leader's log.
  149. This approach introduces a problem when you try to remove a member
  150. from a two-member cluster: If one of the members dies before the
  151. other one receives the commit of the confchange entry, then the member
  152. cannot be removed any more since the cluster cannot make progress.
  153. For this reason it is highly recommended to use three or more nodes in
  154. every cluster.
  155. MessageType
  156. Package raft sends and receives message in Protocol Buffer format (defined
  157. in raftpb package). Each state (follower, candidate, leader) implements its
  158. own 'step' method ('stepFollower', 'stepCandidate', 'stepLeader') when
  159. advancing with the given raftpb.Message. Each step is determined by its
  160. raftpb.MessageType. Note that every step is checked by one common method
  161. 'Step' that safety-checks the terms of node and incoming message to prevent
  162. stale log entries:
  163. 'MsgHup' is used for election. If a node is a follower or candidate, the
  164. 'tick' function in 'raft' struct is set as 'tickElection'. If a follower or
  165. candidate has not received any heartbeat before the election timeout, it
  166. passes 'MsgHup' to its Step method and becomes (or remains) a candidate to
  167. start a new election.
  168. 'MsgBeat' is an internal type that signals leaders to send a heartbeat of
  169. the 'MsgHeartbeat' type. If a node is a leader, the 'tick' function in
  170. the 'raft' struct is set as 'tickHeartbeat', and sends periodic heartbeat
  171. messages of the 'MsgBeat' type to its followers.
  172. 'MsgProp' proposes to append data to its log entries. This is a special
  173. type to redirect proposals to leader. Therefore, send method overwrites
  174. raftpb.Message's term with its HardState's term to avoid attaching its
  175. local term to 'MsgProp'. When 'MsgProp' is passed to the leader's 'Step'
  176. method, the leader first calls the 'appendEntry' method to append entries
  177. to its log, and then calls 'bcastAppend' method to send those entries to
  178. its peers. When passed to candidate, 'MsgProp' is dropped. When passed to
  179. follower, 'MsgProp' is stored in follower's mailbox(msgs) by the send
  180. method. It is stored with sender's ID and later forwarded to leader by
  181. rafthttp package.
  182. 'MsgApp' contains log entries to replicate. A leader calls bcastAppend,
  183. which calls sendAppend, which sends soon-to-be-replicated logs in 'MsgApp'
  184. type. When 'MsgApp' is passed to candidate's Step method, candidate reverts
  185. back to follower, because it indicates that there is a valid leader sending
  186. 'MsgApp' messages. Candidate and follower respond to this message in
  187. 'MsgAppResp' type.
  188. 'MsgAppResp' is response to log replication request('MsgApp'). When
  189. 'MsgApp' is passed to candidate or follower's Step method, it responds by
  190. calling 'handleAppendEntries' method, which sends 'MsgAppResp' to raft
  191. mailbox.
  192. 'MsgVote' requests votes for election. When a node is a follower or
  193. candidate and 'MsgHup' is passed to its Step method, then the node calls
  194. 'campaign' method to campaign itself to become a leader. Once 'campaign'
  195. method is called, the node becomes candidate and sends 'MsgVote' to peers
  196. in cluster to request votes. When passed to leader or candidate's Step
  197. method and the message's Term is lower than leader's or candidate's,
  198. 'MsgVote' will be rejected ('MsgVoteResp' is returned with Reject true).
  199. If leader or candidate receives 'MsgVote' with higher term, it will revert
  200. back to follower. When 'MsgVote' is passed to follower, it votes for the
  201. sender only when sender's last term is greater than MsgVote's term or
  202. sender's last term is equal to MsgVote's term but sender's last committed
  203. index is greater than or equal to follower's.
  204. 'MsgVoteResp' contains responses from voting request. When 'MsgVoteResp' is
  205. passed to candidate, the candidate calculates how many votes it has won. If
  206. it's more than majority (quorum), it becomes leader and calls 'bcastAppend'.
  207. If candidate receives majority of votes of denials, it reverts back to
  208. follower.
  209. 'MsgSnap' requests to install a snapshot message. When a node has just
  210. become a leader or the leader receives 'MsgProp' message, it calls
  211. 'bcastAppend' method, which then calls 'sendAppend' method to each
  212. follower. In 'sendAppend', if a leader fails to get term or entries,
  213. the leader requests snapshot by sending 'MsgSnap' type message.
  214. 'MsgSnapStatus' tells the result of snapshot install message. When a
  215. follower rejected 'MsgSnap', it indicates the snapshot request with
  216. 'MsgSnap' had failed from network issues which causes the network layer
  217. to fail to send out snapshots to its followers. Then leader considers
  218. follower's progress as probe. When 'MsgSnap' were not rejected, it
  219. indicates that the snapshot succeeded and the leader sets follower's
  220. progress to probe and resumes its log replication.
  221. 'MsgHeartbeat' sends heartbeat from leader. When 'MsgHeartbeat' is passed
  222. to candidate and message's term is higher than candidate's, the candidate
  223. reverts back to follower and updates its committed index from the one in
  224. this heartbeat. And it sends the message to its mailbox. When
  225. 'MsgHeartbeat' is passed to follower's Step method and message's term is
  226. higher than follower's, the follower updates its leaderID with the ID
  227. from the message.
  228. 'MsgHeartbeatResp' is a response to 'MsgHeartbeat'. When 'MsgHeartbeatResp'
  229. is passed to leader's Step method, the leader knows which follower
  230. responded. And only when the leader's last committed index is greater than
  231. follower's Match index, the leader runs 'sendAppend` method.
  232. 'MsgUnreachable' tells that request(message) wasn't delivered. When
  233. 'MsgUnreachable' is passed to leader's Step method, the leader discovers
  234. that the follower that sent this 'MsgUnreachable' is not reachable, often
  235. indicating 'MsgApp' is lost. When follower's progress state is replicate,
  236. the leader sets it back to probe.
  237. */
  238. package raft