WIP: endpoints: include learner node endpoints #1389
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The main goal behind this change is to have a complete list of etcd endpoints in kube-apiserver instances from revision 1 to speed up the bootstrap process and make it more reliable. Without that change, we have to wait for around 4 revisions and we've had issues where some apiservers wouldn't have any live etcd endpoints accessible because its etcd-servers list would only contain the bootstrap node and localhost.
Ultimately, including the learners should help reduce the load on other etcd members, but this is not the goal here.
The reasoning behind this change is that learner nodes are able to serve serializable read and status requests: https://github.com/etcd-io/etcd/blob/main/server/etcdserver/api/v3rpc/util.go#L141-L151
If it is another kind of request, the server returns a RPC error: https://github.com/etcd-io/etcd/blob/main/server/etcdserver/api/v3rpc/interceptor.go#L52-L54
This error will always be retried by the client until it finds a voter node that can serve it: https://github.com/etcd-io/etcd/blob/main/client/v3/retry_interceptor.go#L326-L334
To avoid seeing the rpc errors in the logs, ceo has a condition to not add the learner nodes to the list of endpoints: https://github.com/openshift/cluster-etcd-operator/blob/fa34bdfb8ae17f5698ae0f3086[…]pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go
But in my opinion, ceo should include learner nodes as the retry logic will let the voters take over. We will see a few rpc errors coming back, but it should speed up our bootstrap time and its reliability as we will be able to count on more nodes than we currently are. Also the rpc errors will be gone if this TODO is ever addressed: https://github.com/etcd-io/etcd/blob/main/client/v3/retry_interceptor.go#L331 and the client becomes aware of learners/voters/leaders.