Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: endpoints: include learner node endpoints #1389

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

dgrisonnet
Copy link
Member

@dgrisonnet dgrisonnet commented Jan 23, 2025

The main goal behind this change is to have a complete list of etcd endpoints in kube-apiserver instances from revision 1 to speed up the bootstrap process and make it more reliable. Without that change, we have to wait for around 4 revisions and we've had issues where some apiservers wouldn't have any live etcd endpoints accessible because its etcd-servers list would only contain the bootstrap node and localhost.

Ultimately, including the learners should help reduce the load on other etcd members, but this is not the goal here.

The reasoning behind this change is that learner nodes are able to serve serializable read and status requests: https://github.com/etcd-io/etcd/blob/main/server/etcdserver/api/v3rpc/util.go#L141-L151

If it is another kind of request, the server returns a RPC error: https://github.com/etcd-io/etcd/blob/main/server/etcdserver/api/v3rpc/interceptor.go#L52-L54

This error will always be retried by the client until it finds a voter node that can serve it: https://github.com/etcd-io/etcd/blob/main/client/v3/retry_interceptor.go#L326-L334

To avoid seeing the rpc errors in the logs, ceo has a condition to not add the learner nodes to the list of endpoints: https://github.com/openshift/cluster-etcd-operator/blob/fa34bdfb8ae17f5698ae0f3086[…]pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go
But in my opinion, ceo should include learner nodes as the retry logic will let the voters take over. We will see a few rpc errors coming back, but it should speed up our bootstrap time and its reliability as we will be able to count on more nodes than we currently are. Also the rpc errors will be gone if this TODO is ever addressed: https://github.com/etcd-io/etcd/blob/main/client/v3/retry_interceptor.go#L331 and the client becomes aware of learners/voters/leaders.

Signed-off-by: Damien Grisonnet <dgrisonn@redhat.com>
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 23, 2025
@openshift-ci openshift-ci bot requested review from hasbro17 and tjungblu January 23, 2025 18:13
Copy link
Contributor

openshift-ci bot commented Jan 23, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: dgrisonnet
Once this PR has been reviewed and has the lgtm label, please assign hasbro17 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Jan 23, 2025

@dgrisonnet: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-etcd-recovery 1b3e4fb link false /test e2e-aws-etcd-recovery
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown 1b3e4fb link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/okd-scos-e2e-aws-ovn 1b3e4fb link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-aws-ovn-etcd-scaling 1b3e4fb link true /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-aws-ovn-single-node 1b3e4fb link true /test e2e-aws-ovn-single-node
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown 1b3e4fb link false /test e2e-metal-ovn-sno-cert-rotation-shutdown

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant