-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve storage model of Caddy's pods in K8s #737
Comments
Aye aye, I agree that there is an issue and it's important that we fix it. Let me pull Florian @fghaas into the conversation, as he proposed one of the original fixes and he also has extensive expertise about k8s.
I don't remember this conversation, but I assume that my argument was that we needed to preserve SSL certificates or we would be rate-limited by let's encrypt's servers. I am also facing this issue when I redeploy the demo Open edX platform 30 times/month. To be honest, I don't feel very competent to propose a smart solution -- although I do understand the problem. What would you suggest? |
That recreation can take up to 15 minutes in my experience, and that's not counting ACME service disruptions. I would argue that 15 minutes (or longer if the ACME service happens to be unavailable) is not an acceptable service interruption, so I think just relying on automatic cert regeneration in case of a pod being rescheduled to another node is not an option. In other words: we do need the certificate data in a PV.
More broadly, use a volume with an access mode other than the If
then you'll need an So, the only thing I can think of here is to make the Caddy PVC's access mode configurable somehow:
Is exposing that implementation detail actually useful and beneficial to users? |
I have let this issue linger for too long, sorry about that... @wahajArbisoft what's your take on this issue? |
Kubernetes has much more stable and scalable ways to get certificates (cert-manager for example) and manage ingress traffic (cluster ingress controllers like traefik). For larger deployments it should be a recommended way. If PVC only used for certificate storage, then setting |
I would like to start a discussion about the persistence model of Caddy.
Currently, Caddy's deployment definition has a generic persistent volume claim which will cause the cluster to create a generic persistent volume and mount to the
/data
directory.In our first failover tests, we've found that, in multi-AZ environments, Caddy will fail rescheduling to a different AZ, as a pod cannot bind to a PV in a different AZ (I have a post in discuss about this).
Now a commit to fix rolling up updates in Caddy makes it even harder, as all pods for Caddy must be in the same node as the original ReplicaSet. Recently we had a site outage when Caddy crashed, and Caddy failed to reschedule due to lack of resources in the original node, and was prevented to reschedule to another node. We had to delete the volume and deployment manually and then Caddy was rescheduled to another node.
Additionally, there is an excellent backup plugin, which makes a backup of the Caddy's data volume. To do this, it includes a node affinity with Caddy to access its volume. It is also an issue, because if the node does not have enough resources to allocate the pods for the backup or restore jobs, these tasks will fail.
The idea behind K8s is to have nodes tightly dimensioned to support its current workloads, and let the scheduler assign nodes to pods dynamically wherever there is room. So it is frequent that nodes do not have resources and crashing pods need to be rescheduled to another pod. Too many node affinity constraints and taints limit the scheduler ability and may lead to pods failing to start.
AFAIK, Caddy uses this volume to store only the SSL certificates, which are generated dynamically and can be recreated if lost. The other core pods of Open edX do not require any PV (out of MySQL, MongoDB, ElasticSearch, Redis and MinIO, which can be consumed as a service out of the K8s cluster).
I would like to start a discussion to change the way Caddy stores this data, to improve its scalability and resiliency.
As a starting point, we can review some of these options:
The text was updated successfully, but these errors were encountered: