-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GKE Deployment with gvisor fails with Cloud DNS but not with kube-dns #3418
Comments
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗 |
Dataplane V2 enabled This could be what its about i think, this is their name for using cilium as a a CNI in the k8s cluster - and when cilium enforces the charts NetworkPolicy rules, there can be issues because of a documented limitation in cilium. Can you try disabling creation of networkpolicy resources or alternatively delete those created by the chart and see if that helps to start? There is an issue or two about this in the repo already, also search for cilium / dataplane v2 and see if you find it. I'm on a mobile and can't quickly provide a link =\ |
For a resolution, its good to know how Cloud DNS is accessed from within the cluster - is it a pod in the cluster, or a private IP (10.x.x.x, 172.x.x.x, 192.168.x.x), or some unusual port? |
I think you're talking about #3167 right? I read through that conversation and it seems like you applied all of the changes needed to get it working when closing out that issue (thanks!) so I expect it to be a different issue. Forgot to mention, but while debugging this issue I did turn off all of the network policies for hub, singleuser, proxy in config.yaml for one test, and there was no change. The only change that made a difference was switching Cloud DNS to kube-dns and vice versa. I'll see if I have some more time to dig into this but it'll be in a few days, since I now have a standard and Autopilot cluster (both with gvisor) working. I'm not too familiar with kubernetes networking, so I'd have to spend some time figuring out how to debug and get those logs working. It looks like Cloud DNS does create a pod in the cluster for DNS. It might be ok to just say that a Standard Cluster with DataPlane V2, Cloud DNS, and a user pool with gvisor enabled (but not applied at a pod-level) doesn't work right now, since it's a super niche use case. Hopefully this doesn't cause any other issues down the road. |
I think the reason with gvisor and cloud DNS it does not work is because of this limitation listed on GCP help page, |
Bug description
I'm deploying JupyterHub on GKE. I'd like there to be a separation between the default pool and a user pool, so I can use gvisor for container isolation. Currently, I have a Standard cluster running fine with
kube-dns
, Dataplane V2 turned off, a default pool, and a user pool withgvisor
container isolation turned on, withscheduling.userPods.nodeAffinity.matchNodePurpose: require
to route user servers to the user pool. This cluster works completely fine and runs well.I've heard that GKE is moving towards Autopilot clusters (which have DataPlane V2 and Cloud DNS turned on), so I decided to create a new Standard cluster with DataPlane V2 and Cloud DNS turned on. This cluster was configured the same way: a default pool and a user pool created with
gvisor
container isolation turned on. However, in this setup, the hub pod is unable to communicate with the user server pod (see errors below).In order to isolate the issue, I created a few more clusters with the exact same config (a default pool and a user pool) with Dataplane V2 on and:
gvisor
isolation, Cloud DNS,singleuser.extraPodConfig.runtimeClassName: gvisor
: This is the failing cluster setup described above.gvisor
isolation, Cloud DNS,singleuser.extraPodConfig.runtimeClassName
not set: This is also the failing cluster setup described above. Since that thegvisor
runtime class name actually turns on gvisor, the issue happens even if gvisor is not turned on.gvisor
isolation, kube-dns instead of Cloud DNS: Works properly; no issuesgvisor
isolation, Cloud DNS: Works properly; no issuesAs a result, it looks like there's an issue with JupyterHub deployed on GKE with the following configuration:
kube-dns
hub.jupyter.org/node-purpose: core
. This is to route different pods to different nodes.gvisor
container isolation andhub.jupyter.org/node-purpose: user
label.gvisor
is added to the runtime class name. However, the pod'sruntimeClassName
does not need to be set togvisor
for this issue to occur.I've also looked at #3167 which previously addressed issues with Dataplane V2 and Cloud DNS, and but it seems like those changes have already made it into the latest release.
Note that I'm aware that
gvisor
is not officially supported by Z2JH, and that it's totally possible that this is an undocumented limitation betweengvisor
container isolation and Cloud DNS! However, this issue occurs even when gvisor is turned off and that there doesn't seem to be a larger issue usinggvisor
with Cloud DNS, there might be something JupyterHub-specific around DNS related here. Given that GKE is pushing people towards Autopilot/Cloud DNS clusters, and that Autopilot clusters by default have that option pre-selected (since the only change needed to use gvisor in an Autopilot cluster is to add thegvisor
runtime class name), I wonder if this issue also prevents people from deploying JupyterHub on an Autopilot Cluster today.(edit: looks like deploying a Autopilot cluster does not work, whether or not I use
runtimeClassName: gvisor
--the hub and server both time out while waiting for the server to start. That is, unless there's something else that needs to be done beyond what's already specified in the Discourse server.)(edit2: I'm mistaken, I'm able to start an Autopilot cluster; launching the user server just takes a really long time since user placeholders don't take on user servers' taints/selectors, so we need to wait for a new node to be created and an image to be pulled)
How to reproduce
kube-dns
).hub.jupyter.org/node-purpose: core
to nodes in this cluster. This is to make sure we schedule pods on separate nodes depending on whether they're core or user pods.gvisor
sandbox isolation turned on. Also addhub.jupyter.org/node-purpose: user
as a Kubernetes label to nodes in this cluster.Expected behaviour
The user server should initialize and register itself with the hub.
Working Logs
These are the logs I expect to see if communication between the hub and user server is working:
Actual behaviour
The user server pod is unable to communicate with the hub pod. After a while, the hub times out and the user server shuts down.
Erroring Logs
You can see here that the hub requests a spawn of the user server, but the API request fails. The JupyterHubSingleUser server, once started, fails to connect to the hub as well.
The text was updated successfully, but these errors were encountered: