-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ML-Pipelines API Server and Metadata Writer in CrashLoopBackoff #6121
Comments
Hi @ReggieCarey! |
For metadata-writer, what is status of |
Thanks Yuan Gong, As per request: Thu Aug 05 15:43:51 $ kubectl logs ml-pipeline-9b68d49cb-6z2tq ml-pipeline-api-server --previous
I0806 00:55:18.818889 7 client_manager.go:154] Initializing client manager
I0806 00:55:18.818963 7 config.go:57] Config DBConfig.ExtraParams not specified, skipping Sadly, that's it. Reggie |
For metadata-writer issue: The status of metadata-grpc-deployment is stable(ish). It has restarted 533 times. Here is output from "describe", the logs for "container" are empty
|
Both servers are part of the istio service mesh. In the past, the mysql process was implicated. That process remains up and stable with 0 restarts. Both pods go to a state of Running then fail but they are not synchronized in this failure. I should also add additional info that cache-server-* has restarted some 132 times in 15 days.
Maybe this is an issue with istio and service mesh configuration on bare metal. In past incarnations of this bug, there was a repair offered up to establish PeerAuthentication - this resource does not exist in my cluster. The old suggestion was to apply: apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: "default"
spec:
mtls:
mode: STRICT I do not know if this is still applicable in the the Istio 1.9.0 world. |
Any progress or ideas or things to try? KFP probably represents one of the most beneficial parts of Kubeflow 1.3.0 to us. As of now all I have is Jupiter Notebooks with a Kubeflow Dashboard wrapper. I can't use KFP at all. |
It's been 17 days and I have not heard any movement on this bug. I really want to get this resolved. I have switched my ISTIO service mesh to require mTLS (STRICT) everywhere. I have verified that the processes listed here as well as the mysql process are all within the service mesh - as evidenced by istio-proxy and istio-init being injected into the three pods. This really does appear to be a problem with access to the mysql store. My next experiments will be to stand up an ubuntu container with mysql client in the kubeflow namespace. From there I hope to be able to validate connectivity. I can see two outcomes:
In both cases the next step is still : What do I do given this additional information? As an FYI, the mysql process' istio-proxy shows the following in the logs:
|
UPDATE: Was able to get to partial success: What I did was to edit KubeFlow/v1.3.0/manifests/apps/pipeline/upstream/third-party/mysql/base/mysql-deployment.yaml
(NOTE: I also changed to use image: mysql:8 - but I don't think this is the issue) And then, I underplayed and redeployed the KFP - I know I could have just applied the changes.
The downside is that metadata-grpc-deployment and metadata-writer now fail. for metadata-grpc-deployment, the log reads:
for metadata-writer, the logs read:
Next I tried to use "platform-agnostic-multi-user-legacy"...
And all processes are now running - except this now shows up: Again: Any suggestions and assistance is highly appreciated. |
I can confirm that this is issue is seen with KF 1.3.1 too. The bug is very annoying as KFP remains inaccessible. |
Ohh, sorry this fell through the cracks. Let me take a look tmr. |
/assign @zijianjoy |
When you disable sidecar injection, also find all the destination rules and delete the destination rule for mysql. Otherwise, all other clients will fail to access MySQL assuming mTLS is turned on. Edit: this is a workaround by pulling MySQL out of the mesh. |
If you want MySQL in the mesh, you need to check Istio documentation for troubleshooting instructions. I agree Istio is very hard to troubleshoot, got the same frustration when configuring these up. |
I had the similar problem on metadatawriter, after the pod of deployment so please check any component in |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I also encountered this issue. install command:
metadata-grpc-deployment-6f6f7776c5-pq6hq log:
mysql-f7b9b7dd4-5m4tb logs:
metadata-writer-d7ff8d4bc-8thjn logs:
And I tried k8s 1.21.2, still has the same problem. |
NOTE: Everything I wrote below is totally not useful to this thread about cloud deployments. I just picked this issue out of the five I was looking at. I seriously thought it was contextually appropriate but it is not. The below information may be useful to you, but it was put here on accident essentially. I also think it's useful info and I don't want to just delete it. I will move what I've said below into it's own issue tomorrow. I have this problem on some systems and believe it is because of the max_user_{instance,watches} settings. Which can be fixed by setting this.
I have found that it should be set before I deploy the cluster and kubeflow but if a cluster is running you can kill the pods that get hung but in my experience that has been a waste of time compared to just starting over from a clean slate. I also had this issue on a machine that was running off of an SSD booted over USB-C and setting the max_user_* didn't fix it. I still don't really "know" why I couldn't get that system to work but basically I think this problem amounts to any situation where the disk throughput gets maxed out. Kubeflow is a massive beast that does a LOT of stuff and on many machines you will hit limits where it can't function and it usually looks like what you've posted here. I know that's not the most technical solution to an open issue, but the problem is kind of squishy and I think can occur for multiple reasons. |
I believe this problem is consistently produced by following the Charmed Kubeflow documentation on a fresh Ubuntu 22.04 or 20.04 system. I would be surprised if anyone can follow the current installation instructions with any success as the range of systems I'm testing on right now is some pretty high end server stuff and a couple consumer sort of desktop machines. Not that this likely matters but I am doing GPU stuff so I've enabled GPU with microk8s. I can easily pass validation and push up a container to do Since I do have a running microk8s 1.7 kubeflow setup, I think I should be able to solve this and commit back but I'm pretty certain the current Charmed Kubeflow setup instructions are very broken. |
The numbers I used above were not sufficient to solve the I have no idea how to calculate what to raise the numbers to or the consequences of raising the limit too high. I just upped the largest digit by one.
And I totally forgot about this other issue I've had. GVFS backends! There is a client that runs on a default Ubuntu Desktop install for each user that's logged in. It is monitoring the disk and when you fire up Kubeflow it flips out and takes up 120-200% of the cpu for each user. I do not need this tool for my deployments. I'm unsure if this is a problem with a barebones server install but this is critical to solving the problem of launching kubeflow on a fresh installed Ubuntu Desktop 20.04/22.04 system.
Hooray. I think this solves a reproducibility problem I've had for over a month. I haven't quite figured out if there are licensing issues with distributing this, but I've built an ubuntu LiveCD with CUDA, docker, and kubeflow (microk8s). I'll ping folks on slack and see if there's any interest in it. I've got a solid pipeline for doing the CI/CD for releases and man, the little things to get going are really a big barrier. It is very possible that the max_user_* doesn't need to be raised so high if the |
Also https://discourse.charmhub.io/t/charmed-kubeflow-upgrade-error/7007/7
Solves that. I've tested this on multiple machines now and raising the max_user_* limits, uninstalling gvfs-backends, and fixing the ingress with the above command solves all of the problems consistently. I'm working on a Zero to Kubeflow tutorial, but I'll submit a PR for the Charmed Kubeflow instructions that covers these things if someone can point me at where to submit it. I am realizing after some review that in this situation and the other issues I've read relating to a similar failure, most people are running in the cloud and not on bare metal. I do think the gist of what I've pointed out is still valid though but on this thread what I've posted is just not directly useful. It's squishy. These problems usually are related to disk throughput but that gets weird to sort in the cloud. Anyway.... All of what I said above has nothing to do with this issue I am realizing. Sorry for posting all this in the wrong place. I don't know where else to put all this info, so I'm going to leave it here for now. I'll collect it and put it into it's own issue tomorrow and remove it from this thread. Sorry if this confused anyone. It's been a long day. |
If you are using an external MySQL database (especially if its MySQL 8.0), you are likely experiencing this issue around support for FYI, Kubeflow Pipelines itself fixed MySQL 8.0 and |
I had the same problem. For me it was a cilium networking provider compatibility issue. I had to move to kubenet and it worked. |
Can you elaborate the compatibility issue please? As I'm using cilium as well. |
Maybe you enable Cilium’s kubeProxyReplacement feature, you can disable this feature or set --config bpf-lb-sock-hostns-only=true |
This issue looks like there are more than problems and resolutions so it's better to close it and if there are still pending issues let's open a new one to have a cleaner discussion thread. /close |
@rimolive: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What steps did you take
I deployed Kubeflow 1.3 by using the manifests approach.
I then repaired an issue with dex running on K8s v1.21
What happened:
The installation succeeded. All processes started up except the two. Both Metadata writer and ml-pipeline crash constantly and are restarted. ML-Pipeline always reports 1 of 2 running.
Metadata-writer sometimes appears to be fully running then fails.
No other kubeflow pods are having problems like this - even the mysql pod seems stable.
I can only assume the failure of the metadata writer is due to a continued failure in ml-pipeline api-server.
The pod keeps getting terminated by something with a reason code of 137.
See last image provided for details on the cycle time.
What did you expect to happen:
I expect that the pipeline tools install and operate normally. This has been a consistent problem going back to KF 1.1 with no adequate resolution
Environment:
I use the kubeflow 1.3 manifests deployment approach
This install is via the kubeflow 1.3.
NOT APPLICABLE
Anything else you would like to add:
kubectl logs ml-pipeline-9b68d49cb-x67mp ml-pipeline-api-server
kubectl describe pod ml-pipeline-9b68d49cb-x67mp
Metadata-Writer Logs:
Labels
/area backend
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: