Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to get routing manager deployment status #58

Open
praveensrao opened this issue Jul 18, 2023 · 15 comments
Open

Failed to get routing manager deployment status #58

praveensrao opened this issue Jul 18, 2023 · 15 comments
Labels
bug Something isn't working duplicate This issue or pull request already exists

Comments

@praveensrao
Copy link

praveensrao commented Jul 18, 2023

Describe the bug
Failed to establish a connection. Error: Failed to get routing manager deployment status - ran out of time : 'Failed to get routing manager deployment status: 'Oops... An unexpected error has occurred.'Please include the following Correlation ID when contacting support: '51998f9b-5d22-43fe-97d5-44a591e71d1f1689648728010:4ce86b08fabd'. '
This issue is inconsistent - it happens 70% of the time, and leading to a lot of developer frustration.

Mention the platform you are using
I am using VS Code Bridge to Kubernetes extension v2.0.120230716

To Reproduce
This issue occurs only when using the isolation mode.
Steps to reproduce the behavior:

  1. Open command pallette
  2. Bridge To Kubernetes > Configure
  3. Choose the service
  4. Enter local port
  5. Select launch configuration
  6. Isolate your code from local developers? Select "Yes"
  7. Run and Debug
  8. Provide permissions to update hosts file

Expected behavior
The Bridge to Kubernetes connection to the cluster must complete and the application must launch

Screenshots
image

After some time:
image

Logs:
bridge-to-k8s-logs.zip
routingmanager-deployment-bf984c7cc-7ktv7-logs.txt.zip

Desktop (please complete the following information):
I am on Mac OS Ventura 13.3.1 (a)
Using VS Code 1.80.0

Note, I've also tried this many times by cleaning up the deployments/services and pods in the cluster, but issue persists.

@praveensrao praveensrao added the bug Something isn't working label Jul 18, 2023
@garyburgmann
Copy link

I am also seeing this error on MacOS Monterey 12.6.5.

Interestingly, when using Remote - SSH extension against a Debian box, then invoking the Bridge to Kubernetes extension remotely, it works fine.

@praveensrao I would suggest his issue actually belongs against https://github.com/Azure/vscode-bridge-to-kubernetes, as I can use the CLI successfully on MacOS.

@garyburgmann
Copy link

garyburgmann commented Jul 18, 2023

@praveensrao I created an issue against the extension repo ^^^

Give the CLI a look if its a blocker for you, works fine for me.

@praveensrao
Copy link
Author

@garyburgmann - thanks for your help.
The CLI is working fine for me. Will start using it instead of the vs code extension.
Should I be moving this ticket to the other project?

@hsubramanianaks
Copy link
Collaborator

hsubramanianaks commented Jul 18, 2023

@praveensrao @garyburgmann Thanks for the details and the chat. Is this happening with latest release 2.0.120230716 or it happens for previous release as well ? Can you share the vscode logs for these please ? it would be in $HOME/.vscode-server/data/logs/<id>/exthost1/mindaro.mindaro/*.

Also can you confirm in the following location, you are dotnet 7 libraries downloaded, if you still see dotnet 6 it is an issue.
$HOME/.vscode-server/data/User/globalStorage/mindaro.mindaro/file-downloader-downloads/dotnet/host\fxr
Please let me know.

@hsubramanianaks hsubramanianaks transferred this issue from Azure/Bridge-To-Kubernetes Jul 18, 2023
@hsubramanianaks hsubramanianaks added the duplicate This issue or pull request already exists label Jul 18, 2023
@hsubramanianaks
Copy link
Collaborator

@praveensrao @garyburgmann Thanks for the logs and inputs. here are my observations from logs kubectl portforwarding is failing, exception doesn't give much details and it is logged as warning. Also only when this portforward works , routing manager status call would work. But when I looked at routing manager logs around the same time the pod was deleted so it could be race condition where port forward would be failing due to deletion of pod and thus bridge connection would have failed. Any idea why the pod was deleted at first place ?

2023-07-18T02:54:41.9279330Z | Library | WARNG | Error during port forward to routing manager : I0717 22:54:41.927651   26501 log.go:184] (0xc0006e6000) (0xc00089c140) Create stream
2023-07-18T02:54:31.9237245Z | RoutingManager | TRACE | Pod 'praveensrao-rabbitmq-consumer-5964b75877-9jrm2.rabbitmq' was 'Added'
2023-07-18T02:54:32.1911247Z | RoutingManager | TRACE | Pod 'praveensrao-rabbitmq-consumer-5964b75877-9jrm2.rabbitmq' was 'Modified'
2023-07-18T02:54:32.1976624Z | RoutingManager | TRACE | Pod 'praveensrao-rabbitmq-consumer-5964b75877-9jrm2.rabbitmq' was 'Deleted'

@garyburgmann do you happen to see similar type of errors in your logs ? Please confirm. Thank you.

@garyburgmann
Copy link

$ kc -n bridge logs -f garyburgmanngo1com-iam-64b56ff9f8-jktwm 
2023-08-03T06:08:32.5767620Z | RemoteAgent | TRACE | Shutting down gracefully\nOperation context: <json>{"clientRequestId":"0c293ee7-9883-4531-8e7a-e2271c22efd2","correlationId":"ab88dc60-b6ed-47d0-b38a-785440ea7a111691041696493:4f8cfff13346:3c9a2eb7ee94","requestId":null,"userSubscriptionId":null,"startTime":"2023-08-03T06:06:28.158888+00:00","userAgent":"RemoteAgent/1.0.20230524.1","requestHttpMethod":null,"requestUri":null,"version":"1.0.20230524.1","requestHeaders":{},"loggingProperties":{"ApplicationName":"RemoteAgent","DeviceOperatingSystem":"Linux 5.15.0-1040-azure #47-Ubuntu SMP Thu Jun 1 19:38:24 UTC 2023","Framework":".NET 6.0.16","ProcessId":1,"TargetEnvironment":"Production"}}</json>
2023-08-03T06:08:32.5961584Z | RemoteAgent | TRACE | AssemblyLoadContext unloading
2023-08-03T06:08:32.6043901Z | RemoteAgent | TRACE | Process exiting...
2023-08-03T06:08:32.6047679Z | RemoteAgent | TRACE | Process exited

they were the only logs on the pod before it died

$ kc -n bridge logs -f garyburgmanngo1com-iam-64b56ff9f8-jktwm-restore-d54b0-7txhs
2023-08-03T06:06:33.0405947Z | RestorationJob | TRACE | Waiting to restore previous state on Pod bridge/garyburgmanngo1com-iam-64b56ff9f8-jktwm...\nOperation context: <json>{"clientRequestId":"512a39bb-2c19-498b-bcf9-aacbea2478d1","correlationId":"ab88dc60-b6ed-47d0-b38a-785440ea7a111691041696493:4f8cfff13346:3c9a2eb7ee94","requestId":null,"userSubscriptionId":null,"startTime":"2023-08-03T06:06:32.8105723+00:00","userAgent":"RestorationJob/1.0.20230705.1","requestHttpMethod":null,"requestUri":null,"version":"1.0.20230705.1","requestHeaders":{},"loggingProperties":{"ApplicationName":"RestorationJob","DeviceOperatingSystem":"Linux 5.15.0-1040-azure #47-Ubuntu SMP Thu Jun 1 19:38:24 UTC 2023","Framework":".NET 7.0.8","ProcessId":1,"TargetEnvironment":"Production"}}</json>
2023-08-03T06:06:43.3806986Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:06:43.4223754Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":355}}</json>
2023-08-03T06:06:48.4334184Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:06:48.4381234Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":13}}</json>
2023-08-03T06:06:53.4510364Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:06:53.4535317Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":10}}</json>
2023-08-03T06:06:58.4600346Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:06:58.4627438Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":11}}</json>
2023-08-03T06:07:03.4773051Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:07:03.4793142Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":16}}</json>
2023-08-03T06:07:08.4868602Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:07:08.4889216Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":9}}</json>
2023-08-03T06:07:13.4955392Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:07:13.4975159Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":10}}</json>
2023-08-03T06:07:18.5046539Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:07:18.5066087Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":10}}</json>
2023-08-03T06:07:23.5126653Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:07:23.5145872Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":9}}</json>
2023-08-03T06:07:28.5356300Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:07:28.5375957Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":22}}</json>
2023-08-03T06:07:33.5446697Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:07:33.5465370Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":10}}</json>
2023-08-03T06:07:38.5541115Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:07:38.5561159Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":9}}</json>
2023-08-03T06:07:43.5644099Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:07:43.5665424Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":11}}</json>
2023-08-03T06:07:48.5758663Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:07:48.5778206Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":9}}</json>
2023-08-03T06:07:53.5908531Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:07:53.5929789Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":10}}</json>
2023-08-03T06:07:58.5991923Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:07:58.6012322Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":9}}</json>
2023-08-03T06:08:03.6067454Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:08:03.6088604Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":10}}</json>
2023-08-03T06:08:08.6151974Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:08:08.6169995Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":8}}</json>
2023-08-03T06:08:13.6217815Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:08:13.6235685Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":9}}</json>
2023-08-03T06:08:18.6323500Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:08:18.6340839Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":9}}</json>
2023-08-03T06:08:23.6449728Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:08:23.6473115Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":12}}</json>
2023-08-03T06:08:28.6555996Z | RestorationJob | TRACE | Dependency: Kubernetes <json>{"name":"Kubernetes","target":"GetV1PodAsync","success":true,"duration":null,"properties":{}}</json>
2023-08-03T06:08:28.6574931Z | RestorationJob | TRACE | Event: RestorationJob-AgentPing <json>{"eventName":"RestorationJob-AgentPing","properties":{"RestorePerformed":"false","NumFailedPings":"0","HasConnectedClients":"true","Result":"Succeeded"},"metrics":{"DurationInMs":9}}</json>
2023-08-03T06:08:33.0810392Z | RestorationJob | TRACE | AssemblyLoadContext unloading
2023-08-03T06:08:33.1427124Z | RestorationJob | TRACE | Process exiting...
2023-08-03T06:08:33.1430480Z | RestorationJob | TRACE | Restoration job cancelled
2023-08-03T06:08:33.1882152Z | RestorationJob | TRACE | Process exited

@garyburgmann
Copy link

we have 167 pods/services (all deployments 1/1) in the ns, maybe just taking too long to bootstrap?

$ kc -n bridge get pod | wc -l
     168

@hsubramanianaks
Copy link
Collaborator

@garyburgmann @praveensrao Thanks for the details, are you both using M1/M2 Mac (arm based one's). I was able to replicate the same problem with my personal mac, just trying to confirm on this fact. Please let me know.

@garyburgmann
Copy link

Intel -> Monterey

@garyburgmann
Copy link

I was able to run this on an M1 Pro MBP ...

@garyburgmann
Copy link

garyburgmann commented Oct 9, 2023

This is still happening.

I can run dsc perfectly fine

$ dsc connect \
    --local-port 3000 \
    --routing foo \
    --service bar \
    --namespace bridge \
    --use-kubernetes-service-environment-variables

but the equivalent tasks.json fails

{
	"version": "2.0.0",
	"tasks": [
		{
			"label": "bridge-to-kubernetes",
			"type": "bridge-to-kubernetes.resource",
			"resource": "bar",
			"resourceType": "service",
			"ports": [
				3000
			],
			"targetCluster": "aks",
			"targetNamespace": "bridge",
			"useKubernetesServiceEnvironmentVariables": true,
			"isolateAs": "foo"
		}
	]
}
Error: Failed to establish a connection. Error: Failed to get routing manager deployment status - ran out of time : 'Failed to get routing manager deployment status: 'Oops... An unexpected error has occurred.'Please include the following Correlation ID when contacting support: '1a1b8c6b-d43c-43d8-aa9b-559ada2c7fd01696814560762:d21acbc92888'. '

By removing "isolateAs": "foo", the vscode extension will then work.

@garyburgmann
Copy link

it all goes wrong here I believe

                if (!string.IsNullOrEmpty(_routingHeaderValue))
                {
                    // If routing is enabled, wait until the routing manager is connected to say the connection has been established successfully.
                    var routingStatus = await routingManagementClient.GetStatusAsync(remoteAgentInfo.PodName, cancellationToken);
                    if (routingStatus.Value.IsConnected == null)
                    {
                        _log.Error(routingStatus.Value.ErrorMessage);
                        throw new UserVisibleException(_operationContext, Resources.FailedToGetRoutingManagerDeploymentStatusFormat, string.Format(CommonResources.CorrelationIdErrorMessageFormat, _operationContext.CorrelationId));
                    }
                    else if (routingStatus.Value.IsConnected == false)
                    {
                        throw new UserVisibleException(_operationContext, Resources.FailedToGetRoutingManagerDeploymentStatusRanOutOfTimeFormat, routingStatus.Value.ErrorMessage + string.Format(CommonResources.CorrelationIdErrorMessageFormat, _operationContext.CorrelationId));
                    }
                }

var routingStatus = await routingManagementClient.GetStatusAsync(remoteAgentInfo.PodName, cancellationToken);

results in routingStatus.Value.IsConnected == false

But only when using the vscode extension, not when invoking dsc directly

@hsubramanianaks
Copy link
Collaborator

@garyburgmann I will check on this and let you know. It happened to me in my mac as well but I couldn't reproduce it often to get the logs and see why it is happening. Basically routing manager pod will reach to dev host agent pod (this is envoy pod with your alias or username with it) and if that pod dies quicker this error can happen. Thanks for your patience.

@hsubramanianaks
Copy link
Collaborator

@garyburgmann I tried to replicate this, but I was not able to reproduce with latest release in mac. are you available for call tomorrow via slack, I am available via slack in the CNCF slack or kubernetes.slack.com. Please reach out to me. we can troubleshoot over the call. Thanks.

@hsubramanianaks
Copy link
Collaborator

@praveensrao @garyburgmann can you please test with latest version and see if this issue is resolved ? if not , we can see why it is still happening. Also to note, we are very small team just maintaining B2K, contributions are welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

3 participants