Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd 2.16 HTTP/2 keep-alive is causing issues with gRPC connections. #13266

Open
edgarkz opened this issue Nov 4, 2024 · 2 comments
Open
Labels

Comments

@edgarkz
Copy link

edgarkz commented Nov 4, 2024

What is the issue?

After upgrading from Linkerd stable 2.13 to 2.16 BEL, we are experiencing an increase in gRPC errors due to dropped connections, which is leading to data loss. Our current setup implements gRPC retries on the client side (c# grpc client generated by https://github.com/protobuf-net/protobuf-net.Grpc), but only for specific errors.

How can it be reproduced?

Install the latest BEL edition using the Helm chart with high availability (HA) values.
Internal client - protobuf-net.Grpc v1.1.1 (Grpc.Net.Client v2.62.0)

Logs, error output, etc

gRPC-related errors in internal service:
Grpc.Core.RpcException: Status(StatusCode="Internal", Detail="Incomplete message."

LinkerD proxy errors:
`[ 94676.618554s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=http2 error: keep-alive timed out error.sources=[keep-alive timed out, operation timed out] client.addr=10.0.139.170:47318 server.addr=10.0.115.25:8081

  | Nov 4, 2024 @ 17:45:34.188 | [ 9258.779776s] INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=http2 error: keep-alive timed out error.sources=[keep-alive timed out, operation timed out] client.addr=10.0.130.128:49554 server.addr=172.20.45.82:8081

  | Nov 4, 2024 @ 17:39:06.554 | [ 94278.350425s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=http2 error: keep-alive timed out error.sources=[keep-alive timed out, operation timed out] client.addr=10.0.106.32:36818 server.addr=10.0.134.88:8081

  | Nov 4, 2024 @ 17:38:37.660 | [ 94249.456302s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=http2 error: keep-alive timed out error.sources=[keep-alive timed out, operation timed out] client.addr=10.0.129.202:49608 server.addr=10.0.134.88:8081

  | Nov 4, 2024 @ 17:38:16.548 | [ 94228.345056s] INFO ThreadId(01) inbound: linkerd_app_core::serve: Connection closed error=http2 error: keep-alive timed out error.sources=[keep-alive timed out, operation timed out] client.addr=10.0.139.170:45848 server.addr=10.0.134.88:8081

`

output of linkerd check -o short


Environment

1.28 AWS EKS - Server Version: v1.28.13-eks-a737599
LinkerD BEL 2.16

Possible solution

none.

Additional context

We've adjusted the default timeouts, but timeout logs are still appearing.

Related PR: #12498

Would you like to work on fixing this bug?

None

@edgarkz edgarkz added the bug label Nov 4, 2024
@olix0r
Copy link
Member

olix0r commented Nov 15, 2024

In your environment, is the client running meshed with a sidecar proxy? Or is only the server process meshed?

We've adjusted the default timeouts, but timeout logs are still appearing.

This timeout was introduced to help defend against situations where the underlying networking stack doesn't properly detect a closed/lost connection. Proxy servers send PING frames to clients and expect them (as required by the HTTP/2 spec) to respond. This timeout indicates that we haven't received a PING from the client.

If you've increased the timeout and continue to see these errors, this may be surfacing a problem that was being hidden before.

It would be helpful to try to trace this behavior to the clients to figure out why PINGs aren't being acknowledged in a timely fashion.

@erewok
Copy link

erewok commented Jan 25, 2025

Hello, I have a related issue and I was hoping to get some clarity but from looking at the linkerd docs, I can't seem to discern what should be expected behavior. In my case I have a kubernetes pod sending tracing spans to an opentelemetry collector using GRPC at the otlp endpoint (port 4317).

We see outbound-only errors like this:

INFO ThreadId(01) outbound: linkerd_app_core::serve: Connection closed error=http2 error: keep-alive timed out error.sources=[keep-alive timed out, operation timed out] client.addr=172.16.0.192:33164 server.addr=172.16.0.104:4317

Some details:

  • the client is meshed
  • the server is unmeshed
  • my client looks to not be setting keepalive options on the GRPC connection

In this scenario, would the keepalive timeout log message mean that my client is not closing sockets after sending a request, or does it mean that linkerd-proxy is sending keepalives on behalf of my client to the server (or even something else entirely)?

I apologize for jumping in to this issue but after reading various issues and pull requests, I couldn't quite see what would be the expected behavior for linkerd-proxy. I see this PR, but couldn't tell in which linkerd version GRPC keepalives may have been added or removed? I would appreciate any info you can share here about expected behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants