-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GracefulStop() doesn’t always drain the channel #4282
Comments
From my understanding, this doesn't seem like the root cause. The pings are not meant to be "read". I think there is a knowledge gap. And it's not clear to me what
|
Thanks for reaching out. There is indeed a knowledge gap on both sides and of course, the fact that the PING was the cause is just theorical (really hard to imagine a test case for this). With systemd socket activation the following happens (with
We tried to monitor what’s on the socket, it’s clear that systemd sees "Incoming traffic" (see logs) which restarts erronously the service once it has stopped. The most interesting part is that if we define
So, there really seems to be some leftover on buffer on the UDS once the unique client requesting GracefulStop is leaving, which is what then makes the service restarting. As it’s hard to remove systemd out of the loop, we took back the example as of bug #4272, which is https://github.com/ubuntu/grpcstop at revision c62a9fce7ee31ccd420ecf85e9000bf9751f7daa. Here are the logs of both service and client: Service:
Client logs:
Hope this helps, if you need any additional logs, we are happy to provide them. |
What I think happenedWhen the client receives a But this connection may arrive at the socket after the old server exits, triggering Something to tryThe key is to stop the client from starting new connections after var stopped uint32
grpc.Dial(...,
grpc.WithContextDialer(func(ctx context.Context, addr string) (conn net.Conn, err error) {
if atomic.LoadInt32(&stopped) != 0 {
// Client was stopped
return nil, errors.New("stopped")
}
// normal net.DialUnix()
}),
)
switch os.Args[1] {
...
case "stop":
atomic.StoreInt32(&stopped, 1) // Set 1 to stop the dialer.
// normal c.Stop(context.Background(), &Empty{})
...
} Please try this and see if it works. |
Spot on! There are indeed 2 connections happening, the second being just after Stop() is called. So, that would explain why there are sometimes some unwanted connections on the socket triggering this activation. However, as you said, this is racy:
|
When stopping the daemon, the GOAWAY message is sent to connected clients, which is not necessarily consumed by them, leaving a new connection on the socket, which retriggers a restart from systemd. Upstream discussion is at grpc/grpc-go#4282. Co-authored-by: Jean-Baptiste Lallement <jean-baptiste@ubuntu.com>
Sorry for the late reply. Before that fix is ready, another option to consider is to use an RPC to control the client's I understand this is less then ideal, and will make the application unnecessarily complex. |
Indeed, that looks like related to the other bug. Thanks for the investigation @menghanl. The workaround is quite hackish though (and suppose all clients implements this "stop don’t redial" hack, which is fine for us, but less than ideal in a situation with an opened API). Right now, we are going with the Looking foward to see bug #4298 fixed, and report back if things are indeed fixed this way. Thanks again! |
(as requested on bug #4272, here is the same bug with a simpler reproducer without systemd)
What version of gRPC are you using?
What version of Go are you using (
go version
)?go version go1.16 linux/amd64
What operating system (Linux, Windows, …) and version?
Linux (Ubuntu) hirsute (devel series, incoming 21.04)
What did you do?
Using systemd with an unix socket, after requesting a GracefulStop on our grpc server, we see that the service is sometimes
restarted through socket activation (like 50% of the time).
The cause seems to be that even if all clients are disconnected, the grpc server can write a ping on the socket
before disconnecting, without draining it. So, it leaves a message on the socket and triggers a restart of the daemon by systemd.
We have a reproducer available at https://github.com/ubuntu/grpcstop, updated to remove systemd from the equation as requested on bug #4272.
To run this test program execute the command:
GODEBUG=http2debug=2 go run . -logtostderr=true
The trace of a restarted behaviour:
We can see here that once the client stopped, there is a PING write that is drained by a read PING. However, there is
a second ping write (
2021/03/19 10:14:34 http2: Framer 0xc0001ee000: wrote PING len=8 ping="\x02\x04\x10\x10\t\x0e\a\a"
)that is never read.
However, other times, the socket is fully drained as expected:
What did you expect to see?
What did you see instead?
The text was updated successfully, but these errors were encountered: