Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket activated grpc service restarts after GracefulStop() #4272

Closed
didrocks opened this issue Mar 16, 2021 · 5 comments
Closed

Socket activated grpc service restarts after GracefulStop() #4272

didrocks opened this issue Mar 16, 2021 · 5 comments

Comments

@didrocks
Copy link

What version of gRPC are you using?

github.com/golang/protobuf v1.4.2 // indirect
google.golang.org/grpc v1.36.0 // indirect
google.golang.org/grpc/cmd/protoc-gen-go-grpc v1.1.0 // indirect
google.golang.org/protobuf v1.25.0 // indirect

What version of Go are you using (go version)?

go version go1.16 linux/amd64

What operating system (Linux, Windows, …) and version?

Linux (Ubuntu) hirsute (devel series, incoming 21.04)

What did you do?

See our minimal reproducer test case at https://github.com/ubuntu/grpcstop:

  • create a simple grpc server with a service with 2 grpc calls:
    • Stop() -> call in a goroutine server.GracefulStop() to request the server to shutdown once all current requests are done
    • Wait() -> sleep for 30s before returning
  • ensure both clients and servers communicates over an Unix socket (hardcoded here as /tmp/grpcstop.socket)
  • have systemd using socket activation
    Then:
    Scenario 1. a call to grpcstop stop activates and then stops as expected the service (without any error), but this one restarts immediately then (different pid for the server process)
    Scenario 2. a call to grpcstop wait activates the service, then a call to grpcstop stop exits immediately and the service is still waiting. Once the timeout is over, the server quits as expected, but also restarts immediately.
    Note: this isn’t reproduceable 100%, but a good 70% of tests shows up the restarting behaviour. It seems the slower the machine (like a vm), the more you get this issue.

This sounds like there are still some messages on the unix socket that are not drained by GracefulStop() and thus, systemd restarts the service.

The code available at https://github.com/ubuntu/grpcstop/blob/main/main.go#L43 contains both minimal server and client.

systemd output:

# We call grpcstop stop here (or wait + stop and wait exits) which activates, stop and then reactive the service
systemd[1]: grpcstop.socket: Incoming traffic
systemd[1]: grpcstop.service: Trying to enqueue job grpcstop.service/start/replace
systemd[1]: grpcstop.service: Installed new job grpcstop.service/start as 41336
systemd[1]: grpcstop.service: Enqueued job grpcstop.service/start as 41336
systemd[1]: grpcstop.socket: Changed listening -> running
systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=14140 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=14141 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=14142 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=14143 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=JobNew cookie=14144 reply_cookie=0 signature=uos error-name=n/a error-message=n/a
systemd[1]: Failed to read pids.max attribute of cgroup root, ignoring: No data available
systemd[1]: grpcstop.service: Passing 1 fds to service
systemd[1]: grpcstop.service: About to execute /tmp/grpcstop server
[…]
grpcstop[41341]: Starting
systemd[1]: Child 41336 (grpcstop) died (code=exited, status=0/SUCCESS)
systemd[1]: grpcstop.service: Child 41336 belongs to grpcstop.service.
systemd[1]: grpcstop.service: Main process exited, code=exited, status=0/SUCCESS
systemd[1]: grpcstop.service: Succeeded.
systemd[1]: grpcstop.service: Service will not restart (restart setting)
systemd[1]: grpcstop.service: Changed start -> dead
systemd[1]: grpcstop.service: Job 41336 grpcstop.service/start finished, result=done
[…]

# Daemon ends on GracefulStop()
systemd[1]: Finished GRPCStop daemon service.
systemd[1]: grpcstop.service: Main process exited, code=exited, status=0/SUCCESS
systemd[1]: grpcstop.service: Succeeded.
systemd[1]: grpcstop.service: Service will not restart (restart setting)
systemd[1]: grpcstop.service: Changed start -> dead
systemd[1]: grpcstop.service: Job 41336 grpcstop.service/start finished, result=done
systemd[1]: Finished GRPCStop daemon service.
systemd[1]: grpcstop.socket: Changed running -> listening


# And the daemon restarts immediately, due to socket being activated
systemd[1]: grpcstop.socket: Incoming traffic
systemd[1]: grpcstop.service: Trying to enqueue job grpcstop.service/start/replace
systemd[1]: grpcstop.service: Installed new job grpcstop.service/start as 41420
systemd[1]: grpcstop.service: Enqueued job grpcstop.service/start as 41420
systemd[1]: grpcstop.socket: Changed listening -> running
systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=14140 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2esocket interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=14141 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=14142 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/grpcstop_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=14143 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
systemd[1]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=JobNew cookie=14144 reply_cookie=0 signature=uos error-name=n/a error-message=n/a
systemd[1]: Failed to read pids.max attribute of cgroup root, ignoring: No data available
systemd[1]: grpcstop.service: Passing 1 fds to service
systemd[1]: grpcstop.service: About to execute /tmp/grpcstop server
[…]
grpcstop[41420]: Starting
systemd[1]: Child 41420 (grpcstop) died (code=exited, status=0/SUCCESS)

What did you expect to see?

  • we expected in both scenarios, the server to stop as soon as any remaining clients have quit, and then not restart

What did you see instead?

  • as soon as the last client disconnects and the server quits, this one is restarted by systemd as it seems there are some leftovers messages on the socket.

Minimal test case to reproduce

To reproduce with our minimal test case at https://github.com/ubuntu/grpcstop:

  • Protobuf and grpc files are already generated (you can still go generate) and go build.
  • Copy the resulting binary to /tmp/grpcstop
  • Copy systemd service and socket file from the systemd directory to /lib/systemd/system/
  • Make systemd aware of the new services
    $ systemctl daemon-reload
  • Enable and start the socket:
    $ systemctl enable grpcstop.socket
    $ systemctl start grpcstop.socket
  • Optionally increase systemd log level to monitor the socket activation of the service by adding systemd.log_level=debug to the boot command line of the test machine and restarting the machine.

Scenario 1:

  • Call grpcstop stop
    • Check systemd logs to see that the grpcstop service is activated, then stopped, then restarted

Scenario 2:

  • Call grpcstop wait in one terminal.
    -> The grpcstop service is activated.
  • Call grpcstop stop in another terminal while the wait command is still running.
    -> the stop command exits immediately
    -> the grpcstop service is stopped once the wait command exit, but then, it is restarted (different pid)
@easwars
Copy link
Contributor

easwars commented Mar 18, 2021

Your setup seems to be quite involved. We are not best placed to comment on interactions with systemd. If you could simplify your setup and just have a grpc client+server (without systemd) and point us to a problem that you see, we would be happy to take a look at.

@easwars easwars closed this as completed Mar 18, 2021
@didrocks
Copy link
Author

@easwars: The issue there is that systemd is what will restart the service with socket activation, hence we can’t remove it from the tests or you won’t see the restarting effect by itself.
We minimize a 20k program to this simple setup so that you can reproduce it easily in a linux vVM for instance. Do you mind reevaluating it?

@menghanl
Copy link
Contributor

One thing to try is to turn on logs: https://github.com/grpc/grpc-go#how-to-turn-on-logging
Aligning the grpc logs with the service behavior could give us some useful information. Did you try that?

@jibel
Copy link

jibel commented Mar 19, 2021

@menghanl as we couldn't reopen this issue, we filed a new one in #4282 with a simplified test case and log enabled.

@menghanl
Copy link
Contributor

OK. I will close this one then. Let's track in the new issue.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 16, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants