Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to safely make sure that main() exit stops the loop? #91

Open
sersorrel opened this issue Oct 15, 2024 · 3 comments
Open

How to safely make sure that main() exit stops the loop? #91

sersorrel opened this issue Oct 15, 2024 · 3 comments
Labels

Comments

@sersorrel
Copy link
Contributor

Some context: I use aiorun.run(main(), stop_on_unhandled_errors=True) to make sure that main crashing will exit the program, because any exception in main probably indicates a programmer error (and a service manager can restart the process once it exits).

Unfortunately, main being cancelled (e.g. because something it awaits was cancelled) is not treated as an "unhandled error", even though from my perspective it's unexpected behaviour and I want the process to exit and be restarted, rather than sitting there doing nothing. So, my main looks like this:

async def main():
	try:
		await business_logic()
	finally:
		asyncio.get_running_loop().stop()

in an effort to ensure that, no matter what, if business_logic doesn't work properly, at least the program will exit.

However, this causes a new problem – when I ctrl-C the app, it prints an ugly traceback:

Traceback (most recent call last):
  File "/home/ubuntu/app/.venv/bin/app", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/app/src/app/server/__init__.py", line 535, in main
    aiorun.run(async_main(), stop_on_unhandled_errors=True)
  File "/home/ubuntu/app/.venv/lib/python3.10/site-packages/aiorun.py", line 352, in run
    loop.run_until_complete(
  File "/usr/lib/python3.10/asyncio/base_events.py", line 647, in run_until_complete
    raise RuntimeError('Event loop stopped before Future completed.')
RuntimeError: Event loop stopped before Future completed.

I think the flow of events here is something like:

  • I hit ctrl-C
  • aiorun's _shutdown_handler runs (with a short diversion via loop.call_soon_threadsafe) and calls loop.stop()
  • pending tasks are polled by asyncio to see if they can make progress (that is, this iteration of the event loop runs to completion)
  • the loop actually stops, and aiorun's "shutdown phase" begins
  • pending tasks are cancelled by aiorun, to signal that they should tidy up and exit
  • aiorun starts up the loop again, and waits for all the pending (cancelled) tasks
  • one of those tasks is main, which hits its finally clause and calls loop.stop()!
  • this iteration of the event loop runs to completion
  • the loop actually stops, again – but there's a task that didn't complete (perhaps it had more cleanup work to do than main, and would've completed given a few more event loop iterations)
  • ugly traceback

Is there a way I can solve this? How can main tell whether it's been cancelled because it was waiting on something that got cancelled (i.e. it must stop the loop) vs because aiorun detected a ctrl-C and is running cancelled tasks to completion (in which case aiorun will stop the loop)?

@cjrh
Copy link
Owner

cjrh commented Oct 17, 2024

Thanks for the detailed report. I will have a look in the next few days when I get a chance.

@cjrh
Copy link
Owner

cjrh commented Oct 17, 2024

Ok, I had a quick look at this. There are two parts to discuss:

  1. Interpretation of CancelledError as unexpected behaviour.

  2. Calling loop.stop inside main

Let's deal with them in turn.

Interpretation of CancelledError as unexpected behaviour

So, unfortunately, the design of asyncio in Python regards CancelledError as a kind of signal, not an error. This is similar to how StopIteration is a signal, not an error. This library, and much of the ecosystem treats CancelledError this way and it is difficult to change.

Therefore, when CancelledError is found we can only ever understand it to be the result of an intentional act, not an error. The fundamental parts of aiorun assume this, and it is now difficult to change. There are hacks I could incorporate to, e.g., set a flag somewhere if shutdown has been initiated by a signal versus some other way, and somehow expose this, but I think we are all better off treating cancellation as intentional. So, if something in your system is causing spurious cancellations, I think you should make it stop doing that rather than try to catch it.

For your purposes, you can at least catch the CancelledError in your main function, and if you like, reraise it as a different exception:

def test_stop_also_on_cancellation(caplog):

    async def i_get_cancelled():
        # Force the loop to actually spin, prevent testing against any special
        # optimizations
        await asyncio.sleep(5.0)
        raise asyncio.CancelledError()

    async def main():
        try:
            await i_get_cancelled()
        except asyncio.CancelledError as e:
            raise Exception("Unexpected cancellation received") from e

    run(main(), stop_on_unhandled_errors=True)

This is a test case I wrote to investigate this issue. You can catch the CancelledError in main and raise a different exception, or do something else entirely. For your scenario, this will definitely stop main, and shutdown, if stop_on_unhandled_errors is active. Hopefully this is sufficient?

However, there remains no way to tell the difference between whether the caught CancelledError is due to the shutdown process itself (like CTRL-C say), or due to some other unidentified cancellation happening somewhere else in the system.

I could do something like the following:

  • Set a an attribute on the loop instance inside the signal handler, like loop._aiorun_signal,
  • and then you could check for this inside the exception handler.

Example:

    async def main():
        try:
            await i_get_cancelled()
        except asyncio.CancelledError as e:
            loop = asyncio.get_running_loop()
            if not getattr(loop, '_aiorun_signal', False):
                raise Exception("Unexpected cancellation received") from e

This is very messy and I don't want to include this in aiorun.

However, you can do what you want in your own code, and if this works for you, you could provide your own shutdown_handler callback that can add this annotation to the loop instance, and then you can check for it yourself in your main.

Your shutdown_handler might look something like this, which is the same one inside aiorun, but with the extra attribute set:

def _shutdown_handler(loop):
    logger.debug("Entering shutdown handler")
    loop = loop or get_event_loop()

    logger.warning("Stopping the loop")
    loop._aiorun_signal = True
    loop.stop()

In any event, if you discover some part of your system that is causing spurious cancellations that are bubbling up within your tasks, you should try to fix that behaviour regardless.

Calling loop.stop inside main

Your issue has highlighted to me that loop.stop() must not be called inside main once the shutdown handling in aiorun has begun. Thank you for highlighting this :)

As you correctly identified, once the loop has already been stopped, the remainder of the cancellation+cleanup work takes place inside a run_until_complete call, which will raise if another loop.stop is encountered. In the README, I mention that a user can call loop.stop inside main. However, I should probably add a note about this specific problematic scenario in the documentation. It happens, again as you described, if the loop.stop is called inside the finally section.

async def main():
  try:
    await something()
  finally:
    asyncio.get_running_loop().stop()

When CTRL-C is pressed, CancelledError is raised at the current await point inside every Task nested hierarchy of coroutines, and in the snippet above, this happens at await something. Then, the loop is started up again with run_until_complete, which raises when the stop is called above. This causes the messy traceback.

At the moment, I am satisfied with simply adding documentation for this I think.

@sersorrel
Copy link
Contributor Author

So, if something in your system is causing spurious cancellations, I think you should make it stop doing that rather than try to catch it.

Agreed! (I need to catch it once, in order to find where the cancellation actually came from, but then I absolutely intend to make it Not Do That.)

you could provide your own shutdown_handler callback

Ahh, yes, that makes total sense and gives me exactly the data I need – I can keep track of "we are shutting down now" on the loop or elsewhere. I agree that this makes more sense than asking aiorun to expose that as loop._aiorun_signal or something.

At the moment, I am satisfied with simply adding documentation for this I think.

Sounds fine to me :) thank you very much for taking the time to look at this!

@cjrh cjrh added the docs label Jan 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants