[proposal] Structured Async for Mojo #3945

owenhilyard · 2025-01-13T19:46:18Z

Proposes to add structured async to Mojo, following in the the Rust tradition of async since Mojo has the ability to fix many of the issues with Rust's async, some of which are ecosystem inflicted and some of which are inflicted by old decisions made about the language (such as that any value may be leaked). I think that this interface offers a lot more flexibility than the current one does for high performance code, while providing a better path for gradual evolution. It does have some language-level dependences, namely unions, and requires the introduction of Send and Sync traits, which are used to control data movement between threads.

Proposes to add structured async to Mojo, following in the the Rust tradition of async since Mojo has the ability to fix many of the issues with Rust's async, some of which are ecosystem inflicted and some of which are inflicted by old decisions made about the language (such as that any value may be leaked). I think that this interface offers a lot more flexibility than the current one does for high performance code, while providing a better path for gradual evolution. It does have some language-level dependences, namely unions, and requires the introduction of `Send` and `Sync` traits, which are used to control data movement between threads. Signed-off-by: Owen Hilyard <hilyard.owen@gmail.com>

szbergeron · 2025-01-14T17:46:46Z

Carrying some external discussion in for context on Waker--

The core motivation is that when building something of this sort, you want to spend as little time processing useless data as possible compared to processing useful data. This is simple in theory, but in practice the actual "knowledge" of what can vs can't be effectively progressed is sparse, disaggregated, and rarely has a clear way to even collect that state outside of just "poking" everything that "wants" to make progress.

Wakers partially solve this on their own. For many "boundaries" (things like channels, queues, timers), wakers can be more or less tossed over the barrier to be collected by the other side. When the "other side" is itself administrated by a coroutine registered with an active executor (especially the same executor) the system nicely passes control away from something that can't progress to something else that can, and doesn't waste more time on something that can't progress until It can.

This works very nicely for "closed" systems that can be reduced almost entirely to a single computation, and where there aren't other computational priorities that exist outside of the executor. Unfortunately, the outside world also exists. The most performant forms of I/O nowadays are not interrupt driven, and do not have a way to directly signal in and poke a waker. Also, many systems are going to have multiple priorities competing for attention, and they may not (for various architectural reasons) exist within the same executor even though they reside within the same process.

This motivates another cut along which to aggregate/concentrate useful "threads to pull": subsystems. You may have many places that you need to wait for an operation to complete within io_uring, or for some condition to occur in some region of shared memory. This requires busy polling, and has no straightforward (naive) way for a waker to be used to drive a computation. If this is reduced to every future that is waiting on an operation busy polling, we end up with an excess of duplicated computation being done with no progress (bad!).

What's the alternative? Busy polling must occur, but can be moderated and deduplicated. It may or may not even need to happen within the same executor, but it must happen somewhere. If we create a way for subsystems to be statically registered, or even dynamically registered with some prioritization flag for executors to poll them with priority more closely matching their utility (how many coroutines depend on them progressing), we can then treat them almost the same way as any other async "barrier" (such as channels, async mutexes, timers). This way, a coroutine corresponding to each subsystem can itself collect wakers from other coroutines that would otherwise busy poll on their own, and itself act as a form of scheduler, as alluded to in the proposal.

The implementation specifics have a bunch of intertwined tradeoffs, but as far as we can tell these broad strokes are the limit of how minimal the overall type structure can be made without necessarily sacrificing significant performance to duplicated computation (no-progress polling).

owenhilyard · 2025-01-14T18:08:04Z

Continuing on from what @szbergeron mentioned, there are a lot of IO mechanisms which are completion based. These busy-polled "subsystem" futures, which ideally can be spawned in a way that the executor is made aware of them as special, can help to de-duplicate a lot of that polling since most of these mechanisms deliver the results through some kind of queue. For epoll-like things, you still want a more central system place to handle polling the eventfd and waking things up. The current API doesn't really have a good way to support this kind of flexibility, so it all but guarantees we have the same executor lock-in issues that Go has and Rust has with Tokio, and doesn't leave room for libraries to experiment with different designs that the stdlib executor might benefit from or for high performance applications to have an executor which meets their own needs.

lattner · 2025-01-20T01:43:31Z

This is a super interesting proposal, thank you for writing this up. It is a bit over my head, but all of the framing sounds right to me and I think you're tackling the right problems. I'm really happy you're seeing how the individual ingredients (move ctors, linear types, safe references) come together to build something special! I think the major conceptual concern I have (mentioned below a couple times) is whether "pluggable executors" are actually a desirable thing?

Doing a quick review of the document:

We don't need that, so we can toss it out unless we want to discuss whether it's a good idea to delay fixups until the last possible second to defer work.

Let's toss it out - simpler is better, and we've planned to not need this.

These are concerns I don't think we want in the language, so my solution to this problem is to define a portable, extensible interface, and then provide a default "good enough" implementation hidden behind that interface which can either be swapped (at compile time or runtime) for a newer, better executor (ex: doing IO with epoll -> io_uring -> DPDK + io_uring or io_uring + SPDK -> DPDK + SPDK) or upgraded over time.
...

allowing for ecosystem interoperability and preventing "tokio lock-in" where substantial portions of the ecosystem are tied to a single executor.

I'm not an expert on many things in this space, but your argument to push this into the library aligns with my general instincts. That said, I'm not sure how important it is in this case. Rust took a pretty unopinionated view on concurrency runtimes, enabling tokio and others to experiment and build different worlds. This seemed like a good idea - don't have to make a binding decision early, but then fragmented the software ecosystem, because it turns out we all really just want one /good/ concurrency runtime to build on top of.

Swift learned from this and did its own concurrency engine, adding a bunch of bells and whistles that UI programmers need (e.g. QoS features), and a nice cancelation approach. Have you looked at its runtime and how its approach to structured concurrency?

Other notes: +1 for building on top of io_uring as default on linux though, it seems like clearly the right low-level thing to build on. As you say, other platforms will have other things and we should have some amount of implementation flexibility to provide a portable layer over them. I find "entirely by accident" to be funny :), +1 for connecting to linear types!

There are a few downsides to this. First, we would need trait objects or hand-written vtables before this could be used for a general purpose async.

I don't think that's a downside, just a dependence. I think we should take the long view and build the right thing even if it takes time. If this is the right design, then let's get existentials / trait objects done and then build on top of it.

I personally think that a fairly large amount of compile-time overhead is acceptable for this feature since it should be in the "you only pay for it if you use it" category,

Wouldn't this affect /every/ async function? On the one hand, yes you're only paying for it there and async makes you opt into it, but for certain code this will be pretty common.

There is also the "trait explosion" issue, but I don't think that's avoidable with a capability-based API.

This is the thing I'm most concerned about with this proposal: it isn't really the number of traits that I'm worried about, but the amount of synthesized code. What is the impact on code size, compile time, and other dimensions as a consequence for this flexibility?

bit of a bump in user-facing complexity over Go's "make everything async" approach.

We've already accepted the "what is your functions color" approach of Python here. Let's not worry about go specifically.

I agree with your framing of accidental vs inherent complexity!

-Chris

Some revisions based on small typos I found while re-reading the proposal.

owenhilyard · 2025-01-20T03:24:25Z

This is a super interesting proposal, thank you for writing this up. It is a bit over my head, but all of the framing sounds right to me and I think you're tackling the right problems. I'm really happy you're seeing how the individual ingredients (move ctors, linear types, safe references) come together to build something special! I think the major conceptual concern I have (mentioned below a couple times) is whether "pluggable executors" are actually a desirable thing?

These are concerns I don't think we want in the language, so my solution to this problem is to define a portable, extensible interface, and then provide a default "good enough" implementation hidden behind that interface which can either be swapped (at compile time or runtime) for a newer, better executor (ex: doing IO with epoll -> io_uring -> DPDK + io_uring or io_uring + SPDK -> DPDK + SPDK) or upgraded over time.
...

allowing for ecosystem interoperability and preventing "tokio lock-in" where substantial portions of the ecosystem are tied to a single executor.

I'm not an expert on many things in this space, but your argument to push this into the library aligns with my general instincts. That said, I'm not sure how important it is in this case. Rust took a pretty unopinionated view on concurrency runtimes, enabling tokio and others to experiment and build different worlds. This seemed like a good idea - don't have to make a binding decision early, but then fragmented the software ecosystem, because it turns out we all really just want one /good/ concurrency runtime to build on top of.

The danger of only having one is that it needs to serve every need in the ecosystem. Everything from running on a tiny RISC-V core with 64K of dedicated memory which acts as a management processor for an accelerator to manage the running kernels and check the command queue occasionally, all the way up to 32 socket servers where it is expected to handle tens of millions of network requests per second. It also means factoring in things like Intel's DLB fixed-function accelerator, which offloads much of the queue management of a simple actor framework to hardware and can be used to aid in scheduling.

I think it should be possible to define a robust enough interface that almost all use-cases fit behind it, meaning only highly specialized libraries which need to use more peculiar executor features not contained in the standard library interface would be executor specific.

Swift learned from this and did its own concurrency engine, adding a bunch of bells and whistles that UI programmers need (e.g. QoS features), and a nice cancelation approach. Have you looked at its runtime and how its approach to structured concurrency?

I haven't had chance to look at Swift. Cancellation is something I was putting off until later, but I'm interested to see solutions in that area and I'll take a look.

Other notes: +1 for building on top of io_uring as default on linux though, it seems like clearly the right low-level thing to build on. As you say, other platforms will have other things and we should have some amount of implementation flexibility to provide a portable layer over them.

I think that io_uring is a sensible default, but depending on how much chasing "state of the art" is desirable, XDP sockets and/or DPDK may be helpful as enhanced capabilities for networking, since io_uring has continued to give performance reports in terms of "% of DPDK performance". If the goal is for Mojo/MAX to be able to be in the same order of magnitude of performance as RDMA when it doesn't have RDMA-capable hardware, DPDK is probably the best path forwards for that. XDP sockets are a step back from DPDK which are more portable than DPDK but tend to be a bit slower since you can't use the NIC's hardware offloads as effectively. io_uring is close enough to SPDK, DPDK's sister library for storage that I don't think it's worthwhile to have wide support for SPDK.

I find "entirely by accident" to be funny :)

Oops, I doubled that up during an editing pass at some point and missed it.

I personally think that a fairly large amount of compile-time overhead is acceptable for this feature since it should be in the "you only pay for it if you use it" category,

Wouldn't this affect /every/ async function? On the one hand, yes you're only paying for it there and async makes you opt into it, but for certain code this will be pretty common.

It does affect every async function, and may force a topological order (following the call graph) onto determining the layout of coroutine frames in order to place them inside of each other. Rust does do a form of this, and I'm not aware of performance complaints with that part of the compiler, so there may be prior art to draw from.

There is also the "trait explosion" issue, but I don't think that's avoidable with a capability-based API.

This is the thing I'm most concerned about with this proposal: it isn't really the number of traits that I'm worried about, but the amount of synthesized code. What is the impact on code size, compile time, and other dimensions as a consequence for this flexibility?

My understanding, which may be incorrect, is that the amount of generated code is very similar to C++'s approach, which is most similar to what Mojo has now. What may occur is a few functions get inlined, resulting in a few large jump tables instead of a lot of smaller ones.

I think that the current approach in Mojo may be slightly better for compile time, because it will enable more parallelism, but I haven't spent enough time inside of a compiler to quantify it and I think you (@lattner) are in a far better position to evaluate that. I know there will be a non-zero compile-time cost, but there's also a chance that the ability to do some of the things this enables may have knock-on effects which help in other areas, such as reducing the input size to some items further along in the compilation process via reducing the number of coroutines they need to consider. I've done my best to stick reasonably close to proven approaches as far as things which I think might cause compile-time issues, but I haven't spent enough time writing compilers to know that I caught everything.

This flexibility also has some issues for recursion, since doing recursion in this model, aside from tail recursion, requires the compiler to automatically add indirection to prevent infinite-size structs. It can also cause issues since it may lead to users accidentally moving or copying multi-kilobyte structs.

lattner · 2025-01-20T06:23:57Z

I'd suggest separating two different things:

The ability for an app to provide its own executor engine.
The ability for a platform to define its own executor.

The first is the problem that rust fell into. It has a lot of challenges when two libraries assume different executors. The later is something that is obviously general goodness (io_uring doesn't exist on windows or your_kernel_driver or a gpu).

lattner · 2025-01-20T06:25:23Z

I think that the current approach in Mojo may be slightly better for compile time, because it will enable more parallelism, but I haven't spent enough time inside of a compiler to quantify it and I think you (@lattner) are in a far better position to evaluate that.

Actually, I'm not really the best to eval this. I do have some context in the current impl, but a lot is changing and I'm not on top of everythign. We're focused on getting other fundamentals nailed down (eg reference system, closures, etc) which this builds on top of though, so when this comes into focus we can better evaluate it.

owenhilyard · 2025-01-20T16:08:10Z

I'd suggest separating two different things:
1. The ability for an app to provide its own executor engine.

2. The ability for a platform to define its own executor.
The first is the problem that rust fell into. It has a lot of challenges when two libraries assume different executors. The later is something that is obviously general goodness (io_uring doesn't exist on windows or your_kernel_driver or a gpu).

So, you would make platforms something like x86_64-unknown-linux-gnu-dpdk-io_uring (meaning x86 linux with DPDK for networking and io_uring for storage)? One of the issues with this is that DPDK does actually work with GPUs, but it requires host-side setup since the NIC needs to be told to enable P2P as well. In theory, given a network card which supports peer to peer and a GPU which also does, this is a portable feature, although it's not widely supported right now.

I personally think that some of the issues stem from the lack of a base interface that all executors implement. If there's a basic interface for sleeping, spawning a new task, UDP sockets, TCP sockets, synchronization primitives and common file operations, I think most libraries will be portable. I think Rust started on this far too late, and that we can side-step the problem of "what if this isn't the right interface for this operation?" by heavily decomposing the traits (ex: 1-2 functions in most of the traits in the interface), and encouraging libraries to only depend on the interface they need, instead of a whole executor. This should minimize the amount of incompatibility, and let the stdlib executor focus on the common 90% of cases while still leaving the door open for "I need hard real time with no heap allocations" executors for people with more specialized needs. My hope is that if the stdlib sets an example, other executors will be more willing to split out functionality in a similar way, especially if the "So you want to write an executor?" docs include the reasons why it's desirable to decompose like this.

The other reason I want to push for custom executors is that I see devices getting smarter, and I can easily see a future in which the primary purpose of a CPU is to schedule an array of accelerators (already somewhat the case in ML clusters). If we go with a single stdlib-owned executor, instead of the "executor subsystem" approach, where a few basic ones like io_uring and kqueue are provided by the stdlib, there will need to be handling for every single device MAX supports which has some kind of "async kernel" support. I think that may threaten the ability for a device vendor to do independent device bring-up for MAX and provide the thing to manage async ops on their hardware as a library, which could lead to forks of MAX for specific hardware, similar to vendor forks of pytorch. I don't think this is desirable, so I want to provide a way to avoid that.

owenhilyard · 2025-01-20T16:18:39Z

I think that the current approach in Mojo may be slightly better for compile time, because it will enable more parallelism, but I haven't spent enough time inside of a compiler to quantify it and I think you (@lattner) are in a far better position to evaluate that.

Actually, I'm not really the best to eval this. I do have some context in the current impl, but a lot is changing and I'm not on top of everythign. We're focused on getting other fundamentals nailed down (eg reference system, closures, etc) which this builds on top of though, so when this comes into focus we can better evaluate it.

Ok, let's circle back around to cost evaluation once the ground work is in place.

[proposal] Mojo structured async - Typos

95fe75b

Some revisions based on small typos I found while re-reading the proposal.

JoeLoser added the needs-discussion Need discussion in order to move forward label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proposal] Structured Async for Mojo #3945

[proposal] Structured Async for Mojo #3945

owenhilyard commented Jan 13, 2025

szbergeron commented Jan 14, 2025

owenhilyard commented Jan 14, 2025

lattner commented Jan 20, 2025

owenhilyard commented Jan 20, 2025

lattner commented Jan 20, 2025

lattner commented Jan 20, 2025

owenhilyard commented Jan 20, 2025

owenhilyard commented Jan 20, 2025

[proposal] Structured Async for Mojo #3945

Are you sure you want to change the base?

[proposal] Structured Async for Mojo #3945

Conversation

owenhilyard commented Jan 13, 2025

szbergeron commented Jan 14, 2025

owenhilyard commented Jan 14, 2025

lattner commented Jan 20, 2025

owenhilyard commented Jan 20, 2025

lattner commented Jan 20, 2025

lattner commented Jan 20, 2025

owenhilyard commented Jan 20, 2025

owenhilyard commented Jan 20, 2025