ipc_session, ipc_core: sync_io-pattern APIs: Be more forgiving of user reporting active event when FD not really active. #117

ygoldfeld · 2024-12-05T00:34:22Z

Background (Guided Manual has comprehensive docs; this is quick refresher):

For all classes X such that X must be able to wait on an asynchronous (background) event to operate, there is sync_io::X counterpart; in this case sync_io::X provides the sync_io pattern for detecting such events (and possibly reporting its own async events to user); while X provides the async-I/O pattern (wherein it takes care of its own background-waiting; and reports async events - if any - to user by invoking user callback from unspecified background thread). User may use either in a given situation. (They can use any mix of either; it's fully a case-by-case decision.)

Basically, X (async) will use background worker threads it itself starts; while sync_io::X (sync) will avoid this as much as possible, and it will rely on user's own event loop to perform async waits on its behalf. sync_io::X is therefore spiritually similar to non-blocking APIs such as OpenSSL's WANT_READ/WANT_WRITE-based pattern.

This ticket is regading the sync_io pattern. Let x be a sync_io::X. In this pattern, the core sequence is that

x has ascertained within reason that some FD S is not readable (or, instead, writable; for brevity let's say readable here) and wants to be informed when S is readable in the future.
As part of some other non-blocking, synchronous API call into x.some_api() (or indeed an x-emitted on_active_ev_func(), which is essentially no different from an x.some_api()), x thus informs user they want to perform, conceptually, a single S.async_wait(readable, on_active_ev_func). I.e., user shall use a technique of their choice to async-wait for S to be readable and promises to call on_active_ev_func() (which is conceptually - again - itself an x.some_api()) when it is readable (or in error state).
User does what is asked; 2 ways are supported:
- boost.asio async_wait() (recommended - we like boost.asio, and our async-wait requests map very easily to it); and
- whatever user wants to use, usually epoll or poll. (This is trickier but details omitted here. It is fully supported and is an important use case.)

In many cases, S is an FD directly corresponding to some kernel-handled IPC mechanism that we are directly interested in internally: Unix domain socket, POSIX pipe.

However, sometimes conceptually we (internally) need an async-wait on an event that does not have an FD interface. In that case, we implement it internally as follows:

Set up an anon pipe, with reader FD R and writer FD W.
Execute an async-wait (just like for direct async-waits, by asking user to do it) for readability of R.
Use a background thread to blockingly await the event of interest. Once it occurs, write a byte to W.
When R is thus reported as readable, read a byte from R (just to flush it, bringing it back to equilibrium).
Now we know the background thread has reported our not-directly-FD-waitable event as being active. So respond as desired.

This is an added step, but it is very fast. Plus, in some cases there's really no choice.

In actual fact here are the situations we do this:

Rarely (period = seconds) firing timed events, such as idle timeout and auto-ping timer. (We could've used a built-in timerfd in Linux but we chose to not do so. Code comments explain why. Basically it's cleaner this way.)
boost.interprocess (bipc) MQs (message queues). boost::interprocess::message_queue does not provide an FD-waitable API, so we use a background thread in which we blockingly await stuff... etc.
When wrapping an async-I/O-pattern core to implement a sync_io-pattern wrapper, instead of the other way around. This is done when performance is of no import; namely for Native_socket_stream_acceptor, Client_session, Server_session, and Session_server. In this case, when an async-core X fires a user-reportable event (e.g., successful async_accept()) to user in thread W, the sync_io::X wrapper (..._adapter) around the X writes to a pipe; user reports readability of this pipe (though it doesn't know what the FD represents, but internally it's a pipe-reader); we flush the byte and then react to the event.

This ticket is about all situations where we do this. Namely it's about each bullet point just listed. These are in ipc_core and ipc_session. Hopefully I didn't forget anything; but basically simply search for pipe_produce or pipe_consume to get an accurate list.

So what's the problem?

Consider what happens if user reports that an FD is readable (or has error)... but actually it is not the case.

In the non-pipe situations (which, as noted, are not the problem for this ticket), our internal code will attempt the appropriate non-blocking read. (Aside: Reminder: actually we might have asked for FD writability and tried to write. We're using readable+read as short-hand for brevity.) If it yields would-block... no problem. We'll just async-wait again. Stuff happens, is what that code assumed, in defensive fashion.

The problem: In the pipe situation, we call internal utility util::pipe_consume(). This attempts a boost.asio read_some() of 1 byte. If there are no bytes actually available... and there is a very clear comment saying just this... it will BLOCK! The pipe object is not even set to non-blocking mode; the situation isn't detected / asserted-upon. It is not written defensively; it just assumed this will not happen and is invalid/undefined behavior/a misuse of our API.

Blocking in the middle of formally non-blocking APIs, needless to say, will throw any user program into disarray at this point.

The ask is, naturally, to: Internally set the reader pipe to non-blocking mode; to detect read_some() yielding specifically would-block; and if this happens pipe_consume() should report this happened; and all callers (which are our internal code in ipc_core and ipc_session) should behave as-if the proper reaction to this event was actually to no-op; and therefore to re-async-wait again.

This would be quite simple coding, albeit in a few places (hopefully enumerated above).

Is it a bug or an enhancement? How important is it? / Priority

That's a pretty good question actually. For this reason I've labeled this issue as both "bug" and "enhancement" (odd as that is). Philosophically/formally, one could reasonably say that for a simple, deterministic construct that is the anon pipe, the user is expected to be able to accurately detect events; and if properly using either boost.asio async-wait or a native epoll/poll/whatever, there is no reason the kernel shall spuriously report inactivity as activity on it.

Practically though... people make mistakes; epoll in particular is a tricky API; and just in terms of reasonable expectations, why can't Flow-IPC just handle a bit of sloppiness like this? When networking classically one doesn't freak out in this situation, and our own internal code for non-pipe-FD activity handles it fine too. So why the sudden rigidity with the pipes?

That said, I making this ticket simply because I thought of it. No one has reported this problem to me in practice yet.

Priority-wise... sooner is probably better than later, as it's a low-hanging fruit. However, if no one's complained, technically there's no reason to rush it in either. It just "feels" like someone will complain soon enough. Plus, a random blocking in the middle of a program will be so unpleasant as to put someone off the whole project, potentially. We don't want that.

The text was updated successfully, but these errors were encountered:

ygoldfeld · 2024-12-05T02:20:50Z

CC @wkorbe

ygoldfeld added bug Something isn't working enhancement New feature or request labels Dec 5, 2024

ygoldfeld self-assigned this Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipc_session, ipc_core: sync_io-pattern APIs: Be more forgiving of user reporting active event when FD not really active. #117

ipc_session, ipc_core: sync_io-pattern APIs: Be more forgiving of user reporting active event when FD not really active. #117

ygoldfeld commented Dec 5, 2024

ygoldfeld commented Dec 5, 2024

ipc_session, ipc_core: sync_io-pattern APIs: Be more forgiving of user reporting active event when FD not really active. #117

ipc_session, ipc_core: sync_io-pattern APIs: Be more forgiving of user reporting active event when FD not really active. #117

Comments

ygoldfeld commented Dec 5, 2024

Background (Guided Manual has comprehensive docs; this is quick refresher):

So what's the problem?

Is it a bug or an enhancement? How important is it? / Priority

ygoldfeld commented Dec 5, 2024