Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Bug Report — Runtime APIs: large wasm memory allocations bring a DurableObject into broken state #3264

Open
yarolegovich opened this issue Dec 23, 2024 · 10 comments

Comments

@yarolegovich
Copy link

Details

Reproduction code with steps in README.

We're using automerge library. When we make a large number of allocations a:

Range Error: Invalid typed array length: undefined

Error is thrown in this code produced by wasm-bindgen:

let cachedUint8ArrayMemory0 = null;

function getUint8ArrayMemory0() {
    if (cachedUint8ArrayMemory0 === null || cachedUint8ArrayMemory0.byteLength === 0) {
        cachedUint8ArrayMemory0 = new Uint8Array(wasm.memory.buffer);
    }
    return cachedUint8ArrayMemory0;
}

wasm.memory.buffer is an ArrayBuffer with non-zero byteLength.

The worts part is that after this happens for the first time the object starts failing on any interaction with the library (cause of memory access attempts) and it is hard to get the object out of this state. DurableObject reset by an unhandled exception or this.ctx.abort() doesn't help. Other DurableObjects of the same type are working fine (until overloaded).

@joshthoward
Copy link
Collaborator

Hey @yarolegovich. We believe that the root cause here is that the compile target wasm32-unknown-unknown is going to abort in the event of a panic and the runtime does not know when a WASM instance has panicked. This issue is more easily observed with DOs because requests to a single DO will be routed back to the same WASM instance.

@LuisDuarte1 has developed a pretty complicated work around to handle panics at the application layer. LMK if you are interested.

@DaniFoldi
Copy link
Contributor

Hey @joshthoward,

I'm definitely interested 👀 , we have a few different use cases for larger wasm binaries in Durable Objects, including CRDTs. Some of them are our rust code, others are libraries that use wasm under the hood.

@yarolegovich
Copy link
Author

yarolegovich commented Jan 20, 2025

Hi @joshthoward

@LuisDuarte1 has developed a pretty complicated work around to handle panics at the application layer. LMK if you are interested.

Sure, knowing what exactly messes up the state will definitely be very helpful.
The reproduction code I shared just triggers a lot of wasm object allocations. I saw a number of discussions on the web about 128mb worker memory limit, but don't remember seeing it in the official docs. Is this the hard limit? In this case I suppose we're just getting an oom.
Also, do multiple DOs potentially share the same wasm instance? If not, is there a reason ctx.abort() doesn't reset the broken wasm?

@LuisDuarte1
Copy link
Contributor

LuisDuarte1 commented Jan 20, 2025

Sure, knowing what exactly messes up the state will definitely be very helpful. The reproduction code I shared just triggers a lot of wasm object allocations. I saw a number of discussions on the web about 128mb worker memory limit, but don't remember seeing it in the official docs. Is this the hard limit?

The 128MiB limit is documented here (not sure if you can ask to increase via the form): https://developers.cloudflare.com/workers/platform/limits/#worker-limits

In this case I suppose we're just getting an oom. Also, do multiple DOs potentially share the same wasm instance? If not, is there a reason ctx.abort() doesn't reset the broken wasm?

Not sure about implementation details in automerge but, if they are using wasm-bindgen - the generated WASM instance from it is usually global, meaning that the WASM instance might be shared between DO instances in the same V8 isolate.

If resetting the WASM instance is fine in your use-case (this is not a general fix) - I would personally recommend re-creating the WASM instance (probably will need a patch in automerge) when this error happens.

@yarolegovich
Copy link
Author

yarolegovich commented Jan 21, 2025

The 128MiB limit is documented here (not sure if you can ask to increase via the form): https://developers.cloudflare.com/workers/platform/limits/#worker-limits
if they are using wasm-bindgen - the generated WASM instance from it is usually global, meaning that the WASM instance might be shared between DO instances in the same V8 isolate.

Right, so worker limits apply to durable objects (was looking on this page).

Just to confirm my understanding. Requests are load balanced across servers where worker runtime processes run, hosting isolates of different tenants. Inside a runtime process, each tenant will only have a single isolate with tasks in the event loop potentially corresponding to different requests and durable object invocations, and the 128 MB limit applies to an isolate, also including wasm?

If resetting the WASM instance is fine in your use-case (this is not a general fix) - I would personally recommend re-creating the WASM instance (probably will need a patch in automerge) when this error happens.

Yeah, so far that was the only possible "workaround" we could think of.
But I'm not yet sure it's an OOM, there are no "Memory limit exceeded" errors in our worker's dashboard. Let's try the panic handling workaround and see what's really happening and whether it's really just a memory allocation failure.

@dmaretskyi
Copy link

@joshthoward Hi, thanks for taking a look at this. Would panic inside WASM put the WebAssembly.Memory instance into an invalid state? Note, that the error doesn't originate inside the WASM bytecode, but when we try to create a Uint8Array view of the WASM memory in JS.

@LuisDuarte1
Copy link
Contributor

Would panic inside WASM put the WebAssembly.Memory instance into an invalid state?

Panics in the wasm32-unknown-unknown are panic=abort by default, meaning that the memory doesn't get cleaned up after panicking and just calls unreachable AFAIK. So yes, every call to the WASM instance after a panic is considered UB (undefined behavior) because the memory hasn't been cleaned up.

@dmaretskyi
Copy link

dmaretskyi commented Jan 21, 2025

So yes, every call to the WASM instance after a panic is considered UB (undefined behavior) because the memory hasn't been cleaned up.

In the above error we never get to the WASM instance. The error gets thrown in JS on this line right here:

cachedUint8ArrayMemory0 = new Uint8Array(wasm.memory.buffer); // Range Error: Invalid typed array length: undefined

@LuisDuarte1
Copy link
Contributor

In the above error we never get to the WASM instance. The error gets thrown in JS on this line right here:

I think you got the right idea in the sense that: why wasm.memory.buffer somehow gets into a bad state when a WASM instance aborts because it cannot allocate more memory? (remember the worker limits - 128 MiB).

AFAICT getUint8ArrayMemory0 gets called through the WASM instance (most likely while converting a JS String to a Rust one, or vice versa) of which it's most likely aborted because it failed to allocate memory.

So my point remains, further calls to an aborted WASM instance (and by extension its own generated JS glue code) are considered UB.

(not to say that the bad state of wasm.memory.buffer doesn't exist, but by handling the aborted WASM instances, I think you can circumvent these kind of problems altogether)

@yarolegovich
Copy link
Author

by handling the aborted WASM instances, I think you can circumvent these kind of problems altogether

So, how do we handle these? 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants