🐛 Bug Report — Runtime APIs: large wasm memory allocations bring a DurableObject into broken state #3264

yarolegovich · 2024-12-23T09:41:25Z

Details

We're using automerge library. When we make a large number of allocations a:

Range Error: Invalid typed array length: undefined

Error is thrown in this code produced by wasm-bindgen:

let cachedUint8ArrayMemory0 = null;

function getUint8ArrayMemory0() {
    if (cachedUint8ArrayMemory0 === null || cachedUint8ArrayMemory0.byteLength === 0) {
        cachedUint8ArrayMemory0 = new Uint8Array(wasm.memory.buffer);
    }
    return cachedUint8ArrayMemory0;
}

wasm.memory.buffer is an ArrayBuffer with non-zero byteLength.

The worts part is that after this happens for the first time the object starts failing on any interaction with the library (cause of memory access attempts) and it is hard to get the object out of this state. DurableObject reset by an unhandled exception or this.ctx.abort() doesn't help. Other DurableObjects of the same type are working fine (until overloaded).

The text was updated successfully, but these errors were encountered:

joshthoward · 2025-01-14T19:36:29Z

Hey @yarolegovich. We believe that the root cause here is that the compile target wasm32-unknown-unknown is going to abort in the event of a panic and the runtime does not know when a WASM instance has panicked. This issue is more easily observed with DOs because requests to a single DO will be routed back to the same WASM instance.

@LuisDuarte1 has developed a pretty complicated work around to handle panics at the application layer. LMK if you are interested.

DaniFoldi · 2025-01-15T07:58:01Z

Hey @joshthoward,

I'm definitely interested 👀 , we have a few different use cases for larger wasm binaries in Durable Objects, including CRDTs. Some of them are our rust code, others are libraries that use wasm under the hood.

yarolegovich · 2025-01-20T08:18:49Z

Hi @joshthoward

@LuisDuarte1 has developed a pretty complicated work around to handle panics at the application layer. LMK if you are interested.

Sure, knowing what exactly messes up the state will definitely be very helpful.
The reproduction code I shared just triggers a lot of wasm object allocations. I saw a number of discussions on the web about 128mb worker memory limit, but don't remember seeing it in the official docs. Is this the hard limit? In this case I suppose we're just getting an oom.
Also, do multiple DOs potentially share the same wasm instance? If not, is there a reason ctx.abort() doesn't reset the broken wasm?

LuisDuarte1 · 2025-01-20T11:28:06Z

Sure, knowing what exactly messes up the state will definitely be very helpful. The reproduction code I shared just triggers a lot of wasm object allocations. I saw a number of discussions on the web about 128mb worker memory limit, but don't remember seeing it in the official docs. Is this the hard limit?

The 128MiB limit is documented here (not sure if you can ask to increase via the form): https://developers.cloudflare.com/workers/platform/limits/#worker-limits

In this case I suppose we're just getting an oom. Also, do multiple DOs potentially share the same wasm instance? If not, is there a reason ctx.abort() doesn't reset the broken wasm?

Not sure about implementation details in automerge but, if they are using wasm-bindgen - the generated WASM instance from it is usually global, meaning that the WASM instance might be shared between DO instances in the same V8 isolate.

If resetting the WASM instance is fine in your use-case (this is not a general fix) - I would personally recommend re-creating the WASM instance (probably will need a patch in automerge) when this error happens.

yarolegovich · 2025-01-21T05:49:46Z

The 128MiB limit is documented here (not sure if you can ask to increase via the form): https://developers.cloudflare.com/workers/platform/limits/#worker-limits
if they are using wasm-bindgen - the generated WASM instance from it is usually global, meaning that the WASM instance might be shared between DO instances in the same V8 isolate.

Right, so worker limits apply to durable objects (was looking on this page).

Just to confirm my understanding. Requests are load balanced across servers where worker runtime processes run, hosting isolates of different tenants. Inside a runtime process, each tenant will only have a single isolate with tasks in the event loop potentially corresponding to different requests and durable object invocations, and the 128 MB limit applies to an isolate, also including wasm?

If resetting the WASM instance is fine in your use-case (this is not a general fix) - I would personally recommend re-creating the WASM instance (probably will need a patch in automerge) when this error happens.

Yeah, so far that was the only possible "workaround" we could think of.
But I'm not yet sure it's an OOM, there are no "Memory limit exceeded" errors in our worker's dashboard. Let's try the panic handling workaround and see what's really happening and whether it's really just a memory allocation failure.

dmaretskyi · 2025-01-21T11:14:55Z

@joshthoward Hi, thanks for taking a look at this. Would panic inside WASM put the WebAssembly.Memory instance into an invalid state? Note, that the error doesn't originate inside the WASM bytecode, but when we try to create a Uint8Array view of the WASM memory in JS.

LuisDuarte1 · 2025-01-21T11:47:07Z

Would panic inside WASM put the WebAssembly.Memory instance into an invalid state?

Panics in the wasm32-unknown-unknown are panic=abort by default, meaning that the memory doesn't get cleaned up after panicking and just calls unreachable AFAIK. So yes, every call to the WASM instance after a panic is considered UB (undefined behavior) because the memory hasn't been cleaned up.

dmaretskyi · 2025-01-21T16:11:49Z

So yes, every call to the WASM instance after a panic is considered UB (undefined behavior) because the memory hasn't been cleaned up.

In the above error we never get to the WASM instance. The error gets thrown in JS on this line right here:

cachedUint8ArrayMemory0 = new Uint8Array(wasm.memory.buffer); // Range Error: Invalid typed array length: undefined

LuisDuarte1 · 2025-01-21T20:16:35Z

In the above error we never get to the WASM instance. The error gets thrown in JS on this line right here:

I think you got the right idea in the sense that: why wasm.memory.buffer somehow gets into a bad state when a WASM instance aborts because it cannot allocate more memory? (remember the worker limits - 128 MiB).

AFAICT getUint8ArrayMemory0 gets called through the WASM instance (most likely while converting a JS String to a Rust one, or vice versa) of which it's most likely aborted because it failed to allocate memory.

So my point remains, further calls to an aborted WASM instance (and by extension its own generated JS glue code) are considered UB.

(not to say that the bad state of wasm.memory.buffer doesn't exist, but by handling the aborted WASM instances, I think you can circumvent these kind of problems altogether)

yarolegovich · 2025-01-22T03:16:41Z

by handling the aborted WASM instances, I think you can circumvent these kind of problems altogether

So, how do we handle these? 🙂

richburdon mentioned this issue Jan 9, 2025

[tools] Cloudflare dxos/dxos#8470

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Bug Report — Runtime APIs: large wasm memory allocations bring a DurableObject into broken state #3264

🐛 Bug Report — Runtime APIs: large wasm memory allocations bring a DurableObject into broken state #3264

yarolegovich commented Dec 23, 2024

joshthoward commented Jan 14, 2025

DaniFoldi commented Jan 15, 2025

yarolegovich commented Jan 20, 2025 •

edited

Loading

LuisDuarte1 commented Jan 20, 2025 •

edited

Loading

yarolegovich commented Jan 21, 2025 •

edited

Loading

dmaretskyi commented Jan 21, 2025

LuisDuarte1 commented Jan 21, 2025

dmaretskyi commented Jan 21, 2025 •

edited

Loading

LuisDuarte1 commented Jan 21, 2025

yarolegovich commented Jan 22, 2025

🐛 Bug Report — Runtime APIs: large wasm memory allocations bring a DurableObject into broken state #3264

🐛 Bug Report — Runtime APIs: large wasm memory allocations bring a DurableObject into broken state #3264

Comments

yarolegovich commented Dec 23, 2024

Details

joshthoward commented Jan 14, 2025

DaniFoldi commented Jan 15, 2025

yarolegovich commented Jan 20, 2025 • edited Loading

LuisDuarte1 commented Jan 20, 2025 • edited Loading

yarolegovich commented Jan 21, 2025 • edited Loading

dmaretskyi commented Jan 21, 2025

LuisDuarte1 commented Jan 21, 2025

dmaretskyi commented Jan 21, 2025 • edited Loading

LuisDuarte1 commented Jan 21, 2025

yarolegovich commented Jan 22, 2025

yarolegovich commented Jan 20, 2025 •

edited

Loading

LuisDuarte1 commented Jan 20, 2025 •

edited

Loading

yarolegovich commented Jan 21, 2025 •

edited

Loading

dmaretskyi commented Jan 21, 2025 •

edited

Loading