-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We need to be more explicit about the Cats Effect memory model #3899
Comments
I think it's hard to do this right, but probably worth it. Even a somewhat informal description is probably better than nothing (which is the current situation). One of the things which makes this hard is that (as far as I know), the Scala memory model is also "implicitly defined via implementation details and the Java Memory Model". Another thing we should think about is whether to define these rules for the typeclasses in I know it's just the example which triggered this issue, but I'd be interested in what guarantee an (All these concerns doesn't mean we shouldn't do this. I think we should, I just started to think about how to do it...) |
For comparison, here's what the JDK Semaphore promises:
The OP mentioned that it works even with 0 permits. |
I don't think this part of the JDK apidoc (which is usually pretty good) is worded as carefully as it should be. By that, I mean that it is not true as written. Let's say we have If we look at the case of monitors in JLS 17.4.5, this is what it says:
The word "subsequent" seems important here (and I suspect that's what's missing from the semaphore apidoc). It is defined formally in 17.4.4. As far as I understand, a consequence of that definition is essentially the same as intuitively thinking in rel-acq pairs (or rel-acqs corresponding to each other). So there is happens-before, if the release and the acquire correspond to each other. (At least that's how I understand it.) And I think this is where there is a problem with |
That's a really great point. Particularly But that's good though, I think it gives a concrete place to start from. The other key semantic that jumps to mind is about memory visibility when you have sequential I'd also like to figure out where and how we are currently relying on these implicit semantics ourselves. |
Yeah. The rule could be the CE equivalent of this from the JLS: "A write to a volatile field happens-before every subsequent read of that field." (And of course this holds for the default
And this could be the CE equivalent of this. "If x and y are actions of the same thread and x comes before y in program order, then hb(x, y)." Obviously with fiber instead of thread, and "something" instead of program order. (As far as I can tell, this also holds currently due to Executor semantics. Of course we should double check that the WSTP follows correct Executor semantics, but it should.) Similarly: we also have an equivalent of "A call to start() on a thread happens-before any actions in the started thread", when forking a fiber. And of "All actions in a thread happen-before any other thread successfully returns from a join() on that thread", when joining a fiber. We should have something for |
Relates to #2928 |
A little while back Daniel S and I chatted about this on Discord (sorry public internet) and had a follow-up conversation. In those discussions, I propose that most Cats Effect applications and libraries actually do not rely on the JMM's final field semantics.1 In idiomatic Cats Effect code, data is explicitly shared via Continuing from there, I also want to hone down exactly what memory is published when synchronizing via It turns out this is an existing concept in C++, known as
https://devblogs.microsoft.com/oldnewthing/20230427-00/?p=108107 The JMM also uses some notion of reachability to define its final field semantics.
https://docs.oracle.com/javase/specs/jls/se21/html/jls-17.html#jls-17.5 Note, I'm not claiming that we can implement strictly these semantics. For example, final field semantics are not going away on the JVM and LLVM doesn't currently expose Anyway, the reason for bringing this up again is because the possibility of relaxing the memory model on Scala Native has finally cropped up :) Footnotes
|
Just a note,
From P0371R1: Temporarily discourage memory_order_consume That's from 2016, but newer notes are not much more encouraging:
From P0735R1: Interaction of Also interesting reading: https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0750r1.html Why would final field semantics matter when we specify and/or implement (On the other hand, final field semantics absolutely matter to scala-native.) |
Another way we could define the CE memory model is that we could say that |
Huh, yeah, that makes sense. Basically, even if we'd make |
Cross-linking my comment on the Scala Native thread. |
So I've skimmed the "All Fields are Final" article again, and the cost doesn't seem so dramatic to me. The conclusion seems to be that on x86 it is practically free, and on ARM it is also not too bad. It probably helps, that hotspot can somehow avoid the acq fence on reads. (Do we know why scala-native can't do that?) (Btw, I think what hotspot does with eliding the acquire is essentially what But of course, if we can make it faster on ARM, that's always good. (I'm just afraid of immutable objects changing... that's not fun...) |
Hmm, what are you referring to exactly? This bit?
If I understand correctly, I wouldn't describe this as avoiding the acquire fence. Rather, there is semantically an acquire there, that just happens to require emitting no instructions on most hardware. But you did exactly nail the question I've been pondering today. Which is whether the implementation of More broadly, my understanding was that for
I found a nice illustration of this in these slides. In the context of So this is where I'm getting a bit puzzled. It's quite possible that the JVM, C++, and LLVM are just not corresponding to each other in such a direct or obvious way. It's also possible that on most platforms there is no practical distinction between a But the question remains what is the correct way to implement Regarding performance on JVM vs Scala Native ... the "All Fields are Final" article discusses barrier coalescing on HotSpot. It's possible that Native is unable to do this kind of optimization. If it did, it would have to be something in LLVM's optimizer (didn't see any hits in a quick search). So this could explain why the performance impact is much worse on Native. |
You're right. Or rather, semantically there is a consume there; because I think they can avoid a CPU-fence instruction exactly because it's a consume-like situation (i.e., they don't need a general-purpose acquire). Regarding how finals are implemented in scala-native: is your concern that (1) the release is not paired correctly with the acquire, because they're for different memory addresses; or that (2) the object reference may be written with a plain store instead of a relaxed one? (Or both?) (Now we're talking about something which I only have a vague understanding of, so don't take me too seriously.) About (1): I think this is a valid concern. It seems that the acquire fence will be in the wrong place. If I understand it correctly, this is what happens now: // constructor:
this.finalField = something
releaseFence()
// using the object:
val obj = ...
val x = obj.finalField
acquireFence() While the constructor seems correct, at the use site what should happen is more like this: // using the object:
val obj = ...
acquireFence()
val x = obj.finalField About (2): I think C++ relaxed approximately corresponds to opaque, which is stronger than plain. But, there is this (here, emphasis mine):
Are object references primitively bitwise atomic in scala-native? They are on the JVM. (But when Valhalla value types arrive, those might not be!) If they are also in scala-native, (2) seems fine to me. (But honestly, I don't really understand the difference between LLVM monotonic and unordered...) Regarding what's implementable and what is not (when we're implementing, e.g., |
Actually it is even simpler than a Attempting to illustrate more concretely, consider: val obj1 = getObjUnordered()
val obj2 = getObjUnordered()
val x = obj2.finalField It would be okay for that code to be re-written to: val obj1 = getObjUnordered()
val speculation = obj1.finalField
val obj2 = getObjUnordered()
val x = if (obj2 eq obj1) speculation else obj2.finalField Contrast with: val obj1 = getObjUnordered()
val obj2 = getObjConsume()
val x = obj2.finalField In this case, that rewrite is not allowed. It seems like
Ugh, you are right, it is in the wrong place. And actually I had completely missed this.
Not sure. I hope so 😅 object references are modeled as LLVM pointers and I can't find any LLVM-related hits for those keywords. They probably call it something else ...
I agree, it seems to be (approximately) equivalent to the JVM's In any case, thanks, that section you quoted is helpful. Still don't feel like I fully grok anything 🤔 It definitely does seem like the
On Scala Native we can do whatever NIR (Scala Native bytecode) allows. In practice, this corresponds closely to LLVM, including for atomics. |
I don't think so. This whole game with fences is exactly because it is possible to read 2 different values from a final field. Namely, the
Apparently Alpha is not like "most" hardware:
From https://www.kernel.org/doc/Documentation/RCU/rcu_dereference.txt (which is about Linux RCU, which was apparently the motivation for consume). But also, the CPU is not the only one who can reorder things: the compiler(s) can do that too.
From https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/p0098r0.pdf. So I think final fields work without (read) barriers in Hotspot, because (1) there is no Hotspot for Alpha (probably?), and (2) Hotspot knows, that neither Hotspot, nor the Java compiler makes any optimization which would break them. (I think (2) is equivalent to saying that there is a compiler read barrier for final fields.) Btw., interesting (although long, and at points very confusing) discussion about consume: https://news.ycombinator.com/item?id=36040457 |
Thanks, I found jcranmer's comments in this thread of the HN post very helpful.
Okay, now I see what you are saying. I agree. For the record, I have a hard time seeing what kind of optimizations a compiler could make at the use-site of a Unfortunately this is bad news for Scala Native where we don't have fine control over the optimizations that LLVM makes. Or rather, emitting these barriers is how we control the optimizations. I'm worried that to correctly implement
Since so much code is generic, eliding the barriers for stores/loads is difficult unless you know at linking time that no subclasses require |
Doesn't scala-native need this anyway? For every access to shared memory, since data races are undefined behavior (on LLVM). |
Scala Native uses
|
Ah, okay. And that's not enough for the situation with |
Well, maybe not, based on what you shared in #3899 (comment).
If we can confirm primitive bitwise atomicity of pointers, then I suppose an |
What bothers me is that even though that JDK9 Memory Order Modes guide suggests this would work, nothing in C/C++ world says that. If anything, they seem to say the opposite i.e. that it won't work 😕
|
We're getting very much offtopic here, but...
Yeah. Except that's C++; what "we" (i.e., scala-native, really :-) care about is LLVM. I think unordered is fine for LLVM:
From https://llvm.org/docs/LangRef.html#id219 (emphasis mine). So, like C++, LLVM requires atomic operations (i.e., an atomic operation with any ordering constraint). Reading https://llvm.org/docs/LangRef.html#atomic-memory-ordering-constraints, it seems that the weakest one is unordered. So that should be fine. Why the difference? I think it's simply because the weakest possible atomic operation in C++ is a relaxed one (which corresponds to LLVM monotonic). But LLVM introduces a weaker (but still atomic!) operation: unordered. |
Thanks. Yes, I've been pondering the same. If true (and seems like it should be) that's very good ... however I no longer have a good intuition for what atomic actually means 😅 LLVM explains but I haven't really grokked it.
I just don't have a good enough understanding of how and why compilers apply those sorts of optimizations to unordered loads and stores and furthermore, why that would cause fences to have undefined behavior 🤔
Ha, maybe 😅 I'll invoke All Fields Are Final in an attempt to defend this tangent:
i.e. understanding what we can implement and how is a prerequisite for defining the Cats Effect memory model. In particular, if it is impossible to efficiently implement In any case, thanks for your patient and thoughtful engagement on this subject, much appreciated :) |
At this point, I think I'll just say that atomic is whatever the spec says is atomic; and if another part of the spec (e.g., the But, speculating a little bit, a compiler might want to sometimes generate a 64-bit write as two 32-bit writes (e.g., on a 32-bit platform, or something). This would not be atomic (the "transforms a store into multiple stores" part); and obviously could cause problems with reading values which never were written. And if I'm dereferencing a pointer which never existed... that's obviously very bad. (And I think it's undefined behavior exactly because they want compilers to be able to freely do these kinds of transformations.) |
Returning to CE a little bit...
About designing (i.e., what we spec): it seems to me, that the only way we'd really rely on The next stronger possible way to spec them seems to be About implementing: CE is full of intentionally racy internal code where it relies on |
It might be helpful to talk about some specific examples. Here's one that I know of: cats-effect/core/shared/src/main/scala/cats/effect/IOFiber.scala Lines 164 to 165 in 97b0174
Currently this works via data race because the If
Meanwhile, we removed the release/acquire barriers from all other uses of
Care to share? :) |
Yeah, you're right:
Yeah, ok, I admit this is not that much... I remembered (imagined?) more 🤷♂️
Not intentional races in CE, but possible races in user code (I guess we could say that these are forbidden...):
Not final fields, but data race (probably unintentional?):
I will, just need time to think it through (and write it down) properly... |
So, let's imagine, that we say that idiomatic CE code should be fine with using Ref/Deferred and use only "structurally reachable" things. If I understand it correctly, this is approximately saying, that I believe this would make totally reasonable code like this incorrect. In fact, it would make incorrect the whole idiom of using a
While the other fiber:
So after waking up, it could read an "old" state (i.e., it doesn't necessarily sees |
Yes. Well, let's be more precise about what exactly we are discussing ... On the JVM, On Scala Native, Suffice to say, it seems less likely that changing our implementation on the JVM will bring any optimizations. But we have a lot more potential to do so on Scala Native.
I think this one is different from
Yes, definitely count this one :) I admit this one definitely needs some thinking about. The And in fact there is another memory publication issue there that I've been worried about. The cats-effect/core/shared/src/main/scala/cats/effect/IOFiber.scala Lines 628 to 630 in 85d6f3c
Instead of encoding that callback as an anonymous closure, we could implement it as a top-level (static) class that explicitly takes all its state in a constructor. Then in the implementation of Putting aside those hacks, I think we could definitely make it work like this:
Yes, I realize this is just replicating
Tough questions. I don't know. I see what you are saying, esp. about the definition of thread-safety. "Thread-safe but only if published with appropriate fencing" doesn't give the usual cozy thread-safe feeling. In Scala Native, my argument has been that the number of developers designing algorithms based on data races is small and that the onus should fall to them to know what they are doing. For Cats Effect users to encounter these sorts of things requires creeping out of the "idiomatic" patterns into a wildly unsafe world. May I cop-out and declare these things as implementation-specific? 😅
Mmm. Seems like a legit bug, independent of this discussion.
This is a fantastic observation. Thank you so much for thinking about this :) I am going to think more about it, but even if hypothetically there is a "better" or "more idiomatic" pattern that could be used instead, I absolutely agree with you that it is very very reasonable code. Paired by the fact that we cannot implement |
We can absolutely do this on the JVM too (versions > 8).
👍
(Besides, saying that there are "no visibility guarantees anyway" when reading through a race, is a little strange to me. There are guarantees for #3781: seeing "at least" the initial The closure is a very good point. (I was thinking about closures, but didn't put together the two things.) I expect this is a thing that will be easy to forget. (Btw., how do you use the new scala-native annotation on the fields of a closure?)
💯 If we try to avoid synchronization for performance in specific cases, like in #3781; and also avoid synchronization in general for About thread-safety of WSTP and others:
Sure. But maybe we're underestimating the number of unintentional data races? I suspect the reason the JVM has more or less "reasonable" semantics even for data races is exactly because they expected them to be common, and wanted to avoid the situation in C/C++, where it's undefined behavior. Whether they're really that common (or if they're common specifically in Scala) is a very good question...
That's basically saying "no" 🙂 (i.e., generic code will need to publish them with synchronization.)
Sure thing, it's interesting stuff. (Although now I know more about the C++ memory model than I've ever wanted to...) |
Related issue: #2484. |
(Sorry for slow cadence, got a bit behind on stuff :)
Sorry, let me try to explain again. Your original comment was:
If we are reading a reference to a fiber via a race, then we might not even see that fiber. As you say, not the end of the world, but also not nice. Suppose we did read a non- I think it would be especially not nice if we were guaranteed to see the reference to the fiber, but not guaranteed to see its contents. But when we are reading references to fibers themselves via a race, this is not the case: we are neither guaranteed to see the reference nor its contents. This is what I meant by "no visibility guarantees anyway".
I suppose so, but this is less clear to me on Scala Native. What confuses me is that the same address spaces may be used over and over again for different objects as stuff gets GCed and allocated. So in the case of data race I don't understand what prevents you from reading (stale) values of an old, GCed object and how you can be confident you are seeing at least the "cleared" values i.e.
Mmm, right, you are referring to 17.7. Non-Atomic Treatment of double and long?
Indeed.
By writing it out as a top-level On Scala JVM and JS using a closure vs a class generates meaningfully different code (e.g.
Tough to say.
Haha, fair enough. I guess I was casting doubt on whether such code could be generic anyway i.e. if someone has dropped low-level enough that they have to worry about safety of WSTP when published through data race, chances are they are writing code that already cannot be generic across Scala Native and Scala JVM and needs to be split into platform-specific implementations. But as you say, maybe this "low level" is not quite as low and uncommon as I think it is. |
About "no visibility guarantees anyway": thanks, I understand now. I agree.
I know (almost) nothing about scala-native, but this indeed seems an important thing to think about; fwiw, I'd consider it a (serious) bug in scala-native, if it's possible to see values from a GC'd object.
👍
I'd say the fact that we'd have to do this shows that this scala-native annotation is not that well thought through. The annotation itself could be made cross-compilation friendly (by making a stub for JVM/JS), but doing this requires split sources. Which are a pain...
Ehh... I see your point. But I still see value in having certain "inherently" multi-threading-adjacent objects being "unconditionally" thread-safe. E.g., |
Further reading about final fields (and other related things, like NonAtomic vs. Unordered) in kotlin native: https://github.com/JetBrains-Research/KotlinMemoryModelResearch/blob/master/research/final-fields.md (and also other documents in that repo). Even more reading: https://weakmemory.github.io/project-theses/studying-compilation-schemes-KN.pdf. |
Currently the Cats Effect memory model is implicitly defined via implementation details and the Java Memory Model (JMM). In some cases this works okay but in other cases this means that our memory model is effectively undefined.
As a concrete example, earlier today on Discord there was a report that
Semaphore
offers no memory ordering guarantees when acquiring/releasing 0 permits. This boils down to a specific implementation detail that doesn't matter if you are using CE's own concurrency primitives and data structures, but matters a lot if you are reading/writing unsynchronized memory unsafely withdelay
. We can't say if the currentSemaphore
implementation is bugged or not without explicitly defining the expected semantics.With Scala Native multi-threading on the horizon in #3662, nailing down the specifics of the CE memory model becomes even more important since we have a lot more flexibility in implementation details. Longer-term, Scala Native could potentially even support weaker memory models than the JMM.1
Footnotes
Tangential, but I have a hypothesis that code written using immutable data structures and CE concurrency primitives actually does not need many guarantees of the JMM (e.g. publication of
final
fields on construction) and that a weaker CE memory model would be sufficient and have better performance esp. on ARM. ↩The text was updated successfully, but these errors were encountered: