-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Range extension thunks #425
base: master
Are you sure you want to change the base?
Conversation
This seems out of scope to me, as it is a solution in search of a problem. Also, we need to support functions that are >1 MiB in size--something GCC does poorly currently but will need to get better at. That would require JAL thunks to be interspersed within a function, which sounds undesirably messy. So, I'd focus on the large code model motivation. (The dynamically cold call site optimization is also interesting, and though it also has problems with >1 MiB functions, that's OK, because it's purely opportunistic.) |
PLT stubs clobber t1 and t3 unconditionally (I don't know why that was picked in the first place, but hey, that's what we've got). Without STO_RISCV_VARIANT_CC and -z now / LD_BIND_NOW the header also clobbers t0 and t2, but of course the resolver then clobbers some arbitrary subset of the caller-saved register set.
Because that eats precious relocation encoding space for something that does nothing new. We've already got the wastage from CALL+CALL_PLT, we don't need to add yet another relocation that implementations will treat identically to the other two. RV64 has enough space that one should never run out, but RV32 is very constrained. |
This change would significantly diverge the RISC-V psABI from the psABIs of other RISC processors. Other psABIs do not distinguish range-extendable CALL relocation and the usual one. The linker automatically takes care of everything and creates range extension thunks as needed. (2) and (3) can be implemented without making a modification to the psABI, too. So I don't see a reason to add new relocations. |
Are you saying that code size / speed tradeoffs in general are out of scope (what about Zcmt?), or are you making a narrower point that I'm failing to see? My current thought for the prototype is
The compiler knows how large the function/section is and can switch to AUIPC+JALR if external references would struggle to reach outside it, it needs to do this anyway for internal references and I've done it before (there was function larger than 1 MiB in the Go test suite). I don't intend for the linker to ever insert bytes in the middle of an input section.
Unfortunately, the large code model isn't close to done, so I can't do much with it without taking over a much larger project. Likewise I am not confident in my ability to get PGO working on a nontrivial project. So short calls on the basis of a command line option / attribute are the most likely thing I'll be able to have an end to end demonstration of in a reasonable length of time.
Good catch. Unfortunately, we need
I'll drop CALL_THUNK then. It potentially turns link failures into runtime undefined behavior if anyone is using CALL_PLT for DSO-local functions with a custom calling convention that uses t2, which won't affect
To be clear, you want to drop both CALL_THUNK and JAL_THUNK, and add range extension thunk semantics for CALL, CALL_PLT, and JAL? I think at a minimum we'd need a clear statement that R_RISCV_JAL within a single section over less than a 1 MiB range will never generate a thunk. |
Not at all! My point was that motivation (2) is unconvincing unless you assume a fictitious universe in which linker relaxation does not exist. But, as I wrote, (1) is a sufficient motivation for this idea, and (3) is an interesting direction. I'm not pooh-poohing the idea; I'm just trying to help hone it. |
Only CALL_PLT (and CALL as it's a synonym for CALL_PLT) would require a range extension thunk, as that's the only relocation in this ABI to call a function that may be at an arbitrary memory address. Essentially, I don't think you need to invent something new for RISC-V for range extension thunks. You can instead just do what other RISC psABIs do. As a reference, this is what AArch64 psABI do for range extension thunks: https://github.com/ARM-software/abi-aa/blob/2a70c42d62e9c3eb5887fa50b71257f20daca6f9/aaelf64/aaelf64.rst#L1281 |
We are considering three use cases here. 1. A true large code model needs to support more than 2 GiB of text; data accesses are out of scope for this change but jumps and calls across a range of more than 2 GiB are needed. Most users of a large model will have more than 2 GiB of data but small text, or text with a highly local call pattern, so we want most calls to be able to use the auipc+jalr sequence. This would normally call for relaxation, but relaxation requires object files to contain the longest possible sequence, of which several are possible. Instead, keep the sequences the same and allow thunk insertion. 2. There have been requests from developers of linkers other than ld.bfd to avoid the use of length-changing relaxation, since it is a unique linking step for RISC-V among common large systems architectures, can be computationally expensive, and requires a substantial number of additional relocations in object files. In an environment which makes limited use of global data, most of the code size benefits of relaxation come from call->jal relaxation. If the compiler is modified to generate jal instructions instead of call instructions, the code size benefits can be achieved without relaxation at all, but this requires JAL_THUNK to avoid relocation errors at a 1 MiB limit. 3. If a function has many static call sites in a large binary but is known to be dynamically cold, due to a function attribute or PGO, the call sites can be replaced with jal instructions, sharing a single thunk between all call sites within a 2 MiB text region. This saves code size at small runtime cost. Restricting the register usage of the thunks is an intentional feature copied from the Go 1.15 toolchain, where every non-leaf function requires a conditional call to runtime.morestack in the prologue; since ra cannot be saved before the stack frame is allocated, the call is performed using t0 as the return register. Range extension thunks use t1 and t2 as temporary registers, while traditional practice in PLT entries is to use t1 and t3. This is a necessary change to support the use of software-guarded branches when Zicfilp support is added. Thunk insertion is forbidden inside a single section, since relaxation assumes that sections cannot grow and to ensure that intra-function branches do not impact register allocation. Thunks may be inserted on any cross-section jump; while linkers should endeavor to minimize the number of thunks used, this is a complex optimization problem with a quality/time tradeoff and mandating any algorithm would be inappropriate. Since this change redefines the existing relocation types, it is a **backwards incompatible ABI change** if object files expect t1 and t2 to retain their values across jumps that cross section boundaries. In practice, cross-section JAL relocations are not generated by current compilers and rarely appear in handwritten assembly, and linkers are not expected to generate thunks for CALL and CALL_PLT until the output approaches the 2 GiB limit of auipc+jalr sequences. Signed-off-by: Stefan O'Rear <sorear@fastmail.com>
96bdae0
to
9ee8805
Compare
Take 2, now with a better description of use case 2 and with the requested feature of "surprise ABI breaks for currently working code". There are at least three ways to get current gcc or clang to expect t2 to be valid across a
What is the path forward for these? Do we change them to not use |
It looks much better, but I think we need to first answer that question: do we need range extension thunks for RISC-V? Other RISC psABIs required range extension thunks because, without them, they couldn't support a medium code model of 2 GiB binary size. They use only a single instruction for function calls, and therefore the "reach" of function call instructions is limited (typically equal to or less than ±128 MiB). On the other hand, they use longer code sequences for data access, so range extenders are not needed for data loads and stores. In other words, other RISC-V psABI required range extension thunks to bridge the discrepancy between code and data references's reaches. To build binaries larger than 2 GiB, we usually need to build them with the large code model in the first place so that data references can refer to a location beyond ±2 GiB. The RISC-V's medium code model doesn't have the above-mentioned issue because both code and data references can address ±2 GiB. That means the situation in which range extension thunks are useful is very limited; they're useful only when we have code scattered across more than a 2 GiB address range while all data references are within 2 GiB from the code location. There might be a program that fits into that use case, but I honestly think it would be very rare. |
No. It's an optimization, albeit an important one that will be painful to retrofit if it isn't considered early.
That's the situation in which range extension thunks are present. They are useful precisely when they are absent: less than 2 GiB of relatively contiguous code, and data scattered across the address space. I expect this to represent the majority of uses of large models. If we want to close the call performance gap between the medium and large models, we need to either define relaxations to turn a 64-bit-range call into a 32-bit-range call and then use them on every call in every input object, or define range extension thunks and not use them. I think that the latter option is better for several reasons:
|
All of the bullet points appear to be hypothetical and not validated by any actual experience or implementation. Utilizing a long code sequence with a full 64-bit offset for a function call and allowing the linker to relax it would be a logical expansion for the RISC-V large code model, as that's what we are doing for the medium code model. Moreover, the linker relaxation and range extension thunks are independent. The linker is permitted to insert range extension thunks into a program, even with the current psABI, as long as it doesn't violate the ABI's assumptions (i.e., as long as range extension thunks preserve registers just like PLT entries do). Therefore, I believe defining range extension thunks at this moment is not absolutely necessary. To me, it appears that this proposal is too early to be ratified. |
I would also favor relaxation from the hypothetical large code model to the current regime as the default approach, given that we already have the relaxation arrow in our quiver. It’s clear to me that consideration (3) from Stef’s original proposal could bear some fruit, though it’s an optimization beyond the broader topic of large code model support. |
https://review.gerrithub.io/c/riscv/riscv-go/+/352852/5
Do you consider it a bug that LLVM treats t3/X28 as preserved by call instructions for the purposes of ipra and fastcc, despite the fact that PLT entries clobber t3? When Zicfilp happens, "PLT entries" for address not taken functions without a landing pad will need to be changed to use t2/X7 for software-guarded branch reasons. Does this mean that t2 will be removed from the list of registers preserved across a CALL_PLT? Does this mean that the register used for LLVM trampoline intrinsics and GCC nested functions will have to change to something else? (What external constraint is responsible for Zicfilp using X7 instead of X6 in spite of the ABI problems this causes?) |
I guess it depends. ipra and fastcc doesn't follow the calling convention defined by this psABI, so I'd think they are implementation-defined optimizations.
That needs to be address in the Zicflip spec, no? |
ipra and fastcc opt out of the calling convention, but they don't opt out of the relocation scheme, and LLVM generates normal relocatable files with nothing to prevent them from being consumed by a third-party linker that implements the psABI exactly. So I don't think that "it's implementation-defined" is a blanket license to ignore relocation rules.
Are you advocating changing the draft Zicfilp spec from using |
FYI, I think...we should reconsider Zicfilp should use t1 rather than t2 for the landing pad label register riscv/riscv-cfi#208 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support range extension thunks, without introduce new relocation and new tags.
- Adding new tag isn't really helpful, especially for asm code, it's not easy to maintain that new tag in asm code, and it's also not too much useful if we add that implicitly via some configuration option or flag.
other input sections may be resolved by the linker to point to a range | ||
extension thunk instead of the target symbol. Range extension thunks will | ||
eventually transfer control to the target symbol, and preserve the contents of | ||
memory and all registers except for `t1` and `t2`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use t1
and t3
, avoid use t2
due to landing pad (zicfilp).
@@ -735,6 +735,56 @@ that can represent an even signed 21-bit offset (-1MiB to +1MiB-2). | |||
Branch (SB-Type) instructions have a `R_RISCV_BRANCH` relocation that | |||
can represent an even signed 13-bit offset (-4096 to +4094). | |||
|
|||
==== Range Extension Thunks | |||
|
|||
`R_RISCV_JAL`, `R_RISCV_CALL`, and `R_RISCV_CALL_PLT` relocations to targets in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to exclude R_RISCV_JAL
, that relocation are used with j
and jal
; j
is used for jump within function and jal
typically is only used when the target is known very close, otherwise should use call
or tail
(then with R_RISCV_CALL_PLT
).
We are considering three use cases here.
A true large code model needs to support more than 2 GiB of text; data accesses are out of scope for this change but jumps and calls across a range of more than 2 GiB are needed. Most users of a large model will have more than 2 GiB of data but small text, or text with a highly local call pattern, so we want most calls to be able to use the auipc+jalr sequence. This would normally call for relaxation, but relaxation requires object files to contain the longest possible sequence, of which several are possible. Instead, keep the sequences the same and allow thunk insertion.
For executables and shared objects in a Unix environment, most of the code size benefits of relaxation come from call->jal relaxation, not data or TLS relaxation. If the compiler is modified to generate jal instructions instead of call instructions, the code size benefits can be achieved without relaxation at all, but this requires JAL_THUNK to avoid relocation errors at a 1 MiB limit.
If a function has many static call sites in a large binary but is known to be dynamically cold, due to a function attribute or PGO, the call sites can be replaced with jal instructions, sharing a single thunk between all call sites within a 2 MiB text region. This saves code size at small runtime cost.
Restricting the register usage of the thunks is an intentional feature copied from the Go 1.15 toolchain, where every non-leaf function requires a conditional call to runtime.morestack in the prologue; since ra cannot be saved before the stack frame is allocated, the call is performed using t0 as the return register.
There's an argument to be made that we can use thunks for JAL, CALL, and CALL_PLT because linking would fail otherwise. I'd rather not accept the risk of breakage with that. There's also an argument that CALL_PLT is always allowed to use a PLT, and PLT stubs clobber t1 and t2 even with STO_RISCV_VARIANT_CC, so substituting CALL_THUNK for CALL_PLT is always fine. I'm more sympathetic to that, but we need JAL_PLT for 2/3 use cases so why not add both.
I don't have a working PoC or a schedule for preparing a working PoC. As this is my first time on this side of the current process, what level of functional completeness are we looking for? Do I need to cover all three use cases or would it be enough to have a working
clang -mshort-calls
andlld
combination and results showing a binary size decrease?Recommend using CALL_THUNK for #388, FDPIC, and any other new code models or sub-ABIs.