Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACP: Add floating point representation conversions #501

Open
tczajka opened this issue Dec 8, 2024 · 21 comments
Open

ACP: Add floating point representation conversions #501

tczajka opened this issue Dec 8, 2024 · 21 comments
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api

Comments

@tczajka
Copy link

tczajka commented Dec 8, 2024

Proposal

Problem statement

There is currently no easy way to:

  • get an exponent and mantissa from a floating point number
  • construct a floating point number from an exponent and mantissa

Even creating a power of 2 in floating point is not easy even though it is a reasonably useful operation that can be computed exactly. powi can't be used because it has unspecified precision.

The only way is to directly deal with the internal bit representation and use to_bits and from_bits to convert.

Motivating examples or use cases

This would essentially be a replacement for f32::classify that keeps all the information about the number. f32::classify can be implemented in terms of to_repr.

Converting floating point to and from bignums (e.g. UBig::to_f32) could use this.

Implementing f32::div_euclid in std using integral division could use this.

Users could use this to inspect their floating point numbers, understand their behavior, debug their code, etc.

Solution sketch

pub enum FpRepresentation<Mantissa> {
    // number = 2^exponent * mantissa
    Finite {
        exponent: i32,
        mantissa: Mantissa,
        // This is redundant with the sign of `mantissa` except for negative zero
        sign: Sign,
    },
    Infinity { sign: Sign },
    Nan { nan_type: NanType, payload: Mantissa,  sign: Sign },
}

pub enum Sign {
    Positive,
    Negative,
}

pub enum NanType {
    Signaling,
    Quiet,
}

impl f32 {
    pub fn to_repr(self) -> FpRepresentation<i32> { ... }

    // Rounds the representation to the nearest representable number.
    // This will ignore the `sign` field for `Finite` numbers except when `mantissa` is 0.
    pub fn from_repr(repr: FpRepresentation<i32>) -> Self { ... }

    // Returns `None` if the number cannot be represented exactly
    pub fn from_repr_exact(repr: FpRepresentation<i32>) -> Option<Self> { ... }
}

Alternatives

Existing API (to_bits, from_bits) can be used to have this implemented as an external crate. It seems like it belongs to core because it deals with the essence of how floating point numbers are represented.

impl From<f32> for FpRepresentation could be implemented instead of to_repr.

impl TryFrom<FpRepresentation<i32>> for f32 could be implemented instead of from_repr_exact.

An equivalent of C functions ldexp and frexp could be implemented instead. The advantages of the proposed solution over these:

  • Enums are clearer than what these functions do for special values (NaN, infinities)
  • frexp returns mantissa as a floating point number but you typically want to deal with it as an integer (otherwise why convert?), which requires another conversion step

For ldexp the extra conversion step (integer -> fp) is less of a problem because it might be necessary anyway when the mantissa has more bits than can be represented. For example if you want to convert a 64-bit mantissa + exponent into f32 you would have to first convert the 64-bit number into f32 either way.

There could be a separate enum variant for subnormal numbers. This is an unnecessary complication. Subnormal numbers can be distinguished with the proposed API by the fact that they have a small mantissa (mantissa.abs() < 1 << (MANTISSA_DIGITS - 1)) and often don't need separate logic.

Links and related work

internals.rust-lang.org thread

@tczajka tczajka added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Dec 8, 2024
@tgross35
Copy link

tgross35 commented Dec 9, 2024

Having some sort of way to deconstruct and reconstruct floats would be a nice convenience for soft float work. I'm not positive that mixing classification and de/reconstruction is better than .from_repr(self) -> (Sign, i32, Self::Int) or possibly some specific constructor methods, but it does seem nice to be able to match on the float classification.

Making mantissa an unsigned integer would also eliminate the redundancy with Sign. IME this is more useful anyway since it's common to figure out signs separately from the rest of the operation (e.g. figure out the sign of the result first and then do an unsigned multiplication algorithm).

If there is a need for nan_type, it should just be a method on FpRepresentation rather than a field redundant with payload. NonZero<Self::Int> for the payload was brought up on IRLO and might be reasonable (doesn't help with the overflow case).

For naming I'd prefer FloatRepr since "float" is more clear than "fp", and we're already using the repr abbreviation in the methods.

    // Rounds the representation to the nearest representable number.
    // This will ignore the `sign` field for `Finite` numbers except when `mantissa` is 0.
    pub fn from_repr(repr: FpRepresentation<i32>) -> Self { ... }

What exactly would this expand to? If it casts the mantissa to the float and then applies the exponent, it seems better to just provide a set_exponent(self, i32) -> Self method and let the user do that themselves. I'd rather the default method only mask and shift rather than potentially involving another softfloat algorithm.

@tczajka
Copy link
Author

tczajka commented Dec 9, 2024

I'm not positive that mixing classification and de/reconstruction is better than .from_repr(self) -> (Sign, i32, Self::Int) or possibly some specific constructor methods, but it does seem nice to be able to match on the float classification.

Classification into categories is part of deconstruction though. Would this method work on infinity, giving the max exponent and zero mantissa? The user would have to know to use another method to check for infinity first, or that max exponent and zero mantissa encodes infinity, either way in effect doing part of the deconstruction manually.

Making mantissa an unsigned integer would also eliminate the redundancy with Sign.

Yes I think an unsigned mantissa is reasonable, it would clean up the API.

If there is a need for nan_type, it should just be a method on FpRepresentation rather than a field redundant with payload.

My thinking was that the "signaling" bit would not be included in payload.

What exactly would this expand to? If it casts the mantissa to the float and then applies the exponent, it seems better to just provide a set_exponent(self, i32) -> Self method and let the user do that themselves.

Yes that makes sense. More useful would be to add to the exponent (rather than setting it). The function can then be called mul_pow2. This is what the C function ldexp does.

There can still be a from_repr(FloatRepr<i32>) -> Option<Self> that doesn't round, so that you can round-trip to and from representation.

@pitaj
Copy link

pitaj commented Dec 9, 2024

This seems like a great application for arbitrary-bitwidth integers. Then the extracted mantissa and exponent can be exactly the precision encoded in the float.

@BartMassey
Copy link

I am currently working on this as I re-implement floating div_euclid() and rem_euclid(). My sketch is pretty similar to the proposed one. Give me a day or two and I'll post it here.

@joshtriplett
Copy link
Member

I'm not sure we should have this level of complexity in the standard library, to combine the classification and the to/from parts operation combined into one with an enum like this.

For the purposes many people might want this for, I'd be inclined to have a simpler to_parts and from_parts version for each float type that breaks it out into a tuple. That operation is a simple bit-shift, no conditionals or classification or enum construction.

As noted earlier in this thread, arbitrary-width integers would make this much nicer. But in the absence of that, we can pick an appropriate wider type, and truncate input values (or assert in debug mode).

cc @BartMassey

@BartMassey
Copy link

BartMassey commented Dec 10, 2024

Do keep in mind that something usually has to renormalizing denormalized numbers and adjust the exponent accordingly, also add the implicit one bit to normalized numbers. Most things working with these numbers will want to do that. So some amount of classification will be done internally anyhow, even if it is then thrown away.

The num crate has Float::integer_decode() that we could steal if desired. It provides an interface that is more like what @joshtriplett asked for. Here's the f32 implementation:

fn integer_decode_f32(f: f32) -> (u64, i16, i8) {
    let bits: u32 = f.to_bits();
    let sign: i8 = if bits >> 31 == 0 { 1 } else { -1 };
    let mut exponent: i16 = ((bits >> 23) & 0xff) as i16;
    let mantissa = if exponent == 0 {
        (bits & 0x7fffff) << 1
    } else {
        (bits & 0x7fffff) | 0x800000
    };
    // Exponent bias + mantissa shift
    exponent -= 127 + 23;
    (mantissa as u64, exponent, sign)
}

I'd probably go a little farther and actually normalize denorms there. I could live with this interface, though I would mildly prefer something a little more Rustic.

Thoughts?

@BartMassey
Copy link

I'd also probably left-justify the significand. It's normally what you want, I think? Also, recognizing NaN and ±Inf is not really a thing here because you have to know the exponent adjustment: that's going to have to be done outside, I think.

@BartMassey
Copy link

Also, this design pretty much locks you into about f64 as the maximum size, because of the hard types. Idk if f96 or f128 or something is ever going to be a thing…

@tgross35
Copy link

Idk if f96 or f128 or something is ever going to be a thing…

Yes :) num's API is a bit stuck because of the concrete u64 type in the trait, but for us we should use the float-sized integer for anything representing the mantissa.

@BartMassey
Copy link

BartMassey commented Dec 10, 2024

Also that sign calculation could be done without the test.

let sign = 1 - ((bits >> 30) & 2) as i8;

Might be slower, but would conserve a branch predictor.

@BartMassey
Copy link

Alright. Give me a couple of hours, and I'll try to propose something everyone can live with and we can go from there.

@BartMassey
Copy link

Here's what I've got so far. Still needs some work.

https://github.com/BartMassey/float-parts

My plan is to generalize the f32 implementation to provide a default implementation of to_float_parts() for any IEEE floating-point type.

I think we should also provide an unsafe method to go the other way; it's practically free, and would be useful. Maybe a safe checked method also.

@BartMassey
Copy link

Ok, I've checked in a version that does both f32 and f64 and is pretty close to done, I think.

https://github.com/BartMassey/float-parts

I gave up on the default function because the lack of concrete types in the trait definition was too painful: used a macro instead.

I'll wait on review, bikeshedding, and response to my suggestions about adding the inverse function.

@BartMassey
Copy link

Of course, this could just be concrete functions or methods on each of the std floating types. I wrote it this way mostly for use as an external library, since I'm also working on div_euclid()/mod_euclid() and need this anyhow.

@tczajka
Copy link
Author

tczajka commented Dec 11, 2024

just provide a set_exponent(self, i32) -> Self

This could be the << operator on float and i32:

/// Works for positive and negative integers.
impl Shl<i32> for f32

@Amanieu
Copy link
Member

Amanieu commented Jan 14, 2025

@BartMassey The implementation looks good, however in the libs-api meeting we thought that the API would be better as separate methods to extract the mantissa and exponent rather than a single method that returns all 3 parts (we already have ways to extract the sign bit). This would make it easier to document the behavior of each of these methods.

@BartMassey
Copy link

@Amanieu Sounds good! Thanks much for the review. (Out of curiosity, what's the official way to collect the sign bit currently?)

@tgross35
Copy link

.is_sign_negative() or .is_sign_positive() do the mask and compare

@BartMassey
Copy link

To be honest, I think the convenience of something like .sign_bit() with the appropriate return type might be worth it? Right now I'd see myself writing something like .is_sign_positive() as u32, which doesn't seem ideal but would work I guess.

@tgross35
Copy link

What would the usecase for that be? I think most times you just want to check whether you should branch based on the sign, or you want a signed -1 / +1 to multiply something by later. But a sign bit in the repr's LSB doesn't seem all that useful.

@BartMassey
Copy link

If you're going on to reconstruct the float somehow it can be helpful. But yeah, we'll leave it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api
Projects
None yet
Development

No branches or pull requests

6 participants