Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement floats #65

Draft
wants to merge 29 commits into
base: master
Choose a base branch
from
Draft

Implement floats #65

wants to merge 29 commits into from

Conversation

ruuda
Copy link
Owner

@ruuda ruuda commented Aug 10, 2024

This is a work in progress, I’ll update the description later.

  • TODO: Write something about the implementation (decimal, rational, ...)
  • TODO: Write something about the types (Float, Num)
  • Find "json superset" caveats and update them — it’s finally practically true.
  • Ensure the fuzzer discovers new interesting code paths and that dictionaries are up to date.
  • Ensure test coverage.
  • Ensure grammars are up to date, for types as well as syntax.
  • Ensure documentation is up to date.

Open questions

How should (in)equality between numbers interact with their identity? If we allow 10 < 11.0 to typecheck and produce the result you’d expect, then we should also allow 10 == 10.0, and {10, 10.0} But what should they produce? Numerically they are equal, but as runtime::Value, they are different values. I see two possible ways forward:

  • Erode the difference between Int and Float further. Possibly only keep a Num type after all and everything is a float. 10 and 10.0 are equal, {10, 10.0} == {10}. This probably means that we print all numbers that are ints without suffix, which means that processing json is no longer “lossless” in the sense that we'd convert [10.0, 10] into [10, 10]. That sounds like something that will cause problems, hmmm ...
  • Do not allow Int and Float to mix so implicitly. 10 < 11.0 is a type error. 10 == 10.0 is false for the same reason that 10 == "10" is false. (Which makes sense and is consistent in one way, but may be surprising in another.) {10, 10.0} is a set with two elements.

I’m leaning towards the second one, but on the other hand, am I making my own life difficult? Why have both Int and Float? I think primarily to add additional type-safety in schemas (for things like ports, or counts, non-integers make no sense). But if that’s the argument, then should there also be UInt? Uint16? Fin n? ...

What does Python do here? It has different runtime representations, but considers the values equal, and for sets the first value is preserved:

>>> {10, 10.0}
{10}
>>> {10.0, 10}
{10.0}

Cue treats 10.0 as not an int, even though it is equal to the integer 10:

$ cue eval -
x: int
x: 10.0
x: conflicting values int and 10.0 (mismatched types int and float):
    -:1:4
    -:2:4

$ cue eval -
10.0 == 10
true

Thinking out loud, a possible solution may be:

  • There are only two numeric types: Int and Float (or maybe Float should be called Num), and Int is a subtype of the more general one.
  • There is only one runtime representation, all numbers are implemented as Decimal.
  • Decimal should preserve some presentation information that does not affect its identity, in particular the number of decimals. This way 10.0 and 10 can both be represented, and preserved exactly, for lossless json processing.
  • For runtime checks, if we have a Num and expect an Int, we need to do more than just a type check; we should also change the presentation attribute to render it with zero decimals. This feels ad-hoc though. An alternative is to do what Cue does and verify the presentation attribute as part of the type check.

ruuda added 22 commits August 24, 2024 12:44
The one thing that prevents that right now is floats, and the fuzzer
discovered it within a few seconds:

  ╭──────╴ Opcode (hex)
  │  ╭───╴ Argument (hex)
  │  │  ╭╴ Operation, argument (decimal)
  26 03 ExprPushInput, 3
        take_str, 3   → "4e2"
  e6 01 ModeJsonSuperset, 1
  EvalJsonSuperset -->
  4e2
It doesn't add functionality to deal with exponents or decimal points,
it only moves stuff into a function. The test currently has the wrong
expectation with an error in there.
Let's finally add the float type. After some deliberation, I think I
want to represent them as decimals in scientific notation internally,
unless you do a division at which point they will turn into a rational.
I can turn the rational back into a decimal (which may be lossy) in
order to format it as a string; the rational itself is an implementation
detail, but it does allow us to get certain computations exactly right
where a float would not.
It contains the Decimal type after all, and right now I don't have
anything named "float" either way.
Why the naming discrepancy? My thinking is that I want both Decimal and
Rational as internal representation, so Value can be decimal or rational.
But what is a good name for "decimal or rational"? I can think of
"number", but I want to reserve that for the supertype of int and this
one. "Rational" would be technically correct because decimals are also
rationals, just with a power of 10 denominator. But then what if I put
a sqrt function in there, or an approximation of pi? Sure technically
they are approximations, but I think this would _also_ be weird. So
let's go with float ... Also, this is what Cue calls them so there is
precedent.
RCL rejects 9223372036854775807.576460752303423487 with an overflow
error, but I think that is fine, I don't want to lose precision on
inputs.
This adds the type and the relations, but it doesn't add all the tests,
fuzzer dictionary, documentation, etc. I'll do all of that later as part
of this same feature branch.
This should get us one step closer to json compatibility, maybe even the
final step.
The choice I went with is to have a 16-bit exponent, which gives RCL's
float/decimal type more range than a regular f64. Now the fuzzer can
generate an input with a large exponent, and RCL will happily echo it,
and it's technically syntactically valid json, but Serde rejects it with
"number out of range" (in the same way that RCL rejects some numbers as
overflow). So add an exception for this mismatch.
Now that Rustfmt formatted them tall anyway, it's probably best to keep
them sorted.
This removes one case of incompatibility with Serde. If you write a
float literal that is too precise to be represented exactly, then we now
silently round it rather than treating it as an overflow error. I think
this is acceptable because if you are in the case where you care about
numbers to 19 significant digits then probably RCL is not the best tool
for what you are doing, but the case where we encounter some arbitrary
json that we want to query with "rcl jq" and it happens to have some
humongous float in it, that is probably more likely. Python handles
float literals in this way too so I think it's okay.
This is tricky stuff. Better write a lot of fuzz tests for this later.
Ugh this is such a rabbit hole, and it looks like I am putting it full
of special cases, I have a feeling it could be way more elegant.
This adds back the exception that was removed by allowing float parsing
imprecision, though in a more limited form initially because it only
affected exponents.

But after running the fuzzer for a bit longer, it also affects large
integers, so we are back to the start, overflow is just an intentional
incompatibility.
This was a to do, the fuzzer found it pretty quickly. The exact case it
found that triggered it was this:

    {5.001,5.001e97,}
RCL can handle larger exponents on floats, we have to admit that then.
Surrogate pairs are not supported by RCL on purpose, so when that can be
parsed by Serde but is rejected by RCL, we shouldn't fail the fuzzer on
it.
ruuda added 6 commits August 31, 2024 13:12
This overflow was discovered by the fuzz_source fuzzer. The test
currently fails.
See also the parent commit that adds the test. The test now passes.
The golden test doesn't hurt, but this is one of the few places where we
can do a reasonable unit test as well.
This is only the start, but let's verify Decimal::cmp against f64::cmp.
It instantly finds an input where they disagree:

	Compare {
	    a: NormalF64(
	        -0.16406250000007813,
	    ),
	    b: NormalF64(
	        0.0,
	    ),
	}
It turns out that the case that the fuzzer found was already one that my
past self marked as to do.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant