Skip to content

Egg #2: Erasure annotations

Ahmad Salim Al-Sibahi edited this page Feb 7, 2014 · 10 revisions

Outline

Motivation

We need erasure

Consider the following Nat-indexed type family representing binary numbers:

data Bin : Nat -> Type where
  N : Bin 0
  O : {n : Nat} -> Bin n -> Bin (0 + 2*n)
  I : {n : Nat} -> Bin n -> Bin (1 + 2*n)

These are supposed to be (at least asymptotically) fast and memory-efficient because their size is logarithmic compared to the numbers they represent.

Unfortunately this is not the case. The problem is that these binary numbers still carry the unary indices with them, performing arithmetic on the indices whenever arithmetic is done on the binary numbers themselves. Hence the real representation of the number 15 looks like this:

I -> I -> I -> I -> N
S    S    S    Z
S    S    Z    
S    S    
S    Z
S    
S
S
Z

The used memory is actually linear, not logarithmic and therefore we cannot get below O(n) with time complexities.

One could argue that Idris in fact compiles Nats via GMP but that's a moot point for two reasons:

  • First, whenever we try to index our datastructures with anything else than Nats, the compiler is not going to come to the rescue.
  • Second, even with Nats, the GMP integers are still there and they slow the runtime down.

This ought not to be the case since the Nats are never used at runtime and they are only there for typechecking purposes. Hence we should get rid of them and get runtime code similar to what a Haskell programmer would write.

Erasure in dependently typed languages

There are several approaches to erasure: automatic erasure (Idris), the Prop universe (Coq), irrelevant types (Agda).

Erasure cannot be fully automatic (Idris)

I'm not going to discuss what Idris can erase (I can elaborate for documentational purposes if required) but let me instead use an example to show that fully automatic erasure is troublesome.

Consider the type of U-lists:

data U : List a -> Type where
  nil : U []
  one : (z : a) -> U [z]
  two : {xs : List a} -> (x : a) -> (u : U xs) -> (y : a) -> U (x :: xs ++ [y])

For better intuition, the shape of the U-list representing [x₀,x₁,x₂,z,y₂,y₁,y₀] looks like this:

  x₀   y₀    (two)
  x₁   y₁    (two)
  x₂   y₂    (two)
     z       (one)

Let's look at the constructor two. The main statement is that while x can be erased because it's forced and xs can be erased because it can be reconstructed from u, we cannot erase both. The reason is that the function that reconstructs xs from u necessarily needs all the x-es all the way down to do its job. However, these x-es cannot be forced because it's the index that we're trying to calculate now.

Hence, the compiler would have to choose between erasing x or xs and while some criteria could be invented (like expected size of the term), I argue that it is not a good idea, especially when the choice

  • has a significant performance impact;
  • would be made silently behind the scenes;
  • on very subtle grounds (do you see readily what's going on from the code?).

Prop is cumbersome (Coq)

The core idea of having Prop is having a universe of types of values that are designed to be unable to affect what happens at runtime. The typechecker just checks that values from Type don't depend on values from Prop and code extraction simply erases everything that belongs to Prop. This sounds exactly like what we need to do.

However, the realisation of this idea as a separate universe brings various issues.

  • The duality pervades the language; there are separate induction principles for Prop and Type, there are separate sigma-types for Prop and Type etc., which is — I believe — what Edwin wants to avoid in a practical programming language.
  • Having a Prop still does not resolve all problems because vectors are still indexed with Nat from Type[1] and thus these numbers are still present at runtime. I'm unsure whether a Prop version of Nat could be defined, but in Coq, it apparrently isn't. Anyway, having two Nat types is something we probably want to avoid.

Note. Vanilla Coq does not have proof irrelevance (i.e. equality of any two proofs of the same proposition in Prop) but this can be postulated as an axiom.

Irrelevant types are too much (Agda)

Type irrelevance is cousin to type squashing, i.e. asserting that any two inhabitants of a type are equal. Because of that, these values cannot affect the runtime and they can be erased.

(I'm not completely sure how exactly different irrelevant types are from type squashing/truncation as we know it from HoTT, but I believe that's a good approximation, at least for our purposes. The Agda manual does not discuss these issues.[2])

One difference to Prop is that a data type is not intrinsically relevant or irrelevant; any type can be made irrelevant whenever convenient, which lifts the burden of the Prop/Type duality. In the following example, dots mark irrelevant arguments:

data Bin : .(Nat) -> Set where
  N : Bin 0
  O : .{n : Nat} -> Bin n -> Bin (0 + 2*n)
  I : .{n : Nat} -> Bin n -> Bin (1 + 2*n)

In the above example, the Nat-typed arguments can be erased.

Also, values of irrelevant types are automatically considered equal, which is useful when packing proofs in records. In Coq, one must either postulate proof irrelevance or not include proofs in records being compared for equality.

However, irrelevant types are not suitable for our purposes, either. We don't really want irrelevance, we want erasure. Irrelevance is too strong and it reduces the informational content of the type to mere existence already before the typechecking stage. For example, to get the indices erased from Bin, one would try to index them with the irrelevant .Nat. However, this also destroys the ability to reason with the exact values of these indices and expressing statements about them, as shown in the following snippet, that typechecks in Agda:

data ireq {A : Set} .(x : A) : .A -> Set where
  irefl : ireq x x

-- every two irrelevant x, y are equal
squash : {A : Set} .(x y : A) -> ireq x y
squash x y = irefl

Other considerations

We would like to avoid changing the typechecker or the core type theory of Idris.

The proposal: erasure annotations

What they look like

For the reasons summed up in the headings of the previous sections, I propose erasure annotations. These are annotations that can be put on fields of data declarations:

  • constructor arguments;
  • record fields;
  • typeclass methods;
  • and (optionally but usefully, IMHO) function arguments.

Note the dot marking xs as erased (the syntax is irrelevant at this point):

-- Erasure annotation at a constructor argument:
data U : List a -> Type where
  -- (...)
  two : .{xs : List a} -> (x : a) -> U xs -> (y : a) -> U (x :: xs ++ [y])

-- Erasure annotation at a function argument:
left : .{xs : List a} -> U xs -> List a
left  nil        = []
left (one z)     = []
left (two x u y) = x :: left u

I'd also find useful making unbound implicits erased by default, too.

What they mean

Let us introduce a new adjective referring to variables/values/expressions: "erased". The ultimate aim is to do away with these entities at runtime at all — erased arguments of constructors will not exist at runtime, erased arguments of functions will not exist at runtime as well.

Erased values are not accessible at runtime, they will not be forced or reconstructed. They simply may not be used in code that is intended to exist at runtime.

  1. A value marked as erased in source code is erased.
  2. A function matching on an erased argument is erased.
  3. A function returning an erased value in at least one clause is erased.
  4. An application of an erased function, or a function to an erased argument is erased.
  5. All other values are not erased.
  6. Correctness is enforced through the type of main, which is not erased.

An interesting consequence is that erased functions can be omitted from the runtime code completely because they can only get called in an erased context.

How it works

  1. The erasure-annotated code is typechecked completely normally, ignoring the annotations.
  2. The typechecked code is erasure-checked according to the above rules. This ensures that erased data is never used at runtime.
  3. Once erasure is confirmed, erased entities are removed from the runtime code.
  4. Other optimisations (collapsibility, forcing, further automatic erasure) are performed on the resulting code, yielding even more optimisation.

Discussion

One of the most important consequences is that erasure is just an optimisation, it does not change the type theory.

  • Adding erasure annotations to the compiler will not affect existing code in any way. Unannotated code will typecheck if and only if it typechecked before changing the compiler.
  • In erasure-annotated .idr code, it is always possible to remove or ignore the annotations from code and this code will typecheck if and only if the annotated code typechecks.
  • Hence one may use something like --ignore-erasure-annotations to get the same code that the compiler produces today.

We get the best of all related areas:

  • we don't get any duality between Prop and Type, usage is as simple as with irrelevant types;
  • however, erased values are not irrelevant and the typechecker sees them as absolutely normal values;
  • erased values are guaranteed to be omitted at runtime;
  • the core type theory and the typechecker is not touched at all.

Erasure annotations cannot (always) be inferred

I know that Edwin strives for a compiler that infers as much as possible. In the ideal world, the compiler would infer erasure annotations, too, by examining the call/usage graph. This may be used to infer erasure annotations for function declarations.

However, it is impossible to infer erasure annotations for data types. Simply putting a data type in a library would prevent anything from being erased because one can never know what parts of the data type a user of the library might access. (Note that this is not the case with annotations on library functions.)

Hence we need some annotations on the data types and, perhaps, allowing explicit annotations on functions might help users understand and debug their code better.

After all, I find the trade-off good. All the programmer needs to do is to put a few dots in the datatype declarations (especially if unbound implicits are made erased) — and the return in runtime efficiency is great.

Implementation considerations

Implementation would consist of:

  • changing the parser;
    • so that it recognises erasure annotations
  • adapting the internal data structures for erasure annotations;
    • data constructors and functions would probably just get some additional IntSet saying which arguments are erased;
    • especially, we don't want to have Erase : Type -> Type;
  • adding the erasure checker;
    • probably a bit of work
  • adding a flag to turn the erasure checker off.

Optional extensions

  1. Every value of the type Type n for all n (= all types) could be made erased.
  2. Unbound implicits could be made erased by default.
  3. As an alternative to erasing unbound implicits, we could abbreviate .{x : A} -> B x as B .x.
  4. Erasure annotations on arguments of functions could be left out.
Clone this wiki locally