Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast and correct #22

Open
wants to merge 81 commits into
base: master
Choose a base branch
from
Open

Fast and correct #22

wants to merge 81 commits into from

Conversation

rinde
Copy link
Contributor

@rinde rinde commented Apr 18, 2024

First of all, thank you for reading this. I know this PR is very big and contains a lot of breaking changes. I will try my best to explain the major changes in this description. The changes in this PR are the result of an iterative development process over the past 2 months. At my company we have been linking to this PR (and it's successor) in an actively developed codebase for the past month and that usage has informed a lot of the changes. If something is still unclear, please reach out.

Fast and correct

As we discussed before, the current API has issues with speed and soundness. The design put forward in this PR fixes both.

In the new design, the non-empty collection essentially encapsulates the corresponding possibly empty collection, for example:

pub struct NEVec<T> {
    inner: Vec<T>,
}

The implementation of NEVec forwards all calls to the underlying Vec while guaranteeing non-emptiness. This simplified inner representation also greatly simplifies a lot of complexity that previously existed. Take for example Vec::insert():

pub fn insert(&mut self, index: usize, element: T) {
        let len = self.len().get();
        assert!(index <= len);

        if index == 0 {
            let head = std::mem::replace(&mut self.head, element);
            self.tail.insert(0, head);
        } else {
            self.tail.insert(index - 1, element);
        }
 }

This is now:

pub fn insert(&mut self, index: usize, element: T) {
        self.inner.insert(index, element);
}

Advantages of this approach are:

  • a lower chance of introducing bugs
  • a lower chance of introducing inefficiencies
  • lower maintenance burden (less code)

Non-emptiness is enforced by ensuring that an NEVec can never be constructed without any elements (quite similar to before), and by ensuring that the inner data structure can not be accessed (i.e. it is properly encapsulated).

For NEVec I added a benchmark in benches/vec.rs that tests that performance of NEVec is equal to Vec. Because of this new simplified design I belief that the other datastructures are also on par with their possibly empty counterpart. Additionally, the codebase at work gets regularly benchmarked and there we found no performance impact with this new version (we mostly use NEVec and NEIndexMap), this is contrary to the current version which slowed our code down noticeably.

A sound NonEmptyIterator

As discussed in #16 there are some problems with the current implementation, most notably:

  • Calling the methods next() and first() in the incorrect order could lead to data loss
  • Calling any of NonEmptyIterators method after calling next() at least once, leads to unexpected results because the first value is always included even if the iterator has already passed it.

Additionally I noticed that to implement the NonEmptyIterator a lot of repetitive boiler plate code is required.

In this PR, the NonEmptyIterator looks as follows:

pub trait NonEmptyIterator: IntoIterator {
    fn next(self) -> (Self::Item, Self::IntoIter)
    where
        Self: Sized,
    {
        let mut iter = self.into_iter();
        (iter.next().unwrap(), iter)
    }

    // all other methods
}

Note that NonEmptyIterator is bound by IntoIterator and that all its methods are already implemented, this means that it is a marker trait and therefore trivial to implement. This allows for very simple implementations of non-empty iterators, take for example Iter of Vec:

pub struct Iter<'a, T: 'a> {
    iter: std::slice::Iter<'a, T>,
}

impl<T> NonEmptyIterator for Iter<'_, T> {}

impl<'a, T> IntoIterator for Iter<'a, T> {
    type Item = &'a T;

    type IntoIter = std::slice::Iter<'a, T>;

    fn into_iter(self) -> Self::IntoIter {
        self.iter
    }
}

This is the entire implementation.

All NonEmptyIterators are safe to use when they are created by a non-empty collection because in that case it is guaranteed that the underlying iterator (std::slice::Iter<'a, T> in the above example) is non-empty. In case an implementor would make a mistake and create a NonEmptyIterator that is in fact empty, a panic would occur when calling next(). This means that such mistakes are easy to detect and test for.

Other changes

New features

In our codebase we converted some datastructures to their non-empty equivalent. During this refactoring we found missing functionality in this library. To fill the gaps, I created two new features:

  • itertools: adds the NonEmptyItertools trait similar to the Itertools trait for Iterator. So far it only contains a few methods that we needed. It is easy to extend this implementation to cover more of itertools in the future.
  • either: this feature adds the NEEither enum, which is a non-empty variant of the Either enum.

Linting, formatting, and justfile

Cargo.toml now contains configuration for both Rust lints and Clippy lints. I've updated the code where needed to satisfy the lints. In order to ensure that lints don't get violated accidentally, they are automatically checked as part of CI.

I've pinned the Rust toolchain to the most recent version (at the time of writing). This is to ensure that the build is stable. If we don't do this, it could be that a new version of Rust is released with new lints and that will break the build which will prohibit any updates/releases until it is fixed. Pinning the toolchain allows for choosing the time when to update to the new Rust version, which avoids breaking the build.

I added a justfile which can be run by the just runner. I've defined several useful commands for the project such as test, fmt, and lint. I've also updated CI to use these definitions.

The fmt command formats all Rust code. There is a rustfmt.toml file that configures the formatter. To run the code formatter a version of Rust nightly is required, but this introduces no dependency on nightly for users of the library. I've added just install-nightly for easy installing the correct version. Part of the lint command is also a formatting check, this ensures that all code in the repository follows consistent formatting. You will see some minor changes in the diff of this PR that fixes some formatting inconsistencies.

Breaking changes

I updated the changelog with every noteworthy change.

Follow-up

I've continued my work in a follow-up PR.

@rinde rinde marked this pull request as ready for review May 21, 2024 14:40
@fosskers
Copy link
Owner

Hi, thank you for this. I'm back from my trip. I will be addressing this after finishing up some Aura work and releasing Beta 3 of that. Thanks again for your patience.

@rinde
Copy link
Contributor Author

rinde commented Jun 10, 2024

Thanks for the transparency!

@fosskers
Copy link
Owner

fosskers commented Aug 9, 2024

Hi, I've completed the major project I had been working on, and it's time to return to this. In the interim you would have noticed that I've patched the library a few times for work purposes, and there are some conflicts now. We'll need to resolve those.

Conflicts aside, I will begin to review this in earnest. To confirm: this is the PR I should be looking at? What's the status of the other two PRs?

@rinde
Copy link
Contributor Author

rinde commented Sep 6, 2024

Hi @fosskers , yes this is the PR to review first. There is already a follow-up PR that we use internally but I think it makes sense to address that one separately once we reached consensus on this one.

I'll have a look at the changes that happened on master and try to port them over to this PR.

@rinde
Copy link
Contributor Author

rinde commented Sep 6, 2024

I just updated this PR to be in sync with master again. I removed the group_by function temporarily because I ran out of time to fix it. I think what needs to happen is to re-add it but in the new itertools file now. I think that it should be possible to wrap the itertools implementation to avoid having to essentially re-implement the group_by function from itertools. I hope this gets clear once you have reviewed the rest of this PR. I may have some time to work on the group_by sometime in the coming weeks, but it might also be instructive for you to re-implement it in the new structure that I'm proposing in this PR.

I'm back from parental leave now, and I hope we can use the coming weeks to push this over the finish line to avoid doing double work again. I'm happy to answer any questions you may have.

@fosskers
Copy link
Owner

fosskers commented Sep 8, 2024

Thanks a lot, I will be going over this fully soon.

@rinde
Copy link
Contributor Author

rinde commented Jan 22, 2025

Hi @fosskers , are you planning to follow-up on this? It's totally fine if you don't have the interest or time, I just would like to know because in that case I will probably publish this (and several follow-ups) in a separate repository.

@fosskers
Copy link
Owner

I do, I'm sorry to not have followed up yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants