Store trait #21

sgwilym · 2024-06-24T15:13:52Z

Work in progress, currently being discussed and iterated upon.

AljoschaMeyer · 2024-06-26T18:34:54Z

Design Notes on a Willow Store Trait: Mutation

For our rust implementation of Willow, we are designing a trait to abstract over data stores. Among the features that a store must provide are ingesting new entries, querying areas, resolving payloads, and subscribing to change notifications. Turns out this trait becomes rather involved. In this writeup, I want to focus on a small subset of the trait: everything that allows user code to mutate a data store.

On the surface, Willow stores support only a single mutating operation: ingesting new entries. Does the following trait (heavily simplified on the type-level) do the trick?

trait Store {
    async fn ingest(&mut self, entry: Entry) -> Result<(), StoreError>;
}

Nope, there is actually a whole lot more to consider. We'll start with something simple: while the data model only considers adding new entries to a store, there is a second operation that all implementations should support: locally deleting entries. We want to support this both for explicit entries, and for whole areas:

trait Store {
    async fn ingest(&mut self, entry: Entry) -> Result<(), StoreError>;
    async fn forget(&mut self, entry: Entry) -> Result<(), StoreError>;
    async fn forget_area(&mut self, area: Area3d) -> Result<(), StoreError>;
    // We also need ingestion and forgetting of payloads,
    // but we'll leave that for later.
}

Now we have a sketch of a somewhat useable API, but it does not admit particularly efficient implementations. We want stores to (potentially) be backed by persistent storage. But persisting data to disk is expensive. Typically, writes are performed to an in-memory buffer, and then periodically (or explicitly on demand) flushed to disk (compare the fsync syscall in posix). To support this pattern, we should change the semantics of our methods to "kindly asking the store to eventually do something", and add a flush method to force an expensive commitment of all prior mutations (typically by fsyncing a write-ahead log).

trait Store {
    async fn ingest(&mut self, entry: Entry) -> Result<(), StoreError>;
    async fn forget(&mut self, entry: Entry) -> Result<(), StoreError>;
    async fn forget_area(&mut self, area: Area3d) -> Result<(), StoreError>;
    async flush(&mut self) -> Result<(), StoreError>;
}

Another typical optimisation is that of efficiently operating on slices of bulk data instead of individual pieces of data. Forgetting should hopefully be rare enough, but we should definitely have a method for bulk ingestion.

trait Store {
    async fn ingest(&mut self, entry: Entry) -> Result<(), StoreError>;
    async fn ingest_bulk(&mut self, entries: &Entry[]) -> Result<(), StoreError>;
    async fn forget(&mut self, entry: Entry) -> Result<(), StoreError>;
    async fn forget_area(&mut self, area: Area3d) -> Result<(), StoreError>;
    async flush(&mut self) -> Result<(), StoreError>;
}

The next issue is one of interior versus exterior mutability. Those async methods desugar to returning Futures. And since the methods take a mutable reference to the store as an argument, no other store methods can be called while such a Future is in scope. Hence, this trait forces sequential issuing of commands, with no concurrency. While it might be borderline acceptable to force linearization of all mutating store accesses (especially since most method calls would simply place an instruction in a buffer rather than actually performing and committing side-effects), the inability to mutate the store while also querying it (say, by subscribing to changes) is a dealbreaker. Thus, we should change those methods to take an immutable &self reference, forcing store implementatons to employ interior mutability. To the experienced rust programmer, this shouldn't be too surprising: the whole point of a store is to provide mutable, aliased state, after all.

trait Store {
    async fn ingest(& self, entry: Entry) -> Result<(), StoreError>;
    async fn ingest_bulk(&self, entries: &Entry[]) -> Result<(), StoreError>;
    async fn forget(&self, entry: Entry) -> Result<(), StoreError>;
    async fn forget_area(&self, area: Area3d) -> Result<(), StoreError>;
    async flush(&self) -> Result<(), StoreError>;
}

Finally, there is a line of thought that I'm less confident about yet. Starting with simple ingest operations and then adding support for buffering (and flushing of the buffer) and bulk operations duplicates a lot of the design of the ufotofu BulkConsumer. So perhaps it would make sense to simply expose a BulkConsumer, whose items are an enum for the different operations (ingest, forget, forget_area). Replacing method calls with explicitly moving around enum values might sound inefficient, but there's an argument to be made that any efficient buffering operation would implement the method calls by storing a reification of the operation inside some buffer anyways. The main upside would be that of using a principled abstraction instead of providing a zoo of methods that end up implementing the same semantics anyways.

Aside from the drawback of forcing the explicit reification of operations, another drawback of the consumer approach would be the question of how to obtain the consumer. If the store itself implemented BulkConsumer, then the issue of exerior mutability would make it unusuable. If the store had a function that took a &self and returned an owned handle that implements BulkConsumer, then arbitrarily many handles could be created, i.e., the consumer would be forced to support many writers. I don't really see a way around that. So the two options seem to either be a multi-writer consumer, or a collection of (interiorly mutable) methods on the store trait without an explicit consumer.

This is the main issue I wanted to convey in this writeup. I have glossed over various details (generics and assocated types, precise errors, information to return on successful operations, an ingest-unless-it-would-overwrite method, payload handling, etc.), because those are comparatively superficial. But I'd love to hear some feedback on the issues of interior vs exterior mutability and explicit vs implicit (buffered, bulk) consumer semantics.

data-model/src/store.rs

AljoschaMeyer · 2024-07-01T08:39:18Z

Storage requirements

user-facing
- mutation
  - ingest entry
    - optionally do nothing if it would overwrite a (strict) extension of the path
  - append to payload prefix
  - forget entry by pair of path and subspace id
  - forget by area (of interest)
  - forget all but area (of interest) (within a containing area (of interest)) (?)
  - all forget functionality but for payloads instead of entries
  - all forget functionality comes with a traceless flag
    - if traceless, the storage keeps no record of ever having had the data
    - if not traceless, the storage is allowed to persist the forget query for an arbitrary amount of time, in order to accurately inform persistent subscribers about the forgetting in the future
      - non-traceless forgetting does not imply rejecting future data that matches the forget operation, such a service must live at a different level (and must be implemented by not ingesting the data again in the first place)
  - flush (force persistence of all prior mutations)
    - one-shot queries may or may not observe mutations before they were flushed
    - subscriptions are only notified of mutations that have been flushed successfully
- query
  - query entry by pair of path and subspace id
    - yields Entry, AuthorisationToken, available payload prefix (length)
    - option to filter out results with incomplete payloads
  - query entries by area (of interest)
    - all of the functionality of singleton queries, also
    - ordering results (arbitrary, PTS (optionally reversed), TSP (optionally reversed), SPT (optionally reversed)) (?)
  - subscriptions: all queries should also work as long-lived subscriptions
    - this includes subscription to payload prefix appending
    - this also includes notifications about overwriting and forgetting things
  - persistent subscriptions
    - all subscription notifications come with a u64 progress id, client code can stop consuming a subscription at any time and can later (even between program shutdown and restart) resume it at the same point by supplying its progress id
    - this is a best-effort service, the storage may reject a resumption because it is too outdated
      - a simplistic implementation can effectively opt out of providing persistent subscriptions by simply considering all ids as outdated
    - forgetting entries/payloads can cause progress ids to become outdated, unless the storage tracks the things it forgot
      - this is why all forget functionality comes with a flag whether it should be completely traceless or whether the storage is allowed to track the act of forgetting
internal (for replication)
- all of the public-facing functionality may also be used internally
- query entries by 3dRange
  - yields Entry, AuthorisationToken, available payload prefix (length)
  - option to filter out results with incomplete payloads (?)
  - optionally order according to the ReconciliationAnnounceEntries:will_sort field
  - optionally as a subscription (?)
- summarise 3dRange (count, fingerprint)
- convert area of interest to 3dRange
- partition a 3dRange into k roughly evenly sized 3dRanges

sgwilym · 2024-08-20T12:38:46Z

I'm starting to work in earnest on this again.

One thing I have thought about is how what a store is in our implementations is not what a store is in the spec:

A store is a set of AuthorisedEntries ...

Whereas in our implementations, a store is a set of authorised entries AND a partial set of corresponding payloads.

In willow-js, this led to this situation where you need to specify how a store's payloads should be stored and retrieved via PayloadDrivers, ramping up the complexity of the store.

I suggested the opportunity to separate the two concepts to @AljoschaMeyer, so that we'd have Stores concerned only with entries and, err, PayloadStores concerned only with Payloads. Rather than have Store.ingest_payload, we could have Store.update_available_payload_length to update the fingerprints.

But Aljoscha raised a good point:

I feel like internally the implementations need to be coupled though, say, for locking.

Which is persuasive, but maybe there's some way around it? Which is why I'm sharing this here.

sgwilym · 2024-08-21T11:34:50Z

@AljoschaMeyer I have now added all mutation methods (unless we want a bulk payload ingestion method?)

data-model/src/store.rs

AljoschaMeyer · 2024-08-26T11:36:54Z

data-model/src/store.rs

+    /// Attempt to ingest many [`AuthorisedEntry`] in the [`Store`].
+    ///
+    /// The result being `Ok` does **not** indicate that all entry ingestions were successful, only that each entry had an ingestion attempt, some of which *may* have returned [`EntryIngestionError`]. The `Err` type of this result is only returned if there was some internal error.
+    fn bulk_ingest_entry(


Returning a Vec of errors causes an allocation that might be avoidable (or at least configurable) if the method took a consumer of BulkIngestionResults as an argument instead? Not entirely sure whether that would really be an improvement. CC @Frando

I mean, I think you would always want to know about errors, right? The consumer pattern is cool, but is it maybe a little fancy?

Co-authored-by: Aljoscha Meyer <AljoschaMeyer@users.noreply.github.com>

Add query methods to Store trait

data-model/src/store.rs

Frando · 2024-08-29T11:46:50Z

Note: Many of the structs still need the standard derives, Debug, Clone, and Eq/PartialEq where applicable.

data-model/src/store.rs

sgwilym added the enhancement New feature or request label Jun 24, 2024

AljoschaMeyer reviewed Jun 27, 2024

View reviewed changes

data-model/src/store.rs Show resolved Hide resolved

sgwilym added 5 commits August 20, 2024 11:07

Sketch of Store trait

fd2007c

ok

54968da

good comments

b7ad9b0

wip who cares

f2b490d

Begin rework: ingest_entry, append_bytes, LengthyEntry

c6bb42e

sgwilym force-pushed the store branch from 74bee31 to c6bb42e Compare August 20, 2024 12:32

sgwilym added 7 commits August 20, 2024 13:50

Make ingest / append async, add TODO on bulk entry ingestion

0885197

Store.forgetEntry

f999e19

Add forgetting by area methods

fe8702c

Add payload forgetting methods

a23e76d

Add flush

f01dcff

Add bulk_ingest_entry, Store::FlushError

38ecd75

prevent_pruning param, revisit bulk ingestion result

98a9b73

sgwilym force-pushed the store branch from 83f9897 to 98a9b73 Compare August 21, 2024 11:34

sgwilym added 4 commits August 21, 2024 12:51

Better docs for bulk ingestion

fe7b7b1

add entry method

f4f72bc

Add query_area method

9305aca

Add subscription fns

bbf09c0

sgwilym mentioned this pull request Aug 22, 2024

Add query methods to Store trait #48

Merged

fix Store.resume_subscription params

f1cf5e1

sgwilym mentioned this pull request Aug 22, 2024

Add RbsrStore trait #49

Merged

Frando reviewed Aug 26, 2024

View reviewed changes

data-model/src/store.rs Outdated Show resolved Hide resolved

data-model/src/store.rs Show resolved Hide resolved

data-model/src/store.rs Outdated Show resolved Hide resolved

Frando reviewed Aug 26, 2024

View reviewed changes

data-model/src/store.rs Outdated Show resolved Hide resolved

AljoschaMeyer reviewed Aug 26, 2024

View reviewed changes

sgwilym and others added 11 commits August 26, 2024 18:17

no references to deleting

9277add

Update data-model/src/store.rs

32c9318

Co-authored-by: Aljoscha Meyer <AljoschaMeyer@users.noreply.github.com>

Update data-model/src/store.rs

a130c47

Co-authored-by: Aljoscha Meyer <AljoschaMeyer@users.noreply.github.com>

Clarify “payload too long” error

09b7016

Single forget_area fn with protected param

cc2a67e

Add missing &self param / make many params references

e133097

Add QueryIgnoreParams

f824d43

Update QueryOrder

4f3fcd4

Update data-model/src/store.rs

a578229

Co-authored-by: Aljoscha Meyer <AljoschaMeyer@users.noreply.github.com>

Merge store into store-trait-query

74e6202

Merge pull request #48 from earthstar-project/store-trait-query

ade83c7

Add query methods to Store trait

sgwilym requested a review from AljoschaMeyer August 27, 2024 14:04

AljoschaMeyer reviewed Aug 28, 2024

View reviewed changes

data-model/src/store.rs Outdated Show resolved Hide resolved

AljoschaMeyer reviewed Aug 28, 2024

View reviewed changes

data-model/src/store.rs Outdated Show resolved Hide resolved

AljoschaMeyer reviewed Aug 28, 2024

View reviewed changes

data-model/src/store.rs Show resolved Hide resolved

AljoschaMeyer reviewed Aug 28, 2024

View reviewed changes

data-model/src/store.rs Outdated Show resolved Hide resolved

sgwilym added 2 commits August 28, 2024 12:01

Rename forget_payload_unchecked

4c6cfc7

Make payload forgetting APIs more consistent

b8a87ad

Frando reviewed Aug 29, 2024

View reviewed changes

data-model/src/store.rs Outdated Show resolved Hide resolved

Add EntryOrigin to ingestion

b8de760

sgwilym requested a review from AljoschaMeyer August 29, 2024 11:18

Use Areas for subscriptions

de0f72f

sgwilym added 2 commits August 29, 2024 14:13

Add standard derives

d39f7e0

tedious implementation of Error

51ab2c4

AljoschaMeyer reviewed Aug 29, 2024

View reviewed changes

data-model/src/store.rs Outdated Show resolved Hide resolved

Frando mentioned this pull request Aug 29, 2024

feat(iroh-willow): Event subscriptions n0-computer/iroh#2682

Merged

4 tasks

clearer docs for EntryOrigin

914bc63

sgwilym merged commit 442854e into main Aug 30, 2024
14 checks passed

sgwilym deleted the store branch August 30, 2024 10:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store trait #21

Store trait #21

sgwilym commented Jun 24, 2024

AljoschaMeyer commented Jun 26, 2024

AljoschaMeyer commented Jul 1, 2024 •

edited

Loading

sgwilym commented Aug 20, 2024

sgwilym commented Aug 21, 2024

AljoschaMeyer Aug 26, 2024

sgwilym Aug 27, 2024 •

edited

Loading

Frando commented Aug 29, 2024

Store trait #21

Store trait #21

Conversation

sgwilym commented Jun 24, 2024

AljoschaMeyer commented Jun 26, 2024

Design Notes on a Willow Store Trait: Mutation

AljoschaMeyer commented Jul 1, 2024 • edited Loading

Storage requirements

sgwilym commented Aug 20, 2024

sgwilym commented Aug 21, 2024

AljoschaMeyer Aug 26, 2024

Choose a reason for hiding this comment

sgwilym Aug 27, 2024 • edited Loading

Choose a reason for hiding this comment

Frando commented Aug 29, 2024

AljoschaMeyer commented Jul 1, 2024 •

edited

Loading

sgwilym Aug 27, 2024 •

edited

Loading