Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: upload/delete race #10283

Open
erikgrinaker opened this issue Jan 5, 2025 · 3 comments
Open

pageserver: upload/delete race #10283

erikgrinaker opened this issue Jan 5, 2025 · 3 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jan 5, 2025

Uploads and deletes are racy in the upload queue, which can lead to delayed deletes removing newly uploaded files. This will be true even after upload reordering in #10218 ensures conflicting operations are executed in order, because deletions are async -- they're only submitted to the deletion queue:

self.deletion_queue_client
.push_layers(
self.tenant_shard_id,
self.timeline_id,
self.generation,
delete.layers.clone(),
)
.await
.map_err(|e| anyhow::anyhow!(e))

A lot of complexity has been added to deal with this, see e.g. #9844.

We should instead make deletes synchronous, such that the upload queue can properly order these and avoid races. We can either continue using the deletion queue, but wait for the deletion to execute before completing the task, or simply run the deletion operation synchronously.

When we do this, we should also document the invariants around this, in a doc comment and/or an RFC, with diagrams.

@erikgrinaker erikgrinaker added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Jan 5, 2025
@erikgrinaker erikgrinaker self-assigned this Jan 5, 2025
@jcsp
Copy link
Collaborator

jcsp commented Jan 6, 2025

So with reordering, I guess we now don't really care if the delete operation itself is slower (because it waits for flush), because other operations will usually just skip past it, right?

We would need to care about runtime of deletions at some points that we do a wait_completion though, like in some shutdown paths, so maybe those paths will need to kick the deletion queue if the timeline client has any deletions pending

@erikgrinaker
Copy link
Contributor Author

So with reordering, I guess we now don't really care if the delete operation itself is slower (because it waits for flush), because other operations will usually just skip past it, right?

Yes, except for any conflicting operations (same file).

I'm not sure to what extent we need to use the deletion queue at all, or if we can just submit delete requests directly. But I'm assuming we use it for good reason.

We would need to care about runtime of deletions at some points that we do a wait_completion though, like in some shutdown paths, so maybe those paths will need to kick the deletion queue if the timeline client has any deletions pending

How long does it take for the deletion queue to process deletions? Can we bound it to something reasonable like 5 seconds?

@jcsp
Copy link
Collaborator

jcsp commented Jan 6, 2025

I'm not sure to what extent we need to use the deletion queue at all, or if we can just submit delete requests directly. But I'm assuming we use it for good reason.

The deletion queue has two purposes (via 025-generation-numbers.md)

  • Batching generation validation requests to the controller (pageservers need to effectively ask permission before doing a deletion)
  • Batching deletion requests to S3.

In practices, the batches are often pretty small, and this optimization has limited absolute benefit, but the queue makes it easier to reason about scale and things being "cheap": we don't have to worry about how storage controller (or DB) load would spike if a pageserver decided to execute lots of deletions all of a sudden, because they're nicely batched.

One could imagine a world where we collapse the generation validation logic into RemoteTimelineClient, so that it can essentially do its own batching, but that's a little awkward because generations are per tenant and clients are per-timeline. There's a tradeoff between the separation of concerns (status quo, deletion queue is its own little subsystem), and net LoC (it would be terser to build it all into one queue).

How long does it take for the deletion queue to process deletions? Can we bound it to something reasonable like 5 seconds?

We could, but 5s would still be too long for something like a timeline shutdown during live migration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

No branches or pull requests

2 participants