-
Notifications
You must be signed in to change notification settings - Fork 784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix data columns not persisting for PeerDAS due to a getBlobs
race condition
#6756
Conversation
25e137b
to
44f8add
Compare
@@ -602,7 +631,7 @@ impl<T: BeaconChainTypes> DataAvailabilityCheckerInner<T> { | |||
|
|||
// Check if we have all components and entire set is consistent. | |||
if pending_components.is_available(self.sampling_column_count, log) { | |||
write_lock.put(block_root, pending_components.clone()); | |||
write_lock.put(block_root, pending_components.clone_without_column_recv()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These clones where we delete the receiver are OK because we should be using the receiver the first time it is returned inside an available block, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly - I added some docs to the function but I still think don't think it's immediately obvious:
lighthouse/beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
Lines 42 to 44 in aabb5c1
/// Clones the `PendingComponent` without cloning `data_column_recv`, as `Receiver` is not cloneable. | |
/// This should only be used when the receiver is no longer needed. | |
pub fn clone_without_column_recv(&self) -> Self { |
Added some more comments above this line in 4173135:
lighthouse/beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
Lines 552 to 554 in 4173135
// We keep the pending components in the availability cache during block import (#5845). | |
// `data_column_recv` is returned as part of the available block and is no longer needed here. | |
write_lock.put(block_root, pending_components.clone_without_column_recv()); |
… Add more docs on `data_column_recv`.
// If `data_column_recv` is `Some`, it means we have all the blobs from engine, and have | ||
// started computing data columns. We store the receiver in `PendingComponents` for | ||
// later use when importing the block. | ||
pending_components.data_column_recv = data_column_recv; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If data_column_recv
is already Some means we have double reconstruction happening. Should we mind that case or do something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I chatted with Michael about this - so currently it's possible any of these three happens at the same time
- gossip block triggering EL
getBlobs
and column computation - rpc block triggering EL
getBlobs
and column computation - gossip/rpc column triggering column reconstruction (only one computation at a time)
I initially wanted to address these altogether, but it got a bit messy and I decided to leave this issue to a separate PR and to keep this PR small , as I wanted to get this PR merged first and start interop testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created issue #6764
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just a minor comment. Thanks for the docs throughout
# Conflicts: # beacon_node/beacon_chain/src/block_verification_types.rs # beacon_node/beacon_chain/src/data_availability_checker/overflow_lru_cache.rs
ff7088e
to
adf1c86
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@mergify queue |
✅ The pull request has been merged automaticallyThe pull request has been merged automatically at dd7591f |
Issue Addressed
This PR fixes a race condition with
getBlobs
in PeerDAS, causing data columns not to be persisted on block import.This bug was discovered on a local devnet. The sequence of events below:
process_availability
.lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs
Lines 3929 to 3931 in fa6c4c0
getBlobsSidecars
endpoint shows that both supernode and fullnode are missing columnsProposed Changes
After some discussion with @michaelsproul , we think the optimisation #6600 is useful to keep, as it enables data columns to be computed and propagated to the network sooner. The solution proposed is to store the data column
Receiver
in theDataAvailaiblityChecker
(new field inBlockImportData
), so it's always available on block import.TODO