Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute columns in post-PeerDAS checkpoint sync #6760

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

dapplion
Copy link
Collaborator

@dapplion dapplion commented Jan 7, 2025

Issue Addressed

Addresses #6026.

Post-PeerDAS the DB expects to have data columns for the finalized block.

Proposed Changes

Instead of forcing the user to submit the columns, this PR computes the columns from the blobs that we can already fetch from the checkpointz server or with the existing CLI options.

Note 1: (EDIT) Pruning concern addressed

Note 2: I have not tested this feature

Note 3: @michaelsproul an alternative I recall is to not require the blobs / columns at this point and expect backfill to populate the finalized block

@dapplion dapplion added the das Data Availability Sampling label Jan 7, 2025
@jimmygchen jimmygchen self-assigned this Jan 7, 2025
@jimmygchen jimmygchen added the ready-for-review The code is ready for review label Jan 7, 2025
Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice and simple! Should have done this ages ago 😅

Just a TODO comment to address, but we can probably start testing with this!

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review ready-for-merge This PR is ready to merge. labels Jan 9, 2025
@jimmygchen
Copy link
Member

@michaelsproul I think this approach is simple and maintains same behavour as Deneb - but keen to hear if you have any preference or benefit of going with the alternatives discussed earlier (breaking the existing db invariant and populate the columns via backfill / p2p etc)?

Note 1: We should only store the columns that are meant to be in custody. Otherwise, later pruning won't delete the extras. However, accessing the NodeID and computing the custody indices is quite cumbersome as the logic is inside the network init. Ideas?

Yeah it is a bit of a pain, this code is executed before network is initialised so we don't have the node ID yet, and I'm not sure if it's worth doing a big refactor for this. Why wouldn't blob pruning delete the extras? Looks like it iterate through all columns for the root, unless i'm looking at the wrong thing?

let indices = self.get_data_column_keys(block_root)?;
if !indices.is_empty() {
trace!(
self.log,
"Pruning data columns of block";
"slot" => slot,
"block_root" => ?block_root,
);
last_pruned_block_root = Some(block_root);
ops.push(StoreOp::DeleteDataColumns(block_root, indices));
}

@dapplion I think this should work - would you mind doing a quick test this locally before we merge?

@jimmygchen jimmygchen added the under-review A reviewer has only partially completed a review. label Jan 9, 2025
@jimmygchen
Copy link
Member

@michaelsproul I think this approach is simple and maintains same behavour as Deneb - but keen to hear if you have any preference or benefit of going with the alternatives discussed earlier (breaking the existing db invariant and populate the columns via backfill / p2p etc)?

Oh I think the downside with this approach is that the checkpoint server would have to serve blobs, i.e. be a supernode. The latter approach above would allow syncing columns from peers instead, and checkpointz server doesn't have to store all columns (if we have 16mb blobs per block, that'd be ~2TB of blob data that the checkpointz server have to store).

Or if we're ok with checkpointz server serving blobs, we could potentially introduce a --checkpoint-server flag to for servers to store checkpoint blobs only (so it can serve them) instead of having to store all columns for 18 days?

@jimmygchen
Copy link
Member

This PR addresses #6026

@mrabino1
Copy link

@jimmygchen forgive my ignorance on the specs here (and indeed the existing implementation).. quick question. instead of storing the blobs on a checkpoint server, wouldn't it be easier to store the hash only such that the clients upon launch would p2p query / gossip those newly requested blobs? From my understanding, the objective of the checkpoint server is to allow a newly launched node to get going with minimal effort... but just like the state, LH still had to backfill... wouldn't a similar logic apply here? thx .

@jimmygchen
Copy link
Member

Hi @mrabino1

Yes it is possible to download the blobs from peers, with the alternative @dapplion mentioned:

Note 3: @michaelsproul an alternative I recall is to not require the blobs / columns at this point and expect backfill to populate the finalized block

The proposed solution in this PR behaves similarly to Mainnet today - we download and store the checkpoint state, block and blobs - so this approach requires minimal changes in Lighthouse, and will likely work without additional effort - it may not be the final solution but could be useful for testing now.

We could consider the alternative to download from peers - it means we'd have to break the current invariant of storing complete block and blobs in the database, and will need to have some extra logic to populate the finalized block when backfill completes.

@realbigsean realbigsean requested review from jimmygchen and realbigsean and removed request for jimmygchen and realbigsean January 12, 2025 22:38
@michaelsproul
Copy link
Member

I think breaking the invariant that we have all blobs/columns for blocks probably makes sense, seeing as there are multiple cases now where this is useful. The other main use case is importing partial blob history, e.g. all txns related to some L2. This is being worked on in:

I think it makes sense to restructure the DB storage as well so that we don't index by block_root -> list of blobs/data columns. For blobs it's too late (we need a schema upgrade), but we should change the data column schema before we ship PeerDAS on a permanent testnet (we can break schema versions between devnets without issue IMO).

@michaelsproul
Copy link
Member

I think this PR is fine for now until we get our ducks in a row

@realbigsean realbigsean requested review from realbigsean and removed request for realbigsean January 13, 2025 00:16
@dapplion
Copy link
Collaborator Author

@michaelsproul

I think it makes sense to restructure the DB storage as well so that we don't index by block_root -> list of blobs/data columns

Current DB indexes columns by [root:column_index], what do you propose to index with?

@jimmygchen jimmygchen requested review from jimmygchen and removed request for jimmygchen January 14, 2025 07:31
@jimmygchen jimmygchen added waiting-on-author The reviewer has suggested changes and awaits thier implementation. and removed under-review A reviewer has only partially completed a review. labels Jan 15, 2025
@jimmygchen
Copy link
Member

I think this PR is fine for now until we get our ducks in a row

+1, would be useful to be able to checkpoint sync on devnets, we could also test backfill.

@dapplion do you want to get this tested and merged first? or consider the alternatives?

jimmygchen added a commit that referenced this pull request Jan 21, 2025
Squashed commit of the following:

commit 8a6e3bf
Author: dapplion <35266934+dapplion@users.noreply.github.com>
Date:   Tue Jan 7 16:30:57 2025 +0800

    Compute columns in post-PeerDAS checkpoint sync
Copy link
Member

@jimmygchen jimmygchen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved earlier - just pending testing and we can merge!

@dapplion
Copy link
Collaborator Author

dapplion commented Jan 22, 2025

Run a peerdas devnet with kurtosis, and then checkpoint synced the same image against the first node. Node initialized fine.

root@lh-neth-0 ➜  ~ ./run_lighthouse_checkpoint_columns_test.sh
Jan 22 08:46:03.678 INFO Logging to file                         path: "/root/.lighthouse/custom/beacon/logs/beacon.log"
Jan 22 08:46:03.679 INFO Lighthouse started                      version: Lighthouse/v6.0.1-8a6e3bf
Jan 22 08:46:03.679 INFO Configured for network                  name: custom (/testnet_dir)
Jan 22 08:46:03.687 INFO Data directory initialised              datadir: /root/.lighthouse/custom
Jan 22 08:46:03.692 INFO Deposit contract                        address: 0x4242424242424242424242424242424242424242, deploy_block: 0
Jan 22 08:46:05.418 INFO Blob DB initialized                     oldest_data_column_slot: None, oldest_blob_slot: Some(Slot(0)), path: "/root/.lighthouse/custom/beacon/blobs_db", service: freezer_db
Jan 22 08:46:05.419 WARN No JWT found on disk. Generating        path: /root/jwt.hex, service: exec
Jan 22 08:46:05.419 DEBG Loaded execution endpoint               jwt_path: "/root/jwt.hex", endpoint: https://blockprint-worker-2.sigp-dev.net/, service: exec
Jan 22 08:46:08.745 INFO Starting checkpoint sync                remote_url: http://172.16.8.17:4000/, service: beacon
Jan 22 08:46:08.751 DEBG Downloading deposit snapshot            service: beacon
Jan 22 08:46:08.752 WARN Remote BN does not support EIP-4881 fast deposit sync, error: Error fetching deposit snapshot from remote: HttpClient(, kind: decode, detail: invalid type: null, expected struct DepositTreeSnapshot at line 1 column 12), service: beacon
Jan 22 08:46:08.752 DEBG Downloading finalized state             service: beacon
Jan 22 08:46:08.789 DEBG Downloaded finalized state              slot: Slot(59744), service: beacon
Jan 22 08:46:08.789 DEBG Downloading finalized block             block_slot: Slot(59744), service: beacon
Jan 22 08:46:08.790 DEBG Downloaded finalized block              service: beacon
Jan 22 08:46:08.805 INFO Loaded checkpoint block and state       block_root: 0x0696661d2fcc1dfb8caf04ba4647ba24ef702811bf8f6b732c99e823c60386a1, state_slot: 59744, block_slot: 59744, service: beacon
Jan 22 08:46:08.817 DEBG Storing cold state                      slot: 0, strategy: snapshot, service: freezer_db
Jan 22 08:46:08.855 INFO Block production enabled                method: json rpc via http, endpoint: Auth { endpoint: "https://blockprint-worker-2.sigp-dev.net/", jwt_path: "/root/jwt.hex", jwt_id: Some("lion"), jwt_version: None }
Jan 22 08:46:08.885 DEBG Could not get proposers from cache      decision_root: 0x47972e20042d4b30c7323944e8ea9a97389ff657b5f4905372532225955c67f2, epoch: Epoch(1866), service: val_mon, service: beacon
Jan 22 08:46:08.885 INFO Beacon chain initialized                head_slot: 59744, head_block: 0x0696661d2fcc1dfb8caf04ba4647ba24ef702811bf8f6b732c99e823c60386a1, head_state: 0x46916f67ad50a67c8d260b861997853c18a21da37d5912c7f94575d67f6603e7, service: beacon

EDIT: I checked the DB and it's empty, the checkpoint block has no data. Will repeat the test

@jimmygchen jimmygchen added ready-for-merge This PR is ready to merge. and removed waiting-on-author The reviewer has suggested changes and awaits thier implementation. labels Jan 22, 2025
&data_column.as_ssz_bytes(),
)?;
self.block_cache
.lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note that we're acquiring a mutex in a loop here, although this is probably fine as a one time operation during startup.

@jimmygchen
Copy link
Member

jimmygchen commented Jan 22, 2025

I've raised an issue for the longer term solution to download checkpoint columns from p2p peers:
#6837

This PR works now and is useful for testing for the time being.

@jimmygchen
Copy link
Member

@mergify queue

Copy link

mergify bot commented Jan 22, 2025

queue

🛑 The pull request has been removed from the queue default

Pull request #6760 has been dequeued by a dequeue command.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

@jimmygchen jimmygchen added under-review A reviewer has only partially completed a review. and removed ready-for-merge This PR is ready to merge. labels Jan 22, 2025
mergify bot added a commit that referenced this pull request Jan 22, 2025
@jimmygchen
Copy link
Member

@mergify dequeue

Copy link

mergify bot commented Jan 22, 2025

This pull request has been removed from the queue for the following reason: pull request dequeued.

Pull request #6760 has been dequeued by a dequeue command

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

Copy link

mergify bot commented Jan 22, 2025

dequeue

✅ The pull request has been removed from the queue default

@jimmygchen
Copy link
Member

Hold off merging for now as this needs a retest.

@jimmygchen jimmygchen added do-not-merge and removed under-review A reviewer has only partially completed a review. labels Jan 22, 2025
jimmygchen added a commit to jimmygchen/lighthouse that referenced this pull request Jan 23, 2025
Squashed commit of the following:

commit 8a6e3bf
Author: dapplion <35266934+dapplion@users.noreply.github.com>
Date:   Tue Jan 7 16:30:57 2025 +0800

    Compute columns in post-PeerDAS checkpoint sync
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling do-not-merge peerdas-devnet-4
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants