Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

elastic scaling: rework core selector handling #6939

Open
wants to merge 46 commits into
base: master
Choose a base branch
from

Conversation

alindima
Copy link
Contributor

Prior to the PR, the parachain runtimes were prevented from sending the core selector ump signal by the experimental-ump-signals compile-time feature. This was needed because the validators were not yet upgraded to be able to decode the new signals.

This PR removes the experimental-ump-signals feature and makes the runtime send the UMP signal if the SelectCore impl (part of the runtime config trait) returns Some(...). This is a breaking change on the SelectCore trait so cannot be backported.
It also adds an impl for (), which returns None.

Instructions on how to configure the parachain runtime for elastic scaling will be included in #6739

@alindima alindima added T8-polkadot This PR/Issue is related to/affects the Polkadot network. T9-cumulus This PR/Issue is related to cumulus. labels Dec 18, 2024
@alindima alindima requested a review from sandreim December 18, 2024 09:27
@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/12390057192
Failed job name: test-linux-stable

@alindima alindima mentioned this pull request Dec 18, 2024
2 tasks
@alindima alindima requested a review from skunert December 18, 2024 15:37
cumulus/client/collator/src/service.rs Outdated Show resolved Hide resolved
cumulus/client/collator/src/service.rs Outdated Show resolved Hide resolved
cumulus/client/collator/src/service.rs Outdated Show resolved Hide resolved
@alindima
Copy link
Contributor Author

alindima commented Jan 9, 2025

I'm having second thoughts about whether or not removing the experimental-ump-signals compile-time feature is indeed a good idea.
If we do (as this PR does), then we need to find an efficient way of communicating that:

If you have a parachain runtime and update to this version, make sure that you use a SelectCore implementation that returns None, unless:

  • all of your collators run with the latest node software
  • all of your collators are using the slot-based collator (we only added logic to use the core selector for the slot-based collator).

If they don't, they will run into core index mismatches if they have multiple cores assigned. (leading to skipped slots and potentially parachain stall).

If we keep the functionality of sending ump signals guarded under experimental-ump-signals, it becomes more of a conscious decision to enable this. However, it will be a very long time until we will be able to remove this feature.

@sandreim @skunert what are your thoughts?

@sandreim
Copy link
Contributor

sandreim commented Jan 9, 2025

If you have a parachain runtime and update to this version, make sure that you use a SelectCore implementation that returns None, unless:

  • all of your collators run with the latest node software
  • all of your collators are using the slot-based collator (we only added logic to use the core selector for the slot-based collator).

If they don't, they will run into core index mismatches if they have multiple cores assigned. (leading to skipped slots and potentially parachain stall).

Yes, this is sub-optimal, but it only applies to parachains that have multiple cores assigned. I think the default implementation () is fine and should be what the docs of SelectCore implementation recommend unless you want to enable elastic scaling on the parachain.

/// allowed.
#[cfg_attr(
feature = "std",
error("Candidate contains core selector but descriptor version is v1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
error("Candidate contains core selector but descriptor version is v1")
error("Version 1 receipt does not support core selectors")

@@ -391,9 +402,7 @@ pub mod pallet {
UpwardMessages::<T>::put(&up[..num as usize]);
*up = up.split_off(num as usize);

// Send the core selector UMP signal. This is experimental until relay chain
// validators are upgraded to handle ump signals.
#[cfg(feature = "experimental-ump-signals")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we also used this to guard until enabling RFC103 on the relay chain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's important that they know at least to ignore the signals if the feature is disabled (which I'm pretty sure was included in a node release already, will check)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I think you're right. the feature needs to be enabled on the relay chain, this was another reason I chose a compile time feature back then..

So I'll leave the core selector config for later and cherry-pick the changes to only refuse candidates that output UMP signals but use v1 descriptor into its own PR that can be merged and also backported

@sandreim
Copy link
Contributor

sandreim commented Jan 9, 2025

Also, to make it more clear we can rename the implementations of SelectCore:

  • deafault () to DisableSelectCore which means the parachain doesn't use RFC103
  • DefaultCoreSelector to SingleCoreSelector which means parachain enabled RFC103 but just a single core can be used.

@skunert
Copy link
Contributor

skunert commented Jan 10, 2025

I'm having second thoughts about whether or not removing the experimental-ump-signals compile-time feature is indeed a good idea. If we do (as this PR does), then we need to find an efficient way of communicating that:

If you have a parachain runtime and update to this version, make sure that you use a SelectCore implementation that returns None, unless:

* all of your collators run with the latest node software

* all of your collators are using the slot-based collator (we only added logic to use the core selector for the slot-based collator).

If they don't, they will run into core index mismatches if they have multiple cores assigned. (leading to skipped slots and potentially parachain stall).

If we keep the functionality of sending ump signals guarded under experimental-ump-signals, it becomes more of a conscious decision to enable this. However, it will be a very long time until we will be able to remove this feature.

@sandreim @skunert what are your thoughts?

So if parachains upgrade their chain, they should use () by default if they don't want any changes, this can be communicated clearly. Also in most cases I would expect teams to upgrade the SDK version on their project at the same time for node and runtime.

However, does anything speak against adjusting the lookahead too to prevent this mismatch? If we already see this footgun lets remove it.

@alindima
Copy link
Contributor Author

However, does anything speak against adjusting the lookahead too to prevent this mismatch? If we already see this footgun lets remove it.

This would basically mean making lookahead work with elastic scaling. Do we want to support that?
The argument for not doing this from the get-go I think were that we don't want to keep supporting it.

@alindima alindima requested a review from a team as a code owner January 13, 2025 10:33
@alindima
Copy link
Contributor Author

I set the DefaultCoreSelector to be () , which returns None.
I also added warning comments to impls that return Some(): RoundRobinCoreSelector and LookaheadRoundRobinCoreSelector

@alindima
Copy link
Contributor Author

alindima commented Jan 13, 2025

Extracted #7127 from this PR, as discussed here. This PR is blocked until v2 descriptors node feature is enabled on all networks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T8-polkadot This PR/Issue is related to/affects the Polkadot network. T9-cumulus This PR/Issue is related to cumulus.
Projects
Status: Audited
Status: Backlog
Development

Successfully merging this pull request may close these issues.

4 participants