-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support tx poh recording in unified scheduler #4150
Conversation
2add55e
to
427736f
Compare
@@ -140,7 +140,7 @@ pub struct RecordTransactionsSummary { | |||
pub starting_transaction_index: Option<usize>, | |||
} | |||
|
|||
#[derive(Clone)] | |||
#[derive(Clone, Debug)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need Debug
on all of this? I can't imagine anyone has ever used it for the massive scheduling and bank structures. I'd sooner remove it from all of that than let it spread further, wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i think Debug
is handy for quick debugging and actually required by assert_matches!()
...
sdk/transaction-error/src/lib.rs
Outdated
/// Commit failed internally. | ||
CommitFailed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we now able to fail on commit while it was not previously possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we are failing the pre-commit check (recording), not commit itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, why can we not eat the error in the same way we do in normal block-production, without adding a new variant here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we now able to fail on commit while it was not previously possible?
...
It seems like we are failing the pre-commit check (recording), not commit itself.
hope this rename helps to reduce this confusion: 23159ff
why can we not eat the error in the same way we do in normal block-production, without adding a new variant here?
normal block-production code path can handle the error condition while not confined to TransactionError
. it is directly deciding what to commit by looking RecordTransactionsSummary at core/src/banking_stage/consumer.rs.
However, I opted not to use the block production code path for SchedulingMode::BlockProduction
because it has some unwanted functionalities for unified scheduler. Also, I didn't want to introduce yet another code-path with copied code for this transaction execution because these code-pathes are quite important to be maintained well.
While it's not ideal to add a variant here, there's some precedent for internal-use variants: ResanitizationNeeded
and ProgramCacheHitMaxLimit
. And they are added for ease of introduction rather than correctly adjusting all the types around code base.
unified-scheduler-pool/src/lib.rs
Outdated
BlockProduction => { | ||
let mut vec = vec![]; | ||
if handler_context.transaction_status_sender.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this part, could you explain what the thought process here is?
What does block-production have to do with transaction_status_sender
; in nearly all cases it should be None
for block-producers since it is only used for RPCs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given below code, I'm thinking it may have been a typo and expected to be transaction_recorder
?
Even so, it seems better to guarantee the recorder is present if we have block-production mode selected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay; I see now it's RPC-only stuff for the status batches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for pointing out unclear code...: a59e39a
ledger/src/blockstore_processor.rs
Outdated
) -> Result<()> { | ||
let TransactionBatchWithIndexes { | ||
batch, | ||
transaction_indexes, | ||
} = batch; | ||
let record_token_balances = transaction_status_sender.is_some(); | ||
let mut transaction_indexes = transaction_indexes.to_vec(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: adding a clone here here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if we can remove this clone and simplify the callback interface.
What if the callback itself did an allocation (only if necessary) returning the transaction index in that case.
I also really don't like the Option<Option<usize>>
that is there now because it hides the meaning. It's not clear from just this code that the outer option means that recording/pre-commit failed.
If I were to do this, I'd take one of these two approaches:
- Make the outer option of return value a Result<...,()>
- Simple enum type to more clearly represent meanings
Lean towards 1 since it's simpler, and not sure an additional enum would benefit too much here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: adding a clone here here
Wondering if we can remove this clone and simplify the callback interface.
What if the callback itself did an allocation (only if necessary) returning the transaction index in that case.
nice catch.. done: 08536f0
I think Cow is enough. I'd like to remain the closure agnostic from allocation at all for separation of concern.
I also really don't like the
Option<Option<usize>>
that is there now because it hides the meaning. It's not clear from just this code that the outer option means that recording/pre-commit failed.If I were to do this, I'd take one of these two approaches:
1. Make the outer option of return value a Result<...,()> 2. Simple enum type to more clearly represent meanings
Lean towards 1 since it's simpler, and not sure an additional enum would benefit too much here.
this is done: 3b852f6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cow::to_mut
will clone if not already owned: https://doc.rust-lang.org/std/borrow/enum.Cow.html#method.to_mut
If we want to only clone in case of transaction_status_sender.is_some()
, it's probably the easiest/most direct to just let TransactionBatchWithIndexes
be a mutable reference and just modify it? Clone for the actual status send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cow::to_mut
will clone if not already owned: https://doc.rust-lang.org/std/borrow/enum.Cow.html#method.to_mut
good catch. removed needless cloning for the case of transaction_status_sender.is_none()
by making .to_mut()
conditional: f8eeef5 (after rebase: 272a959)
fyi, cloning vec of zero length doesn't incur any memory allocation. Still, fewer visible clones, better. :)
If we want to only clone in case of
transaction_status_sender.is_some()
, it's probably the easiest/most direct to just letTransactionBatchWithIndexes
be a mutable reference and just modify it? Clone for the actual status send.
well, I'm reluctant to cloud the api semantics (using &mut
or even taking ownership of TransactionBatchWithIndexes
) just due to impl details. For one thing, bench_execute_batch()
repeatedly calls execute_batch()
with immutable references to the same instances of TransactionBatchWithIndexes
. (i don't say this tiny piece of bench code usage is a deal breaker and this is unfixable, though..).
Actually, I think use of Cow
here is appropriate: the later .into_owned()
is no-op , once after .to_vec()
-ed conditionally. And .into_owned()
will nicely .clone()
for the case of block verification code path. So, this isn't changed from the previous code (.to_vec()
).
And, only the cow construction (= Cow::from(..)
) (effectively zero-cost) will be incurred if transaction_status_sender.is_none()
for the both cases of block verification and production.
runtime/src/bank.rs
Outdated
recording_config: ExecutionRecordingConfig, | ||
timings: &mut ExecuteTimings, | ||
log_messages_bytes_limit: Option<usize>, | ||
pre_commit_callback: Option<impl FnOnce() -> bool>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it simpler to just not make this optional?
just have impl FnOnce() -> bool
. Likely compiler will optimize the simple case of || true
out completely, and let's us simplify the code by not having conditional calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe, i made them Option
-ed by purpose. wrote about it a bit: 36f8537
indeed, || true
will completely optimized out. but I prefer to retain Option
here, considering this is very security sensitive code for extra safety to ensure block-verification code-path isn't affected.
runtime/src/bank.rs
Outdated
timings: &mut ExecuteTimings, | ||
log_messages_bytes_limit: Option<usize>, | ||
pre_commit_callback: Option<impl FnOnce() -> bool>, | ||
) -> Option<(Vec<TransactionCommitResult>, TransactionBalancesSet)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems like it should be a Result
instead of an Option
, because this should succeed and fails due to something going wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is done: 3b852f6
bb442fa
to
77136d3
Compare
f290620
to
447bdb6
Compare
ledger/src/blockstore_processor.rs
Outdated
@@ -110,25 +113,30 @@ fn first_err(results: &[Result<()>]) -> Result<()> { | |||
fn get_first_error( | |||
batch: &TransactionBatch<impl SVMTransaction>, | |||
commit_results: &[TransactionCommitResult], | |||
is_block_producing_unified_scheduler: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd to me that we would call this function at all for a block-producing scheduler, it seems to make more sense if we had the if switch around the call than inside to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also maybe a dumb question - IF we detect an error here we discard the block as dead.
Why commit the results of such transactions at all?
Why not make this function (or slight modification of it) a pre-commit check for the block-verification variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems odd to me that we would call this function at all for a block-producing scheduler, it seems to make more sense if we had the if switch around the call than inside to me.
done: 8112db8 (after rebase: 94a1ba0)
IF we detect an error here we discard the block as dead. Why commit the results of such transactions at all?
I did look up the git history a bit. After that it seems that this is just missed optimization. so, i made the block verification code-path bail out earlier: 8112db8 (after rebase: 94a1ba0)
Why not make this function (or slight modification of it) a pre-commit check for the block-verification variant?
I moved this function into pre-commit check: 8112db8 (after rebase: 94a1ba0)
As an extra clean-up bonus, i also moved the apply_cost_tracker_during_replay
thing as well.
50dda93
to
01871e3
Compare
305f989
to
f8eeef5
Compare
f8eeef5
to
272a959
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much outside the scope of this PR: if the execute_batch
function is shared between block-verification and block-production, blockstore_processor
does not seem like the right place for that function to live. In block-production, we're not processing the blockstore!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it should be moved somewhere else eventually.
// ensure good_tx will succeed and was just rolled back above due to other failing tx | ||
let entry_3 = next_entry(&entry_2_to_3_mint_to_1.hash, 1, vec![good_tx]); | ||
assert_matches!( | ||
process_entries_for_tests_without_scheduler(&bank, vec![entry_3]), | ||
Ok(()) | ||
); | ||
// First transaction in third entry succeeded, so keypair1 lost 1 lamport | ||
assert_eq!(bank.get_balance(&keypair1.pubkey()), 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
do_create_test_recorder(bank, blockstore, poh_config, leader_schedule_cache, false) | ||
} | ||
|
||
pub fn create_test_recorder_with_index_tracking( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of scope - this should not exist. the indexes are only used for RPC. block-production in general should not support RPC-only stuff.
@@ -249,7 +249,7 @@ struct RentMetrics { | |||
pub type BankStatusCache = StatusCache<Result<()>>; | |||
#[cfg_attr( | |||
feature = "frozen-abi", | |||
frozen_abi(digest = "BHg4qpwegtaJypLUqAdjQYzYeLfEGf6tA4U5cREbHMHi") | |||
frozen_abi(digest = "4e7a7AAsQrM5Lp5bhREdVZ5QGZfyETbBthhWjYMYb6zS") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noting for myself: this changed because there's a new TransactionError
variant.
ExecutionRecordingConfig::new_single_setting(transaction_status_sender.is_some()), | ||
timings, | ||
log_messages_bytes_limit, | ||
pre_commit_callback, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is always present now.
Where else do we call load_execute_and_commit_transactions
without a callback?
I thought both branches of replay go through this execute_batch
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT it done only in tests/benches, we could make the variant without callback dcou, and/or clean it up eventually so that those calls default to using the same replay variant of the callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean it up eventually so that those calls default to using the same replay variant of the callback.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thank you for addressing the feedback and answering my questions! Sorry again for slow review!
@@ -4545,6 +4597,9 @@ impl Bank { | |||
}, | |||
); | |||
|
|||
if let Some(pre_commit_callback) = pre_commit_callback { | |||
pre_commit_callback(timings, &processing_results)?; | |||
} | |||
let commit_results = self.commit_transactions( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun. I think we may need to grab the freeze_lock
here, similar to how we do in block-production.
commit_transactions
requires that the bank is not frozen yet (obviously), but there's a small gap between recording and this line where the bank could get frozen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh, quite good find... fixed: fb71c0d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^comment above, clicked wrong button. May need to grab the freeze_lock as I indicated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thanks for fixing the last minute freeze-lock. Your commit for that was even a bit simpler than I thought it might be!
Problem
Currently, unified scheduler cant record transactions into poh, because its subsystem called
TaskHandler
and its underlying code-path (solana-ledger
=>solana-runtime
) doesn't support such a committing operation.Summary of Changes
Support it with a relatively less intrusive way by introducing a callback mechanism along the code-path.
Also, this pr slightly changes the error code path on transaction errors and
apply_cost_tracker_during_replay
.extracted from #3946, see the pr for the general overview.