Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[repo depot 3/n] nexus background task to replicate TUF artifacts across sleds #7129

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions dev-tools/omdb/src/bin/omdb/nexus.rs
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ use nexus_types::internal_api::background::RegionSnapshotReplacementFinishStatus
use nexus_types::internal_api::background::RegionSnapshotReplacementGarbageCollectStatus;
use nexus_types::internal_api::background::RegionSnapshotReplacementStartStatus;
use nexus_types::internal_api::background::RegionSnapshotReplacementStepStatus;
use nexus_types::internal_api::background::TufArtifactReplicationCounters;
use nexus_types::internal_api::background::TufArtifactReplicationRequest;
use nexus_types::internal_api::background::TufArtifactReplicationStatus;
use nexus_types::inventory::BaseboardId;
use omicron_uuid_kinds::BlueprintUuid;
use omicron_uuid_kinds::CollectionUuid;
Expand Down Expand Up @@ -943,6 +946,9 @@ fn print_task_details(bgtask: &BackgroundTask, details: &serde_json::Value) {
"service_firewall_rule_propagation" => {
print_task_service_firewall_rule_propagation(details);
}
"tuf_artifact_replication" => {
print_task_tuf_artifact_replication(details);
}
_ => {
println!(
"warning: unknown background task: {:?} \
Expand Down Expand Up @@ -2024,6 +2030,69 @@ fn print_task_service_firewall_rule_propagation(details: &serde_json::Value) {
};
}

fn print_task_tuf_artifact_replication(details: &serde_json::Value) {
fn print_counters(counters: TufArtifactReplicationCounters) {
const ROWS: &[&str] = &[
"list ok:",
"list err:",
"put ok:",
"put err:",
"copy ok:",
"copy err:",
"delete ok:",
"delete err:",
];
const WIDTH: usize = const_max_len(ROWS);

for (label, value) in ROWS.iter().zip([
counters.list_ok,
counters.list_err,
counters.put_ok,
counters.put_err,
counters.copy_ok,
counters.copy_err,
counters.delete_ok,
counters.delete_err,
]) {
println!(" {label:<WIDTH$} {value:>3}");
}
}

match serde_json::from_value::<TufArtifactReplicationStatus>(
details.clone(),
) {
Err(error) => eprintln!(
"warning: failed to interpret task details: {:?}: {:?}",
error, details
),
Ok(status) => {
println!(" request ringbuf:");
for TufArtifactReplicationRequest {
time,
target_sled,
operation,
error,
} in status.request_debug_ringbuf.iter()
{
println!(" - target sled: {target_sled}");
println!(" operation: {operation:?}");
println!(
" at: {}",
time.to_rfc3339_opts(SecondsFormat::Secs, true)
);
if let Some(error) = error {
println!(" error: {error}")
}
}
println!(" last run:");
print_counters(status.last_run_counters);
println!(" lifetime:");
print_counters(status.lifetime_counters);
println!(" local repos: {}", status.local_repos);
}
}
}

/// Summarizes an `ActivationReason`
fn reason_str(reason: &ActivationReason) -> &'static str {
match reason {
Expand Down
12 changes: 12 additions & 0 deletions dev-tools/omdb/tests/env.out
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,10 @@ task: "switch_port_config_manager"
manages switch port settings for rack switches


task: "tuf_artifact_replication"
replicate update repo artifacts across sleds


task: "v2p_manager"
manages opte v2p mappings for vpc networking

Expand Down Expand Up @@ -355,6 +359,10 @@ task: "switch_port_config_manager"
manages switch port settings for rack switches


task: "tuf_artifact_replication"
replicate update repo artifacts across sleds


task: "v2p_manager"
manages opte v2p mappings for vpc networking

Expand Down Expand Up @@ -522,6 +530,10 @@ task: "switch_port_config_manager"
manages switch port settings for rack switches


task: "tuf_artifact_replication"
replicate update repo artifacts across sleds


task: "v2p_manager"
manages opte v2p mappings for vpc networking

Expand Down
62 changes: 62 additions & 0 deletions dev-tools/omdb/tests/successes.out
Original file line number Diff line number Diff line change
Expand Up @@ -394,6 +394,10 @@ task: "switch_port_config_manager"
manages switch port settings for rack switches


task: "tuf_artifact_replication"
replicate update repo artifacts across sleds


task: "v2p_manager"
manages opte v2p mappings for vpc networking

Expand Down Expand Up @@ -724,6 +728,35 @@ task: "switch_port_config_manager"
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
warning: unknown background task: "switch_port_config_manager" (don't know how to interpret details: Object {})

task: "tuf_artifact_replication"
configured period: every <REDACTED_DURATION>h
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
request ringbuf:
- target sled: ..........<REDACTED_UUID>...........
operation: List
at: <REDACTED_TIMESTAMP>
last run:
list ok: 1
list err: 0
put ok: 0
put err: 0
copy ok: 0
copy err: 0
delete ok: 0
delete err: 0
lifetime:
list ok: 1
list err: 0
put ok: 0
put err: 0
copy ok: 0
copy err: 0
delete ok: 0
delete err: 0
local repos: 0

task: "v2p_manager"
configured period: every <REDACTED_DURATION>s
currently executing: no
Expand Down Expand Up @@ -1183,6 +1216,35 @@ task: "switch_port_config_manager"
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
warning: unknown background task: "switch_port_config_manager" (don't know how to interpret details: Object {})

task: "tuf_artifact_replication"
configured period: every <REDACTED_DURATION>h
currently executing: no
last completed activation: <REDACTED ITERATIONS>, triggered by a periodic timer firing
started at <REDACTED_TIMESTAMP> (<REDACTED DURATION>s ago) and ran for <REDACTED DURATION>ms
request ringbuf:
- target sled: ..........<REDACTED_UUID>...........
operation: List
at: <REDACTED_TIMESTAMP>
last run:
list ok: 1
list err: 0
put ok: 0
put err: 0
copy ok: 0
copy err: 0
delete ok: 0
delete err: 0
lifetime:
list ok: 1
list err: 0
put ok: 0
put err: 0
copy ok: 0
copy err: 0
delete ok: 0
delete err: 0
local repos: 0

task: "v2p_manager"
configured period: every <REDACTED_DURATION>s
currently executing: no
Expand Down
22 changes: 12 additions & 10 deletions dev-tools/omdb/tests/usage_errors.out
Original file line number Diff line number Diff line change
Expand Up @@ -315,17 +315,19 @@ Options:
Show sleds that match the given filter

Possible values:
- all: All sleds in the system, regardless of policy or state
- commissioned: All sleds that are currently part of the control plane cluster
- decommissioned: All sleds that were previously part of the control plane cluster
but have been decommissioned
- discretionary: Sleds that are eligible for discretionary services
- in-service: Sleds that are in service (even if they might not be eligible
- all: All sleds in the system, regardless of policy or state
- commissioned: All sleds that are currently part of the control plane cluster
- decommissioned: All sleds that were previously part of the control plane
cluster but have been decommissioned
- discretionary: Sleds that are eligible for discretionary services
- in-service: Sleds that are in service (even if they might not be eligible
for discretionary services)
- query-during-inventory: Sleds whose sled agents should be queried for inventory
- reservation-create: Sleds on which reservations can be created
- vpc-routing: Sleds which should be sent OPTE V2P mappings and Routing rules
- vpc-firewall: Sleds which should be sent VPC firewall rules
- query-during-inventory: Sleds whose sled agents should be queried for inventory
- reservation-create: Sleds on which reservations can be created
- vpc-routing: Sleds which should be sent OPTE V2P mappings and Routing rules
- vpc-firewall: Sleds which should be sent VPC firewall rules
- tuf-artifact-replication: Sleds which should have TUF repo artifacts replicated onto
them

--log-level <LOG_LEVEL>
log level filter
Expand Down
16 changes: 16 additions & 0 deletions nexus-config/src/nexus_config.rs
Original file line number Diff line number Diff line change
Expand Up @@ -417,6 +417,8 @@ pub struct BackgroundTaskConfig {
/// configuration for region snapshot replacement finisher task
pub region_snapshot_replacement_finish:
RegionSnapshotReplacementFinishConfig,
/// configuration for TUF artifact replication task
pub tuf_artifact_replication: TufArtifactReplicationConfig,
}

#[serde_as]
Expand Down Expand Up @@ -722,6 +724,14 @@ pub struct RegionSnapshotReplacementFinishConfig {
pub period_secs: Duration,
}

#[serde_as]
#[derive(Clone, Debug, Deserialize, Eq, PartialEq, Serialize)]
pub struct TufArtifactReplicationConfig {
/// period (in seconds) for periodic activations of this background task
#[serde_as(as = "DurationSeconds<u64>")]
pub period_secs: Duration,
}

/// Configuration for a nexus server
#[derive(Clone, Debug, Deserialize, PartialEq, Serialize)]
pub struct PackageConfig {
Expand Down Expand Up @@ -978,6 +988,7 @@ mod test {
region_snapshot_replacement_garbage_collection.period_secs = 30
region_snapshot_replacement_step.period_secs = 30
region_snapshot_replacement_finish.period_secs = 30
tuf_artifact_replication.period_secs = 300
[default_region_allocation_strategy]
type = "random"
seed = 0
Expand Down Expand Up @@ -1174,6 +1185,10 @@ mod test {
RegionSnapshotReplacementFinishConfig {
period_secs: Duration::from_secs(30),
},
tuf_artifact_replication:
TufArtifactReplicationConfig {
period_secs: Duration::from_secs(300)
},
},
default_region_allocation_strategy:
crate::nexus_config::RegionAllocationStrategy::Random {
Expand Down Expand Up @@ -1257,6 +1272,7 @@ mod test {
region_snapshot_replacement_garbage_collection.period_secs = 30
region_snapshot_replacement_step.period_secs = 30
region_snapshot_replacement_finish.period_secs = 30
tuf_artifact_replication.period_secs = 300
[default_region_allocation_strategy]
type = "random"
"##,
Expand Down
4 changes: 4 additions & 0 deletions nexus/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ license = "MPL-2.0"
[lints]
workspace = true

[features]
# Set by omicron-package based on the target configuration.
rack-topology-single-sled = []
Comment on lines +10 to +12
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we document who uses this and what it's intended for? My guess was that it controls a bit of policy about whether we expect/require that we have multiple sleds for availability, but then I'd have expected that to be a runtime thing and also we already have stuff in this bucket (like Crucible region allocation?) so does that use some other mechanism?

(edit: more in another comment where we use it)


[build-dependencies]
omicron-rpaths.workspace = true

Expand Down
1 change: 1 addition & 0 deletions nexus/db-model/src/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -907,6 +907,7 @@ table! {
sled_policy -> crate::sled_policy::SledPolicyEnum,
sled_state -> crate::SledStateEnum,
sled_agent_gen -> Int8,
repo_depot_port -> Int4,
}
}

Expand Down
3 changes: 2 additions & 1 deletion nexus/db-model/src/schema_versions.rs
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ use std::collections::BTreeMap;
///
/// This must be updated when you change the database schema. Refer to
/// schema/crdb/README.adoc in the root of this repository for details.
pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(120, 0, 0);
pub const SCHEMA_VERSION: SemverVersion = SemverVersion::new(121, 0, 0);

/// List of all past database schema versions, in *reverse* order
///
Expand All @@ -29,6 +29,7 @@ static KNOWN_VERSIONS: Lazy<Vec<KnownVersion>> = Lazy::new(|| {
// | leaving the first copy as an example for the next person.
// v
// KnownVersion::new(next_int, "unique-dirname-with-the-sql-files"),
KnownVersion::new(121, "tuf-artifact-replication"),
KnownVersion::new(120, "rendezvous-debug-dataset"),
KnownVersion::new(119, "tuf-artifact-key-uuid"),
KnownVersion::new(118, "support-bundles"),
Expand Down
10 changes: 10 additions & 0 deletions nexus/db-model/src/sled.rs
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,9 @@ pub struct Sled {
/// This is specifically distinct from `rcgen`, which is incremented by
/// child resources as part of `DatastoreCollectionConfig`.
pub sled_agent_gen: Generation,

// ServiceAddress (Repo Depot API). Uses `ip`.
pub repo_depot_port: SqlU16,
}

impl Sled {
Expand Down Expand Up @@ -169,6 +172,7 @@ impl From<Sled> for params::SledAgentInfo {
};
Self {
sa_address: sled.address(),
repo_depot_port: sled.repo_depot_port.into(),
role,
baseboard: Baseboard {
serial: sled.serial_number.clone(),
Expand Down Expand Up @@ -220,6 +224,9 @@ pub struct SledUpdate {
pub ip: ipv6::Ipv6Addr,
pub port: SqlU16,

// ServiceAddress (Repo Depot API). Uses `ip`.
pub repo_depot_port: SqlU16,

// Generation number - owned and incremented by sled-agent.
pub sled_agent_gen: Generation,
}
Expand All @@ -228,6 +235,7 @@ impl SledUpdate {
pub fn new(
id: Uuid,
addr: SocketAddrV6,
repo_depot_port: u16,
baseboard: SledBaseboard,
hardware: SledSystemHardware,
rack_id: Uuid,
Expand All @@ -247,6 +255,7 @@ impl SledUpdate {
reservoir_size: hardware.reservoir_size,
ip: addr.ip().into(),
port: addr.port().into(),
repo_depot_port: repo_depot_port.into(),
sled_agent_gen,
}
}
Expand Down Expand Up @@ -282,6 +291,7 @@ impl SledUpdate {
reservoir_size: self.reservoir_size,
ip: self.ip,
port: self.port,
repo_depot_port: self.repo_depot_port,
last_used_address,
sled_agent_gen: self.sled_agent_gen,
}
Expand Down
1 change: 1 addition & 0 deletions nexus/db-queries/src/db/datastore/dataset.rs
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,7 @@ mod test {
let sled = SledUpdate::new(
*sled_id.as_untyped_uuid(),
"[::1]:0".parse().unwrap(),
0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are quite a lot of places where we use 0 for the repo depot port. I assume this is a sentinel value? It might be nice to use Option instead here. (see also the discussion about whether the field should be NULLable but I think this is true regardless).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general in these test functions, I attempted to follow the same usage of how the sled agent port was specified. In this case you can see the sled agent SocketAddr is localhost port 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense.

The end result is that there are many callers where we're repeating the same values that, if I'm understanding right, can't actually be right -- they're just unused. This makes me wonder if both of those ought to be optional. Maybe this should be a SledUpdateBuilder? But anyway it's fine to say that's out of scope here.

SledBaseboard {
serial_number: "test-sn".to_string(),
part_number: "test-pn".to_string(),
Expand Down
Loading
Loading