[nexus] Support Bundle background task #7063

smklein · 2024-11-14T00:32:33Z

PR 4 / ???

Adds a background task to manage support bundle lifecycle, and perform collection

In this PR:

Actually perform the work of collection in a background task
Invoke the API from [sled-agent] Support bundle storage api #6782 to store bundles
Add tests to confirm the background task is working as intended

…lementation

…undle-bg-task

nexus/src/app/background/tasks/support_bundle_collector.rs

papertigers · 2025-01-07T20:06:04Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+    entries.sort_by(|a, b| a.file_name().cmp(&b.file_name()));
+
+    for entry in &entries {
+        // Remove the "/tmp/..." prefix from the path when we're storing it in the


I guess this answers my question about where the tmpdir is being created. What I still don't understand is if that's the best location for them.

Just dug this up from my system which I based on some discussion in chat:

2024-11-04.21:41:38 zfs create -o atime=off -o sync=disabled rpool/faketmpfs 2024-11-04.21:41:56 zfs set quota=20G rpool/faketmpfs 2024-11-04.21:42:39 zfs set mountpoint=/faketmpfs rpool/faketmpfs 2024-11-04.21:43:24 zfs snapshot rpool/faketmpfs@empty

Wondering if we should create a dataset similar to this on the active m.2 device if it's missing at sled-agent startup, or if it's found rewind the snapshot to the empty state.

Maybe @jclulow or @davepacheco could weigh in on this (which is where I think I got the above from) vs storing the temporary bundle collection in /tmp where we may have limited memory due to the VMM reservoir and control plane processes. Or perhaps I am being overly cautious...

Sorry, I'm not to speed on any of this so I'm not sure what to suggest, but I'm happy to talk through it if it's helpful. Generally I avoid using /tmp/tmpfs for just about anything. When figuring out where to store stuff I think about what data's being stored, how big it might be, and what its lifecycle is (i.e., how do we make sure we don't leak it, does it need to survive reboot, can it become incorrect or inconsistent with some other copies, etc.) to figure out where it should go. I'm not sure how helpful that is but I'm happy to talk through it if that's useful!

This was covered a bit in https://rfd.shared.oxide.computer/rfd/0496#_storage

this RFD proposes storing support bundles within the Nexus zone filesystem during collection, but transferring them into a dataset within a dedicated U.2 while the bundle is active, and accessible for download by the Nexus API.

That was my intent, at least - that this data would go through the following steps:

Collected into the Nexus transient zone filesystem (limited to one bundle at a time, thanks to the structure of this background task).

Aggregated into a zipfile by the Nexus collection task.

Transferred to a U.2, which may be running on a separate sled from Nexus entirely. It's only at this point that the bundle is made "active" and is durable.

I do think it's a mistake for this to be using /tmp -- this code is running in a Nexus zone, so I think I need to be using /var/tmp to actually be using that transient zone storage, which was my goal, rather than using a RAM-backed filesystem.

That all makes sense I think. The only catch with the root filesystem is that it will disappear if the system reboots and then there might be some database-level thing to clean up (e.g., mark the bundle failed? I don't know).

That all makes sense I think. The only catch with the root filesystem is that it will disappear if the system reboots and then there might be some database-level thing to clean up (e.g., mark the bundle failed? I don't know).

The current implementation of support bundles assigns them to a single Nexus, and it's the responsibility of that individual Nexus to perform collection.

If the Nexus hasn't been expunged: When it reboots, it'll see the bundle in state SupportBundleState::Collecting, and will restart collection.

omicron/nexus/src/app/background/tasks/support_bundle_collector.rs

Lines 336 to 344 in d23826e

let result = self

.datastore

.support_bundle_list_assigned_to_nexus(

opctx,

&pagparams,

self.nexus_id,

vec![SupportBundleState::Collecting],

)

.await;

If either Nexus was expunged, or the support bundle storage was expunged, then the reconfigurator has an execution step which invokes support_bundle_fail_expunged:

omicron/nexus/db-queries/src/db/datastore/support_bundle.rs

Lines 219 to 221 in 0603f0b

/// Marks support bundles as failed if their assigned Nexus or backing

/// dataset has been destroyed.

pub async fn support_bundle_fail_expunged(

This will update database records, and possibly re-assign the Nexus which "owns" the bundle so that storage on the final U.2 can be freed:

omicron/nexus/src/app/background/tasks/support_bundle_collector.rs

Lines 229 to 247 in d23826e

// Monitors all bundles that are "destroying" or "failing" and assigned to

// this Nexus, and attempts to clear their storage from Sled Agents.

async fn cleanup_destroyed_bundles(

&self,

opctx: &OpContext,

) -> anyhow::Result<CleanupReport> {

let pagparams = DataPageParams::max_page();

let result = self

.datastore

.support_bundle_list_assigned_to_nexus(

opctx,

&pagparams,

self.nexus_id,

vec![

SupportBundleState::Destroying,

SupportBundleState::Failing,

],

)

.await;

(this shares some logic for the "bundle was deleted manually by an operator" and "bundle was failed because something was expunged" pathways)

Sorry for the drive-by comment - I don't have much context here, but I skimmed over support_bundle_fail_expunged and had a question. It looks like that's checking the current blueprint for expunged zones/datasets specifically. How will that interact with dropping expunged entities from the blueprint altogether?

Sorry for the drive-by comment - I don't have much context here, but I skimmed over support_bundle_fail_expunged and had a question. It looks like that's checking the current blueprint for expunged zones/datasets specifically. How will that interact with dropping expunged entities from the blueprint altogether?

Filed #7319 - I think this could result in us "missing" marking a bundle as failed if we quickly transition a zone from "expunged" to "pruned" without execution ever completing successfully in-between.

#7325 should fix #7319

for the tempdir side of this, fixed in 53dcbe6

nexus/src/app/background/tasks/support_bundle_collector.rs

…lementation

…undle-bg-task

wfchandler

One optional suggestion, but otherwise nothing further from me. I defer to those who know more about Nexus on the tmp questions.

wfchandler · 2025-01-13T14:32:54Z

nexus/src/app/background/tasks/support_bundle_collector.rs

+            let mut reader = BufReader::new(std::fs::File::open(&src)?);
+
+            loop {
+                let buf = reader.fill_buf()?;
+                let len = buf.len();
+                if len == 0 {
+                    break;
+                }
+                zip.write_all(&buf)?;
+                reader.consume(len);
+            }


I think my first suggestion led us astray. We can simplify this further and still avoid reading the whole file into memory with io::copy call. Sorry about that.

let mut file = std::fs::File::open(&src)?; std::io::copy(&mut file, zip)?;

Sounds good. Looks like this is using buffered writers under the hood too: https://doc.rust-lang.org/beta/src/std/io/copy.rs.html#70-90

Patched up in a54097c

…undle-bg-task

PR 3 / ??? This PR aims to re-use the support bundle management logic in sled-agent/src/support_bundle/storage.rs for both the real and simulated sled agent. It accomplishes this goal with the following: 1. It creates a trait, `LocalStorage`, that abstracts access to storage. The "real" sled agent accesses real storage, the simulated sled agent can access the simulated storage APIs. 2. Reduce the usage of unnecessary async mutexes to make lifetimes slightly more manageable. This happens to align with our guidance in RFD 400 (https://rfd.shared.oxide.computer/rfd/400#no_mutex), but has a fall-out impact in replacing `.await` calls throughout Omicron. As an end result of this PR, tests in subsequent PRs (e.g. #7063) can rely on the simulated sled agent to respond realistically to support bundle requests, rather than using a stub implementation.

…undle-bg-task

PR 5 / ??? Implements support bundle APIs for accessing storage bundles. Range request support is only partially implemented as-is -- follow-up support is described in #7356 Builds atop the API skeleton in: - #7008 Uses the support bundle datastore interfaces in: - #7021 Relies on the background task in: - #7063

smklein added 30 commits November 7, 2024 12:54

Add support bundle API skeleton to Nexus

cd3aabd

fmt

3774ecb

Merge branch 'main' into nexus-support-bundles-api

33787b8

EXPECTORATE

0b3fd29

Merge branch 'main' into nexus-support-bundles-api

9eb91bf

Merge branch 'main' into nexus-support-bundles-api

0f9ff1c

Add database schema for support bundles

ec897f7

Add failure reason

38b6ea8

Merge branch 'nexus-support-bundles-api' into support-bundles-crdb

a938a82

Merge reason for failure

908f661

Fix SQL, moving forward on tests

aa3978f

Refactor state machine

d6d53c9

Clippy, schema changes

483801a

more clippy

5f78af9

Merge branch 'main' into nexus-support-bundles-api

8dba975

Merge branch 'nexus-support-bundles-api' into support-bundles-crdb

69e04ae

Strongly-typed UUID

92b7d62

Merge branch 'nexus-support-bundles-api' into support-bundles-crdb

bc10ee1

Skeleton of background task

410ff5b

less bp

2234301

Add list-by-nexus method, test it

6c529d1

Merge branch 'support-bundles-crdb' into support-bundle-bg-task

aeec097

Merge branch 'main' into nexus-support-bundles-api

6972b2f

Merge branch 'nexus-support-bundles-api' into support-bundles-crdb

adf956c

Merge branch 'support-bundles-crdb' into support-bundle-bg-task

11a88d6

merge

e61365a

Merge branch 'nexus-support-bundles-api' into support-bundles-crdb

07159ad

Merge UUID changes

01949c9

Merge branch 'support-bundles-crdb' into support-bundle-bg-task

5d02715

Merge UUID changes

81c7121

smklein added 7 commits January 6, 2025 15:48

Merge branch 'support-bundles-crdb' into support-bundle-simulated-imp…

8b58c16

…lementation

Merge branch 'support-bundle-simulated-implementation' into support-b…

14d39fd

…undle-bg-task

config.test.toml

917759d

Error logging

7d166f1

Propagate unexpected errors during bundle activation

bcf6f8c

remove printlns

76be221

Use BufReader for reading entries

d23826e

smklein mentioned this pull request Jan 7, 2025

Support Bundle: Bundle collection could be more concurrent #7314

Open

papertigers reviewed Jan 7, 2025

View reviewed changes

smklein mentioned this pull request Jan 7, 2025

support_bundle_fail_expunged should not look at expunged datasets, zones #7319

Open

smklein added 8 commits January 8, 2025 13:07

Merge branch 'main' into nexus-support-bundles-api

5746fbf

Merge branch 'nexus-support-bundles-api' into support-bundles-crdb

7c1457f

Merge branch 'support-bundles-crdb' into support-bundle-simulated-imp…

69e0f46

…lementation

Merge branch 'support-bundle-simulated-implementation' into support-b…

1baad89

…undle-bg-task

Update to deal with structured support bundle APIs from sled agent

39394ff

Use /var/tmp

53dcbe6

Merge branch 'main' into support-bundle-simulated-implementation

ac36cf6

Merge branch 'support-bundle-simulated-implementation' into support-b…

f40f1a4

…undle-bg-task

wfchandler approved these changes Jan 13, 2025

View reviewed changes

smklein added 3 commits January 13, 2025 10:11

Merge branch 'main' into support-bundle-simulated-implementation

f73f081

Merge branch 'support-bundle-simulated-implementation' into support-b…

dcedfd3

…undle-bg-task

simpler buffering

a54097c

Base automatically changed from support-bundle-simulated-implementation to main January 13, 2025 20:53

smklein added 2 commits January 13, 2025 13:52

Merge branch 'main' into support-bundle-simulated-implementation

1a8a21d

Merge branch 'support-bundle-simulated-implementation' into support-b…

7c5e396

…undle-bg-task

smklein enabled auto-merge (squash) January 13, 2025 22:06

smklein merged commit 2fe668d into main Jan 13, 2025
17 checks passed

smklein deleted the support-bundle-bg-task branch January 13, 2025 23:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nexus] Support Bundle background task #7063

[nexus] Support Bundle background task #7063

smklein commented Nov 14, 2024 •

edited

Loading

papertigers Jan 7, 2025

papertigers Jan 7, 2025

davepacheco Jan 7, 2025

smklein Jan 7, 2025

davepacheco Jan 7, 2025

smklein Jan 7, 2025

jgallagher Jan 7, 2025

smklein Jan 7, 2025

smklein Jan 8, 2025

smklein Jan 9, 2025

wfchandler left a comment

wfchandler Jan 13, 2025

smklein Jan 13, 2025

smklein Jan 13, 2025

	let result = self
	.datastore
	.support_bundle_list_assigned_to_nexus(
	opctx,
	&pagparams,
	self.nexus_id,
	vec![SupportBundleState::Collecting],
	)
	.await;

	/// Marks support bundles as failed if their assigned Nexus or backing
	/// dataset has been destroyed.
	pub async fn support_bundle_fail_expunged(

	// Monitors all bundles that are "destroying" or "failing" and assigned to
	// this Nexus, and attempts to clear their storage from Sled Agents.
	async fn cleanup_destroyed_bundles(
	&self,
	opctx: &OpContext,
	) -> anyhow::Result<CleanupReport> {
	let pagparams = DataPageParams::max_page();
	let result = self
	.datastore
	.support_bundle_list_assigned_to_nexus(
	opctx,
	&pagparams,
	self.nexus_id,
	vec![
	SupportBundleState::Destroying,
	SupportBundleState::Failing,
	],
	)
	.await;

[nexus] Support Bundle background task #7063

[nexus] Support Bundle background task #7063

Conversation

smklein commented Nov 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wfchandler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smklein commented Nov 14, 2024 •

edited

Loading