Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reconfigurator-cli] Fix blueprint-edit failing if there are multiple expunged zones with the same external IP #7307

Merged
merged 8 commits into from
Jan 9, 2025

Conversation

jgallagher
Copy link
Contributor

@jgallagher jgallagher commented Jan 6, 2025

I tried to reproduce #7305 using just reconfigurator-cli's simulated example system. This required one change (adding external DNS zones, db90b4a), and while I was here I smoothed over a couple of minor ergonomic annoyances (275e24e is uncontroversial, I hope, and a147954 which is a little less clearly good - happy to back this one out if folks don't like it).

With these changes, I can get to a reproduction:

generated RNG seed: discernibly-adaptive-grison
〉load-example
loaded example system with:
- collection: ed006741-bed9-4fd1-87e7-26c95a6f4dd9
- blueprint: 72183372-6429-4ab7-af83-5fea1a40bbcf
〉blueprint-show 72183372-6429-4ab7-af83-5fea1a40bbcf
... snip, finding an external DNS zone to expunge ...
〉blueprint-edit 72183372-6429-4ab7-af83-5fea1a40bbcf expunge-zone 34feb176-1129-4ca8-8df8-6e6cb1be2173
blueprint 9a5f95b4-7eb7-4b8d-b953-5fc9f403f330 created from blueprint 72183372-6429-4ab7-af83-5fea1a40bbcf: expunged zone 34feb176-1129-4ca8-8df8-6e6cb1be2173 from sled e78b0535-959d-4ba9-8d8c-ac3d591fc490
〉blueprint-plan 9a5f95b4-7eb7-4b8d-b953-5fc9f403f330
... snip, finding the ID of the newly-generated blueprint ...
〉blueprint-diff 9a5f95b4-7eb7-4b8d-b953-5fc9f403f330 0a054c71-82cb-4589-afa3-09ced1e425b9
... snip, finding the ID of the newly-added external DNS zone ...
〉blueprint-edit 0a054c71-82cb-4589-afa3-09ced1e425b9 expunge-zone 900df298-9c72-4a64-a4a7-7a92b722f68c
error: adding omicron zone external IP: associating Omicron zone 900df298-9c72-4a64-a4a7-7a92b722f68c with Floating(OmicronZoneExternalFloatingIp { id: 1c962069-7ced-4162-a11f-acb40fd954f4 (external_ip), ip: 198.51.100.3 }) failed due to duplicates: duplicate entry: OmicronZoneExternalIpEntry { zone_id: 900df298-9c72-4a64-a4a7-7a92b722f68c (service), ip: Floating(OmicronZoneExternalFloatingIp { id: 1c962069-7ced-4162-a11f-acb40fd954f4 (external_ip), ip: 198.51.100.3 }) } conflicts with existing: [OmicronZoneExternalIpEntry { zone_id: 34feb176-1129-4ca8-8df8-6e6cb1be2173 (service), ip: Floating(OmicronZoneExternalFloatingIp { id: 1b31bc81-c2ef-4915-be62-c97bd7881b07 (external_ip), ip: 198.51.100.3 }) }]

Fixes #7305

@jgallagher jgallagher changed the title [reconfigurator-cli] Minor ergonomic improvements; load-example includes external DNS [reconfigurator-cli] Fix blueprint-edit failing if there are multiple expunged zones with the same external IP Jan 6, 2025
@jgallagher
Copy link
Contributor Author

The last commit has the fix for #7305, and it's small enough that I appended it to this PR instead of opening a followup. The crux of it is this change:

-        for (_, zone) in
-            parent_blueprint.all_omicron_zones(BlueprintZoneFilter::All)
+        for (_, zone) in parent_blueprint
+            .all_omicron_zones(BlueprintZoneFilter::ShouldBeRunning)

when reconfigurator-cli is finding external networking resources to construct the PlanningInput. It was looking at all zones, so once we got to two expunged zones with the same external IP, we tripped over the error in #7305. Production Nexus, in contrast, checks CRDB for in-service IPs only:

paginated(dsl::external_ip, dsl::id, pagparams)
.filter(dsl::is_service)
.filter(dsl::time_deleted.is_null())
.select(ExternalIp::as_select())

so I believe this is a reconfigurator-cli-only bug.

Copy link
Collaborator

@davepacheco davepacheco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I just had a minor question about some changed test output.

Comment on lines 94 to 103
crucible 4f2eb088-7d28-4c4e-a27c-746400ec65ba in service fd00:1122:3344:102::2f
crucible 53dd7fa4-899e-49ed-9fc2-48222db3e20d in service fd00:1122:3344:102::2a
crucible 7db307d4-a6ed-4c47-bddf-6759161bf64a in service fd00:1122:3344:102::2c
crucible 95ad9a1d-4063-4874-974c-2fc92830be27 in service fd00:1122:3344:102::29
crucible bc095417-e2f0-4e95-b390-9cc3fc6e3c6d in service fd00:1122:3344:102::28
crucible d90401f1-fbc2-42cb-bf17-309ee0f922fe in service fd00:1122:3344:102::2b
crucible e8f994c0-0a1b-40e6-8db1-40a8ca89e503 in service fd00:1122:3344:102::2d
crucible e9bf481e-323e-466e-842f-8107078c7137 in service fd00:1122:3344:102::2e
crucible f97aa057-6485-45d0-9cb4-4af5b0831d48 in service fd00:1122:3344:102::25
crucible_pantry eaec16c0-0d44-4847-b2d6-31a5151bae52 in service fd00:1122:3344:102::24
crucible_pantry f97aa057-6485-45d0-9cb4-4af5b0831d48 in service fd00:1122:3344:102::25
external_dns eaec16c0-0d44-4847-b2d6-31a5151bae52 in service fd00:1122:3344:102::24
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these changes were just due to adding external DNS shifting the generated UUIDs around?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (and this is strongly related to your other comment, will follow up there)

Comment on lines 132 to 146
crucible 09937ebb-bb6a-495b-bc97-b58076b70a78 in service fd00:1122:3344:103::2b
crucible a999e5fa-3edc-4dac-919a-d7b554cdae58 in service fd00:1122:3344:103::25
crucible b416f299-c23c-46c8-9820-be2b66ffea0a in service fd00:1122:3344:103::26
crucible b43ce109-90d6-46f9-9df0-8c68bfe6d4a0 in service fd00:1122:3344:103::2e
crucible b5d5491d-b3aa-4727-8b55-f66e0581ea4f in service fd00:1122:3344:103::2a
crucible cbe91cdc-cbb6-4760-aece-6ce08b67e85a in service fd00:1122:3344:103::2d
crucible cd3bb540-e605-465f-8c62-177ac482d850 in service fd00:1122:3344:103::28
crucible f3628f0a-2301-4fc8-bcbf-961199771731 in service fd00:1122:3344:103::2c
crucible f52aa245-7e1b-46c0-8a31-e09725f02caf in service fd00:1122:3344:103::29
crucible fae49024-6cec-444d-a6c4-83658ab015a4 in service fd00:1122:3344:103::27
crucible_pantry e8971ab3-fb7d-4ad8-aae3-7f2fe87c51f3 in service fd00:1122:3344:103::24
external_dns cc1dc86d-bd6f-4929-aa4a-9619012e9393 in service fd00:1122:3344:103::23
internal_dns 728db429-8621-4e1e-9915-282aadfa27d1 in service fd00:1122:3344:2::1
internal_ntp c8aa84a5-a802-46c9-adcd-d61e9c8393c9 in service fd00:1122:3344:103::21
nexus e7dd3e98-7fe7-4827-be7f-395ff9a5f542 in service fd00:1122:3344:103::22
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really clear why this block and the next one changed. It looks like the IPs shifted around? Is that just because of the new external DNS zone too?

Copy link
Contributor Author

@jgallagher jgallagher Jan 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's more that the UUIDs continued shifting around. The zone UUIDs are all coming out of a deterministic RNG, but that's for the entire blueprint. So the initial sled got a new zone (external DNS), which means it now grabbed one more zone UUID than it did previously. Going up, the "new" UUID it got went to one of the crucible zones (4f2eb088-7d28-4c4e-a27c-746400ec65ba) - that UUID used to belong to this sled's internal NTP zone (the first one ExampleSystemBuilder allocates for a sled). Following that, the UUIDs for every zone on this sled (and all subsequent sleds) shifted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, part of the goal of RNG trees was to minimize this shifting. Can it be restructured to achieve that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the output, I think each sled should definitely get a separate RNG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call - after pulling in #7321, all three blocks have just a few lines shifted around: https://github.com/oxidecomputer/omicron/pull/7307/files#diff-18cea7c50c6bb09d57e37c8bf51e112a953dc2827d2fe4f300f9b6aac2f5b0fb

Comment on lines +834 to +836
if builder
.current_sled_zones(sled_id, BlueprintZoneFilter::All)
.any(|z| z.id == zone_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this match all zones or some subset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think All is right - in this case the user passed in an explicit zone ID, and we're just trying to find it. This would allow someone to try to use reconfigurator-cli to expunge an already-expunged zones, but that seems fine, I think?

Comment on lines 132 to 146
crucible 09937ebb-bb6a-495b-bc97-b58076b70a78 in service fd00:1122:3344:103::2b
crucible a999e5fa-3edc-4dac-919a-d7b554cdae58 in service fd00:1122:3344:103::25
crucible b416f299-c23c-46c8-9820-be2b66ffea0a in service fd00:1122:3344:103::26
crucible b43ce109-90d6-46f9-9df0-8c68bfe6d4a0 in service fd00:1122:3344:103::2e
crucible b5d5491d-b3aa-4727-8b55-f66e0581ea4f in service fd00:1122:3344:103::2a
crucible cbe91cdc-cbb6-4760-aece-6ce08b67e85a in service fd00:1122:3344:103::2d
crucible cd3bb540-e605-465f-8c62-177ac482d850 in service fd00:1122:3344:103::28
crucible f3628f0a-2301-4fc8-bcbf-961199771731 in service fd00:1122:3344:103::2c
crucible f52aa245-7e1b-46c0-8a31-e09725f02caf in service fd00:1122:3344:103::29
crucible fae49024-6cec-444d-a6c4-83658ab015a4 in service fd00:1122:3344:103::27
crucible_pantry e8971ab3-fb7d-4ad8-aae3-7f2fe87c51f3 in service fd00:1122:3344:103::24
external_dns cc1dc86d-bd6f-4929-aa4a-9619012e9393 in service fd00:1122:3344:103::23
internal_dns 728db429-8621-4e1e-9915-282aadfa27d1 in service fd00:1122:3344:2::1
internal_ntp c8aa84a5-a802-46c9-adcd-d61e9c8393c9 in service fd00:1122:3344:103::21
nexus e7dd3e98-7fe7-4827-be7f-395ff9a5f542 in service fd00:1122:3344:103::22
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the output, I think each sled should definitely get a separate RNG.

jgallagher added a commit that referenced this pull request Jan 8, 2025
Suggested by @sunshowers
(#7307 (comment)).
This should prevent changes to one sled (e.g., adding a new zone) from
perturbing IDs on all subsequent sleds in otherwise-stable tests.
jgallagher added a commit that referenced this pull request Jan 8, 2025
* Dump them to stdout instead of stderr
* Omit timestamps (which should fix the test failure on #7307, where our
  expectorate tests don't know how to redact slog timestamps)
jgallagher added a commit that referenced this pull request Jan 8, 2025
My main goal with this PR was to fix the failing test on #7307 by
emitting slog logs without timestamps (to avoid having to redact them in
expectorate tests, and because they're not actually useful in a CLI). As
a nice side effect, the output in a terminal is much nicer.

On main, I see this when planning a blueprint:


![reconfig-main](https://github.com/user-attachments/assets/8a9ed6f0-418c-40ed-8488-45e44afc0067)

I assume this is some bad interaction between slog-async / logs going to
stderr / reedline.

On this branch, I now get:


![reconfig-with-changes](https://github.com/user-attachments/assets/2495880a-4715-4d39-8a08-59f5edb7fb58)
jgallagher added a commit that referenced this pull request Jan 8, 2025
Suggested by @sunshowers

(#7307 (comment)).
This should prevent changes to one sled (e.g., adding a new zone) from
perturbing IDs on all subsequent sleds in otherwise-stable tests. (I
took a stab at combining this with #7307 and confirmed the diff output
is much more reasonable; will update #7307 after this lands.)

The actual code changes here are small and mostly-mechanical; the
changes in `rng.rs` are the driver (moving the sled-specific RNGs into a
per-sled RNG container and dealing with handing those out keyed by sled
ID), and everything else fell out from that. Unfortunately this changes
_all_ zone/dataset/NIC/IP UUIDs in expectorate tests, so the diff volume
is... quite large.
@jgallagher jgallagher merged commit 667832c into main Jan 9, 2025
17 checks passed
@jgallagher jgallagher deleted the john/reconfigurator-cli-ergo branch January 9, 2025 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

failure trying to expunge two external IP zones
3 participants