Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sled-agent races when configuring switch zone underlay address #7337

Closed
jgallagher opened this issue Jan 13, 2025 · 20 comments · Fixed by #7345
Closed

sled-agent races when configuring switch zone underlay address #7337

jgallagher opened this issue Jan 13, 2025 · 20 comments · Fixed by #7345

Comments

@jgallagher
Copy link
Contributor

jgallagher commented Jan 13, 2025

Trying to test something on a4x2, I ran into a situation where MGS had gone into maintenance. Its logs contained the error:

mgs: config refresh failed: initializing http server listening at [fd00:1122:3344:101::2]:12225: failed to bind to [fd00:1122:3344:101::2]:12225

I assume the source of the bind failure was that the underlay address didn't exist (although I'm not sure why we didn't get the rest of the error chain). fd00:1122:3344:101::2 is the correct underlay address of that switch zone, and it did exist at the point I was investigating. Clearing and restarting MGS with no other changes to the system worked fine.

Looking at recent changes in this area, I'm wondering if #7260 introduced a race that could explain this. Prior to that change, sled-agent was zlogin'ing into the switch zone to create the underlay address. After that change, instead it reconfigures zone-network-setup, sends an SMF refresh, and expects zone-network-setup to create the address. But I think at that point, both of these are happening concurrently:

  • sled-agent in the gz is reconfiguring other switch zone services to tell them to start listening on the underlay IP
  • zone-network-service in the switch zone is handling the SMF refresh and actually creating the underlay IP

which means services could fail to bind if they lose that race.

Does this sound reasonable @karencfv / @bnaecker? Any ideas on how we could fix this without falling back to zlogin?

@bnaecker
Copy link
Collaborator

That seems like it could be happening.

I'm not sure the best way to solve it. Maybe we could make the refresh method of the zone-network-setup service an actual script that runs, which would make it synchronous from the point of view of the sled-agent. I've not done that before, but I believe it's similar to making a script for the start method.

Another option would be to have the sled-agent do the refresh, and then wait for the setup service to complete. IIUC, the service is considered online when it exits successfully. I don't know if there's a "wait for service" method, but that would do it. Normally those are expressed as dependencies, but that won't work in this case AFIAU, because the switch services all need to come up before the zone-network-service runs.

@karencfv
Copy link
Contributor

Ah bummer. Was the zone-network-setup service in maintenance as well by any chance? Would you happen to have the logs for that service? I'm asking because if that service was successful then it created all addresses. Since restarting MGS seems to have fixed the problem, I have a feeling the problem may be elsewhere.

The mgs service depends on zone-network-service. So that "should" take care of any races.

Do you still have the sled agent logs? Do they contain "refreshed zone-network-setup service with new configuration" and "refreshed MGS service with new configuration"?

@karencfv
Copy link
Contributor

Maybe we could make the refresh method of the zone-network-setup service an actual script that runs, which would make it synchronous from the point of view of the sled-agent. I've not done that before, but I believe it's similar to making a script for the start method.

Sadly, this is kinda doing the same thing as the network setup service. The service is a thin wrapper around ipadm and other commands

@bnaecker
Copy link
Collaborator

Sadly, this is kinda doing the same thing as the network setup service

I don't think that's quite right. The difference is what "refreshing the service" means. By default, it's delivering a signal to the process. When that's delivered, refreshing is "complete", even though the application now has a bunch of other work to do (all the ipadm commands you mention). If we make refresh a method that is that other work, then refresh is completed when the work is done.

@karencfv
Copy link
Contributor

Thinking about this a bit more, perhaps even though mgs is dependent on the zone-network-service, when both services are refreshed some sort of race can happen depending on how SMF interprets dependencies? I'm not quite sure about this, and the inner workings of SMF 🤔

I wonder if it makes sense to do a synchronous service refresh svcadm refresh -s svc:/oxide/my-service:default after svccfg -s svc:/oxide/my-service:default refresh to make everything is up and running before refreshing the next service? Does this idea make sense @citrus-it or @jclulow ?

@jclulow
Copy link
Collaborator

jclulow commented Jan 14, 2025

Refreshing is always an asynchronous thing, as far as I know. Most things just accept a signal, and then do some amount of work on their own time. There isn't really a mechanism for the service to communicate that the refresh operation is completed.

Something we could do is have the service expose information about its current configuration as a non-persistent property, and have consumers inspect that property and wait for it to have the expected value as needed.

@bnaecker
Copy link
Collaborator

Actually, it already looks like the SMF service's refresh method is really doing the work, rather than delivering a signal. So I'm not sure if more could be done there, to make it synchronous.

<exec_method type='method' name='refresh'
exec='/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d %{config/datalink} -s %{config/static_addr} -g %{config/gateway}'
timeout_seconds='0' />

My guess is having the sled-agent wait for the new address to show up in the SMF properties of the switch zone services, before refreshing them, might work.

@jclulow
Copy link
Collaborator

jclulow commented Jan 14, 2025

Another option is that we could actually have more than one service instance here, the default one for the base setup, say, and then an underlay instance (or whatever) that would be responsible for setting up the address we currently add in there via the refresh today. Then, other services could depend on the specific instance that needs to be fully online prior to themselves starting.

The second instance could start disabled by default and be enabled where we would otherwise today be doing the refresh, say, or the base instance could itself enable the second instance at the appropriate moment, etc. Having an instance per separate resource would make them essentially legible to SMF as something that can be a dependency, I think.

@karencfv
Copy link
Contributor

My guess is having the sled-agent wait for the new address to show up in the SMF properties of the switch zone services, before refreshing them, might work.

I'm unsure if that would work. This whole flow gets kicked off when there are new addresses in the request https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/services.rs#L4290-L4300. From my understanding, we know what the address is, but we need to wait for the zone-network-setup service to actually create it. So the address would show up in the SMF properties, but the service may still not work because zone-network-setup hasn't finished. Unless I'm misunderstanding the proposal?

Another option is that we could actually have more than one service instance here, the default one for the base setup, say, and then an underlay instance (or whatever) that would be responsible for setting up the address we currently add in there via the refresh today. Then, other services could depend on the specific instance that needs to be fully online prior to themselves starting.

The other services have also come up already and are being refreshed. If we made them dependent to a second underlay instance, would this mean they wouldn't come up the first time around when the underlay instance isn't up yet?

@jclulow
Copy link
Collaborator

jclulow commented Jan 14, 2025

Yes that's true. I think ultimately if you want to refresh stuff without restarting it, it probably just needs to be willing to wait for the configured address to show up, rather than try once and give up.

@karencfv
Copy link
Contributor

karencfv commented Jan 14, 2025

Right, so I guess we need to decide which is a better path forward. Modify the switch services to wait for the underlay IP or go back to zlogin in the switch zone.

My first instinct would be to modify the switch services like @jclulow suggests, but I don't have enough knowledge about them and the whole switch zone start up flow to have sufficient confidence to say that's the best path forward.

@jclulow
Copy link
Collaborator

jclulow commented Jan 14, 2025

I definitely wouldn't go back to zlogin, FWIW. If you want to create a sort of barrier or sequencing point so that you can wait for the refresh of the first service to complete, I'd definitely look at having the zone network setup service set some non-persistent property that signals that it's done with setting the thing up. Sled agent can poll for that property to know it's complete without having to get into the zone with zlogin and poke around. Would that work?

@jgallagher
Copy link
Contributor Author

Ah bummer. Was the zone-network-setup service in maintenance as well by any chance? Would you happen to have the logs for that service? I'm asking because if that service was successful then it created all addresses. Since restarting MGS seems to have fixed the problem, I have a feeling the problem may be elsewhere.

I do - this a4x2 is still up for now, although I was planning on destroying it shortly. Only mgs was in maintenance. I think these logs are consistent with a race, though. Here are the sled-agent logs noting refreshes:

21:46:04.072Z INFO SledAgent (ServiceManager): refreshed zone-network-setup service with new configuration
21:46:04.327Z INFO SledAgent (ServiceManager): refreshed dendrite service with new configuration
21:46:04.402Z INFO SledAgent (ServiceManager): refreshed lldpd service with new configuration
21:46:04.482Z INFO SledAgent (ServiceManager): refreshed MGS service with new configuration
21:46:04.515Z INFO SledAgent (ServiceManager): refreshed wicketd service with new configuration
21:46:04.652Z INFO SledAgent (ServiceManager): refreshed mgd service with new configuration
21:46:04.822Z INFO SledAgent (ServiceManager): refreshed mg-ddm service with new configuration

Logs from zone-network-service; in particular, the timestamp for the log after creating the IP (at which point we know the IP should have existed) is later than all the refresh timestamps in sled-agent above:

[ Jan 13 21:46:04 Executing refresh method ("/opt/oxide/zone-setup-cli/bin/zone-setup common-networking -d oxControlService18 -s fd00:1122:3344:101::2 -g fd00:1122:3344:101::1"). ]
note: configured to log to "/dev/stderr"
21:46:04.140Z INFO zone-setup: Ensuring IP interface exists on datalink
    datalink = oxControlService18
21:46:04.145Z INFO zone-setup: Setting MTU to 9000 for IPv6 and IPv4
    datalink = oxControlService18
21:46:04.159Z INFO zone-setup: Setting TCP recv_buf size to 1 MB
21:46:04.165Z INFO zone-setup: Setting TCP congestion control algorithm to cubic
21:46:04.171Z INFO zone-setup: Ensuring static and auto-configured addresses are set on the IP interface
    data link = "oxControlService18"
    static address = fd00:1122:3344:101::2
21:46:05.307Z INFO zone-setup: Ensuring there is a default route
    gateway = Ipv6(fd00:1122:3344:101::1)
21:46:05.321Z INFO zone-setup: Populating hosts file for zone
    zonename = "oxz_switch"
[ Jan 13 21:46:05 Method "refresh" exited with status 0. ]

Looking at the MGS logs specifically, we see the first failure on the refresh:

[ Jan 13 21:46:04 Executing refresh method (:kill). ]
Jan 13 21:46:04.502 INFO Set rack_id, rack_id: a9c8e5a3-3f3e-48e0-8d62-a2bbad8d2606, name: c7f27ef1-a2e3-46fa-a92e-26c67776c420, file: gateway/src/lib.rs:309
Jan 13 21:46:04.502 INFO adjusting dropshot addresses; keeping existing addr, addr: [::1]:12225, name: c7f27ef1-a2e3-46fa-a92e-26c67776c420, file: gateway/src/lib.rs:255
Jan 13 21:46:04.502 INFO adjusting dropshot addresses; starting new server, addr: [fd00:1122:3344:101::2]:12225, name: c7f27ef1-a2e3-46fa-a92e-26c67776c420, file: gateway/src/lib.rs:262
Jan 13 21:46:04.503 INFO starting to poll SP sensor data every 1s, component: sensor-poller, name: c7f27ef1-a2e3-46fa-a92e-26c67776c420, file: gateway/src/metrics.rs:430
Jan 13 21:46:04.503 INFO found a new little friend!, chassis_type: Sled, sp_slot: 3, component: sensor-poller, name: c7f27ef1-a2e3-46fa-a92e-26c67776c420, file: gateway/src/metrics.rs:436
Jan 13 21:46:04.503 INFO found a new little friend!, chassis_type: Switch, sp_slot: 0, component: sensor-poller, name: c7f27ef1-a2e3-46fa-a92e-26c67776c420, file: gateway/src/metrics.rs:436
Jan 13 21:46:04.504 INFO found a new little friend!, chassis_type: Sled, sp_slot: 0, component: sensor-poller, name: c7f27ef1-a2e3-46fa-a92e-26c67776c420, file: mgs: config refresh failed: initializing http server listening at [fd00:1122:3344:101::2]:12225: failed to bind to [fd00:1122:3344:101::2]:12225
gateway/src/metrics.rs:436
Jan 13 21:46:04.505 INFO found a new little friend!, chassis_type: Sled[ Jan 13 21:46:04 Stopping because all processes in service exited. ]
[ Jan 13 21:46:04 Executing stop method (:kill). ]
[ Jan 13 21:46:04 Executing start method ("/opt/oxide/mgs/bin/mgs run --id-and-address-from-smf /var/svc/manifest/site/mgs/config.toml &"). ]

MGS then starts and dies several times in quick succession, with the last one still being before the timestamp when zone-network-service was definitely done creating the IP:

Jan 13 21:46:04.586 DEBG registered endpoint, versions: all, path: /ignition, method: GET, local_addr: [::1]:12225, component: dropshot, name: c7f27ef1-a2e3-46fa-a92e-26c67776c420
Jan 13 21:46:04.586 DEBG registered endpoint, versions: all, path: /ignition/{type}/{slot}, method: GET, local_addr: [::1]:12225, component: dropshot, name: c7f27ef1-a2e3-46fa-a92e-26c67776c420
Jan 13 21:46:04.587 DEBG registered endpoint, versions: all, path: /ignition/{type}/{slot}/{command}, method: POST, local_addr: [::1]:12225, component: dropshot, name: c7f27ef1-a2e3-46fa-a92e-26c67776c420
Jan 13 21:46:04.587 DEBG registered endpoint, versions: all, path: /local/all-sp-ids, method: GET, local_addr: [::1]:12225mgs: initializing http server listening at [fd00:1122:3344:101::2]:12225: failed to bind to [fd00:1122:3344:101::2]:12225
, component: dropshot, name: c7f27ef1-a2e3-46fa-a92e-26c67776c420
Jan 13 21:46:04.588 DEBG registered endpoint, versions: all, path[ Jan 13 21:46:04 Stopping because all processes in service exited. ]
[ Jan 13 21:46:04 Executing stop method (:kill). ]
[ Jan 13 21:46:04 Executing start method ("/opt/oxide/mgs/bin/mgs run --id-and-address-from-smf /var/svc/manifest/site/mgs/config.toml &"). ]

@jgallagher
Copy link
Contributor Author

Hmm - I was curious about the other switch zone services that didn't go into maintenance. I think they just failed to bind and maybe gave up? Here's dendrite (snipping out some irrelevant logs):

[ Jan 13 21:46:04 Rereading configuration. ]
[ Jan 13 21:46:04 Executing refresh method (:kill). ]
21:46:04.347Z INFO dpd: handling SMF refresh
    unit = signal_handler
21:46:04.353Z INFO dpd: refreshed config: Config {
        log_file: None,
        log_format: Json,
        port_config: Some(
            "/opt/oxide/dendrite/misc/softnpu_single_sled_config.toml",
        ),
        xcvr_defaults: Some(
            "/opt/oxide/dendrite/misc/xcvr_defaults.csv",
        ),
        mac_base: None,
        listen_addresses: [
            [fd00:1122:3344:101::2]:12224,
            [::1]:12224,
        ],
        dns_servers: [
            [fd00:1122:3344:1::1]:53,
            [fd00:1122:3344:2::1]:53,
            [fd00:1122:3344:3::1]:53,
        ],
        rack_id: Some(
            a9c8e5a3-3f3e-48e0-8d62-a2bbad8d2606,
        ),
        sled_id: Some(
            9c9d5316-8c92-4870-9520-432f69d443bb,
        ),
        sled_model: Some(
            "i86pc",
        ),
        sled_revision: Some(
            0,
        ),
        sled_serial: Some(
            "g0",
        ),
        asic_config: AsicConfig {
            softnpu_management: UART,
            uds_path: None,
            front_ports: 2,
            rear_ports: 4,
        },
        enable_rpw: true,
        nexus_address: None,
    }
21:46:04.353Z INFO dpd: found suitable IPv6 address for oximeter producer server
    address = fd00:1122:3344:101::2
    unit = oximeter
21:46:04.353Z INFO dpd: starting api server 2 on [fd00:1122:3344:101::2]:12224
    server_id = 2
    unit = api-server
21:46:04.393Z ERRO dpd: failed to create producer server
    error = failed to bind to [fd00:1122:3344:101::2]:0
    unit = oximeter
21:46:04.393Z WARN dpd: failed to create oximeter producer server
    error = Server("failed to bind to [fd00:1122:3344:101::2]:0")
    retry_after = 89.297451ms
    unit = oximeter
21:46:04.393Z ERRO dpd: failed to launch api server 2 on [fd00:1122:3344:101::2]:12224: Cannot assign requested address (os error 126)
    unit = api-server-manager

From those logs, it sounds like dendrite also failed to bind both its oximeter producer (which notes it will retry!) and its API server (which has no note about retrying). To confirm, I see several more failed producer start attempts, and then an eventual success:

21:46:04.483Z ERRO dpd: failed to create producer server
    error = failed to bind to [fd00:1122:3344:101::2]:0
    unit = oximeter
21:46:04.483Z WARN dpd: failed to create oximeter producer server
    error = Server("failed to bind to [fd00:1122:3344:101::2]:0")
    retry_after = 160.437601ms
    unit = oximeter
21:46:04.645Z ERRO dpd: failed to create producer server
    error = failed to bind to [fd00:1122:3344:101::2]:0
    unit = oximeter
21:46:04.645Z WARN dpd: failed to create oximeter producer server
    error = Server("failed to bind to [fd00:1122:3344:101::2]:0")
    retry_after = 205.007618ms
    unit = oximeter
21:46:04.850Z ERRO dpd: failed to create producer server
    error = failed to bind to [fd00:1122:3344:101::2]:0
    unit = oximeter
21:46:04.851Z WARN dpd: failed to create oximeter producer server
    error = Server("failed to bind to [fd00:1122:3344:101::2]:0")
    retry_after = 254.80451ms
    unit = oximeter
21:46:05.112Z DEBG dpd: starting producer registration task
    unit = oximeter
21:46:05.114Z INFO dpd: starting oximeter metric producer server
    address = [fd00:1122:3344:101::2]:56347
    file = /home/build/.cargo/git/checkouts/omicron-d039c41f152bda83/df6d11b/oximeter/producer/src/lib.rs:283
    interval = 10s
    producer_id = e23794b3-17eb-40cd-bb25-d4598cee8456
    unit = oximeter

but no attempted restart of the API server, and indeed the dendrite API service is not listening on the underlay address. Is that worth filing a separate bug in dendrite? It seems like it should either retry forever or go into maintenance rather than just not ever starting the API server, right?

@jgallagher
Copy link
Contributor Author

lldpd:

[ Jan 13 21:46:04 Rereading configuration. ]
[ Jan 13 21:46:04 Executing refresh method (:kill). ]
21:46:04.424Z INFO lldpd: starting api server 3 on [fd00:1122:3344:101::2]:12230
    server_id = 3
    unit = api-server
21:46:04.426Z ERRO lldpd: failed to launch api server 3 on [fd00:1122:3344:101::2]:12230: error creating server listener: Cannot assign requested address (os error 126)
    unit = api-server-manager

Nice of it to log the actual OS error! 126 is EADDRNOTAVAIL which is consistent with the bind failures in other logs, I think.

However, this is the last log from lldpd, and it is online (i.e., not in maintenance) but not listening on the underlay.

@jgallagher
Copy link
Contributor Author

wicketd refresh does not involve an underlay address (it's just picking up other RSS-related things like the rack ID).

@jgallagher
Copy link
Contributor Author

mgd:

[ Jan 13 21:46:04 Rereading configuration. ]
[ Jan 13 21:46:04 Executing refresh method ("/opt/oxide/mgd/pkg/refresh_ctrun_child.sh -c 151"). ]
+ CONTRACT=''
+ getopts c: o
+ CONTRACT=151
+ getopts c: o
+ [[ -n 151 ]]
+ pgrep -c 151
+ PARENT_PID=2505
+ [[ -n 2505 ]]
+ pgrep -P 2505
+ CHILD_PID=2506
+ [[ -n 2506 ]]
+ kill -USR1 2506
21:46:04.708Z INFO slog-rs: handling smf refresh
[ Jan 13 21:46:04 Method "refresh" exited with status 0. ]
21:46:04.713Z INFO slog-rs: starting stats server on smf refresh
21:46:04.717Z ERRO slog-rs: failed to start stats server on refresh: underlay address not found

@jgallagher
Copy link
Contributor Author

mg-ddm:

[ Jan 13 21:46:04 Rereading configuration. ]
[ Jan 13 21:46:04 Executing refresh method ("/opt/oxide/mg-ddm/pkg/refresh_ctrun_child.sh -c 153"). ]
+ CONTRACT=''
+ getopts c: o
+ CONTRACT=153
+ getopts c: o
+ [[ -n 153 ]]
+ pgrep -c 153
+ PARENT_PID=2512
+ [[ -n 2512 ]]
+ pgrep -P 2512
+ CHILD_PID=2513
+ [[ -n 2513 ]]
+ kill -USR1 2513
21:46:04.858Z INFO slog-rs: handling smf refresh
[ Jan 13 21:46:04 Method "refresh" exited with status 0. ]
21:46:04.862Z INFO slog-rs: starting stats server on smf refresh
21:46:04.863Z ERRO slog-rs: failed to start stats server on refresh: underlay address not found

@jgallagher
Copy link
Contributor Author

I'm going to tear down the a4x2; I hope the above captures all the things we might want. If we want to go the "services should retry binding" route it looks like we'd need to update all the switch zone services. I don't love that because it feels like that ought to have some kind of timeout, otherwise an error like "we told MGS the wrong underlay IP" turns into "MGS just sits there forever and never binds at all". We'd also have to remember to do that for any future services.

FWIW. If you want to create a sort of barrier or sequencing point so that you can wait for the refresh of the first service to complete, I'd definitely look at having the zone network setup service set some non-persistent property that signals that it's done with setting the thing up. Sled agent can poll for that property to know it's complete without having to get into the zone with zlogin and poke around. Would that work?

This sounds the most promising to me, although I'm not sure what the mechanics of that look like to handle stuff like "zone-network-service itself went into maintenance while trying to refresh" or other weird edge cases.

Should we revert #7260 for the time being? I assume this race is much more likely to show up on a slow system like a4x2, but presumably it's possible on real hardware too.

@karencfv
Copy link
Contributor

Thanks for capturing all those logs here @jgallagher! Ugh, yeah I see how this confirms there is definitely a race here. How sad.

FWIW. If you want to create a sort of barrier or sequencing point so that you can wait for the refresh of the first service to complete, I'd definitely look at having the zone network setup service set some non-persistent property that signals that it's done with setting the thing up. Sled agent can poll for that property to know it's complete without having to get into the zone with zlogin and poke around. Would that work?

This sounds the most promising to me, although I'm not sure what the mechanics of that look like to handle stuff like "zone-network-service itself went into maintenance while trying to refresh" or other weird edge cases.

I like this solution as well, but yeah seems like this needs a little more thinking through.

For the time being I'll revert #7260

karencfv added a commit that referenced this issue Jan 15, 2025
This commit reverts the
[changes](https://github.com/oxidecomputer/omicron/pull/7260/files#diff-b8a6f13742cae29f44d095f6b9e8c2febc712e0ff86f01f3c8ec9d4e5d2db396)
made to the switch start up flow in #7260.

This had caused a race condition as can be seen in #7337 . Moving
forward, we'd like to return to using the `zone-network-setup` service,
but there needs to be more consideration to get the implementation
details right. In the mean time, let's get rid of this bug.

Fixes #7337
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants