-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sled-agent races when configuring switch zone underlay address #7337
Comments
That seems like it could be happening. I'm not sure the best way to solve it. Maybe we could make the refresh method of the Another option would be to have the sled-agent do the refresh, and then wait for the setup service to complete. IIUC, the service is considered online when it exits successfully. I don't know if there's a "wait for service" method, but that would do it. Normally those are expressed as dependencies, but that won't work in this case AFIAU, because the switch services all need to come up before the zone-network-service runs. |
Ah bummer. Was the The Do you still have the sled agent logs? Do they contain "refreshed zone-network-setup service with new configuration" and "refreshed MGS service with new configuration"? |
Sadly, this is kinda doing the same thing as the network setup service. The service is a thin wrapper around ipadm and other commands |
I don't think that's quite right. The difference is what "refreshing the service" means. By default, it's delivering a signal to the process. When that's delivered, refreshing is "complete", even though the application now has a bunch of other work to do (all the |
Thinking about this a bit more, perhaps even though I wonder if it makes sense to do a synchronous service refresh |
Refreshing is always an asynchronous thing, as far as I know. Most things just accept a signal, and then do some amount of work on their own time. There isn't really a mechanism for the service to communicate that the refresh operation is completed. Something we could do is have the service expose information about its current configuration as a non-persistent property, and have consumers inspect that property and wait for it to have the expected value as needed. |
Actually, it already looks like the SMF service's refresh method is really doing the work, rather than delivering a signal. So I'm not sure if more could be done there, to make it synchronous. omicron/smf/zone-network-setup/manifest.xml Lines 35 to 37 in 2fe668d
My guess is having the sled-agent wait for the new address to show up in the SMF properties of the switch zone services, before refreshing them, might work. |
Another option is that we could actually have more than one service instance here, the The second instance could start disabled by default and be enabled where we would otherwise today be doing the refresh, say, or the base instance could itself enable the second instance at the appropriate moment, etc. Having an instance per separate resource would make them essentially legible to SMF as something that can be a dependency, I think. |
I'm unsure if that would work. This whole flow gets kicked off when there are new addresses in the request https://github.com/oxidecomputer/omicron/blob/main/sled-agent/src/services.rs#L4290-L4300. From my understanding, we know what the address is, but we need to wait for the zone-network-setup service to actually create it. So the address would show up in the SMF properties, but the service may still not work because zone-network-setup hasn't finished. Unless I'm misunderstanding the proposal?
The other services have also come up already and are being refreshed. If we made them dependent to a second |
Yes that's true. I think ultimately if you want to refresh stuff without restarting it, it probably just needs to be willing to wait for the configured address to show up, rather than try once and give up. |
Right, so I guess we need to decide which is a better path forward. Modify the switch services to wait for the underlay IP or go back to zlogin in the switch zone. My first instinct would be to modify the switch services like @jclulow suggests, but I don't have enough knowledge about them and the whole switch zone start up flow to have sufficient confidence to say that's the best path forward. |
I definitely wouldn't go back to zlogin, FWIW. If you want to create a sort of barrier or sequencing point so that you can wait for the refresh of the first service to complete, I'd definitely look at having the zone network setup service set some non-persistent property that signals that it's done with setting the thing up. Sled agent can poll for that property to know it's complete without having to get into the zone with zlogin and poke around. Would that work? |
I do - this a4x2 is still up for now, although I was planning on destroying it shortly. Only
Logs from
Looking at the MGS logs specifically, we see the first failure on the refresh:
MGS then starts and dies several times in quick succession, with the last one still being before the timestamp when
|
Hmm - I was curious about the other switch zone services that didn't go into maintenance. I think they just failed to bind and maybe gave up? Here's dendrite (snipping out some irrelevant logs):
From those logs, it sounds like dendrite also failed to bind both its oximeter producer (which notes it will retry!) and its API server (which has no note about retrying). To confirm, I see several more failed producer start attempts, and then an eventual success:
but no attempted restart of the API server, and indeed the dendrite API service is not listening on the underlay address. Is that worth filing a separate bug in dendrite? It seems like it should either retry forever or go into maintenance rather than just not ever starting the API server, right? |
lldpd:
Nice of it to log the actual OS error! 126 is However, this is the last log from lldpd, and it is |
wicketd refresh does not involve an underlay address (it's just picking up other RSS-related things like the rack ID). |
mgd:
|
mg-ddm:
|
I'm going to tear down the a4x2; I hope the above captures all the things we might want. If we want to go the "services should retry binding" route it looks like we'd need to update all the switch zone services. I don't love that because it feels like that ought to have some kind of timeout, otherwise an error like "we told MGS the wrong underlay IP" turns into "MGS just sits there forever and never binds at all". We'd also have to remember to do that for any future services.
This sounds the most promising to me, although I'm not sure what the mechanics of that look like to handle stuff like " Should we revert #7260 for the time being? I assume this race is much more likely to show up on a slow system like a4x2, but presumably it's possible on real hardware too. |
Thanks for capturing all those logs here @jgallagher! Ugh, yeah I see how this confirms there is definitely a race here. How sad.
I like this solution as well, but yeah seems like this needs a little more thinking through. For the time being I'll revert #7260 |
This commit reverts the [changes](https://github.com/oxidecomputer/omicron/pull/7260/files#diff-b8a6f13742cae29f44d095f6b9e8c2febc712e0ff86f01f3c8ec9d4e5d2db396) made to the switch start up flow in #7260. This had caused a race condition as can be seen in #7337 . Moving forward, we'd like to return to using the `zone-network-setup` service, but there needs to be more consideration to get the implementation details right. In the mean time, let's get rid of this bug. Fixes #7337
Trying to test something on a4x2, I ran into a situation where MGS had gone into maintenance. Its logs contained the error:
I assume the source of the bind failure was that the underlay address didn't exist (although I'm not sure why we didn't get the rest of the error chain).
fd00:1122:3344:101::2
is the correct underlay address of that switch zone, and it did exist at the point I was investigating. Clearing and restarting MGS with no other changes to the system worked fine.Looking at recent changes in this area, I'm wondering if #7260 introduced a race that could explain this. Prior to that change,
sled-agent
waszlogin
'ing into the switch zone to create the underlay address. After that change, instead it reconfigureszone-network-setup
, sends an SMF refresh, and expectszone-network-setup
to create the address. But I think at that point, both of these are happening concurrently:sled-agent
in the gz is reconfiguring other switch zone services to tell them to start listening on the underlay IPzone-network-service
in the switch zone is handling the SMF refresh and actually creating the underlay IPwhich means services could fail to bind if they lose that race.
Does this sound reasonable @karencfv / @bnaecker? Any ideas on how we could fix this without falling back to
zlogin
?The text was updated successfully, but these errors were encountered: