Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix registration logic in lifecycle manager #52

Merged
merged 6 commits into from
Jan 8, 2025

Conversation

luca-della-vedova
Copy link
Member

@luca-della-vedova luca-della-vedova commented Jan 7, 2025

This PR works toward #48 by fixing corner cases with workcells bringup / bringdown.
The logic in the lifecycle manager was flawed, specifically in the branch where nodes already exist and a new one is added.

Two main issues are fixed by this PR.

Re-registering the same workcell.

If Nexus has a single workcell, that workcell is brought down without proper cleanup (i.e. it suddenly crashes) then it's readded, we will go into the branch I linked above. However, when we try to get the target state

if (!this->GetState(this->_node_clients.begin()->first, targetState))
{
message("The state of current nodes were not recheable, "
"Keep the new node is unconfigure state");
return false;
}

We will actually query the workcell we are trying to create right now, hence try to transition it to its current state ending up in a NOOP. The new workcell will be successfully added but remain unconfigured.

workcell_1 is created, workcell_1 is killed, workcell_2 is created

Another corner case of the check above, now we will actually try to get the target state from a workcell that died, was not cleaned up properly and will thus be unreachable, adding new nodes will fail.

The solution

The solution I came up with is actually a large simplification of the logic. I just created a new class variable _target_state that contains the state that we want the nodes to be in.

  • If the lifecycle manager is set to autostart the target state will be ACTIVE, otherwise UNCONFIGURED.
  • _autostart itself can be removed, now it just sets _target_state to ACTIVE instead.
  • system_active can be removed, we just check that the _target_state is ACTIVE.
  • Whenever we do a changeStateForAllNodes we will also update this internal _target_state.
  • Whenever a new node is added, we transition it to _target_state.

The drawback

The only "corner" case I can think of is in cases in which we want to manage a set of nodes at different states, in which the concept of a single target state that new nodes should be transitioned to will not be applicable. Two main points about this:

  • Most importantly I would argue that the previous version was also broken. We would arbitrarily take the state the first item in an unordered_map and transition new nodes to that state. unordered_map is... unordered... so the first item could be literally any node depending on what the hashing does.
  • I also feel the lifecycle manager's main scope is to make sure that all the managed nodes are at the same target state. If that is the case what I would actually recommend is only exposing APIs that transition all the nodes to a state and make APIs that transition a single node private, so it won't be possible for external users to create invalid internal state.

Signed-off-by: Luca Della Vedova <lucadv@intrinsic.ai>
Signed-off-by: Luca Della Vedova <lucadv@intrinsic.ai>
Signed-off-by: Luca Della Vedova <lucadv@intrinsic.ai>
Signed-off-by: Luca Della Vedova <lucadv@intrinsic.ai>
Copy link
Member

@Yadunund Yadunund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! The approach is a lot better. Left a minor comment to consider before we can merge this.

ok = ok && (
transitionGraph.atGoalState(state.value(), transition) ||
client->change_state(transition));
ok &= transitionGraph.atGoalState(state.value(), transition) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: By switching to bitwise AND we lose the optimization of short circuiting if ok is false so we can we keep the original implementation? Let me know if I missed something else.

Also not sure whats the point of returning a boolean in this function since we might have successfully transitioned some nodes and failed for others and the user would not know which ones succeeded. But that is beyond the scope of this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting one, the idea behind this change is that previously as soon as ok became false the short circuiting would make all the following clients skip transitioning as well.
With the new logic without short circuiting instead, all the following client will still transition.
This is also why I added the continue above, before the short circuiting would make sure we never call .value() even as new nodes do indeed have a state but since short circuiting is removed we need an early guard clause

Signed-off-by: Luca Della Vedova <lucadv@intrinsic.ai>
…cle_manager.hpp

Co-authored-by: yadunund <yadunund@gmail.com>
@Yadunund Yadunund merged commit 5ffa61a into main Jan 8, 2025
3 checks passed
@Yadunund Yadunund deleted the luca/fix_reregistration branch January 8, 2025 19:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants