Fix registration logic in lifecycle manager #52

luca-della-vedova · 2025-01-07T04:30:59Z

This PR works toward #48 by fixing corner cases with workcells bringup / bringdown.
The logic in the lifecycle manager was flawed, specifically in the branch where nodes already exist and a new one is added.

Two main issues are fixed by this PR.

Re-registering the same workcell.

If Nexus has a single workcell, that workcell is brought down without proper cleanup (i.e. it suddenly crashes) then it's readded, we will go into the branch I linked above. However, when we try to get the target state

nexus/nexus_lifecycle_manager/include/nexus_lifecycle_manager/lifecycle_manager.hpp

Lines 376 to 381 in a0eff3c

    
           if (!this->GetState(this->_node_clients.begin()->first, targetState)) 
        
           { 
        
             message("The state of current nodes were not recheable, " 
        
               "Keep the new node is unconfigure state"); 
        
             return false; 
        
           }

We will actually query the workcell we are trying to create right now, hence try to transition it to its current state ending up in a NOOP. The new workcell will be successfully added but remain unconfigured.

`workcell_1` is created, `workcell_1` is killed, `workcell_2` is created

Another corner case of the check above, now we will actually try to get the target state from a workcell that died, was not cleaned up properly and will thus be unreachable, adding new nodes will fail.

The solution

The solution I came up with is actually a large simplification of the logic. I just created a new class variable _target_state that contains the state that we want the nodes to be in.

If the lifecycle manager is set to autostart the target state will be ACTIVE, otherwise UNCONFIGURED.
_autostart itself can be removed, now it just sets _target_state to ACTIVE instead.
system_active can be removed, we just check that the _target_state is ACTIVE.
Whenever we do a changeStateForAllNodes we will also update this internal _target_state.
Whenever a new node is added, we transition it to _target_state.

The drawback

The only "corner" case I can think of is in cases in which we want to manage a set of nodes at different states, in which the concept of a single target state that new nodes should be transitioned to will not be applicable. Two main points about this:

Most importantly I would argue that the previous version was also broken. We would arbitrarily take the state the first item in an unordered_map and transition new nodes to that state. unordered_map is... unordered... so the first item could be literally any node depending on what the hashing does.
I also feel the lifecycle manager's main scope is to make sure that all the managed nodes are at the same target state. If that is the case what I would actually recommend is only exposing APIs that transition all the nodes to a state and make APIs that transition a single node private, so it won't be possible for external users to create invalid internal state.

Signed-off-by: Luca Della Vedova <lucadv@intrinsic.ai>

Yadunund

Thanks for working on this! The approach is a lot better. Left a minor comment to consider before we can merge this.

Yadunund · 2025-01-07T18:57:20Z

nexus_lifecycle_manager/include/nexus_lifecycle_manager/lifecycle_manager.hpp

-      ok = ok && (
-        transitionGraph.atGoalState(state.value(), transition) ||
-        client->change_state(transition));
+      ok &= transitionGraph.atGoalState(state.value(), transition) ||


Nit: By switching to bitwise AND we lose the optimization of short circuiting if ok is false so we can we keep the original implementation? Let me know if I missed something else.

Also not sure whats the point of returning a boolean in this function since we might have successfully transitioned some nodes and failed for others and the user would not know which ones succeeded. But that is beyond the scope of this PR.

This is an interesting one, the idea behind this change is that previously as soon as ok became false the short circuiting would make all the following clients skip transitioning as well.
With the new logic without short circuiting instead, all the following client will still transition.
This is also why I added the continue above, before the short circuiting would make sure we never call .value() even as new nodes do indeed have a state but since short circuiting is removed we need an early guard clause

nexus_lifecycle_manager/include/nexus_lifecycle_manager/lifecycle_manager.hpp

Signed-off-by: Luca Della Vedova <lucadv@intrinsic.ai>

nexus_lifecycle_manager/include/nexus_lifecycle_manager/lifecycle_manager.hpp

…cle_manager.hpp Co-authored-by: yadunund <yadunund@gmail.com>

luca-della-vedova added 3 commits January 7, 2025 12:12

Fix registration logic in lifecycle manager

9e4d2f4

Signed-off-by: Luca Della Vedova <lucadv@intrinsic.ai>

Make sure to use state, not transition

da8e425

Signed-off-by: Luca Della Vedova <lucadv@intrinsic.ai>

Remove _system_active in favor of lifecycle state

410b119

Signed-off-by: Luca Della Vedova <lucadv@intrinsic.ai>

luca-della-vedova requested a review from Yadunund January 7, 2025 04:30

Convert target state to optional

c3200d5

Signed-off-by: Luca Della Vedova <lucadv@intrinsic.ai>

Yadunund requested changes Jan 7, 2025

View reviewed changes

Add warning

967fade

Signed-off-by: Luca Della Vedova <lucadv@intrinsic.ai>

Yadunund requested changes Jan 8, 2025

View reviewed changes

nexus_lifecycle_manager/include/nexus_lifecycle_manager/lifecycle_manager.hpp Outdated Show resolved Hide resolved

Update nexus_lifecycle_manager/include/nexus_lifecycle_manager/lifecy…

40e1473

…cle_manager.hpp Co-authored-by: yadunund <yadunund@gmail.com>

Yadunund approved these changes Jan 8, 2025

View reviewed changes

Yadunund merged commit 5ffa61a into main Jan 8, 2025
3 checks passed

Yadunund deleted the luca/fix_reregistration branch January 8, 2025 19:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix registration logic in lifecycle manager #52

Fix registration logic in lifecycle manager #52

luca-della-vedova commented Jan 7, 2025 •

edited

Loading

Yadunund left a comment

Yadunund Jan 7, 2025

luca-della-vedova Jan 8, 2025

	if (!this->GetState(this->_node_clients.begin()->first, targetState))
	{
	message("The state of current nodes were not recheable, "
	"Keep the new node is unconfigure state");
	return false;
	}

Fix registration logic in lifecycle manager #52

Fix registration logic in lifecycle manager #52

Conversation

luca-della-vedova commented Jan 7, 2025 • edited Loading

Re-registering the same workcell.

workcell_1 is created, workcell_1 is killed, workcell_2 is created

The solution

The drawback

Yadunund left a comment

Choose a reason for hiding this comment

Yadunund Jan 7, 2025

Choose a reason for hiding this comment

luca-della-vedova Jan 8, 2025

Choose a reason for hiding this comment

luca-della-vedova commented Jan 7, 2025 •

edited

Loading

`workcell_1` is created, `workcell_1` is killed, `workcell_2` is created