Hammer/fix dangling delivery alerts #905

aaronchongth · 2024-02-20T07:39:53Z

What's new

When delivery alerts are queried, we check for ongoing tasks, and if there exists a dangling delivery alert, we resume it

Self-checks

I have prototyped this new feature (if necessary) on Figma
I'm familiar with and follow this Typescript guideline
I added unit-tests for new components
I tried testing edge cases
I tested the behavior of the components that interact with the backend, with an e2e test

…tead of repo Signed-off-by: Aaron Chong <aaronchongth@gmail.com>

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>

codecov · 2024-02-20T07:44:23Z

Codecov Report

Attention: Patch coverage is 25.77320% with 72 lines in your changes are missing coverage. Please review.

Project coverage is 47.01%. Comparing base (fe0e808) to head (e94309f).
Report is 66 commits behind head on deploy/hammer.

Files	Patch %	Lines
.../dashboard/src/components/delivery-alert-store.tsx	0.00%	47 Missing ⚠️
...es/api-server/api_server/models/delivery_alerts.py	48.88%	23 Missing ⚠️
...es/api-server/api_server/routes/delivery_alerts.py	60.00%	2 Missing ⚠️

Additional details and impacted files

@@                Coverage Diff                @@
##           deploy/hammer     #905      +/-   ##
=================================================
- Coverage          49.35%   47.01%   -2.35%     
=================================================
  Files                285      204      -81     
  Lines               7564     6239    -1325     
  Branches            1050      824     -226     
=================================================
- Hits                3733     2933     -800     
+ Misses              3682     3300     -382     
+ Partials             149        6     -143

Flag	Coverage Δ
api-server	`77.31% <50.00%> (-3.49%)`	⬇️
dashboard	`13.88% <0.00%> (-1.18%)`	⬇️
react-components	`?`
rmf-auth	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>

koonpeng · 2024-02-22T01:42:37Z

packages/api-server/api_server/routes/delivery_alerts.py

+            for robot in f.robots.values():
+                if robot.task_id is not None and len(robot.task_id) != 0:
+                    current_task_ids.append(robot.task_id)
+    except Exception as e:  # pylint: disable=broad-except


Do not disable the linter without a good reason.

For this particular scenario, where there's a lot of action happening under-the-hood, especially with casting involved, I would prefer a broad except, rather than extensively writing out every single possible exception.

You should still write out the exceptions which you can handle, let uncaught exceptions propagate up to fastapi and it should return with an http error code.

Although even if we were to manage the states ourselves (which we should not), this should be done when we ingest the task states, querying should not cause this kind of side effects.

koonpeng · 2024-02-22T01:46:14Z

packages/api-server/api_server/routes/delivery_alerts.py

+        ):
+            logger.info(f"Found a dangling alert: {d}")
+            try:
+                dangling_alert = await ttm.DeliveryAlert.get_or_none(id=d.id)


We already retrieved the alerts earlier, this is unnecessary database read.

koonpeng · 2024-02-22T01:48:26Z

packages/api-server/api_server/routes/delivery_alerts.py

+                if dangling_alert is None:
+                    logger.error(f"Failed to retrieve dangling alert {d.id}")
+                    continue
+                dangling_alert.update_from_dict({"action": "resume"})


I don't think we should be updating the alerts like this, the fleet adapter should notify us that the alert is done. This creates multiple sources of truth for the actual state of the delivery alert.

This fix is for the odd case where either

alert update from fleet adapter was dropped and missing

alert update from UI was dropped (network connectivity, or wifi, or any other number of reasons the UI failed)

api-server restarts, and misses either one of the above

there was a major issue and the fleet adapter had to be restarted

This will basically cause a "dangling" alert, where every time we refresh the UI, this expired alert will still be present, waiting for the user to respond to (or in the obstruction case, waiting for the fleet adapter to respond). But for cases (1) or (3) or (4), it will just never be resolved.

Implementation detail-wise, there will never be a delivery alert that is not tied to an ongoing task. I'll supplement this block with some notes about these phenomena. But with all these in mind, I don't believe we have multiple sources of truth. Any new delivery alerts for the same task, will overwrite what's in the DB (even after resume), and can require user intervention again

An example of conflicting sources of truth would be if another node subscribe to the delivery alerts, the fleet adapter did not publish the new status but the api server assumes it is resumed by itself, while the other node still thinks the alert is pending.

The proper fix would be to decouple ui alerts with delivery alerts. If the user knows the data is not properly synced or chooses to dismiss the alert for whatever reason, we will no longer show the alert regardless of the rmf status (until the fleet adapter publish a new delivery alert again).

An example of conflicting sources of truth would be if another node subscribe to the delivery alerts, the fleet adapter did not publish the new status but the api server assumes it is resumed by itself, while the other node still thinks the alert is pending.

But the delivery alerts are only tied to task IDs, so if not a single robot is performing this task, we can safely assume that it is considered dangling.

If any other node is subscribed to the delivery alerts, it is not within the scope of what the delivery alerts should do, and we can ignore those behaviors.

The proper fix would be to decouple ui alerts with delivery alerts. If the user knows the data is not properly synced or chooses to dismiss the alert for whatever reason, we will no longer show the alert regardless of the rmf status (until the fleet adapter publish a new delivery alert again).

So what you're suggesting would perhaps be not using the DB at all for the delivery alerts.

Fleet adapter publishes, api-server subscribes, pushes an alert onto the UI

if the user refreshes the UI, this alert disappears, until the fleet adapter publishes it again (let's say every 30 seconds, edit: this will need to be implemented), it will show up

user interactions on the UI alert, will post to the server, which will publish the alert response over ROS 2 to the fleet adapter

For us, we will likely just be logging the events and user interactions, as opposed to keeping track of the alerts in the DB. Is that the same as what you have in mind?

But the delivery alerts are only tied to task IDs, so if not a single robot is performing this task, we can safely assume that it is considered dangling.

Yes, but there is no defined behavior for a dangling alert, we cannot assume that it means the delivery has resumed, other nodes which may use delivery alerts (even though there isn't any atm) does not hold the same assumption, leading to a conflict of truths.

We will still use the DB for alerts, but we need to differentiate between delivery alert and alert shown to the ui. A user never dismiss a delivery alert directly, rather it dismiss the ui alert, which will trigger some actions which sends appropriate messages to the rmf (done by the api server).

So a dangling delivery alert should still be dangling, but it will not cause any alerts to be displayed on the ui after the user has dismissed it.

actually, the fact that there are dangling alerts already means that there are multiple sources of truths. afaik, the owner of this data is the fleet adapter, so the api server should not be making it's own assumption.

I can see 2 ways around this, 1 is the proposal I mentioned earlier, do not use delivery alerts as ui alerts, a 2nd way is to not rely on persisting delivery alerts, i.e. the fleet adapter should be publishing delivery alerts like the fleet state, it should constantly publish a list of current alerts so the api server does not need to build it's own assumption of the state.

Per DM, as long as we ensure that the fleet adapters are keeping states of the delivery alerts, and will prompt again periodically if no user intervention was started, we can skip the use of the DB entirely, and just bridge the UI user interactions towards ROS 2 delivery alert responses.

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>

koonpeng

The improved typings are very helpful!

aaronchongth · 2024-02-26T04:56:48Z

Just tested with an old DB, and no migration was required too 🥳

aaronchongth added 3 commits February 20, 2024 14:28

First working attempt, retrieving fleet state directly through DB ins…

9b67132

…tead of repo Signed-off-by: Aaron Chong <aaronchongth@gmail.com>

Properly logging and exception handling

b0c0332

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>

Lint

3b960c6

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>

aaronchongth requested a review from koonpeng February 20, 2024 10:15

Align QoS for delivery alert topic

55b93f5

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>

koonpeng requested changes Feb 22, 2024

View reviewed changes

aaronchongth added 4 commits February 23, 2024 16:49

New route for responding to delivery alerts

8095e70

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>

Clean up unused interactions when closing error delivery alerts

d3cd507

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>

Remove unused close ID interactions

1403b63

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>

Clean up unsused routes and explicit delivery alert model with enums

e94309f

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>

aaronchongth requested a review from koonpeng February 26, 2024 02:50

koonpeng approved these changes Feb 26, 2024

View reviewed changes

aaronchongth merged commit e0e76bc into deploy/hammer Feb 26, 2024
4 checks passed

aaronchongth deleted the hammer/fix-dangling-delivery-alerts branch February 26, 2024 04:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hammer/fix dangling delivery alerts #905

Hammer/fix dangling delivery alerts #905

aaronchongth commented Feb 20, 2024

codecov bot commented Feb 20, 2024 •

edited

Loading

koonpeng Feb 22, 2024

aaronchongth Feb 22, 2024

koonpeng Feb 22, 2024

koonpeng Feb 22, 2024

koonpeng Feb 22, 2024

aaronchongth Feb 22, 2024

koonpeng Feb 22, 2024

aaronchongth Feb 22, 2024 •

edited

Loading

koonpeng Feb 22, 2024

koonpeng Feb 22, 2024

aaronchongth Feb 22, 2024

koonpeng left a comment

aaronchongth commented Feb 26, 2024

Hammer/fix dangling delivery alerts #905

Hammer/fix dangling delivery alerts #905

Conversation

aaronchongth commented Feb 20, 2024

What's new

Self-checks

codecov bot commented Feb 20, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aaronchongth Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

koonpeng left a comment

Choose a reason for hiding this comment

aaronchongth commented Feb 26, 2024

codecov bot commented Feb 20, 2024 •

edited

Loading

aaronchongth Feb 22, 2024 •

edited

Loading