Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hammer/fix dangling delivery alerts #905

Merged
merged 8 commits into from
Feb 26, 2024

Conversation

aaronchongth
Copy link
Member

What's new

When delivery alerts are queried, we check for ongoing tasks, and if there exists a dangling delivery alert, we resume it

Self-checks

  • I have prototyped this new feature (if necessary) on Figma
  • I'm familiar with and follow this Typescript guideline
  • I added unit-tests for new components
  • I tried testing edge cases
  • I tested the behavior of the components that interact with the backend, with an e2e test

…tead of repo

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
Copy link

codecov bot commented Feb 20, 2024

Codecov Report

Attention: Patch coverage is 25.77320% with 72 lines in your changes are missing coverage. Please review.

Project coverage is 47.01%. Comparing base (fe0e808) to head (e94309f).
Report is 66 commits behind head on deploy/hammer.

Files Patch % Lines
.../dashboard/src/components/delivery-alert-store.tsx 0.00% 47 Missing ⚠️
...es/api-server/api_server/models/delivery_alerts.py 48.88% 23 Missing ⚠️
...es/api-server/api_server/routes/delivery_alerts.py 60.00% 2 Missing ⚠️
Additional details and impacted files
@@                Coverage Diff                @@
##           deploy/hammer     #905      +/-   ##
=================================================
- Coverage          49.35%   47.01%   -2.35%     
=================================================
  Files                285      204      -81     
  Lines               7564     6239    -1325     
  Branches            1050      824     -226     
=================================================
- Hits                3733     2933     -800     
+ Misses              3682     3300     -382     
+ Partials             149        6     -143     
Flag Coverage Δ
api-server 77.31% <50.00%> (-3.49%) ⬇️
dashboard 13.88% <0.00%> (-1.18%) ⬇️
react-components ?
rmf-auth ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
for robot in f.robots.values():
if robot.task_id is not None and len(robot.task_id) != 0:
current_task_ids.append(robot.task_id)
except Exception as e: # pylint: disable=broad-except
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not disable the linter without a good reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this particular scenario, where there's a lot of action happening under-the-hood, especially with casting involved, I would prefer a broad except, rather than extensively writing out every single possible exception.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should still write out the exceptions which you can handle, let uncaught exceptions propagate up to fastapi and it should return with an http error code.

Although even if we were to manage the states ourselves (which we should not), this should be done when we ingest the task states, querying should not cause this kind of side effects.

):
logger.info(f"Found a dangling alert: {d}")
try:
dangling_alert = await ttm.DeliveryAlert.get_or_none(id=d.id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already retrieved the alerts earlier, this is unnecessary database read.

if dangling_alert is None:
logger.error(f"Failed to retrieve dangling alert {d.id}")
continue
dangling_alert.update_from_dict({"action": "resume"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be updating the alerts like this, the fleet adapter should notify us that the alert is done. This creates multiple sources of truth for the actual state of the delivery alert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is for the odd case where either

  1. alert update from fleet adapter was dropped and missing
  2. alert update from UI was dropped (network connectivity, or wifi, or any other number of reasons the UI failed)
  3. api-server restarts, and misses either one of the above
  4. there was a major issue and the fleet adapter had to be restarted

This will basically cause a "dangling" alert, where every time we refresh the UI, this expired alert will still be present, waiting for the user to respond to (or in the obstruction case, waiting for the fleet adapter to respond). But for cases (1) or (3) or (4), it will just never be resolved.

Implementation detail-wise, there will never be a delivery alert that is not tied to an ongoing task. I'll supplement this block with some notes about these phenomena. But with all these in mind, I don't believe we have multiple sources of truth. Any new delivery alerts for the same task, will overwrite what's in the DB (even after resume), and can require user intervention again

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of conflicting sources of truth would be if another node subscribe to the delivery alerts, the fleet adapter did not publish the new status but the api server assumes it is resumed by itself, while the other node still thinks the alert is pending.

The proper fix would be to decouple ui alerts with delivery alerts. If the user knows the data is not properly synced or chooses to dismiss the alert for whatever reason, we will no longer show the alert regardless of the rmf status (until the fleet adapter publish a new delivery alert again).

Copy link
Member Author

@aaronchongth aaronchongth Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of conflicting sources of truth would be if another node subscribe to the delivery alerts, the fleet adapter did not publish the new status but the api server assumes it is resumed by itself, while the other node still thinks the alert is pending.

But the delivery alerts are only tied to task IDs, so if not a single robot is performing this task, we can safely assume that it is considered dangling.

If any other node is subscribed to the delivery alerts, it is not within the scope of what the delivery alerts should do, and we can ignore those behaviors.

The proper fix would be to decouple ui alerts with delivery alerts. If the user knows the data is not properly synced or chooses to dismiss the alert for whatever reason, we will no longer show the alert regardless of the rmf status (until the fleet adapter publish a new delivery alert again).

So what you're suggesting would perhaps be not using the DB at all for the delivery alerts.

  • Fleet adapter publishes, api-server subscribes, pushes an alert onto the UI
  • if the user refreshes the UI, this alert disappears, until the fleet adapter publishes it again (let's say every 30 seconds, edit: this will need to be implemented), it will show up
  • user interactions on the UI alert, will post to the server, which will publish the alert response over ROS 2 to the fleet adapter

For us, we will likely just be logging the events and user interactions, as opposed to keeping track of the alerts in the DB. Is that the same as what you have in mind?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the delivery alerts are only tied to task IDs, so if not a single robot is performing this task, we can safely assume that it is considered dangling.

Yes, but there is no defined behavior for a dangling alert, we cannot assume that it means the delivery has resumed, other nodes which may use delivery alerts (even though there isn't any atm) does not hold the same assumption, leading to a conflict of truths.

We will still use the DB for alerts, but we need to differentiate between delivery alert and alert shown to the ui. A user never dismiss a delivery alert directly, rather it dismiss the ui alert, which will trigger some actions which sends appropriate messages to the rmf (done by the api server).

So a dangling delivery alert should still be dangling, but it will not cause any alerts to be displayed on the ui after the user has dismissed it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the fact that there are dangling alerts already means that there are multiple sources of truths. afaik, the owner of this data is the fleet adapter, so the api server should not be making it's own assumption.

I can see 2 ways around this, 1 is the proposal I mentioned earlier, do not use delivery alerts as ui alerts, a 2nd way is to not rely on persisting delivery alerts, i.e. the fleet adapter should be publishing delivery alerts like the fleet state, it should constantly publish a list of current alerts so the api server does not need to build it's own assumption of the state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per DM, as long as we ensure that the fleet adapters are keeping states of the delivery alerts, and will prompt again periodically if no user intervention was started, we can skip the use of the DB entirely, and just bridge the UI user interactions towards ROS 2 delivery alert responses.

Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
Copy link
Collaborator

@koonpeng koonpeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The improved typings are very helpful!

@aaronchongth aaronchongth merged commit e0e76bc into deploy/hammer Feb 26, 2024
4 checks passed
@aaronchongth aaronchongth deleted the hammer/fix-dangling-delivery-alerts branch February 26, 2024 04:56
@aaronchongth
Copy link
Member Author

Just tested with an old DB, and no migration was required too 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants