-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hammer/fix dangling delivery alerts #905
Hammer/fix dangling delivery alerts #905
Conversation
…tead of repo Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## deploy/hammer #905 +/- ##
=================================================
- Coverage 49.35% 47.01% -2.35%
=================================================
Files 285 204 -81
Lines 7564 6239 -1325
Branches 1050 824 -226
=================================================
- Hits 3733 2933 -800
+ Misses 3682 3300 -382
+ Partials 149 6 -143
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
for robot in f.robots.values(): | ||
if robot.task_id is not None and len(robot.task_id) != 0: | ||
current_task_ids.append(robot.task_id) | ||
except Exception as e: # pylint: disable=broad-except |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not disable the linter without a good reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this particular scenario, where there's a lot of action happening under-the-hood, especially with casting involved, I would prefer a broad except, rather than extensively writing out every single possible exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should still write out the exceptions which you can handle, let uncaught exceptions propagate up to fastapi and it should return with an http error code.
Although even if we were to manage the states ourselves (which we should not), this should be done when we ingest the task states, querying should not cause this kind of side effects.
): | ||
logger.info(f"Found a dangling alert: {d}") | ||
try: | ||
dangling_alert = await ttm.DeliveryAlert.get_or_none(id=d.id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already retrieved the alerts earlier, this is unnecessary database read.
if dangling_alert is None: | ||
logger.error(f"Failed to retrieve dangling alert {d.id}") | ||
continue | ||
dangling_alert.update_from_dict({"action": "resume"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should be updating the alerts like this, the fleet adapter should notify us that the alert is done. This creates multiple sources of truth for the actual state of the delivery alert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix is for the odd case where either
- alert update from fleet adapter was dropped and missing
- alert update from UI was dropped (network connectivity, or wifi, or any other number of reasons the UI failed)
- api-server restarts, and misses either one of the above
- there was a major issue and the fleet adapter had to be restarted
This will basically cause a "dangling" alert, where every time we refresh the UI, this expired alert will still be present, waiting for the user to respond to (or in the obstruction case, waiting for the fleet adapter to respond). But for cases (1) or (3) or (4), it will just never be resolved.
Implementation detail-wise, there will never be a delivery alert that is not tied to an ongoing task. I'll supplement this block with some notes about these phenomena. But with all these in mind, I don't believe we have multiple sources of truth. Any new delivery alerts for the same task, will overwrite what's in the DB (even after resume
), and can require user intervention again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example of conflicting sources of truth would be if another node subscribe to the delivery alerts, the fleet adapter did not publish the new status but the api server assumes it is resumed by itself, while the other node still thinks the alert is pending.
The proper fix would be to decouple ui alerts with delivery alerts. If the user knows the data is not properly synced or chooses to dismiss the alert for whatever reason, we will no longer show the alert regardless of the rmf status (until the fleet adapter publish a new delivery alert again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example of conflicting sources of truth would be if another node subscribe to the delivery alerts, the fleet adapter did not publish the new status but the api server assumes it is resumed by itself, while the other node still thinks the alert is pending.
But the delivery alerts are only tied to task IDs, so if not a single robot is performing this task, we can safely assume that it is considered dangling.
If any other node is subscribed to the delivery alerts, it is not within the scope of what the delivery alerts should do, and we can ignore those behaviors.
The proper fix would be to decouple ui alerts with delivery alerts. If the user knows the data is not properly synced or chooses to dismiss the alert for whatever reason, we will no longer show the alert regardless of the rmf status (until the fleet adapter publish a new delivery alert again).
So what you're suggesting would perhaps be not using the DB at all for the delivery alerts.
- Fleet adapter publishes, api-server subscribes, pushes an alert onto the UI
- if the user refreshes the UI, this alert disappears, until the fleet adapter publishes it again (let's say every 30 seconds, edit: this will need to be implemented), it will show up
- user interactions on the UI alert, will post to the server, which will publish the alert response over ROS 2 to the fleet adapter
For us, we will likely just be logging the events and user interactions, as opposed to keeping track of the alerts in the DB. Is that the same as what you have in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the delivery alerts are only tied to task IDs, so if not a single robot is performing this task, we can safely assume that it is considered dangling.
Yes, but there is no defined behavior for a dangling alert, we cannot assume that it means the delivery has resumed, other nodes which may use delivery alerts (even though there isn't any atm) does not hold the same assumption, leading to a conflict of truths.
We will still use the DB for alerts, but we need to differentiate between delivery alert and alert shown to the ui. A user never dismiss a delivery alert directly, rather it dismiss the ui alert, which will trigger some actions which sends appropriate messages to the rmf (done by the api server).
So a dangling delivery alert should still be dangling, but it will not cause any alerts to be displayed on the ui after the user has dismissed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, the fact that there are dangling alerts already means that there are multiple sources of truths. afaik, the owner of this data is the fleet adapter, so the api server should not be making it's own assumption.
I can see 2 ways around this, 1 is the proposal I mentioned earlier, do not use delivery alerts as ui alerts, a 2nd way is to not rely on persisting delivery alerts, i.e. the fleet adapter should be publishing delivery alerts like the fleet state, it should constantly publish a list of current alerts so the api server does not need to build it's own assumption of the state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per DM, as long as we ensure that the fleet adapters are keeping states of the delivery alerts, and will prompt again periodically if no user intervention was started, we can skip the use of the DB entirely, and just bridge the UI user interactions towards ROS 2 delivery alert responses.
Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
Signed-off-by: Aaron Chong <aaronchongth@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The improved typings are very helpful!
Just tested with an old DB, and no migration was required too 🥳 |
What's new
When delivery alerts are queried, we check for ongoing tasks, and if there exists a dangling delivery alert, we resume it
Self-checks