-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hammer/fix dangling delivery alerts #905
Changes from 4 commits
9b67132
b0c0332
3b960c6
55b93f5
8095e70
d3cd507
1403b63
e94309f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,8 @@ | |
|
||
from api_server.fast_io import FastIORouter, SubscriptionRequest | ||
from api_server.gateway import rmf_gateway | ||
from api_server.logger import logger | ||
from api_server.models import FleetState | ||
from api_server.models import tortoise_models as ttm | ||
from api_server.models.delivery_alerts import ( | ||
DeliveryAlert, | ||
|
@@ -58,16 +60,59 @@ async def query_delivery_alerts( | |
|
||
try: | ||
delivery_alerts = await ttm.DeliveryAlert.filter(**filters) | ||
return [ | ||
delivery_alerts_pydantic = [ | ||
await ttm.DeliveryAlertPydantic.from_tortoise_orm(a) | ||
for a in delivery_alerts | ||
] | ||
except FieldError as e: | ||
raise HTTPException(422, str(e)) from e | ||
|
||
ongoing_delivery_alerts = [] | ||
# Check for dangling delivery alerts, where none of the fleets are still | ||
# performing the task | ||
current_task_ids = [] | ||
try: | ||
db_states = await ttm.FleetState.all().values_list("data", flat=True) | ||
fleets = [FleetState(**s) for s in db_states] | ||
for f in fleets: | ||
if f.robots is None: | ||
continue | ||
for robot in f.robots.values(): | ||
if robot.task_id is not None and len(robot.task_id) != 0: | ||
current_task_ids.append(robot.task_id) | ||
except Exception as e: # pylint: disable=broad-except | ||
logger.error(f"Failed to retrieve current running task IDs: {e}") | ||
return [] | ||
|
||
logger.info(f"Current running task IDs: {current_task_ids}") | ||
for d in delivery_alerts_pydantic: | ||
if ( | ||
d.task_id is not None | ||
and len(d.task_id) != 0 | ||
and d.task_id not in current_task_ids | ||
): | ||
logger.info(f"Found a dangling alert: {d}") | ||
try: | ||
dangling_alert = await ttm.DeliveryAlert.get_or_none(id=d.id) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We already retrieved the alerts earlier, this is unnecessary database read. |
||
if dangling_alert is None: | ||
logger.error(f"Failed to retrieve dangling alert {d.id}") | ||
continue | ||
dangling_alert.update_from_dict({"action": "resume"}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think we should be updating the alerts like this, the fleet adapter should notify us that the alert is done. This creates multiple sources of truth for the actual state of the delivery alert. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This fix is for the odd case where either
This will basically cause a "dangling" alert, where every time we refresh the UI, this expired alert will still be present, waiting for the user to respond to (or in the obstruction case, waiting for the fleet adapter to respond). But for cases (1) or (3) or (4), it will just never be resolved. Implementation detail-wise, there will never be a delivery alert that is not tied to an ongoing task. I'll supplement this block with some notes about these phenomena. But with all these in mind, I don't believe we have multiple sources of truth. Any new delivery alerts for the same task, will overwrite what's in the DB (even after There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An example of conflicting sources of truth would be if another node subscribe to the delivery alerts, the fleet adapter did not publish the new status but the api server assumes it is resumed by itself, while the other node still thinks the alert is pending. The proper fix would be to decouple ui alerts with delivery alerts. If the user knows the data is not properly synced or chooses to dismiss the alert for whatever reason, we will no longer show the alert regardless of the rmf status (until the fleet adapter publish a new delivery alert again). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
But the delivery alerts are only tied to task IDs, so if not a single robot is performing this task, we can safely assume that it is considered dangling. If any other node is subscribed to the delivery alerts, it is not within the scope of what the delivery alerts should do, and we can ignore those behaviors.
So what you're suggesting would perhaps be not using the DB at all for the delivery alerts.
For us, we will likely just be logging the events and user interactions, as opposed to keeping track of the alerts in the DB. Is that the same as what you have in mind? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, but there is no defined behavior for a dangling alert, we cannot assume that it means the delivery has resumed, other nodes which may use delivery alerts (even though there isn't any atm) does not hold the same assumption, leading to a conflict of truths. We will still use the DB for alerts, but we need to differentiate between delivery alert and alert shown to the ui. A user never dismiss a delivery alert directly, rather it dismiss the ui alert, which will trigger some actions which sends appropriate messages to the rmf (done by the api server). So a dangling delivery alert should still be dangling, but it will not cause any alerts to be displayed on the ui after the user has dismissed it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, the fact that there are dangling alerts already means that there are multiple sources of truths. afaik, the owner of this data is the fleet adapter, so the api server should not be making it's own assumption. I can see 2 ways around this, 1 is the proposal I mentioned earlier, do not use delivery alerts as ui alerts, a 2nd way is to not rely on persisting delivery alerts, i.e. the fleet adapter should be publishing delivery alerts like the fleet state, it should constantly publish a list of current alerts so the api server does not need to build it's own assumption of the state. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Per DM, as long as we ensure that the fleet adapters are keeping states of the delivery alerts, and will prompt again periodically if no user intervention was started, we can skip the use of the DB entirely, and just bridge the UI user interactions towards ROS 2 delivery alert responses. |
||
await dangling_alert.save() | ||
except Exception as e: # pylint: disable=broad-except | ||
logger.error(f"Failed to resolve dangling alert ID {d.id}: {e}") | ||
continue | ||
logger.info( | ||
f"Resolved dangling alert ID {d.id} by updating action to resume" | ||
) | ||
else: | ||
ongoing_delivery_alerts.append(d) | ||
return ongoing_delivery_alerts | ||
|
||
|
||
@router.get("/{delivery_alert_id}", response_model=ttm.DeliveryAlertPydantic) | ||
async def get_delivery_alert(delivery_alert_id: str): | ||
|
||
delivery_alert = await ttm.DeliveryAlert.get_or_none(id=delivery_alert_id) | ||
if delivery_alert is None: | ||
raise HTTPException( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not disable the linter without a good reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this particular scenario, where there's a lot of action happening under-the-hood, especially with casting involved, I would prefer a broad except, rather than extensively writing out every single possible exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should still write out the exceptions which you can handle, let uncaught exceptions propagate up to fastapi and it should return with an http error code.
Although even if we were to manage the states ourselves (which we should not), this should be done when we ingest the task states, querying should not cause this kind of side effects.