Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hammer/fix dangling delivery alerts #905

Merged
merged 8 commits into from
Feb 26, 2024
7 changes: 6 additions & 1 deletion packages/api-server/api_server/gateway.py
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,12 @@ def convert_delivery_alert(delivery_alert: RmfDeliveryAlert):
RmfDeliveryAlert,
"delivery_alert_request",
lambda msg: rmf_events.delivery_alerts.on_next(convert_delivery_alert(msg)),
10,
rclpy.qos.QoSProfile(
history=rclpy.qos.HistoryPolicy.KEEP_LAST,
depth=10,
reliability=rclpy.qos.ReliabilityPolicy.RELIABLE,
durability=rclpy.qos.DurabilityPolicy.TRANSIENT_LOCAL,
),
)
self._subscriptions.append(delivery_alert_request_sub)

Expand Down
47 changes: 46 additions & 1 deletion packages/api-server/api_server/routes/delivery_alerts.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@

from api_server.fast_io import FastIORouter, SubscriptionRequest
from api_server.gateway import rmf_gateway
from api_server.logger import logger
from api_server.models import FleetState
from api_server.models import tortoise_models as ttm
from api_server.models.delivery_alerts import (
DeliveryAlert,
Expand Down Expand Up @@ -58,16 +60,59 @@ async def query_delivery_alerts(

try:
delivery_alerts = await ttm.DeliveryAlert.filter(**filters)
return [
delivery_alerts_pydantic = [
await ttm.DeliveryAlertPydantic.from_tortoise_orm(a)
for a in delivery_alerts
]
except FieldError as e:
raise HTTPException(422, str(e)) from e

ongoing_delivery_alerts = []
# Check for dangling delivery alerts, where none of the fleets are still
# performing the task
current_task_ids = []
try:
db_states = await ttm.FleetState.all().values_list("data", flat=True)
fleets = [FleetState(**s) for s in db_states]
for f in fleets:
if f.robots is None:
continue
for robot in f.robots.values():
if robot.task_id is not None and len(robot.task_id) != 0:
current_task_ids.append(robot.task_id)
except Exception as e: # pylint: disable=broad-except
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not disable the linter without a good reason.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this particular scenario, where there's a lot of action happening under-the-hood, especially with casting involved, I would prefer a broad except, rather than extensively writing out every single possible exception.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should still write out the exceptions which you can handle, let uncaught exceptions propagate up to fastapi and it should return with an http error code.

Although even if we were to manage the states ourselves (which we should not), this should be done when we ingest the task states, querying should not cause this kind of side effects.

logger.error(f"Failed to retrieve current running task IDs: {e}")
return []

logger.info(f"Current running task IDs: {current_task_ids}")
for d in delivery_alerts_pydantic:
if (
d.task_id is not None
and len(d.task_id) != 0
and d.task_id not in current_task_ids
):
logger.info(f"Found a dangling alert: {d}")
try:
dangling_alert = await ttm.DeliveryAlert.get_or_none(id=d.id)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already retrieved the alerts earlier, this is unnecessary database read.

if dangling_alert is None:
logger.error(f"Failed to retrieve dangling alert {d.id}")
continue
dangling_alert.update_from_dict({"action": "resume"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be updating the alerts like this, the fleet adapter should notify us that the alert is done. This creates multiple sources of truth for the actual state of the delivery alert.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix is for the odd case where either

  1. alert update from fleet adapter was dropped and missing
  2. alert update from UI was dropped (network connectivity, or wifi, or any other number of reasons the UI failed)
  3. api-server restarts, and misses either one of the above
  4. there was a major issue and the fleet adapter had to be restarted

This will basically cause a "dangling" alert, where every time we refresh the UI, this expired alert will still be present, waiting for the user to respond to (or in the obstruction case, waiting for the fleet adapter to respond). But for cases (1) or (3) or (4), it will just never be resolved.

Implementation detail-wise, there will never be a delivery alert that is not tied to an ongoing task. I'll supplement this block with some notes about these phenomena. But with all these in mind, I don't believe we have multiple sources of truth. Any new delivery alerts for the same task, will overwrite what's in the DB (even after resume), and can require user intervention again

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of conflicting sources of truth would be if another node subscribe to the delivery alerts, the fleet adapter did not publish the new status but the api server assumes it is resumed by itself, while the other node still thinks the alert is pending.

The proper fix would be to decouple ui alerts with delivery alerts. If the user knows the data is not properly synced or chooses to dismiss the alert for whatever reason, we will no longer show the alert regardless of the rmf status (until the fleet adapter publish a new delivery alert again).

Copy link
Member Author

@aaronchongth aaronchongth Feb 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of conflicting sources of truth would be if another node subscribe to the delivery alerts, the fleet adapter did not publish the new status but the api server assumes it is resumed by itself, while the other node still thinks the alert is pending.

But the delivery alerts are only tied to task IDs, so if not a single robot is performing this task, we can safely assume that it is considered dangling.

If any other node is subscribed to the delivery alerts, it is not within the scope of what the delivery alerts should do, and we can ignore those behaviors.

The proper fix would be to decouple ui alerts with delivery alerts. If the user knows the data is not properly synced or chooses to dismiss the alert for whatever reason, we will no longer show the alert regardless of the rmf status (until the fleet adapter publish a new delivery alert again).

So what you're suggesting would perhaps be not using the DB at all for the delivery alerts.

  • Fleet adapter publishes, api-server subscribes, pushes an alert onto the UI
  • if the user refreshes the UI, this alert disappears, until the fleet adapter publishes it again (let's say every 30 seconds, edit: this will need to be implemented), it will show up
  • user interactions on the UI alert, will post to the server, which will publish the alert response over ROS 2 to the fleet adapter

For us, we will likely just be logging the events and user interactions, as opposed to keeping track of the alerts in the DB. Is that the same as what you have in mind?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the delivery alerts are only tied to task IDs, so if not a single robot is performing this task, we can safely assume that it is considered dangling.

Yes, but there is no defined behavior for a dangling alert, we cannot assume that it means the delivery has resumed, other nodes which may use delivery alerts (even though there isn't any atm) does not hold the same assumption, leading to a conflict of truths.

We will still use the DB for alerts, but we need to differentiate between delivery alert and alert shown to the ui. A user never dismiss a delivery alert directly, rather it dismiss the ui alert, which will trigger some actions which sends appropriate messages to the rmf (done by the api server).

So a dangling delivery alert should still be dangling, but it will not cause any alerts to be displayed on the ui after the user has dismissed it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, the fact that there are dangling alerts already means that there are multiple sources of truths. afaik, the owner of this data is the fleet adapter, so the api server should not be making it's own assumption.

I can see 2 ways around this, 1 is the proposal I mentioned earlier, do not use delivery alerts as ui alerts, a 2nd way is to not rely on persisting delivery alerts, i.e. the fleet adapter should be publishing delivery alerts like the fleet state, it should constantly publish a list of current alerts so the api server does not need to build it's own assumption of the state.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per DM, as long as we ensure that the fleet adapters are keeping states of the delivery alerts, and will prompt again periodically if no user intervention was started, we can skip the use of the DB entirely, and just bridge the UI user interactions towards ROS 2 delivery alert responses.

await dangling_alert.save()
except Exception as e: # pylint: disable=broad-except
logger.error(f"Failed to resolve dangling alert ID {d.id}: {e}")
continue
logger.info(
f"Resolved dangling alert ID {d.id} by updating action to resume"
)
else:
ongoing_delivery_alerts.append(d)
return ongoing_delivery_alerts


@router.get("/{delivery_alert_id}", response_model=ttm.DeliveryAlertPydantic)
async def get_delivery_alert(delivery_alert_id: str):

delivery_alert = await ttm.DeliveryAlert.get_or_none(id=delivery_alert_id)
if delivery_alert is None:
raise HTTPException(
Expand Down
Loading