Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait Set Index Too Big #1381

Open
gkuppa24 opened this issue Nov 26, 2024 · 9 comments
Open

Wait Set Index Too Big #1381

gkuppa24 opened this issue Nov 26, 2024 · 9 comments
Labels
more-information-needed Further information is required

Comments

@gkuppa24
Copy link

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04
  • Installation type:
    • binaries
  • Version or commit hash: Humble
  • DDS implementation: cyclone_dds
  • Client library (if applicable): rclpy

Steps to reproduce issue

Traceback (most recent call last):
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/baserepo4/robot/hardware/easo/lib/single_arm_node.py", line 659, in request_collision_cuboid_update
    self.executor.spin_until_future_complete(self.future,
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/baserepo4/external/ros2_rclpy/rclpy/rclpy/executors.py", line 310, in spin_until_future_complete
    self.spin_once_until_future_complete(future, timeout_left)
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/baserepo4/external/ros2_rclpy/rclpy/rclpy/executors.py", line 801, in spin_once_until_future_complete
    self._spin_once_impl(timeout_sec, future.done)
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/baserepo4/external/ros2_rclpy/rclpy/rclpy/executors.py", line 775, in _spin_once_impl
    handler, entity, node = self.wait_for_ready_callbacks(
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/baserepo4/external/ros2_rclpy/rclpy/rclpy/executors.py", line 711, in wait_for_ready_callbacks
    return next(self._cb_iter)
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/baserepo4/external/ros2_rclpy/rclpy/rclpy/executors.py", line 630, in _wait_for_ready_callbacks
    if wt in waitables and wt.is_ready(wait_set):
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/ros2_rclpy/rclpy/rclpy/qos_event.py", line 90, in is_ready
    if wait_set.is_ready('event', self._event_index):
IndexError: wait set index too big

Expected behavior

Actual behavior

Additional information

I cannot provide the code I am using to allow you to reproduce this bug, but it is a rclpy process that has a single MultiThreadedExecutor. The executor spins three nodes, and always gets hung up on this self.executor.spin_until_future_complete(self.future) with a wait set index bug.

Any advice on how to approach this problem would be much appreciated.


Feature request

Feature description

Implementation considerations

@Barry-Xu-2018
Copy link
Contributor

Could you try using FastDDS (instead of cyclone_dds) to check if the problem still occurs in your environment ?

@gkuppa24
Copy link
Author

gkuppa24 commented Nov 26, 2024

Sure. I can give this a shot. Would service call would cause a QoS event added to the wait set? This seems to be the source of the bug.

@gkuppa24
Copy link
Author

Neither cyclone_dds or fastrtps_cpp works.

@Barry-Xu-2018
Copy link
Contributor

Barry-Xu-2018 commented Nov 27, 2024

Sure. I can give this a shot. Would service call would cause a QoS event added to the wait set? This seems to be the source of the bug.

While creating publisher and subscription, you can set event_callbacks.

BTW, how did you determine it's related to the QoS Event?
Did you confirm the content of entity_type is "event" when "index >= num_entities" occurs?

bool
WaitSet::is_ready(const std::string & entity_type, size_t index)
{
const void ** entities = NULL;
size_t num_entities = 0;
if ("subscription" == entity_type) {
entities = reinterpret_cast<const void **>(rcl_wait_set_->subscriptions);
num_entities = rcl_wait_set_->size_of_subscriptions;
} else if ("client" == entity_type) {
entities = reinterpret_cast<const void **>(rcl_wait_set_->clients);
num_entities = rcl_wait_set_->size_of_clients;
} else if ("service" == entity_type) {
entities = reinterpret_cast<const void **>(rcl_wait_set_->services);
num_entities = rcl_wait_set_->size_of_services;
} else if ("timer" == entity_type) {
entities = reinterpret_cast<const void **>(rcl_wait_set_->timers);
num_entities = rcl_wait_set_->size_of_timers;
} else if ("guard_condition" == entity_type) {
entities = reinterpret_cast<const void **>(rcl_wait_set_->guard_conditions);
num_entities = rcl_wait_set_->size_of_guard_conditions;
} else if ("event" == entity_type) {
entities = reinterpret_cast<const void **>(rcl_wait_set_->events);
num_entities = rcl_wait_set_->size_of_events;
} else {
std::string error_text{"'"};
error_text += entity_type;
error_text += "' is not a known entity";
throw std::runtime_error(error_text);
}
if (!entities) {
std::string error_text{"wait set '"};
error_text += entity_type;
error_text += "' isn't allocated";
throw std::runtime_error(error_text);
}
if (index >= num_entities) {
throw std::out_of_range("wait set index too big");
}
return nullptr != entities[index];
}

@MichaelOrlov MichaelOrlov added the more-information-needed Further information is required label Dec 5, 2024
@JasmineRosselliSUPSI
Copy link

JasmineRosselliSUPSI commented Jan 3, 2025

I have a similar error and cannot understand its origin.

I use Galactic and ubuntu 20.04.
I also use a Thread that lunches multiple nodes, including nav2. If I add one too many nodes the error message appears, in my case from nav2, even though the robot still manages to reach the position I wanted.

I am not sure if in my case the error is related to the QoS event, but it seems to be related to the number of active nodes, the number of active subscribers/publishers I have or the Thread itself that is causing internal issues.

Has anyone managed to solve the issue?

Edit:
I also get this other error sometimes: Error: Failed to get number of ready entities for action client: wait set index for result client is out of bounds, at /tmp/binarydeb/ros-galactic-rcl-action-3.1.3/src/rcl_action/action_client.c:635

@Barry-Xu-2018
Copy link
Contributor

@JasmineRosselliSUPSI

I have analyzed the issue, but currently, there's not enough information to pinpoint the cause.
Since Galactic was EOL, could you test the problem on Humble to see if it still occurs ? Maybe I can get more information from you to continue to investigate.

@JasmineRosselliSUPSI
Copy link

Unfortunately, for various reasons, we cannot change versions of ROS2 and have to stay on Galactic.
At the moment I got around the problem by launching the ‘problematic node’ from another raspberry, which is however connected to the same network and visible from the main raspberry.
In this way, the error no longer appeared.

Additional information on the nature of the error and the working context:

  • I work with a turtlebot4 equipped with a main raspberry to which a second raspberry has been added. The latter, among other things, generates the wifi to which the system connects.

  • The raspberry of turtlebot4, that of its base irobot create3 and the additional raspberry are on the same network and can access topics, services and actions with each other.

  • The error appears on the main raspberry of turtlebot4

  • I constantly monitored the CPU with the mpstat command (refresh every 2 sec) and did not detect such intense activity to justify a calculation problem, even when the error appeared.

  • Several nodes are launched on the turtlebot4 raspberry, including a node that activates nav2. The nodes are not all launched at the same time, but they all activate correctly.
    The script that launches the nodes contains a thread that handles interfacing to a non-ROS GUI and turtlebot4. It is a kind of ‘Controller’.
    A new process is opened to activate the nav2 navigation node.

  • The problem node takes care of collecting all the important data during the robot's movements (subscribers of different topics), including its position. This node is launched independently and uses rclpy.spin_once(). We call this node ‘Collector’.
    A new process is also opened to activate the ‘Collector’ node.
    An additional script allows it to interface using a service to access the data concerned. This class is instantiated inside the ‘Controller’ and uses rclpy.spin_until_future_complete(). We call this class ‘CollectorManager’.

  • The error appears when the robot is moving (using nav2's goToPose()) and the “CollectorManager” asks the “Collector” for the robot's position. It is the navigation node of turtlebot4 that reports the error. The robot finishes its movement anyway and arrives at its destination.

Summary:

  • There is one main thread
  • At different times, this thread activates two separate processes, one for navigation, the other for collecting data in real time
  • When the robot moves and a position request occurs, the error appears
    __
  • The CPU never reaches critical levels
  • Without this additional node the system works perfectly
  • By moving the ‘Collector’ process to the secondary raspberry the error no longer appears

Hypothesis:

Is it possible that the problem stems from the use of threads and that the scheduler cannot handle all the tasks correctly?

@Barry-Xu-2018
Copy link
Contributor

@JasmineRosselliSUPSI
Sorry for the late reply.
I know your environment.

Is it possible that the problem stems from the use of threads and that the scheduler cannot handle all the tasks correctly?

Which type of executor is used for the "Collector"? ISingleThreadedExecutor or the MultiThreadedExecutor?

I think this information still doesn't help me pinpoint the specific problematic code.
Can your environment compile code and add some debug logs ?

@JasmineRosselliSUPSI
Copy link

@Barry-Xu-2018
No problem, I can still work with the workaround at the moment. This problem is not exactly a priority right now.

I have to admit I'm not sure which type of executor I'm using. The "Collector" is a class of type threading.Thread, from threading library of python. The additional process, from which ROS2 are launched, are generated using subprocess python library.

I'm using Spyder and VSCode. I compile ROS2 packages using colcon build with no problem.


Additional infos:

My colleagues and I have noticed that launching ROS2 nodes using a subprocess can cause some problems.

The info "Collector" is now launched in the secondary Raspberry. It uses a thread to manage the main functions and launches the "Collector" in a subprocess. The information are provided to other nodes via a service.

When the "Collector" have to provide an heavy messages (in this case a SLAM generated map, about 42x42 pixels), at the second call the node freezes. The service make the request but receives no answer.

With small messages (like, e.g., asking for the robot position), this problem does not arise.

This problem DOES NOT appear if the node is launched separately, from another terminal (same raspberry).

No error message is displayed, but the ‘Collector’ no longer responds.


At this point, it could be a problem related to subprocesses?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more-information-needed Further information is required
Projects
None yet
Development

No branches or pull requests

4 participants