Wait Set Index Too Big #1381

gkuppa24 · 2024-11-26T08:20:24Z

Bug report

Required Info:

Operating System:
- Ubuntu 22.04
Installation type:
- binaries
Version or commit hash: Humble
DDS implementation: cyclone_dds
Client library (if applicable): rclpy

Steps to reproduce issue

Traceback (most recent call last):
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/baserepo4/robot/hardware/easo/lib/single_arm_node.py", line 659, in request_collision_cuboid_update
    self.executor.spin_until_future_complete(self.future,
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/baserepo4/external/ros2_rclpy/rclpy/rclpy/executors.py", line 310, in spin_until_future_complete
    self.spin_once_until_future_complete(future, timeout_left)
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/baserepo4/external/ros2_rclpy/rclpy/rclpy/executors.py", line 801, in spin_once_until_future_complete
    self._spin_once_impl(timeout_sec, future.done)
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/baserepo4/external/ros2_rclpy/rclpy/rclpy/executors.py", line 775, in _spin_once_impl
    handler, entity, node = self.wait_for_ready_callbacks(
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/baserepo4/external/ros2_rclpy/rclpy/rclpy/executors.py", line 711, in wait_for_ready_callbacks
    return next(self._cb_iter)
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/baserepo4/external/ros2_rclpy/rclpy/rclpy/executors.py", line 630, in _wait_for_ready_callbacks
    if wt in waitables and wt.is_ready(wait_set):
  File "/home/gauravkuppa24/.cache/bazel/_bazel_gauravkuppa24/0f5b194a5c00fda748ae81b2326da879/execroot/baserepo4/bazel-out/k8-fastbuild/bin/robot/task/task_manager/launch_high_level_planner/launch_high_level_planner.runfiles/ros2_rclpy/rclpy/rclpy/qos_event.py", line 90, in is_ready
    if wait_set.is_ready('event', self._event_index):
IndexError: wait set index too big

Expected behavior

Actual behavior

Additional information

I cannot provide the code I am using to allow you to reproduce this bug, but it is a rclpy process that has a single MultiThreadedExecutor. The executor spins three nodes, and always gets hung up on this self.executor.spin_until_future_complete(self.future) with a wait set index bug.

Any advice on how to approach this problem would be much appreciated.

Feature request

Feature description

Implementation considerations

The text was updated successfully, but these errors were encountered:

Barry-Xu-2018 · 2024-11-26T09:10:40Z

Could you try using FastDDS (instead of cyclone_dds) to check if the problem still occurs in your environment ?

gkuppa24 · 2024-11-26T18:55:32Z

Sure. I can give this a shot. Would service call would cause a QoS event added to the wait set? This seems to be the source of the bug.

gkuppa24 · 2024-11-26T21:25:01Z

Neither cyclone_dds or fastrtps_cpp works.

Barry-Xu-2018 · 2024-11-27T02:22:32Z

Sure. I can give this a shot. Would service call would cause a QoS event added to the wait set? This seems to be the source of the bug.

While creating publisher and subscription, you can set event_callbacks.

BTW, how did you determine it's related to the QoS Event?
Did you confirm the content of entity_type is "event" when "index >= num_entities" occurs?

rclpy/rclpy/src/rclpy/wait_set.cpp

Lines 157 to 197 in b488272

    
           bool 
        
           WaitSet::is_ready(const std::string & entity_type, size_t index) 
        
           { 
        
             const void ** entities = NULL; 
        
             size_t num_entities = 0; 
        
             if ("subscription" == entity_type) { 
        
               entities = reinterpret_cast<const void **>(rcl_wait_set_->subscriptions); 
        
               num_entities = rcl_wait_set_->size_of_subscriptions; 
        
             } else if ("client" == entity_type) { 
        
               entities = reinterpret_cast<const void **>(rcl_wait_set_->clients); 
        
               num_entities = rcl_wait_set_->size_of_clients; 
        
             } else if ("service" == entity_type) { 
        
               entities = reinterpret_cast<const void **>(rcl_wait_set_->services); 
        
               num_entities = rcl_wait_set_->size_of_services; 
        
             } else if ("timer" == entity_type) { 
        
               entities = reinterpret_cast<const void **>(rcl_wait_set_->timers); 
        
               num_entities = rcl_wait_set_->size_of_timers; 
        
             } else if ("guard_condition" == entity_type) { 
        
               entities = reinterpret_cast<const void **>(rcl_wait_set_->guard_conditions); 
        
               num_entities = rcl_wait_set_->size_of_guard_conditions; 
        
             } else if ("event" == entity_type) { 
        
               entities = reinterpret_cast<const void **>(rcl_wait_set_->events); 
        
               num_entities = rcl_wait_set_->size_of_events; 
        
             } else { 
        
               std::string error_text{"'"}; 
        
               error_text += entity_type; 
        
               error_text += "' is not a known entity"; 
        
               throw std::runtime_error(error_text); 
        
             } 
        
             if (!entities) { 
        
               std::string error_text{"wait set '"}; 
        
               error_text += entity_type; 
        
               error_text += "' isn't allocated"; 
        
               throw std::runtime_error(error_text); 
        
             } 
        
             if (index >= num_entities) { 
        
               throw std::out_of_range("wait set index too big"); 
        
             } 
        
             return nullptr != entities[index]; 
        
           }

JasmineRosselliSUPSI · 2025-01-03T11:46:26Z

I have a similar error and cannot understand its origin.

I use Galactic and ubuntu 20.04.
I also use a Thread that lunches multiple nodes, including nav2. If I add one too many nodes the error message appears, in my case from nav2, even though the robot still manages to reach the position I wanted.

I am not sure if in my case the error is related to the QoS event, but it seems to be related to the number of active nodes, the number of active subscribers/publishers I have or the Thread itself that is causing internal issues.

Has anyone managed to solve the issue?

Edit:
I also get this other error sometimes: Error: Failed to get number of ready entities for action client: wait set index for result client is out of bounds, at /tmp/binarydeb/ros-galactic-rcl-action-3.1.3/src/rcl_action/action_client.c:635

Barry-Xu-2018 · 2025-01-06T01:54:08Z

@JasmineRosselliSUPSI

I have analyzed the issue, but currently, there's not enough information to pinpoint the cause.
Since Galactic was EOL, could you test the problem on Humble to see if it still occurs ? Maybe I can get more information from you to continue to investigate.

JasmineRosselliSUPSI · 2025-01-07T10:08:11Z

Unfortunately, for various reasons, we cannot change versions of ROS2 and have to stay on Galactic.
At the moment I got around the problem by launching the ‘problematic node’ from another raspberry, which is however connected to the same network and visible from the main raspberry.
In this way, the error no longer appeared.

Additional information on the nature of the error and the working context:

I work with a turtlebot4 equipped with a main raspberry to which a second raspberry has been added. The latter, among other things, generates the wifi to which the system connects.
The raspberry of turtlebot4, that of its base irobot create3 and the additional raspberry are on the same network and can access topics, services and actions with each other.
The error appears on the main raspberry of turtlebot4
I constantly monitored the CPU with the mpstat command (refresh every 2 sec) and did not detect such intense activity to justify a calculation problem, even when the error appeared.
Several nodes are launched on the turtlebot4 raspberry, including a node that activates nav2. The nodes are not all launched at the same time, but they all activate correctly.
The script that launches the nodes contains a thread that handles interfacing to a non-ROS GUI and turtlebot4. It is a kind of ‘Controller’.
A new process is opened to activate the nav2 navigation node.
The problem node takes care of collecting all the important data during the robot's movements (subscribers of different topics), including its position. This node is launched independently and uses rclpy.spin_once(). We call this node ‘Collector’.
A new process is also opened to activate the ‘Collector’ node.
An additional script allows it to interface using a service to access the data concerned. This class is instantiated inside the ‘Controller’ and uses rclpy.spin_until_future_complete(). We call this class ‘CollectorManager’.
The error appears when the robot is moving (using nav2's goToPose()) and the “CollectorManager” asks the “Collector” for the robot's position. It is the navigation node of turtlebot4 that reports the error. The robot finishes its movement anyway and arrives at its destination.

Summary:

There is one main thread
At different times, this thread activates two separate processes, one for navigation, the other for collecting data in real time
When the robot moves and a position request occurs, the error appears
__
The CPU never reaches critical levels
Without this additional node the system works perfectly
By moving the ‘Collector’ process to the secondary raspberry the error no longer appears

Hypothesis:

Is it possible that the problem stems from the use of threads and that the scheduler cannot handle all the tasks correctly?

Barry-Xu-2018 · 2025-01-14T03:29:52Z

@JasmineRosselliSUPSI
Sorry for the late reply.
I know your environment.

Is it possible that the problem stems from the use of threads and that the scheduler cannot handle all the tasks correctly?

Which type of executor is used for the "Collector"? ISingleThreadedExecutor or the MultiThreadedExecutor?

I think this information still doesn't help me pinpoint the specific problematic code.
Can your environment compile code and add some debug logs ?

JasmineRosselliSUPSI · 2025-01-14T08:37:55Z

@Barry-Xu-2018
No problem, I can still work with the workaround at the moment. This problem is not exactly a priority right now.

I have to admit I'm not sure which type of executor I'm using. The "Collector" is a class of type threading.Thread, from threading library of python. The additional process, from which ROS2 are launched, are generated using subprocess python library.

I'm using Spyder and VSCode. I compile ROS2 packages using colcon build with no problem.

Additional infos:

My colleagues and I have noticed that launching ROS2 nodes using a subprocess can cause some problems.

The info "Collector" is now launched in the secondary Raspberry. It uses a thread to manage the main functions and launches the "Collector" in a subprocess. The information are provided to other nodes via a service.

When the "Collector" have to provide an heavy messages (in this case a SLAM generated map, about 42x42 pixels), at the second call the node freezes. The service make the request but receives no answer.

With small messages (like, e.g., asking for the robot position), this problem does not arise.

This problem DOES NOT appear if the node is launched separately, from another terminal (same raspberry).

No error message is displayed, but the ‘Collector’ no longer responds.

At this point, it could be a problem related to subprocesses?

MichaelOrlov added the more-information-needed Further information is required label Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait Set Index Too Big #1381

Wait Set Index Too Big #1381

gkuppa24 commented Nov 26, 2024

Barry-Xu-2018 commented Nov 26, 2024

gkuppa24 commented Nov 26, 2024 •

edited

Loading

gkuppa24 commented Nov 26, 2024

Barry-Xu-2018 commented Nov 27, 2024 •

edited

Loading

JasmineRosselliSUPSI commented Jan 3, 2025 •

edited

Loading

Barry-Xu-2018 commented Jan 6, 2025

JasmineRosselliSUPSI commented Jan 7, 2025

Barry-Xu-2018 commented Jan 14, 2025

JasmineRosselliSUPSI commented Jan 14, 2025

Wait Set Index Too Big #1381

Wait Set Index Too Big #1381

Comments

gkuppa24 commented Nov 26, 2024

Bug report

Steps to reproduce issue

Expected behavior

Actual behavior

Additional information

Feature request

Feature description

Implementation considerations

Barry-Xu-2018 commented Nov 26, 2024

gkuppa24 commented Nov 26, 2024 • edited Loading

gkuppa24 commented Nov 26, 2024

Barry-Xu-2018 commented Nov 27, 2024 • edited Loading

JasmineRosselliSUPSI commented Jan 3, 2025 • edited Loading

Barry-Xu-2018 commented Jan 6, 2025

JasmineRosselliSUPSI commented Jan 7, 2025

Barry-Xu-2018 commented Jan 14, 2025

JasmineRosselliSUPSI commented Jan 14, 2025

gkuppa24 commented Nov 26, 2024 •

edited

Loading

Barry-Xu-2018 commented Nov 27, 2024 •

edited

Loading

JasmineRosselliSUPSI commented Jan 3, 2025 •

edited

Loading