-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High latency / Lost messages: Pub/Sub 10B at high pub frequency. #198
Comments
So just to be clear:
|
How are you measuring the latency? If you do one way measurements, then for small data packets the behaviour you are seeing is probably due to batching (at Zenoh and TCP/IP level). If you want to skip batching please leverage the ultra-low latency transport (see https://zenoh.io/blog/2023-10-03-zenoh-dragonite/#support-for-ultra-low-latency). Don't hesitate to reach us out on Zenoh's Discord server if you have any questions on configuration. We'll try to reproduce on our side to see if we get a similar behaviour. On hour side it would be good to see if you have a similar behaviour running the latency test shipped with Zenoh (See https://github.com/eclipse-zenoh/zenoh/tree/main/examples/examples) -- kydos |
Hi, thanks for the suggestions.
The iRobot ros2-performance benchmark shows in console and generates files with info about resources, latency, publish duration, CPU, memory, etc. We write the timestamp to the message, so when the subscription receives the message, it can obtain the latency. |
You can find information in our README here on how to configure the session/router with non-default config files. |
Hello @mauropasse, the |
Hello @mauropasse and @mjcarroll, |
I have tried using the configuration file to enable low latency mode, and it seems to have fixed the issue for the 10B message. I have used the DEFAULT_RMW_ZENOH_SESSION_CONFIG.json5 but setting But for some reason, it doesn't work for bigger messages (100KB, 1MB, etc): The subscription doesn't get any messages, like if the nodes were not being discovered. Besides this problem, the publisher process seems to hang indefinitely. I still see the router messages from both processes, I've experienced this behavior both on x86 and in RPi4. @imstevenpmwork the test topology is just a subscription to a 100KB message in one process, and a publisher at 250Hz in the other process. Let me know if you need more details. |
Hello @mauropasse , |
Hi @imstevenpmwork, yes I was just testing and found that 64K is the limit in which things don't work.
|
Hi @mauropasse I believe the configuration name is a bit confusing and may need to be changed in the future. Essentially, it offers an option to slightly optimize for latency, provided you're dealing with small messages only. If this option is disabled, as it should be when handling large messages, the optimization won't occur. However, this doesn't mean that Zenoh (and Zenoh RMW) in its default configuration is slow for small messages. Since Zenoh RMW is still under development, we're constantly identifying and fixing issues. That's why reports like yours are incredibly valuable to us 😄 I'll test the topology configuration on our end and let you know if we get similar results. If we do, we'll investigate further to identify the cause and develop a fix to ensure Zenoh RMW is as performant as Zenoh itself despite the message size used |
/assign @clalancette |
Hi! I've running some ROS2 benchmarks on
rmw_zenoh
(and other rmw's) using the iRobot ros2-performance framework, this time on a Raspberry Pi 4B and ROS2 Iron.The system tested consists on 2 processes, one for the publisher, one for the subscription.
I noticed a spike on latency for the pub/sub - 10B - 2KHz scenario.
I've been able to reproduce these results using also lower pub frequencies, and is noticeable also on x86.
From x86 benchmark output logs I got:
On RPi4B, inspecting the
events.txt
I see every message being late, like if it has lost sync and a msg is read when a new one arrives:So it looks like something starts failing when pub frequency gets high enough.
Something strange is that I run also a single-process test, with rclcpp intra-process disabled (pub/sub in same process) and the issue goes away, so it seems the issue is with multi-process applications.
The text was updated successfully, but these errors were encountered: