Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error causes the micp_localization node to die #2

Open
Mh-Magdy opened this issue Oct 8, 2024 · 7 comments
Open

CUDA error causes the micp_localization node to die #2

Mh-Magdy opened this issue Oct 8, 2024 · 7 comments

Comments

@Mh-Magdy
Copy link

Mh-Magdy commented Oct 8, 2024

First of all thank you for this great package and the amazing work.
I'm running micp node with combining unit = cpu and backed optix and everything is OK. I changed the combining unit to gpu i got the following error:

image

I edited it back to cpu and i got the same error, also changed the backend to embree same error :(

could you help me please to skip this error?

@Mh-Magdy
Copy link
Author

Mh-Magdy commented Oct 8, 2024

The complete configuration file for referencing:

base_frame: base_link
map_frame: map
odom_frame: odom
tf_rate: 50
micp:
combining_unit: cpu
corr_rate_max: 500
adaptive_max_dist: True #
viz_corr: True
print_corr_rate: False
disable_corr: False
trans: [0.0, 0.0, 0.0]
rot: [0.0, 0.0, 0.0] # euler angles (3) or quaternion (4)

sensors:
velodyne:
topic: mid/points
type: spherical
model:
range_min: 0.5
range_max: 90.0
phi_min: -0.261799067259
phi_inc: 0.03490658503988659
phi_N: 16
theta_min: -3.14159011841
theta_inc: 0.01431249500496489
theta_N: 440
micp:
max_dist: 2.0
adaptive_max_dist_min: 0.15
backend: optix

@Mh-Magdy
Copy link
Author

I changed the CUDA-toolkit version from 12.6 to 11.8 and the issue dissappeard.

@amock
Copy link
Member

amock commented Oct 12, 2024

Hi @Mh-Magdy,

thanks for testing. However, that's weird. Normally, it should run with any cuda version. So I would say it's still an issue. So I will reopen it as a reminder for me to check this.

Could you give me some more info about your setup that you used?

  • operating system
  • cuda version (nvcc --version)
  • gpu driver version (nvidia-smi)
  • OptiX version
  • ROS version
  • rmcl version or branch
  • rmagine version or branch
  • did it also fail when running the https://github.com/amock/rmcl_example with your GPU config file?

With this I think I could reproduce the error and hopefully fix it soon. (Or someone else)

@amock amock reopened this Oct 12, 2024
@Mh-Magdy
Copy link
Author

Hiii @amock 👋

  • OS: Ubuntu 20.04
  • CUDA: 12.6
  • GPU driver: nvidia-driver-560 - third-party non-free recommended
  • Optix: 7.7
  • ROS: ROS1 noetic
  • RMCL: noetic branch
  • rmagine: latest main branch on GitHub
  • Actually it worked just fine with the example configuration for the CPU version and embree backend, but with the gpu it give me the error i mentioned above and when i changed the cuda version and gpu driver version it worked (The example config).

There are another minor issues that faced me recently after the update, i will report them to you in more details but i will give you a hint about them now:

When i perform some edits on the sensor parameters for example changing the number of horizontal samples to match my real sensor (theta_inc, theta_N) if backed is optix the package fails to run and if i change it to embree it works fine. I will capture any issues like these and give you details on the issue and my environment/setup as well as the config to help you reproduce the errors.

Thank you Alexander

@amock
Copy link
Member

amock commented Nov 5, 2024

Hi @Mh-Magdy,

I have finally found some time to deal with your issue. First I tried to resemble your setup:

  • Ubuntu 20.04.6 LTS
  • ROS1 noetic
  • GPU: RTX 2060 Super, Driver 560.35.03, CUDA V12.6.77
  • OptiX 7.7
  • rmagine: (main branch) 2.2.7
  • rmcl: noetic branch
  • rmcl_msgs: noetic branch
  • rmcl_example: noetic branch

In the first terminal I started the example simulation by executing:

roslaunch rmcl_example start_robot.launch

then I changed the rmcl_example/launch/rmcl_micp.launch to load the parameters from rmcl_example/config/micp_gpu.yaml.
This config uses the GPU for both computing the correspondences and combining the covariances.
So I assumed this setup would cause the error you described.
Then I executed:

roslaunch rmcl_example rmcl_micp.launch

In the RViz window I set an initial pose guess and everything went fine. So unfortunately, I could not reproduce the error you described. Could you maybe try the exact same procedure on your system? Otherwise I am not sure what is wrong on your system :/ Perhaps you could also check if rmagine alone is working. There are some benchmark executables in it. Or perhaps you could check if CUDA is working for other projects.

Best
Alex

@Mh-Magdy
Copy link
Author

Hi Alexander @amock ,

I hope you are doing well. I’d like to share some observations from a recent experiment related to this issue.

In the initial experiments that led to opening this issue, I ran the entire software suite (rmcl, mesh_nav, and Gazebo) inside a Docker container that utilized the host machine’s GPU driver. However, due to my limited experience with Docker, I occasionally misconfigured environment variables and mismanaged Docker layers. As a result, the container did not properly utilize the NVIDIA GPU driver libraries—particularly liboptixnv.so, which is required by rmagine beside the downloaded optix headers to generate the rmagine-optix executable.

Although the rmagine-optix executable appeared to build successfully without errors, it actually caused multiple runtime issues, including the errors discussed above.

I hope these observations help clarify the challenges and contribute to finding a solution. Thank you for your collaboration and support.

Best regards,
Muhammad

@amock
Copy link
Member

amock commented Jan 24, 2025

Hi Muhammad @Mh-Magdy,

The last months, we were working on putting everything into Docker images as well -- partly because our situation forced us to do so. But as nice side effect, we are planning to upload preconfigured Docker files and provide intructions how to use them. It comes with the next update which brings features like object pose tracking and convenience tools for scan filtering. Preview: https://www.youtube.com/watch?v=9i3B1ayvMn4 .

Thanks for sharing your insights; they might help us with our Docker setup! And yes, this libnvoptix thing is quite important. (If anyone from NVIDIA is reading this, please consider integrating this library to Jetpack.)

Best
Alex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants