Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The cudaHostRegister call fails on page-backed memory mapped to userspace with remap_pfn_range since version 550.54.14. #732

Open
1 of 2 tasks
lizhengui007 opened this issue Nov 11, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@lizhengui007
Copy link

lizhengui007 commented Nov 11, 2024

NVIDIA Open GPU Kernel Modules Version

550.54.14

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Rocky Linux release 8.10 (Green Obsidian)

Kernel Release

Linux virtaitech-hz-ws2 4.18.0-553.16.1.el8_10.x86_64 #1 SMP Thu Aug 8 17:47:08 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

GPU 0: NVIDIA GeForce GTX 1050 Ti (UUID: GPU-fd200267-5107-2f45-63f6-24848e1ac6c6)

Describe the bug

I found a problem that the cudaHostRegister call fails on page-backed memory mapped to userspace with remap_pfn_range since version 550.54.14.
Previous versions such as 550.40.07 or earlier versions can be executed successfully. The err return by cudaHostRegister is “invalid argument”. I don't know why this change happened from version 550.54.14. Doesn’t the new version code support registration for pages generated by remap_pfn_range?

To Reproduce

I provide a sample code to reproduce it. Please decompress this compressed package cudaHostRegister_demo.tar.gz

The sample code is divided into kernel module (kernel directory) and user programs (user directory).
1、Compile the kernel module.
Enter the kernel directory: make -C /lib/modules/$(uname -r)/build M=pwd modules, and then insmod my_misc_device.ko.
The kernel part provides a mmap interface for mapping a kernel page to user space using remap_pfn_range.

2、Compile the user programs.
Enter the user directory: make and generate cudaHostRegister_demo,and then execute it. The execution of cudaHostRegister_demo return error.

The result of my test demo is below:
[root@virtaitech-hz-ws2 user]# ./cudaHostRegister_demo
normal map memory mapped at address 0x7f023282e000
normal_map cudaHostRegister success
remap_pfn_range memory mapped at address 0x7f0232649000
cudaHostRegister remap_pfn_range memory error=1 name=invalid argument at ln: 57

cudaHostRegister_demo.tar.gz

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

I found a commit that might be causing this problem.
476bd34#diff-fcd34cfcd88326e819dedab0110ce72dea28e8575a87a3eb4106c915fbcb569e

Image

Image

os_lookup_user_io_memory use get_io_ptes instead of get_io_pages.

Image

Judging from the differences in the commit code between the problematic high version of the open source code and the non-problematic low version of the open source code that I have seen so far, it seems that the high version only adds a physical page frame number continuity check, but from my actual test results It seems that even if I only map one PFN page, the registration will fail. The reason is the code below:
Image

hight version code :
pApi->data.AllocOsDesc.descriptor = (NvP64)(NvUPtr)pPteArray;
pApi->data.AllocOsDesc.descriptorType = NVOS32_DESCRIPTOR_TYPE_OS_IO_MEMORY;

old version code:
pApi->data.AllocOsDesc.descriptor = (NvP64)(NvUPtr)pPageArray;
pApi->data.AllocOsDesc.descriptorType = NVOS32_DESCRIPTOR_TYPE_OS_PAGE_ARRAY;

The descriptorType changed from NVOS32_DESCRIPTOR_TYPE_OS_PAGE_ARRAY to NVOS32_DESCRIPTOR_TYPE_OS_IO_MEMORY.
Then the API _nvos32FunctionAllocOsDesc will failed.
Image
The call trace is : RmCreateOsDescriptor->Nv04VidHeapControlWithSecInfo->RmDeprecatedVidHeapControl->_nvos32FunctionAllocOsDesc

I have two questions here:

  1. Why does the current version add physical page frame number continuity check to remap_pfn_range pages, while the previous version does not require this check?
  2. Why does the registration of a single remap_pfn_range page also fail? Is it possible that the registration of the remap_pfn_range page is now limited to device memory and cannot be ram memory?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant