Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NVMLError_Unkown: Unknown Error #761

Closed
mcleantom opened this issue Oct 26, 2021 · 11 comments
Closed

NVMLError_Unkown: Unknown Error #761

mcleantom opened this issue Oct 26, 2021 · 11 comments

Comments

@mcleantom
Copy link

mcleantom commented Oct 26, 2021

Issue description

I installed RAPIDS on WSL2 using the installation guide here. cuDF is working fine but when I try to make a client using dask-cuda, I get the error NVMLError_Unknown: Unknown Error

Steps to reproduce the issue

  1. Installed NVIDIA CUDA-WSL driver
  2. Installed WSL2 with Ubuntu 18.04
  3. Installed Miniconda on Ubuntu
  4. Installed RAPIDS and cudatoolkit 11.2 with the command
conda create -n rapids-21.10 -c rapidsai -c nvidia -c conda-forge \
    rapids-blazing=21.10 python=3.8 cudatoolkit=11.2
  1. Open a jupyter notebook and ran the code
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

cluster = LocalCUDACluster()
client = Client(cluster)

What's the expected result?

Start a CUDA cluster client

What's the actual result?

Error: NVMLError_Unknown: Unkown Error

Additional details / screenshot

My GPU is a NVIDIA GeForce RTX 2060

Full traceback:

Unable to start CUDA Context
Traceback (most recent call last):
  File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/dask_cuda/initialize.py", line 42, in _create_cuda_context
    ctx = has_cuda_context()
  File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/diagnostics/nvml.py", line 76, in has_cuda_context
    handle = pynvml.nvmlDeviceGetHandleByIndex(index)
  File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py", line 1576, in nvmlDeviceGetHandleByIndex
    _nvmlCheckReturn(ret)
  File "/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py", line 743, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.nvml.NVMLError_Unknown: Unknown Error
/home/mclea/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use.
Perhaps you already have a cluster running?
Hosting the HTTP server on port 42977 instead
  warnings.warn(
---------------------------------------------------------------------------
NVMLError_Unknown                         Traceback (most recent call last)
/tmp/ipykernel_17853/3542590344.py in <module>
      2 from dask.distributed import Client
      3 
----> 4 cluster = LocalCUDACluster()
      5 client = Client(cluster)

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/dask_cuda/local_cuda_cluster.py in __init__(self, CUDA_VISIBLE_DEVICES, n_workers, threads_per_worker, memory_limit, device_memory_limit, data, local_directory, shared_filesystem, protocol, enable_tcp_over_ucx, enable_infiniband, enable_nvlink, enable_rdmacm, ucx_net_devices, rmm_pool_size, rmm_managed_memory, rmm_async, rmm_log_directory, jit_unspill, log_spilling, worker_class, **kwargs)
    344             )
    345 
--> 346         super().__init__(
    347             n_workers=0,
    348             threads_per_worker=threads_per_worker,

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/local.py in __init__(self, name, n_workers, threads_per_worker, processes, loop, start, host, ip, scheduler_port, silence_logs, dashboard_address, worker_dashboard_address, diagnostics_port, services, worker_services, service_kwargs, asynchronous, security, protocol, blocked_handlers, interface, worker_class, scheduler_kwargs, scheduler_sync_interval, **worker_kwargs)
    234         workers = {i: worker for i in range(n_workers)}
    235 
--> 236         super().__init__(
    237             name=name,
    238             scheduler=scheduler,

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/spec.py in __init__(self, workers, scheduler, worker, asynchronous, loop, security, silence_logs, name, shutdown_on_close, scheduler_sync_interval)
    281         if not self.asynchronous:
    282             self._loop_runner.start()
--> 283             self.sync(self._start)
    284             self.sync(self._correct_state)
    285 

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/cluster.py in sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    212             return future
    213         else:
--> 214             return sync(self.loop, func, *args, **kwargs)
    215 
    216     def _log(self, log):

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    324     if error[0]:
    325         typ, exc, tb = error[0]
--> 326         raise exc.with_traceback(tb)
    327     else:
    328         return result[0]

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/utils.py in f()
    307             if callback_timeout is not None:
    308                 future = asyncio.wait_for(future, callback_timeout)
--> 309             result[0] = yield future
    310         except Exception:
    311             error[0] = sys.exc_info()

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/tornado/gen.py in run(self)
    760 
    761                     try:
--> 762                         value = future.result()
    763                     except Exception:
    764                         exc_info = sys.exc_info()

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/deploy/spec.py in _start(self)
    309             if isinstance(cls, str):
    310                 cls = import_term(cls)
--> 311             self.scheduler = cls(**self.scheduler_spec.get("options", {}))
    312             self.scheduler = await self.scheduler
    313         self.scheduler_comm = rpc(

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/scheduler.py in __init__(self, loop, delete_interval, synchronize_worker_interval, services, service_kwargs, allowed_failures, extensions, validate, scheduler_file, security, worker_ttl, idle_timeout, interface, host, port, protocol, dashboard_address, dashboard, http_prefix, preload, preload_argv, plugins, **kwargs)
   3797         connection_limit = get_fileno_limit() / 2
   3798 
-> 3799         super().__init__(
   3800             aliases=aliases,
   3801             handlers=self.handlers,

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/scheduler.py in __init__(self, aliases, clients, workers, host_info, resources, tasks, unrunnable, validate, **kwargs)
   1978         self._transition_counter = 0
   1979 
-> 1980         super().__init__(**kwargs)
   1981 
   1982     @property

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/core.py in __init__(self, handlers, blocked_handlers, stream_handlers, connection_limit, deserialize, serializers, deserializers, connection_args, timeout, io_loop)
    158         self._comms = {}
    159         self.deserialize = deserialize
--> 160         self.monitor = SystemMonitor()
    161         self.counters = None
    162         self.digests = None

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/system_monitor.py in __init__(self, n)
     57 
     58         if nvml.device_get_count() > 0:
---> 59             gpu_extra = nvml.one_time()
     60             self.gpu_name = gpu_extra["name"]
     61             self.gpu_memory_total = gpu_extra["memory-total"]

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/diagnostics/nvml.py in one_time()
     91 
     92 def one_time():
---> 93     h = _pynvml_handles()
     94     return {
     95         "memory-total": pynvml.nvmlDeviceGetMemoryInfo(h).total,

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/distributed/diagnostics/nvml.py in _pynvml_handles()
     61         cuda_visible_devices = list(range(count))
     62     gpu_idx = cuda_visible_devices[0]
---> 63     return pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
     64 
     65 

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py in nvmlDeviceGetHandleByIndex(index)
   1574     fn = _nvmlGetFunctionPointer("nvmlDeviceGetHandleByIndex_v2")
   1575     ret = fn(c_index, byref(device))
-> 1576     _nvmlCheckReturn(ret)
   1577     return device
   1578 

~/anaconda3/envs/rapids-21.10/lib/python3.8/site-packages/pynvml/nvml.py in _nvmlCheckReturn(ret)
    741 def _nvmlCheckReturn(ret):
    742     if (ret != NVML_SUCCESS):
--> 743         raise NVMLError(ret)
    744     return ret
    745 

NVMLError_Unknown: Unknown Error

One thing that I noticed was that despite installing CUDA 11.2 from here, when I ran nvidia-smi it says that I am using CUDA 11.6:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.00       Driver Version: 510.06       CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P5    20W /  N/A |    164MiB /  6144MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
@mcleantom mcleantom changed the title [BUG] NVMLError_Unkown: Unknown Error NVMLError_Unkown: Unknown Error Oct 26, 2021
@mcleantom
Copy link
Author

I was doing some tests with pynvml and I came across this:

Running this code:

from pynvml import *
nvmlDeviceGetHandleByIndex(0)

returns the same error

pynvml.nvml.NVMLError_Unknown: Unknown Error

Whereas if I use nvmlInit() beforehand, I do not get the error

from pynvml import *
nvmlInit()
nvmlDeviceGetHandleByIndex(0)

returns

<pynvml.nvml.LP_struct_c_nvmlDevice_t object at 0x7fbb94640240>

Could this have something to do with this error?

@quasiben
Copy link
Member

I think this was resolved in dask/distributed#5343, would it be possible to update dask/distributed to the latest version ?

@mcleantom
Copy link
Author

mcleantom commented Oct 26, 2021

I think this was resolved in dask/distributed#5343, would it be possible to update dask/distributed to the latest version ?

I tried updating dask and distributed to the most recent version (2020.10.0) using pip, and tried to run LocalCUDACluster() again but got the same error

@quasiben
Copy link
Member

Hmm, I wonder if this is because you are using CUDA Version: 11.6 and, as you noted, you are using 11.2. You could try down-grading the cuda driver on the machine. Also, note, RAPIDS support CUDA 11.4 as well.

@pentschev
Copy link
Member

Just to clarify, the CUDA version that's listed in the nvidia-smi is the maximum version supported by the driver, but you should be able to use an older CUDA toolkit, so CUDA toolkit 11.2 should be compatible with a driver that supports up to CUDA 11.6.

As per #761 (comment), nvmlInit() is called in init_once(), and that's called by device_get_count() which will return the number of devices available or 0 if an error occurred when importing pynvml or calling pynvml.nvmlInit().

With that all being said, it isn't really clear to me why that error is happening, and we'll need more information. First, could you tell us what is the version of the WSL kernel? It seems you should be able to find that running uname -r.

@pentschev
Copy link
Member

It's also important to note that LocalCUDACluster is targeted at creating multi-GPU clusters, so it will likely not be of much use if you have a single-GPU, unless your goal is just to test/prototype with Dask-CUDA before moving on to a multi-GPU setup.

@mcleantom
Copy link
Author

mcleantom commented Oct 27, 2021

I was just going to use a single-GPU for now, but I was hoping to try multi-GPU clusters in the future. I'm pretty new to linux and GPU programming, so it's possible I've made a mistake somewhere, but I cant find the issue. This is what happens when I run uname -r:

(base) mclea@DESKTOP-FGFC2TM:/usr/local$ uname -r
5.10.60.1-microsoft-standard-WSL2

One thing I noticed is that if i run the command nvcc in Ubuntu, it says

Command 'nvcc' not found, but can be installed with:

sudo apt install nvidia-cuda-toolkit

Should I be able to run this command from within WSL/Ubuntu?

if I run it from within powershell/cmd it works fine

PS C:\WINDOWS\system32> nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:15:10_Pacific_Standard_Time_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0

I had a look at this stack question about a similar error - https://askubuntu.com/questions/885610/nvcc-version-command-says-nvcc-is-not-installed but couldnt find a cuda folder in /usr/local/.

However, as this stack question says - https://stackoverflow.com/questions/61122950/where-does-anaconda-install-cudatoolkit-and-cudnn?noredirect=1&lq=1 it looks like installing cudatoolkit from anaconda doesnt let you do the nvcc command.

I also tried running the docker version, and it came up with the same error

@pentschev
Copy link
Member

Thanks for confirming the kernel version, this should be fine according to CUDA's WSL2 system requirements.

NVCC is the CUDA compiler, you need that if you're building RAPIDS from source, which isn't your case. When installing from conda you get pre-compiled binaries, for which NVCC isn't required.

Note that WSL2 is mostly experimental and not tested/officially supported by Dask-CUDA as of now, and it does seem we still have some work to get NVML diagnostics working properly. For now, what you can do is disable it completely, everything from Dask should work, except for the GPU information in Distributed's Diagnostics. To disable it you can run your python process with:

DASK_DISTRIBUTED__DIAGNOSTICS__NVML=False python

Alternatively, you can change the corresponding "distributed.diagnostics.nvml" configuration file, for that refer to the Dask Configuration docs.

@mcleantom
Copy link
Author

mcleantom commented Oct 27, 2021

Thanks for the reply, that seemed to fix my issue thanks!

@quasiben
Copy link
Member

Thanks @pentschev for helping to resolve this. While we don't know root cause I'm going to close for now while we continue working on better WSL2 support

@lmeyerov
Copy link

@quasiben @pentschev I drilled down a bit, and traced to this: rapidsai/cudf#9955

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants