[Bug] 训练报错indexSelectLargeIndex: block: [604,0,0], thread: [47,0,0] Assertion srcIndex < srcSelectDimSize
failed.
#208
Labels
bug
Something isn't working
描述该错误
2024-04-19 06:06:37,071 INFO writer.py:60 in init_tb_writer -- Login tensorboard logs to: RUN/7b_internlm2_train/04-19-06.06.02/tensorboards 2024-04-19 06:06:37,761 ERROR train.py:307 in <module> -- Raise exception from c394df8c9997 with rank id: 0 Traceback (most recent call last): File "/data/InternEvo/train.py", line 305, in <module> main(args) File "/data/InternEvo/train.py", line 215, in main _, _, loss = trainer.execute_schedule( File "/data/InternEvo/internlm/core/trainer.py", line 213, in execute_schedule return self._schedule.forward_backward_step(self._engine, /data_iter, **kwargs) File "/data/InternEvo/internlm/utils/timeout.py", line 102, in wrapper result = func(*args, **kwargs) File "/data/InternEvo/internlm/core/scheduler/no_pipeline_scheduler.py", line 220, in forward_backward_step _output, _loss, _moe_loss = self._train_one_batch( File "/data/InternEvo/internlm/core/scheduler/no_pipeline_scheduler.py", line 125, in _train_one_batch output = self._call_engine(engine, /data) File "/data/InternEvo/internlm/core/scheduler/base_scheduler.py", line 86, in _call_engine return engine(**inputs) File "/data/InternEvo/internlm/core/engine.py", line 164, in __call__ return self.model(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/InternEvo/internlm/core/naive_amp.py", line 155, in forward out = self.model(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/InternEvo/internlm/model/modeling_internlm2.py", line 934, in forward hidden_states = self.tok_embeddings(input_ids) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) File "/data/InternEvo/internlm/model/modules/embedding.py", line 66, in forward output = F.embedding(input_, self.weight, self.padding_idx, *self.embed_args, **self.embed_kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: CUDA error: device-side assert triggered Compile with
TORCH_USE_CUDA_DSA` to enable device-side assertions.terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] NCCL watchdog thread terminated with exception: CUDA error: device-side assert triggered Compile with
TORCH_USE_CUDA_DSA` to enable device-side assertions.Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f97371d5617 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f973719098d in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f9737286518 in /opt/conda/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7f973868a150 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7f973868df78 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x24b (0x7f97386a47bb in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7f97386a4ac8 in /opt/conda/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: + 0xd6bf0 (0x7f97a7af3bf0 in /usr/local/gcc-10.2.0/lib64/libstdc++.so.6)
frame #8: + 0x8609 (0x7f97d2bff609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7f97d29ca133 in /usr/lib/x86_64-linux-gnu/libc.so.6)
../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [32,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [33,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [34,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [35,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [36,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [37,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [38,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [39,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [40,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [41,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [42,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [43,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [44,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [45,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [46,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [47,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [48,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [49,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [50,0,0] Assertion
srcIndex < srcSelectDimSize
failed.../aten/src/ATen/native/cuda/Indexing.cu:1292: indexSelectLargeIndex: block: [604,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelect
环境信息
软件环境: ubuntu 官方镜像
硬件环境:A800
其他信息
No response
The text was updated successfully, but these errors were encountered: