You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
root@iZ6we55nj5ujtoxm12k2wwZ:/home/root/rtp-llm# export TOKENIZER_PATH=/home/root/rtp-llm/Qwen2-0.5B-Instruct/
root@iZ6we55nj5ujtoxm12k2wwZ:/home/root/rtp-llm# python3 example/test.py
Fetching 10 files: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 98227.26it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
load final_layernorm.gamma to torch.Size([896])
load final_layernorm.beta to torch.Size([896])
+------------------------------------------+
| MODEL CONFIG |
+-----------------------+------------------+
| Options | Values |
+-----------------------+------------------+
| model_type | QWenV2 |
| act_type | WEIGHT_TYPE.FP16 |
| weight_type | WEIGHT_TYPE.FP16 |
| max_seq_len | 8192 |
| use_sparse_head | False |
| use_multi_task_prompt | None |
| use_medusa | False |
| lora_infos | {} |
+-----------------------+------------------+
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][WARNING][RANK 0][139894265784128][24-09-27 07:13:33] TRT FMHA is disabled for sm 75
[FT][WARNING][RANK 0][139894265784128][24-09-27 07:13:33] opensource FMHA is disabled for sm 75
[FT][WARNING][RANK 0][139894265784128][24-09-27 07:13:33] use fmha: 0
Traceback (most recent call last):
File "/home/root/rtp-llm/example/test.py", line 43, in
asyncio.run(main())
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/root/rtp-llm/example/test.py", line 19, in main
model = ModelFactory.from_huggingface(model_config.ckpt_path, model_config=model_config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/model_factory.py", line 88, in from_huggingface
return ModelFactory.from_model_config(new_model_config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/model_factory.py", line 79, in from_model_config
model = AsyncModel(model, sp_model)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/async_model.py", line 32, in init
self.decoder_engine_ = create_engine(model, self.config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/engine_creator.py", line 48, in create_engine
return _create_normal_engine(model, config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/engine_creator.py", line 64, in _create_normal_engine
cache_manager = CacheManager(cache_config, nccl_op)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/cache_manager.py", line 43, in init
self.__init_kv_cache(config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/cache_manager.py", line 69, in __init_kv_cache
self.v_blocks = torch.zeros((config.layer_num, self.block_nums, config.local_head_num_kv, config.seq_size_per_block, config.size_per_head), dtype=config.dtype, device='cuda:0')
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 0 has a total capacty of 14.57 GiB of which 3.73 GiB is free. Process 20262 has 10.83 GiB memory in use. Of the allocated memory 10.64 GiB is allocated by PyTorch, and 54.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The text was updated successfully, but these errors were encountered:
硬件环境:
[root@iZ6we55nj5ujtoxm12k2wwZ ~]# nvidia-smi
Fri Sep 27 15:14:37 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:00:07.0 Off | 0 |
| N/A 34C P8 10W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:08.0 Off | 0 |
| N/A 31C P8 9W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
根据readme 使用如下镜像
registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:0.1.13_cuda12
0.5B的qwen模型oom了
root@iZ6we55nj5ujtoxm12k2wwZ:/home/root/rtp-llm# export TOKENIZER_PATH=/home/root/rtp-llm/Qwen2-0.5B-Instruct/
root@iZ6we55nj5ujtoxm12k2wwZ:/home/root/rtp-llm# python3 example/test.py
Fetching 10 files: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 98227.26it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
load final_layernorm.gamma to torch.Size([896])
load final_layernorm.beta to torch.Size([896])
+------------------------------------------+
| MODEL CONFIG |
+-----------------------+------------------+
| Options | Values |
+-----------------------+------------------+
| model_type | QWenV2 |
| act_type | WEIGHT_TYPE.FP16 |
| weight_type | WEIGHT_TYPE.FP16 |
| max_seq_len | 8192 |
| use_sparse_head | False |
| use_multi_task_prompt | None |
| use_medusa | False |
| lora_infos | {} |
+-----------------------+------------------+
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][WARNING][RANK 0][139894265784128][24-09-27 07:13:33] TRT FMHA is disabled for sm 75
[FT][WARNING][RANK 0][139894265784128][24-09-27 07:13:33] opensource FMHA is disabled for sm 75
[FT][WARNING][RANK 0][139894265784128][24-09-27 07:13:33] use fmha: 0
Traceback (most recent call last):
File "/home/root/rtp-llm/example/test.py", line 43, in
asyncio.run(main())
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/root/rtp-llm/example/test.py", line 19, in main
model = ModelFactory.from_huggingface(model_config.ckpt_path, model_config=model_config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/model_factory.py", line 88, in from_huggingface
return ModelFactory.from_model_config(new_model_config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/model_factory.py", line 79, in from_model_config
model = AsyncModel(model, sp_model)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/async_model.py", line 32, in init
self.decoder_engine_ = create_engine(model, self.config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/engine_creator.py", line 48, in create_engine
return _create_normal_engine(model, config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/engine_creator.py", line 64, in _create_normal_engine
cache_manager = CacheManager(cache_config, nccl_op)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/cache_manager.py", line 43, in init
self.__init_kv_cache(config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/cache_manager.py", line 69, in __init_kv_cache
self.v_blocks = torch.zeros((config.layer_num, self.block_nums, config.local_head_num_kv, config.seq_size_per_block, config.size_per_head), dtype=config.dtype, device='cuda:0')
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 0 has a total capacty of 14.57 GiB of which 3.73 GiB is free. Process 20262 has 10.83 GiB memory in use. Of the allocated memory 10.64 GiB is allocated by PyTorch, and 54.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
The text was updated successfully, but these errors were encountered: