Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2张16G的T4卡都跑不起来examples/test.py #120

Open
zhangtaibo opened this issue Sep 27, 2024 · 1 comment
Open

2张16G的T4卡都跑不起来examples/test.py #120

zhangtaibo opened this issue Sep 27, 2024 · 1 comment

Comments

@zhangtaibo
Copy link

硬件环境:
[root@iZ6we55nj5ujtoxm12k2wwZ ~]# nvidia-smi
Fri Sep 27 15:14:37 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.06 Driver Version: 555.42.06 CUDA Version: 12.5 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 On | 00000000:00:07.0 Off | 0 |
| N/A 34C P8 10W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:08.0 Off | 0 |
| N/A 31C P8 9W / 70W | 1MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

根据readme 使用如下镜像
registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:0.1.13_cuda12

0.5B的qwen模型oom了

root@iZ6we55nj5ujtoxm12k2wwZ:/home/root/rtp-llm# export TOKENIZER_PATH=/home/root/rtp-llm/Qwen2-0.5B-Instruct/
root@iZ6we55nj5ujtoxm12k2wwZ:/home/root/rtp-llm# python3 example/test.py
Fetching 10 files: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 98227.26it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
load final_layernorm.gamma to torch.Size([896])
load final_layernorm.beta to torch.Size([896])
+------------------------------------------+
| MODEL CONFIG |
+-----------------------+------------------+
| Options | Values |
+-----------------------+------------------+
| model_type | QWenV2 |
| act_type | WEIGHT_TYPE.FP16 |
| weight_type | WEIGHT_TYPE.FP16 |
| max_seq_len | 8192 |
| use_sparse_head | False |
| use_multi_task_prompt | None |
| use_medusa | False |
| lora_infos | {} |
+-----------------------+------------------+
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][WARNING][RANK 0][139894265784128][24-09-27 07:13:33] TRT FMHA is disabled for sm 75
[FT][WARNING][RANK 0][139894265784128][24-09-27 07:13:33] opensource FMHA is disabled for sm 75
[FT][WARNING][RANK 0][139894265784128][24-09-27 07:13:33] use fmha: 0
Traceback (most recent call last):
File "/home/root/rtp-llm/example/test.py", line 43, in
asyncio.run(main())
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/root/rtp-llm/example/test.py", line 19, in main
model = ModelFactory.from_huggingface(model_config.ckpt_path, model_config=model_config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/model_factory.py", line 88, in from_huggingface
return ModelFactory.from_model_config(new_model_config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/model_factory.py", line 79, in from_model_config
model = AsyncModel(model, sp_model)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/async_model.py", line 32, in init
self.decoder_engine_ = create_engine(model, self.config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/engine_creator.py", line 48, in create_engine
return _create_normal_engine(model, config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/engine_creator.py", line 64, in _create_normal_engine
cache_manager = CacheManager(cache_config, nccl_op)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/cache_manager.py", line 43, in init
self.__init_kv_cache(config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/cache_manager.py", line 69, in __init_kv_cache
self.v_blocks = torch.zeros((config.layer_num, self.block_nums, config.local_head_num_kv, config.seq_size_per_block, config.size_per_head), dtype=config.dtype, device='cuda:0')
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 0 has a total capacty of 14.57 GiB of which 3.73 GiB is free. Process 20262 has 10.83 GiB memory in use. Of the allocated memory 10.64 GiB is allocated by PyTorch, and 54.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@dongjiyingdjy
Copy link
Collaborator

因为之前的fmha没有支持T4,且max_seq_len=8K,导致OOM;现在的代码里支持了T4不带paged的fmha,可以更新代码试试;另,多卡需要开TP的环境变量, 见https://github.com/alibaba/rtp-llm/blob/main/docs/MultiGPU.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants