2张16G的T4卡都跑不起来examples/test.py #120

zhangtaibo · 2024-09-27T07:16:36Z

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

根据readme 使用如下镜像
registry.cn-hangzhou.aliyuncs.com/havenask/rtp_llm:0.1.13_cuda12

0.5B的qwen模型oom了

root@iZ6we55nj5ujtoxm12k2wwZ:/home/root/rtp-llm# export TOKENIZER_PATH=/home/root/rtp-llm/Qwen2-0.5B-Instruct/
root@iZ6we55nj5ujtoxm12k2wwZ:/home/root/rtp-llm# python3 example/test.py
Fetching 10 files: 100%|███████████████████████████████████████████████| 10/10 [00:00<00:00, 98227.26it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
load final_layernorm.gamma to torch.Size([896])
load final_layernorm.beta to torch.Size([896])
+------------------------------------------+
| MODEL CONFIG |
+-----------------------+------------------+
| Options | Values |
+-----------------------+------------------+
| model_type | QWenV2 |
| act_type | WEIGHT_TYPE.FP16 |
| weight_type | WEIGHT_TYPE.FP16 |
| max_seq_len | 8192 |
| use_sparse_head | False |
| use_multi_task_prompt | None |
| use_medusa | False |
| lora_infos | {} |
+-----------------------+------------------+
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][WARNING][RANK 0][139894265784128][24-09-27 07:13:33] TRT FMHA is disabled for sm 75
[FT][WARNING][RANK 0][139894265784128][24-09-27 07:13:33] opensource FMHA is disabled for sm 75
[FT][WARNING][RANK 0][139894265784128][24-09-27 07:13:33] use fmha: 0
Traceback (most recent call last):
File "/home/root/rtp-llm/example/test.py", line 43, in
asyncio.run(main())
File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/home/root/rtp-llm/example/test.py", line 19, in main
model = ModelFactory.from_huggingface(model_config.ckpt_path, model_config=model_config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/model_factory.py", line 88, in from_huggingface
return ModelFactory.from_model_config(new_model_config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/model_factory.py", line 79, in from_model_config
model = AsyncModel(model, sp_model)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/async_model.py", line 32, in init
self.decoder_engine_ = create_engine(model, self.config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/engine_creator.py", line 48, in create_engine
return _create_normal_engine(model, config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/engine_creator.py", line 64, in _create_normal_engine
cache_manager = CacheManager(cache_config, nccl_op)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/cache_manager.py", line 43, in init
self.__init_kv_cache(config)
File "/usr/local/lib/python3.10/dist-packages/maga_transformer/async_decoder_engine/cache_manager.py", line 69, in __init_kv_cache
self.v_blocks = torch.zeros((config.layer_num, self.block_nums, config.local_head_num_kv, config.seq_size_per_block, config.size_per_head), dtype=config.dtype, device='cuda:0')
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.73 GiB. GPU 0 has a total capacty of 14.57 GiB of which 3.73 GiB is free. Process 20262 has 10.83 GiB memory in use. Of the allocated memory 10.64 GiB is allocated by PyTorch, and 54.48 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

dongjiyingdjy · 2024-10-08T05:32:38Z

因为之前的fmha没有支持T4，且max_seq_len=8K，导致OOM；现在的代码里支持了T4不带paged的fmha，可以更新代码试试；另，多卡需要开TP的环境变量, 见https://github.com/alibaba/rtp-llm/blob/main/docs/MultiGPU.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2张16G的T4卡都跑不起来examples/test.py #120

2张16G的T4卡都跑不起来examples/test.py #120

zhangtaibo commented Sep 27, 2024

dongjiyingdjy commented Oct 8, 2024

2张16G的T4卡都跑不起来examples/test.py #120

2张16G的T4卡都跑不起来examples/test.py #120

Comments

zhangtaibo commented Sep 27, 2024

dongjiyingdjy commented Oct 8, 2024