Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileNotFoundError: can't find *_optim_states.pt files when running finetune with HumanVLM #5

Open
Ody-trek opened this issue Jan 12, 2025 · 0 comments

Comments

@Ody-trek
Copy link

问题描述
在微调 HumanVLM 模型时,遇到了如下错误:
FileNotFoundError: can't find *_optim_states.pt files in directory '/home/chou/.cache/huggingface/hub/models--OpenFace-CQUPT--Human_LLaVA'

根据报错内容,似乎是优化器状态文件缺失,但模型下载目录中没有相关文件(如 *_optim_states.pt),导致微调无法正常开始。

以下是我的操作步骤和遇到的问题详细描述。

复现步骤
1, 克隆 HumanVLM 项目并安装依赖。
2, 准备预训练模型和数据:

  • 模型下载路径:OpenFace-CQUPT/Human_LLaVA

3, 执行微调命令:
xtuner train HumanVLM/human_llama3_8b_instruct_siglip_so400m_large_p14_384_lora_e1_gpu8_finetune.py
4, 遇到上述错误。

附加信息:

  • 以下是我在HumanVLM/HumanVLM/human_llama3_8b_instruct_siglip_so400m_large_p14_384_lora_e1_gpu8_finetune.py文件中的修改:

#######################################################################
#                          PART 1  Settings                           #
#######################################################################
# Model
llm_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
visual_encoder_name_or_path = 'google/siglip-so400m-patch14-384'
# Specify the pretrained pth
#pretrained_pth = './work_dirs/human_llama3_8b_instruct_siglip_so400m_large_p14_384_e1_gpu8_pretrain/iter_54000.pth'  # noqa: E501
pretrained_pth = '/home/chou/.cache/huggingface/hub/models--OpenFace-CQUPT--Human_LLaVA'

# Data
#data_root = '/home/ubuntu/public-Datasets/HumanSFT/'
data_root = '/home/chou/deep/'
data_path = data_root + 'processed_from_converted_data_for_finetuning'
#data_path = data_root + 'ft_hfformat_base_attr_keypoint_0616_clean'
# data_path = data_root + 'ft_json_base_attr_keypoint_0616'
#image_folder = data_root + 'data'
image_folder = data_root + 'pt_images/train2014'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = int(4096 - 728)
  • 完整日志如下:
01/12 22:37:29 - mmengine - WARNING - Failed to search registry with scope "mmengine" in the "builder" registry tree. As a workaround, the current "builder" registry in "xtuner" is used to build instance. This may cause unexpected failure when running the built modules. Please check whether "mmengine" is a correct scope, or whether the registry is initialized.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:07<00:00,  1.85s/it]
You are using an old version of the checkpointing format that is deprecated (We will also silently ignore `gradient_checkpointing_kwargs` in case you passed it).Please update to the new format on your modeling file. To use the new format, you need to completely remove the definition of the method `_set_gradient_checkpointing` in your model.
Processing zero checkpoint '/home/chou/.cache/huggingface/hub/models--OpenFace-CQUPT--Human_LLaVA'
Traceback (most recent call last):
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/tools/train.py", line 364, in <module>
    main()
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/tools/train.py", line 353, in main
    runner = Runner.from_cfg(cfg)
  File "/home/chou/miniconda3/envs/humancaption/lib/python3.8/site-packages/mmengine/runner/runner.py", line 462, in from_cfg
    runner = cls(
  File "/home/chou/miniconda3/envs/humancaption/lib/python3.8/site-packages/mmengine/runner/runner.py", line 429, in __init__
    self.model = self.build_model(model)
  File "/home/chou/miniconda3/envs/humancaption/lib/python3.8/site-packages/mmengine/runner/runner.py", line 836, in build_model
    model = MODELS.build(model)
  File "/home/chou/miniconda3/envs/humancaption/lib/python3.8/site-packages/mmengine/registry/registry.py", line 570, in build
    return self.build_func(cfg, *args, **kwargs, registry=self)
  File "/home/chou/miniconda3/envs/humancaption/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 232, in build_model_from_cfg
    return build_from_cfg(cfg, registry, default_args)
  File "/home/chou/miniconda3/envs/humancaption/lib/python3.8/site-packages/mmengine/registry/build_functions.py", line 121, in build_from_cfg
    obj = obj_cls(**args)  # type: ignore
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/model/llava.py", line 109, in __init__
    pretrained_state_dict = guess_load_checkpoint(pretrained_pth)
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/model/utils.py", line 313, in guess_load_checkpoint
    state_dict = get_state_dict_from_zero_checkpoint(
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/utils/zero_to_any_dtype.py", line 617, in get_state_dict_from_zero_checkpoint
    return _get_state_dict_from_zero_checkpoint(ds_checkpoint_dir,
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/utils/zero_to_any_dtype.py", line 228, in _get_state_dict_from_zero_checkpoint
    optim_files = get_optim_files(ds_checkpoint_dir)
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/utils/zero_to_any_dtype.py", line 103, in get_optim_files
    return get_checkpoint_files(checkpoint_dir, '*_optim_states.pt')
  File "/home/chou/deep/HumanVLM/xtuner/xtuner/utils/zero_to_any_dtype.py", line 96, in get_checkpoint_files
    raise FileNotFoundError(
FileNotFoundError: can't find *_optim_states.pt files in directory '/home/chou/.cache/huggingface/hub/models--OpenFace-CQUPT--Human_LLaVA'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant