Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using sdpa and flash_attention_2 error #168

Open
aixingxy opened this issue Nov 22, 2024 · 5 comments
Open

Using sdpa and flash_attention_2 error #168

aixingxy opened this issue Nov 22, 2024 · 5 comments

Comments

@aixingxy
Copy link

Hello, thanks for this great job! I followed the instructions INFERENCE , but encountered some difficulties.

from parler_tts import ParlerTTSForConditionalGeneration
import torch
from transformers import AutoTokenizer
import soundfile as sf


torch_device = "cuda:0" # use "mps" for Mac
torch_dtype = torch.float32
model_name = "parler-tts/parler-tts-mini-v1"

attn_implementation = "sdpa" # "sdpa" or "flash_attention_2"

model = ParlerTTSForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch_dtype, attn_implementation=attn_implementation).to(torch_device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = "Hey, how are you doing today?"
description = "A female speaker delivers a slightly expressive and animated speech with a moderate speed and pitch. The recording is of very high quality, with the speaker's voice sounding clear and very close up."

input_ids = tokenizer(description, return_tensors="pt").input_ids.to(torch_device)
prompt_input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(torch_device)

generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)
audio_arr = generation.cpu().to(torch.float32).numpy().squeeze()

sf.write("parler_tts_out.wav", audio_arr, model.config.sampling_rate)

when I set attn_implementation="sdpa",get an error

ValueError: T5EncoderModel does not support an attention implementation through torch.nn.functional.scaled_dot_product_attention yet. Please request the support for this architecture: https://github.com/huggingface/transformers/issues/28005. If you believe this error is a bug, please open an issue in Transformers GitHub repository and load your model with the argument `attn_implementation="eager"` meanwhile. Example: `model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="eager")`

and set attn_implementation="flash_attention_2",get an error

ValueError: T5EncoderModel does not support Flash Attention 2.0 yet. Please request to add support where the model is hosted, on its model hub page: https://huggingface.co/google/flan-t5-large/discussions/new or in the Transformers GitHub repo: https://github.com/huggingface/transformers/issues/new

I use A100 GPU, my environment is:

transformers                4.46.1
torch                       2.3.0
flash-attn                  2.5.8

Am I missing some important configuration information?

@remichu-ai
Copy link

i am encountering this error too. Appreciate if there can be any help

@aixingxy
Copy link
Author

aixingxy commented Nov 29, 2024 via email

@jack-richards
Copy link

I am getting the same issue, I also followed the tutorial exactly.

@lukaLLM
Copy link

lukaLLM commented Dec 11, 2024

@aixingxy I am getting the same issue It was working before on an old installation that I have on Conda it seems there was some update that made it happen as I installed a new one and got it this week. I advise you to use the default one as when I tested all of them on 3090 and L40s I didnt see much difference in speed.

@bzikst
Copy link

bzikst commented Jan 12, 2025

Bumping transformers version to 4.48.0 solved the problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants