Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Translation model from transformers library for inference #173

Open
dasgoutam opened this issue Jan 21, 2025 · 6 comments
Open

Move Translation model from transformers library for inference #173

dasgoutam opened this issue Jan 21, 2025 · 6 comments
Labels
component:chat Chat Back End enhancement New feature or request
Milestone

Comments

@dasgoutam
Copy link
Collaborator

Currently the translation model is loaded into memory using the transformers library with the pipeline function.

Would it be possible to use vllm to serve the model?

@dasgoutam dasgoutam added the component:chat Chat Back End label Jan 21, 2025
@svenseeberg
Copy link
Member

I had a look already but did not find a way to do this. AFAICT the APIs are different. This would probably be even more relevant with #174 as suggested in #170 (comment)

@svenseeberg
Copy link
Member

vllm-project/vllm#1565

@dasgoutam
Copy link
Collaborator Author

dasgoutam commented Jan 22, 2025

vllm-project/vllm#1565

Okay. So even though the project has started support for encoder-decoder models (which NLLB is a type of), there seems to be no implementation yet for the architecture that NLLB-200 possesses i.e. M2M100ForConditionalGeneration

I had a look already but did not find a way to do this. AFAICT the APIs are different. This would probably be even more relevant with #174 as suggested in #170 (comment)

After a big of research, the only working way for inference and/or serving of NLLB models I could find was using Ctranslate2 (https://opennmt.net/CTranslate2/guides/transformers.html#nllb) or the triton inference server (https://github.com/triton-inference-server/server).

There is an (now old) article about it here - https://blog.speechmatics.com/huggingface-translation-triton , specifically addressing serving of the nllb model

@svenseeberg svenseeberg added this to the Backlog milestone Jan 22, 2025
@svenseeberg svenseeberg added the enhancement New feature or request label Jan 22, 2025
@svenseeberg
Copy link
Member

svenseeberg commented Jan 22, 2025

We also need a GPU server powerful enough to load the model. For the full 53B version, we would need roughly 100 GB of GPU memory. Alternatively, we could investigate if there is a quantized version. With Int 4 quantization, the model would shrink to a quarter of its size. Maybe this would work in CPU with sufficiently large RAM (32GB) and file caching. We do have servers with powerful enough CPUs available (Dual Socket AMD EPYC 9684X 96-Core Processor).

@freinold
Copy link

Alternatively, we could investigate if there is a quantized version. With Int 4 quantization, the model would shrink to a quarter of its size.

https://huggingface.co/KnutJaegersberg/nllb-moe-54b-4bit

@freinold
Copy link

I also found an alternative framework called LibreTranslate:
#50 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:chat Chat Back End enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants