-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move Translation model from transformers library for inference #173
Comments
I had a look already but did not find a way to do this. AFAICT the APIs are different. This would probably be even more relevant with #174 as suggested in #170 (comment) |
Okay. So even though the project has started support for encoder-decoder models (which NLLB is a type of), there seems to be no implementation yet for the architecture that NLLB-200 possesses i.e. M2M100ForConditionalGeneration
After a big of research, the only working way for inference and/or serving of NLLB models I could find was using Ctranslate2 (https://opennmt.net/CTranslate2/guides/transformers.html#nllb) or the triton inference server (https://github.com/triton-inference-server/server). There is an (now old) article about it here - https://blog.speechmatics.com/huggingface-translation-triton , specifically addressing serving of the nllb model |
We also need a GPU server powerful enough to load the model. For the full 53B version, we would need roughly 100 GB of GPU memory. Alternatively, we could investigate if there is a quantized version. With Int 4 quantization, the model would shrink to a quarter of its size. Maybe this would work in CPU with sufficiently large RAM (32GB) and file caching. We do have servers with powerful enough CPUs available (Dual Socket AMD EPYC 9684X 96-Core Processor). |
|
I also found an alternative framework called LibreTranslate: |
Currently the translation model is loaded into memory using the transformers library with the pipeline function.
Would it be possible to use vllm to serve the model?
The text was updated successfully, but these errors were encountered: