Move Translation model from transformers library for inference #173

dasgoutam · 2025-01-21T14:08:01Z

Currently the translation model is loaded into memory using the transformers library with the pipeline function.

Would it be possible to use vllm to serve the model?

svenseeberg · 2025-01-21T15:35:52Z

I had a look already but did not find a way to do this. AFAICT the APIs are different. This would probably be even more relevant with #174 as suggested in #170 (comment)

svenseeberg · 2025-01-21T15:51:18Z

vllm-project/vllm#1565

dasgoutam · 2025-01-22T09:45:29Z

vllm-project/vllm#1565

Okay. So even though the project has started support for encoder-decoder models (which NLLB is a type of), there seems to be no implementation yet for the architecture that NLLB-200 possesses i.e. M2M100ForConditionalGeneration

I had a look already but did not find a way to do this. AFAICT the APIs are different. This would probably be even more relevant with #174 as suggested in #170 (comment)

After a big of research, the only working way for inference and/or serving of NLLB models I could find was using Ctranslate2 (https://opennmt.net/CTranslate2/guides/transformers.html#nllb) or the triton inference server (https://github.com/triton-inference-server/server).

There is an (now old) article about it here - https://blog.speechmatics.com/huggingface-translation-triton , specifically addressing serving of the nllb model

svenseeberg · 2025-01-22T13:58:40Z

We also need a GPU server powerful enough to load the model. For the full 53B version, we would need roughly 100 GB of GPU memory. Alternatively, we could investigate if there is a quantized version. With Int 4 quantization, the model would shrink to a quarter of its size. Maybe this would work in CPU with sufficiently large RAM (32GB) and file caching. We do have servers with powerful enough CPUs available (Dual Socket AMD EPYC 9684X 96-Core Processor).

freinold · 2025-01-23T14:25:11Z

Alternatively, we could investigate if there is a quantized version. With Int 4 quantization, the model would shrink to a quarter of its size.

https://huggingface.co/KnutJaegersberg/nllb-moe-54b-4bit

freinold · 2025-01-23T14:26:24Z

I also found an alternative framework called LibreTranslate:
#50 (comment)

dasgoutam added the component:chat Chat Back End label Jan 21, 2025

svenseeberg added this to the Backlog milestone Jan 22, 2025

svenseeberg added the enhancement New feature or request label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move Translation model from transformers library for inference #173

Move Translation model from transformers library for inference #173

dasgoutam commented Jan 21, 2025

svenseeberg commented Jan 21, 2025

svenseeberg commented Jan 21, 2025

dasgoutam commented Jan 22, 2025 •

edited

Loading

svenseeberg commented Jan 22, 2025 •

edited

Loading

freinold commented Jan 23, 2025

freinold commented Jan 23, 2025

Move Translation model from transformers library for inference #173

Move Translation model from transformers library for inference #173

Comments

dasgoutam commented Jan 21, 2025

svenseeberg commented Jan 21, 2025

svenseeberg commented Jan 21, 2025

dasgoutam commented Jan 22, 2025 • edited Loading

svenseeberg commented Jan 22, 2025 • edited Loading

freinold commented Jan 23, 2025

freinold commented Jan 23, 2025

dasgoutam commented Jan 22, 2025 •

edited

Loading

svenseeberg commented Jan 22, 2025 •

edited

Loading