Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using an LLM as a teacher #994

Open
eu9ene opened this issue Jan 15, 2025 · 1 comment
Open

Investigate using an LLM as a teacher #994

eu9ene opened this issue Jan 15, 2025 · 1 comment
Assignees
Labels
LLM Investigations into using LLMs in the pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Jan 15, 2025

We should do a proof of concept for knowledge distillation from an LLM to our standard student model. The main benefit of this if it works is we won't need to deal with parallel data cleaning and training teacher models. All this can be quite challenging, especially for lower-resource languages.

This would require:

  • Estimate costs for different LLMs and APIs we can use
  • Run quality evaluation for those to see which model would provide the best cost/quality tradeoff
  • Choose a mix of monolingual data to translate
  • Run translation with an LLM
  • Train a regular student model on this data
  • Try different LLMs and corpus of different sizes (for example, 10M and 50M sentences)

Folks from WMT also suggested we can try pre-training the student on parallel OPUS corpus as is and then finetune on a smaller but high-quality LLM-produced corpus to make it more cost efficient.

@eu9ene eu9ene added the LLM Investigations into using LLMs in the pipeline label Jan 15, 2025
@eu9ene eu9ene self-assigned this Jan 15, 2025
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 15, 2025

Related to #767

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
LLM Investigations into using LLMs in the pipeline
Projects
None yet
Development

No branches or pull requests

1 participant