-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low Accuracy on 80 Tasks After Fine-Tuning Meta-Llama-3-8B-Instruct (19/400 = 4.75%) #3
Comments
Which checkpoints do you use in Meta-Llama-3-8B-Instruct folder? |
The expected value should be around 36 tasks for this model. I attached the logs of the inference run and the resulting predictions for what you're trying to get per my understanding. I also attached one of the tasks' TTT loss logs. The checkpoint used in this run should be same as: https://huggingface.co/ekinakyurek/marc-8B-finetuned-llama3/tree/main (I can additionally verify if necessary) 0a1d4ef5_tt.txt ========== We also have some verification notebooks now on kaggle. These use For BARC checkpoints make sure the torchtune tokenizer is in BARC mode --- requires editing to tokenizer file under your torchtune installation: https://github.com/ekinakyurek/torchtune/blob/efd85e000e83dcf6803c623cf83943e4a817377a/torchtune/models/llama3/_tokenizer.py#L51-L55 Here are the notebooks: Score: 251.5/400 = 62.875 |
I am using the downloaded safetensors from Meta-Llama-3-8B-Instruct. |
For task 0a1d4ef5, my log is in 0a1d4ef5_log_1732848905.txt.txt Step 1: loss: 1.4690, lr: 7.14e-06, tokens_per_second_per_gpu: 6844.40 Step 1: loss: 0.3976, lr: 7.14e-06, tokens_per_second_per_gpu: 4387.33 |
Okay that’s the problem I guess. Can you use our finetuned checkpoints as in the paper. https://huggingface.co/ekinakyurek/marc-8B-finetuned-llama3/tree/main |
I used 80 tasks from the file task_info_selected.csv in this repository, and fine-tuned Meta-Llama-3-8B-Instruct using the train.sh(train.sh.txt below) script generated from this repository. Then, I generated the predict.sh(predict.sh.txt below) script for inference following the instructions in the same repository here. However, the final result I got is a Competition Accuracy of 19 / 400 = 0.0475, meaning it only got 19 tasks correct. Can you help me identify what might be wrong?
/workspace/wubing/marc/test_time_train.py, for using the selected 80 tasks, I updated the code blow.
if args.num_tasks is not None: if args.num_tasks_selected: import pandas as pd df = pd.read_csv('/workspace/wubing/marc/task_info_selected.csv') selected_tasks = df['task_id'].to_list() arc_test_tasks = [task for task in arc_test_tasks if task.name.replace("-0", "") in selected_tasks] print("Use selected tasks as ttt paper") else: arc_test_tasks = arc_test_tasks[: args.num_tasks]
predict.log
train.log
predict.sh.txt
train.sh.txt
The text was updated successfully, but these errors were encountered: