-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry About K-Means Initialization for Gates Without Fine-Tuning #4
Comments
@DaizeDong maybe you can help |
Thank your for your attention to our project! That is a very good question! Unfortunately, according to our observation, the converted model w/o further finetuning is kind of "broken", i.e., the PPL is very high. I conjecture this is due to the poor sparsity of mordern LLMs, which are usually over-trained on huge amount of tokens. So your insight on directly using the model obtained from K-Means may not be very effective. However, for models like BERT (which is smaller and uses ReLU as the activation), this may worth a try. |
@DaizeDong Thanks for your swift reply. I wonder whether this K-means method can make the gate converge faster than randomly initialized weights? Thanks! |
@pprp Sorry that we didn't conduct experiments on ablating the initialization method of the gate weights. However, this method can lead to better balancedness at the initial, and I believe it can also accelerate the convergence. If you are interested, I think I can experiment on it and maybe we can collaborate for deeper investigations on gate initialization strategies. |
@DaizeDong I conducted experiment over it and here are the results: when employing the k-means initialization, we got the following results: And here are the random initialization's results: The only difference is whether we use |
@pprp Your images show that both models suffer great performance loss after initialization, and this observation aligns with ours. I think you need to train the models with more tokens to compare the convergence rates of these two methods. |
Thanks for your excellent work! I came across your paper and noticed that the gates are initialized using K-Means, which seems quite innovative. However, the paper does not mention the performance when using this method directly.
I am curious to know if, when using the parameters obtained directly from K-Means initialization and testing the model without any fine-tuning, the PPL (perplexity) would be affected. Could you please provide some insights on this?
Thanks again for your time.
The text was updated successfully, but these errors were encountered: