Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry About K-Means Initialization for Gates Without Fine-Tuning #4

Open
pprp opened this issue Dec 16, 2024 · 6 comments
Open

Inquiry About K-Means Initialization for Gates Without Fine-Tuning #4

pprp opened this issue Dec 16, 2024 · 6 comments

Comments

@pprp
Copy link

pprp commented Dec 16, 2024

Thanks for your excellent work! I came across your paper and noticed that the gates are initialized using K-Means, which seems quite innovative. However, the paper does not mention the performance when using this method directly.

I am curious to know if, when using the parameters obtained directly from K-Means initialization and testing the model without any fine-tuning, the PPL (perplexity) would be affected. Could you please provide some insights on this?

Thanks again for your time.

@XiaoYee
Copy link
Contributor

XiaoYee commented Dec 16, 2024

@DaizeDong maybe you can help

@DaizeDong
Copy link
Contributor

Thank your for your attention to our project! That is a very good question!

Unfortunately, according to our observation, the converted model w/o further finetuning is kind of "broken", i.e., the PPL is very high. I conjecture this is due to the poor sparsity of mordern LLMs, which are usually over-trained on huge amount of tokens. So your insight on directly using the model obtained from K-Means may not be very effective. However, for models like BERT (which is smaller and uses ReLU as the activation), this may worth a try.

@pprp
Copy link
Author

pprp commented Dec 17, 2024

@DaizeDong Thanks for your swift reply.

I wonder whether this K-means method can make the gate converge faster than randomly initialized weights?

Thanks!

@DaizeDong
Copy link
Contributor

@pprp Sorry that we didn't conduct experiments on ablating the initialization method of the gate weights. However, this method can lead to better balancedness at the initial, and I believe it can also accelerate the convergence. If you are interested, I think I can experiment on it and maybe we can collaborate for deeper investigations on gate initialization strategies.

@pprp
Copy link
Author

pprp commented Dec 18, 2024

@DaizeDong I conducted experiment over it and here are the results:

when employing the k-means initialization, we got the following results:

image

And here are the random initialization's results:

image

The only difference is whether we use --gate_weights_file

@DaizeDong
Copy link
Contributor

@pprp Your images show that both models suffer great performance loss after initialization, and this observation aligns with ours. I think you need to train the models with more tokens to compare the convergence rates of these two methods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants