Inquiry About K-Means Initialization for Gates Without Fine-Tuning #4

pprp · 2024-12-16T10:04:28Z

Thanks for your excellent work! I came across your paper and noticed that the gates are initialized using K-Means, which seems quite innovative. However, the paper does not mention the performance when using this method directly.

I am curious to know if, when using the parameters obtained directly from K-Means initialization and testing the model without any fine-tuning, the PPL (perplexity) would be affected. Could you please provide some insights on this?

Thanks again for your time.

XiaoYee · 2024-12-16T10:35:04Z

@DaizeDong maybe you can help

DaizeDong · 2024-12-16T11:24:31Z

Thank your for your attention to our project! That is a very good question!

Unfortunately, according to our observation, the converted model w/o further finetuning is kind of "broken", i.e., the PPL is very high. I conjecture this is due to the poor sparsity of mordern LLMs, which are usually over-trained on huge amount of tokens. So your insight on directly using the model obtained from K-Means may not be very effective. However, for models like BERT (which is smaller and uses ReLU as the activation), this may worth a try.

pprp · 2024-12-17T02:38:07Z

@DaizeDong Thanks for your swift reply.

I wonder whether this K-means method can make the gate converge faster than randomly initialized weights?

Thanks!

DaizeDong · 2024-12-17T03:45:44Z

@pprp Sorry that we didn't conduct experiments on ablating the initialization method of the gate weights. However, this method can lead to better balancedness at the initial, and I believe it can also accelerate the convergence. If you are interested, I think I can experiment on it and maybe we can collaborate for deeper investigations on gate initialization strategies.

pprp · 2024-12-18T09:08:42Z

@DaizeDong I conducted experiment over it and here are the results:

when employing the k-means initialization, we got the following results:

And here are the random initialization's results:

The only difference is whether we use --gate_weights_file

DaizeDong · 2024-12-19T05:15:11Z

@pprp Your images show that both models suffer great performance loss after initialization, and this observation aligns with ours. I think you need to train the models with more tokens to compare the convergence rates of these two methods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry About K-Means Initialization for Gates Without Fine-Tuning #4

Inquiry About K-Means Initialization for Gates Without Fine-Tuning #4

pprp commented Dec 16, 2024

XiaoYee commented Dec 16, 2024

DaizeDong commented Dec 16, 2024

pprp commented Dec 17, 2024

DaizeDong commented Dec 17, 2024

pprp commented Dec 18, 2024

DaizeDong commented Dec 19, 2024

Inquiry About K-Means Initialization for Gates Without Fine-Tuning #4

Inquiry About K-Means Initialization for Gates Without Fine-Tuning #4

Comments

pprp commented Dec 16, 2024

XiaoYee commented Dec 16, 2024

DaizeDong commented Dec 16, 2024

pprp commented Dec 17, 2024

DaizeDong commented Dec 17, 2024

pprp commented Dec 18, 2024

DaizeDong commented Dec 19, 2024