Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Sparse/Pruned Models #493

Open
Rikorose opened this issue May 11, 2021 · 4 comments
Open

Support for Sparse/Pruned Models #493

Rikorose opened this issue May 11, 2021 · 4 comments

Comments

@Rikorose
Copy link
Contributor

Related to the The Lottery Ticket Hypothesis, pruning allows to discard a large amount of parameters within a network [1, 2, 3, 4] without sacrificing model accuracy (depending on the amount of course). However, memory locality usually suffers due to the sparsity. Therefore, [3] and [4] use 16x1 and 16x4 blocks that are getting selected for pruning during training to still allow for vectorizations during inference. Choosing a block size might largely depend on the inference backend.

I think this is currently not supported via ONNX but maybe it might make sense to already start some discussion about it, since this can speed inference by fairly large amount.

Other work on the inference site:

[1] Gordon etal: "Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning", https://arxiv.org/abs/2002.08307
[2] Zhu etal: "To prune, or not to prune: exploring the efficacy of pruning for model compression", https://arxiv.org/abs/1710.01878v1
[3] Valin etal: "LPCNet: Improving Neural Speech Synthesis Through Linear Prediction", https://jmvalin.ca/papers/lpcnet_icassp2019.pdf
[4] Valin etal: "Low-complexity, real-time joint neural echo control and speech enhancement based on Percepnet", https://jmvalin.ca/papers/percepnet_res.pdf

@kali
Copy link
Collaborator

kali commented May 11, 2021

Hey, thanks for your interest in tract !

These are just a few thoughts, I have not studied the topic in depth.

I think supporting the kind of models that are described in these papers could be done relatively easily. The key thing being, only the weight matrices are sparse. We can probably manage that by just introducing a few new MatMul operators, avoiding to change the definition of Tensor and letting all the variables stay dense. For instance, we could encode the weight matrix as its diagonal values, plus a block mask and block valuee. As three inputs or as three attributes.

I think it would be relatively easy to draft a simple implementation in rust. We can think about the optimisations later.

In terms of format, ONNX will not help us a lot... We can extend NNEF with custom operators (we already do that for encoding tract-core operators that are not in NNEF).

@Rikorose
Copy link
Contributor Author

From my understanding, masks have the same size as the original dense tensor, no?

Typically, a sparse tensor is defined by a list of values and a list of indices (see e.g. onnx). To support blocks, one could define values as a list of block values. Indices could then only point to the first element of a block. I am not sure about a strict requirement for the diagonal values since they also could be included within the blocks.

If NNEF with custom operators is used, how is the export to NNEF typically handled?

@kali
Copy link
Collaborator

kali commented May 12, 2021

Yeah, the most generic masks have the same size than the dense tensor. But from [4] above, it looks like lots of people are working with blocks instead, and I must say that from an implementation perspective it does make a lot a of sense, as it allow to use vectorized instructions. In these papers, the diagonal elements are handled separately, as they are likely to be non-zero and you don't want to "waste" an entire block just for one value.

Thanks for the link to ONNX. I did not know they had done anything about it. They're going for the general way here, no blocks. I don't know if they have done anything more than defining a protobuf format. I haven't seen any test in the test suite about sparse inputs, and never encountered them in the field yet.

Typically, NNEF/OPL is generated from ONNX or TF using tract command line... That said some of our teams are considering generating NNEF/OPL from the training scripts to bypass ONNX or TF limitations and constraints (the format is not very difficult, and there is some tooling already).

@Rikorose
Copy link
Contributor Author

FYI: PyTorch implemented a few aarch64 and arm kernel with block size 8x1 and 4x8: pytorch/pytorch#50585

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants