-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparsity Runtime Integration with TF/TFLite for Latency Improvements #173
Comments
When do you think this will be included in tensorflow / TFLite release ? Is there a targeted timeline ? We are planning to do an internal development if this is not expected within this year (2020) based on this. |
Hi. Also, we're hoping the current working-from-home situation won't affect things further. Thanks |
Thank you |
Why is this closed ? will this be integrated in next version ? |
Reopened. Will not be integrated necessarily in next release. |
Will sparse models ever result in smaller/compressed *.tflite models? This would be a huge plus for low power use cases as it would reduce I/O. Currently I'm working with a quantized 14MB model but if pruned & compressed it could go down to 2MB and be able to fit in the SRAM of some MCUs. |
Same here! Currently my *tflite model and its sparse counterpart have the same storage requirements. If TFLite could detect the zeros and change their type to uint8, this would make a huge difference on model size (MBs). |
@paulaksm @shariq-audiofocus |
@gordinmitya Thanks, I hadn't heard of structural pruning, seems like that could lead to smaller tflite binaries if it eliminates entire filters. Is structural pruning on the model-optimization roadmap? Re: storage - I'm not worried about offline storage. I'm worried about latency & power usage during inference on tiny edge devices (probably MCUs). ARM is developing processors [1] that can do online decompression of weights on-the-fly during inference. This is interesting because now you can fit larger models in memory by utilizing their compression technique. If the model fits in memory (SRAM) you get lower latency & power usage. I'm wondering if the model-optimization & TFLite team are thinking about this or if it's outside their scope. [1] https://www.theregister.com/2020/02/10/arm_cortex_m_ai_accelerator/ - "To fit this all into a small memory and silicon footprint, the microNPU can decompress trained INT8 models on the fly for inference." |
Structural pruning is really important to my team, too. The current zero-weight pruning for compression is nice but we're far more interested in reduced file sizes to be able to fit models into SRAM instead of DRAM. I'm hopeful that this library will eventually support structural pruning- but so far I haven't seen any mention of it. |
Any updates on this? Can we expect latency improvements for our pruned models? |
Can you estimate its release date for inference time optimization? |
Sorry for keeping you waiting. We're actively working on making the initial release of sparse inference support in TFLite. It's hard to give an exact date but hopefully before Q3 ends. Thanks for your patience! |
Please note that we're still finalizing the API. The workflow in the released version may look different. |
@liyunlu0618 looking at your approach right now and training to implement that. does this latency improved inference also work for Conv and not only Dense filters (how would one do it for Conv filters)? Also why is the block [4,1] exactly. How does that ensuring inference time improvements? Thanks! |
For the Conv op we only support these hosted models at the moment: We need the block config to use SIMD instructions on Arm neon architecture. Feel free to check out the kernel here: |
Hi, are there updates on this? |
@alanchiao any update process? |
This is currently available as an experimental feature in TFLite. For sparse CNNs, it needs to run with the XNNPack delegate. Please refer to this. For sparse RNNs and transformers, TFLite has built-in support. This has a few examples. We'll have formal blogposts/docs soon. In the meanwhile if you could provide more details on your use case, I can suggest how to apply this optimization accordingly. Key points that are helpful:
|
@liyunlu0618 thanks for your information, I will play around it a bit :D. Do you know when the documentation will be finished ? |
mark |
Hello, I was wondering if there is the intention of adding structural prunning support for conv layers (in addition to dense layers) ? Is this something possible to do or some fundamental issue prohibits it ? Thanks |
@liyunlu0618 - My use case:
|
Any chance we will get support for pruned CNNs on other TFLite delegates? We rely on the NNAPI and CoreML delegates for quick and efficient inference on Android and iOS, respectively, but so far it looks like XNNPack is the only supported delegate. |
I have the same issue here. After pruning, I got the same size model and the same inference time. even I convert to tflite but it can run on CPU so, the inference time is still not good. XNNPack doesn't not support my network. So, could you tell me what can I do next for improve the inference time with my pruned model ? Thank you so much ! |
Is there any update on this topic? What's the correct way to improve the inference time of a model with pruning? |
It seems still there's no proper solution to improve the inference time on pruned model |
@sampathrajapaksha - We've found the best approach is to do knowledge distillation (KD) to shrink your model and therefore improve inference time. This paper has some good ideas: https://arxiv.org/pdf/1910.01108.pdf%3C/p%3E and shows you can do it with minimal performance degradation. We're still experimenting but these seems to be a better path forward rather than relying on pruning optimizations |
@shariq-audiofocus Thank you very much for sharing this with me. My use case is quite similar to yours. I'll read this and see how I can apply this to reduce inference time |
hi @alanchiao , I am thinking about pruning the mobilenetv3 small model to fit some smaller MCUs. I am new to the pruning world, where would you suggest as a good starting point? thanks! |
As suggested here, model pruning currently only provides benefits in model compression/size reduction. Further framework support is necessary to provide latency improvements in TF/TFLite.
The text was updated successfully, but these errors were encountered: