-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DenseNet-121 is faster than CondenseNet-74 (C=G=4) on GTX 1080 Ti #3
Comments
Our model is mainly designed for mobile devices, on which the actual inference time highly correlates with the theoretical complexity. However, the group convolution and index/shuffle operations are not efficiently implemented on GPU. |
GPUs tend to be memory-bound rather than compute-bound, in particular, for small models that require additional memory transfers such as ShuffleNets and CondenseNets. On mobile devices, embedded systems, etc. the ratio between compute (in FLOPS) and memory bandwidth is very different though: convnets tend to be compute-bound on such platforms. If you did the same comparison on such a platform, you would find that a CondenseNet is much faster than a DenseNet (see Table 5 of the paper for actual timing results on an ARM processor). |
Thanks for clarification. I already suspected that is the reason after I measured time spent in bottleneck 1x1 layer and grouped 3x3 layer. Forward pass spends twice as much time in 1x1 compared to 3x3. |
I compared the forward pass speed of the larger ImageNet model with DenseNet-121 and the latter actually works faster. After benchmarking my guess is that CondenseConv layer is the cause of the slowdown due to memory transfers in ShuffleLayer and torch.index_select.
@ShichenLiu can you comment on this, did you get better performance compared to DenseNet-121 in your experiments?
The text was updated successfully, but these errors were encountered: