-
Notifications
You must be signed in to change notification settings - Fork 577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nervana's Neon and Winograd #93
Comments
Sorry, the full blog post isn't quite done yet, so we put up a quick performance overview post instead. But do feel free to download the new neon and try out the kernels (and even browse the source if you're curious). I'm almost done with a new Winograd kernel that should speed things up quite a bit more for smaller 3x3 layers (like in googlenet). |
@soumith Why have the cuDNN R4 Googlenet-1 numbers changed? |
@andravin my copy-paste screwup. fixed. |
OK, thanks, I wasn't expecting the Neon Googlenet-1 speedup to decrease. ;-) So these numbers make more sense. |
I should point out that Andrew here not only worked through all the math for Winograd but our long discussions were pretty integral to the successful design of these kernels. Oh and while we're giving credit we were just discussing that these two guys probably deserve as much as Shmuel Winograd for working out the original math: https://en.wikipedia.org/wiki/Andrei_Toom I'll go into a bit more detail on this in the full blog. |
Thanks Scott and Andrew. We also see good gains for 3x3 with Winograd. |
That's great, Julien. Can't wait to see Winograd/Cook/Toom in cuDNN. :-) It has been almost a year since I discovered this new approach to convnet acceleration, and it is great to see these ideas having a real impact on performance now. Everybody should check out Scott's F(4x4,3x3) implementation, it is extremely clever. |
It is indeed amazing that you were able to get F(4x4, 3x3) to work. I'm really impressed because I know for a fact that F(2x2, 3x3) is already super hard :). I am really looking forward to making it work in cuDNN. |
@scott-gray Awesome work! The way you specialize warps in the F(4x4,3x3) kernel is just brilliant! I'm super excited and it's going to be fun to implement such a scheme for cuDNN :) and bring that speedup to the different frameworks. |
Now if you guys could put your heads together and figure out a way to end the NCHW vs CHWN vs NHWC wars. There must be some way to equip these kernels with pluggable front-ends and back-ends that understand the tensor order for load / store, and leaves the computation pipeline unchanged. |
I already have a fully fused version of that kernel that I should finish debugging this weekend. I'm hoping it will bring fprop/bprop performance for fp32 much closer to the fp16 level, as well as perform much more consistently with the size of the C/K dimensions. On the weight update side, fusion probably isn't possible due to the extremely strided memory access pattern required and no shared memory left for mitigating that. But 2 out of 3 fast operations isn't bad. In NHWC it would be the update operation that is fast and the other two slower. I guess there's a chance in NCHW that the overlaps in the super-tiling might make full fusion possible in update, but on the downside you're slower on fprop/bprop for smallish HW because your effective tile size needs to be much bigger and you end up with a lot of zero overlap. At very small N NCHW and CHWN are pretty equivalent. But, I'm confident that CHWN is fastest overall for Winograd. For direct conv, I'm starting to think that NHWC might be best for good performance across all minibatch sizes. There's plenty of shared memory around to efficiently transpose in place at no cost and having C as the inner dimension means that you minimize the slicing logic for all values of N and not just larger ones. But CHWN is just as good for N bigger than about 8 or so. Also, having HWN contiguous means that you can do 1x1 conv super efficiently in a basic gemm kernel. If I had to pick one, I'd stick with what I have: CHWN. But longer term it probably makes sense to have them all implemented to best suit the needs of the task. Speaking of longer term, it would be nice if the community migrated to a fully open sourced implementation for all of this. This stuff is just too important to the progress of the field for it to be locked away in proprietary implementations. The more people working together on this the better for everyone. There's plenty of room to compete on the hardware implementation side. |
The same goes for NHWC though, right? If I'm not mistaken, the order of these dimensions doesn't matter as long as they are contiguous. This is the TensorFlow default, I don't know if any other frameworks use it though. I think CHWN might not get adopted very easily, because everyone is used to having the leading dimension be the batch dimension nowadays (the only established framework I know of that deviates from this is cuda-convnet, which isn't used much anymore). |
@benanne: You're right, NHWC works fine with 1x1 (as does CHWN). The issue we're having with cuDNN with 1x1 is that NCHW has the C "in the middle". Today, our direct convolution is similar to 3x3 or 5x5 for 1x1 and we are having a complex logic that we could simplify for 1x1. Scott's CHWN is "easier" to deal with in many cases. We also suffer from the fact that our filters are KCRS when CRSK (used by Scott) would be better. On paper, NHWC has advantages over NCHW thanks to the fact that data is partly contiguous in memory. I'm only worried about the fact that NHWC could have a bad impact on the behavior of the TEX cache as fetching 8xFP16 (8 is the unrolling factor of the main loop - except for Scott's new F(4x4,3x3)) is only 16B and it's not so great with respect to cache line size. @andravin, @scott-gray: Indeed, I think we should sit together and find a way to get the awesome performance of Scott's implementations for CHWN available to popular frameworks. We'll all be at GTC, for example. Making it open-source is a long discussion ;) |
Right, I just meant grouped. So that's a plus for both NHWC and CHWN. You're right in that there is a lot of cuda code written for the cuDNN layout, and migrating away from that will likely be painful. But for some writing fresh code to a different layout might be a good option if they know they'll get a bit more speed. As I said, ideally you have the option for any layout. Anyway, if you guys are happy sticking with NCHW, then we'll be happy to continue topping you on the benchmarks :) The neon framework isn't burdened by any legacy code and everything is being built from the ground up for speed. And with the new graph backend we're working on hopefully we can substantially improve on the ease of use as well (not that it's too bad right now). |
@jdemouth You're not thinking creatively enough with leveraging shared memory to read deeper than 8 lines. You can cast any gemm or conv operation as a batched gemm and sum the results prior to writing out. The batch dimension in this case is just alternating groups of 8 rows. Happy to chat at GTC. I was looking forward to attending your talk. And I guess I should advertise my own talk here. I was scheduled for an hour but that was mysteriously shortened to just 25 minutes. So I wont be able to go into as much depth as I'd like. But on the other hand it makes preparing for it a lot easier, which means more time to be writing kernels. |
@scott-gray: Funny that you just mentioned that because I was thinking along those lines when you posted your comment :). I already have batched GEMM code for some scenarios. Not to mention my Winograd implementation for NCHW. The DL track is pretty packed, that's the reason why your slot was shortened from 50 to 25 minutes. Like all the other talks. My talk was even cancelled. |
Yah, I thought of the technique while developing the first winograd kernel. It's what I meant above in being able to leverage shared memory for in place transpose. I actually already have some fp16 32x32 gemm tiles that use this that get over 5Tflops with a minibatch of 32. The TN tile (col major) can even out perform the 128x128 tile because it halves the overall number of strided accesses to ddr at any one time. I haven't had a chance to release these yet since I need to still need to finish the complete set. Hopefully I'll get to that in the next week or so. I have new direct conv kernels I want to build first. The cuDNN advantage on small minibatch (for non 3x3s1) will soon be going away :) The goal is end to end training of convnets at very small minibatches at full utilization. |
Right -- what I was saying is that, if NHWC is almost as good as CHWN, the former might be adopted much more quickly. Because TensorFlow already uses it, and because many people would find it "more natural" to have the batch size as the leading dimension. |
I actually really like NHWC a lot, but it means I can't use my fancy new fully fused fprop/bprop F(4x4,3x3) with it. And I have a feeling the performance with it will be too good to throw away. The current partially fused kernels are trivial to convert to any layout. Just modify the external transform cuda-c code and then tweak a few lines in the batched gemm assembly for setting up the output pointers. |
@benanne: I agree with you... We want a layout which can be adopted by the community and which is good for performance. NCHW is widely adopted (and we are making it faster at each new release of cuDNN). CHWN is easier to deal with in many cases and Scott is pushing its performance to awesome levels but it is a somewhat weird layout. Maybe NHWC brings the best of both worlds together :). |
@scott-gray: What prevents you from using your new fused kernel with NHWC (except for the time to write it)? Is it a fetch issue due to your need for LDG.128? Btw, were the numbers quoted in the benchmark all obtained using F(4x4,3x3) or do you use F(2x2,3x3) for some of the layers? |
Here's another approach for tackling this "optimal layout" issue in frameworks using the computational graph paradigm (such as Theano and TensorFlow): stick with the canonical NCHW or NHWC layout on the surface, but have optimizations that insert alternative implementations using more efficient layouts, as well as the necessary reshape operations at the input and output side. Since many convolution and pooling operations usually follow each other (and elementwise nonlinearities are not affected by the layout), spurious reshapes can then be eliminated quite easily. |
An API is also an important requirement for adoption. That is the real reason cuDNN has been so successful, it defined a low level C API for deep learning primitives, and nobody else did. cuDNN is both the standard API and its only implementation. If Neon kernels were wrapped in the cuDNN API then it would be trivial to support them in your favorite framework (provided they are sane about allowing different tensor formats). Maybe the cuDNN API is not ideal, maybe we could do better. But coding to a standard API is key to providing fast kernels that framework maintainers can actually use. |
@jdemouth: I pick the best of both. Most of the time it's the 4x4 numbers. But for small HWN the 2x2 can be faster. Or I guess for small C/K too when the external transform isn't well amortized. But the fully fused kernel will solve that. The fused kernel has no available shared memory to do the in place transpose required of NHWC in fprop/bprop. For update, the data is laid out fine, and that kernel could be fused instead. But that's just 1 of 3 ops instead of 2/3. Plus you want fprop to be the fastest for use in inference. Also fusing update is much more problematic because there's a lot of predicates that need to be recomputed any time you change x or y. @benanne That's basically what I recommended to the TF guys. But you definitely want to avoid dimshuffles between every op. But I guess your point is that the graph optimizer should be smart enough to eliminate dimshuffles that cancel each other out. @andravin: I've wanted to put together an API but all of my time is devoted to writing kernels and just when I start to think things have stabilized enough to do this, someone comes along and asks you to implement some new fancy algorithm :) |
To elaborate on NHWC in fprop/bprop, the 2 image load warps are making 32*36 loads to distinct addresses, only 2 channels deep. That's way more than can fit in L1 so you end up fetching the same transaction 4 times and saturating both L2 and DDR traffic. |
A standard API for deep learning primitives would also mean that frameworks would be able to support any GPU or hardware platform that implements the API. The fact that none of us are even thinking about that is another symptom of our dangerous monoculture. |
An API has definitely been on my mind.. I just wanted to finish a complete set of kernels first. The only problem is that I keep changing the definition of complete. Anyway, I need to get some sleep. It's been nice chatting with you guys. |
Indeed, it was great chatting with all of you. Thanks. |
Oh, and another interesting constraint is batch norm. Reducing HWN is rather straight forward and fast with CHWN. It's just a reshape(C,-1).sum(axis=1). NCHW isn't too bad (but probably annoying). NHWC is a bit trickier to optimize as axis=0 reductions lead to expensive strided memory access patterns if done naively. Another interesting point on the Nervana kernels is that they all have the "mean" component of batchnorm optionally compounded directly inside of the conv kernel at not cost. Currently this is done with atomics but I have a deterministic way I want to change it to that should be just as fast. Many other common operations can be compounded inside of gemm/conv kernels. Incidentally, all the kernels can now be run in full deterministic mode with virtually no change in performance. Anyway, I'm looking forward to Soumith's new set of benchmarks. There's a ton of optimizations that we've made in neon that the current set just don't expose. |
Oh, and another thought. I recently wrote a very fast generalized dimshuffle (src) routine for neon that implements the full numpy.transpose spec. So if there is some custom kernel you want to write that is more natural in one format over another, then it's now easy to get that. And so long as you're not doing it on every layer there would be negligible impact to speed. For example, ROI pooling for RCNN networks is far easier to implement with NHWC. But even if you are using it a lot, it's about as fast as an fprop_relu op. |
@jdemouth Yes I'm only guessing if exist and what is the vendor neutral "lowest common denominator". This could still give a margin for hardware vendors to compete but also give some chance for an unified API for developer interested in a vendor neutral solution. |
I get your point. So far, I do not see the solution but having a higher level solution - if interesting for vendor neutrality - would surely help with development time and innovation. |
OpenVX was a good example for interface collaboration that involved many stakeholders (Nvidia included). But it has totally lost the occasions to cover deep learning needs in the actual release. In the meantime Google is trying to push an llvm subgroup for stream executor with "deep learning" canned operations. See last messages in tensorflow/tensorflow#22 and @henline bootstrap doc at https://github.com/henline/streamexecutordoc |
It is simply not possible to develop efficient dense linear algebra kernels with the current intermediate representations available (like ptx). That's not to say that an IR couldn't be developed that would make it possible. Pascal will be largely binary compatible with the Maxwell ISA, but when Volta rolls around I may adapt my assembler to let you target both architectures with one language. Though maybe it's not worth the effort since new hardware always frees up more resources making kernel design decisions very different. I guess the real key would be to have an IR that can target both Nvidia and AMD. I've read through the GCN spec and there's a lot of overlap, but again the differences are big enough to make a large impact in how you would design kernels. But still, having a common language would make development for both targets much easier and perhaps allow some code sharing. |
The common IR that target both NVIDIA and AMD is SPIR-V (and was co-designed). But seems that it is not enough for achieve this level of optimization. So if the biggest common multi stakeholders effort on a common IR is not enough I think that is better to extend standard API at higher level like pushing, in the next version of OpenVX, support for unfied tensor operations API that fits deep learning needs. |
Yes, the high level API approach looks more promising. |
I think there's still room for better abstractions at the lowest level. Deep learning is currently somewhere between 60-90% dense linear algebra of some kind. And the parts that aren't will soon largely be merged into the dense kernels for better efficiency. That doesn't leave a lot left for higher level gpu languages. Aside form simple operation compounding there may be other innovations we'll want to make with these dense kernels. We know the brain's connections are far from the simple feed forward nets currently in use. |
@jdemouth If you like the idea you can try to talk about this with Thierry Lepley. He is the Nvidia representative in the Khronos OpenVX standardization group. |
Sure. Thierry and I are both in France :) |
And I think that we could start to consider also the interesting results from binarizzation |
I plan on implementing a set of fast binary gemm and conv kernels as soon as I'm done with the small minibatch floating point kernels I'm working on. |
This start to become interesting also on CPU. /cc @xianyi |
@scott-gray |
@xianyi On AVX-512 I think that there is a XNOR instruction __mmask16 _mm512_kxnor? Do you think that this kind of operators could be included in openblas? |
@hengck23 The XNOR/POPC variety. Should be pretty easy to implement with the only real challenge being implementing padding in convolution. It will all just be cuda-c. I don't think there's a need for assembly optimization until we have a full throughput integrated xnor/popc/accumulate instruction |
A kernel under BSD it is at https://github.com/MatthieuCourbariaux/BinaryNet/blob/master/Run-time/binary_kernels.cu |
@bhack yup, I'm aware of that and am actually working with Matthieu on a much faster and more flexible version of that code. I just need to push through this other work first. Which, btw, is working out pretty well. I'm getting really good L1 cache utilization in CHWN for small N. This keeps the power levels lower and let's the clock run faster. Starting to think NWHC is really only good for N<4. Below 4 your slicing logic starts to eat into compute without NHWC. |
bhack wrote:
Interesting. Seems I should take a look at this... andrew wrote:
Agree with these points. Note: if you can allow me to pass a Scott wrote:
That sounds pretty fast. So on the whole the layout doesnt matter too much, can just shuffle before calling the convolutional implementation, if necessary? Scott wrote:
Ok, you mean, for SPIR-V, SPIR-V plausibly targets the PTX level, rather than the SASS level, and therefore cant get results much better than current CUDA and OpenCL high-level languages? |
Kronos maintain a bidirectional LLVM to SPIR-V translator. There are also some efforts inside Kronos on the OpenVX roadmap to support dnn. |
bhack wrote:
If I understand correctly, that will translate between OpenCL and SPIR-V? What advantage(s) do you see to doing this? At the moment, the source code I want to use is all written in sass I was thinking of porting it to OpenCL (admittedly saying 'only 1300 lines of sass' is a bit like saying nips only needs 8 pages ...), but maybe I should port it to SPIR-V instead??? Thoughts on how useable/mature/portable SPIR-V is right now? I dont know much beyond the name, and a one-line description... |
It will translate also compatibile LLVM bytecode in SPIR-V. |
@hughperkins If you want to generally take an overview of these differents IR you can see https://www.linkedin.com/pulse/era-intermediate-languages-vincent-hindriksen. I think that sass it is more at gcn level in the AMD world. But the majority of opinions here is that we cannot go so fast with an IR cause GPU assembly ninjas are still better than compilers to optimize 😃 |
Good information. Thanks! |
Probably I should target AMD IL or HSAIL. However, I lack an AMD GPU. So ... |
After serious perf improvements by NVIDIA's CUDNN R4 across board, I suppose Nervana weren't too happy to be left behind.
They've just released (as part of Neon) their Winograd-based kernels which have a non-trivial improvement in performance.
Their blog post can be found here, where Scott will be going into full detail about the technical implementation, challenges as well as data points showing no side-effects of using these kernels in terms of convergence. The implementation seems very sophisticated, and quite a challenge.
I've benchmarked them independently, and here are the numbers:
FP-16
FP-32
It's really cool that they're still squeezing out performance of this generation of hardware.
They seem to have real wins when the network uses 3x3 convolutions.
At this point, I expect that this is the last round of software optimizations for small convolutions, considering that they're hitting peak limits of the GPU, but happy to be surprised :)
Full logs are checked in to the nervana/ folder.
The text was updated successfully, but these errors were encountered: