-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Another question about wide_vadd sample #16
Comments
I'm not sure from your comment since you mention you converted from int to float, and then later refer to ints, but do keep in mind that prior to the Versal architecture FPGA fabric does not natively support floating point types. As a result we need to add logic around the operations and that incurs a latency penalty. But, looking at the numbers that still doesn't make sense. 810241024 samples divided by 24ms equates to around 7 clock cycles per add, which seems too high, and that also assumes that nothing is being done in parallel by the kernels. Can you check a few things?
|
Actually, sorry, I just realized you probably meant 8 * 1024 * 1024 or ~8M samples. That number works out approximately correctly to 1 sample per clock if you aren't taking advantage of parallelism in the kernel. So maybe you missed a pragma somewhere? |
Thanks very much, Rob.Yes, the sample amounts to essentially transferring two HD RGBA video frames, adding each pixel, My code is almost identical to your Here is the code: https://github.com/GrokImageCompression/latke/tree/master/tests/wide_vadd and there is a README file describing how I set up and build the kernel. I am attaching the vitis analyzer files, if you have time to take a look. Thanks again! |
Looking at the vitis guidance, it looks like the kernel write transfer size is only 64 bits, so it is not using the full 512 bit bus. |
It looks like the cause of the issue is that you're assigning the output value directly from the vadd_as_float() function, which implies that the kernel has to receive the AXI write response before it can process the next value. So it looks like it's working as coded/intended but you're spending most of your time in pipeline stalls. You can analyze that through the HLS analysis view, but the quick fix (note that I haven't had a chance to try it myself yet) would be to break this into three main blocks: read, compute, and write. The write block should be pulling from the previously-declared result_tmp array. As it stands you're passing in out[i+j] by reference and that really slows things down. Even doing something like:
improves the scheduling by a lot, at least in code analysis. I'm not 100% sure how that would translate to runtime efficiency without trying it but the way I outlined above should work. |
Thanks for taking a look! I have made these changes, so the kernel does burst read to local buffer, then compute, then burst write back to global memory. The vitis guidance on this kernel looks better. However, timing is unchanged - 53 ms for a single kernel, and 23 ms for 4 concurrent kernels. So, I must be still missing something...... I'm attaching the new vitis_analyzer outputs in case you have time to take a peek. Thanks again! |
There are two big things I see here in hardware emualation; one is that your FIFOs are drastically overprovisioned (occupancy doesn't exceed one in cosim, I set to 4 just to be safe) and two is that the processes are blocking about 66% of the time. That's mostly due to the way that you have a dataflow region inside nested for loops. It's a bit of an older practice (and I realize the example does this). It doesn't impact the Get Moving with Alveo code, but what happens in your case since there's a longer latency for the int to float conversion is that it basically reads 64 chunks, waits until all of the processing is done, reads another 64 chunks, waits, etc. You really want it to just read as fast as it can. Try simplifying it to something like this:
|
With that code above resource utilization is significantly lower (down to 90 BRAM, 32 DSP, ~17k FF) and in cosimulation I see we hit approximately line rate. HW should be roughly the same, maybe slightly lower with the external DDR latency. So you should be able to process all 8M floats in ~2 ms (1.7 by my back-of-the-envelope calculation). I'll also update the tutorial code to use this coding style since the new Vitis HLS front-end does better with it. |
Thanks again, Rob! really appreciate your help. I would love to get the time down to ~2 ms. I built the kernel with your changes, and the timing did go down, but it is still high. Is there anything else we can try for this kernel ? I've checked my changes in to my Latke project, btw. |
Here are my vitis analyzer files. By the way, I built this kernel with the |
This is bandwidth-bound; there are three 512-bit channels contending for the same DDR controller. That's ~19 GB/s/channel, which saturates the available bandwidth (you have almost 60 GB of traffic going to a DDR with a theoretical max bandwidth of ~27). So unfortunately it's fundamentally not going to get much faster and that was the purpose of the exercise, to point out that even though the fabric can keep up with this at the line rate the memory controller can't, so we are going to hit a speed limit. |
Thanks! In my case I am passing two 8 mega-float buffers into the kernel, and transferring one 8 mega-float buffer back, that should be a total of 64 MB in and 32 MB out. |
It doesn't really matter though; you're still transferring both input buffers at the same time over PCIe, the kernel is going to try to read them both at the same time, etc. The instantaneous memory pressure is still the same, it's just over a very short duration. |
Thanks, that makes sense. So, what is the best practice for doing high memory bandwidth compute on the 250 ? |
The best thing would be to split into multiple channels or between memory banks, but that can sometimes be difficult if you have to spread things out too much on the die. This is where HBM devices like the U50 excel since you're working with individual chiplets instead of a single combined channel. The same core is over 2x faster on HBM without any further optimization. In general, if you have DDR DIMMs they don't perform as well for applications that are I/O bound, it makes more sense to look for compute-bound things or algorithms that can leverage streaming to stay in the FPGA fabric like some of the vision examples. You can stack up operations "indefinitely" and they'll all complete in approximately the same block of time. |
Thanks, I do have access to a U50, so I will give it a try. For example: https://www.intopix.com/JPEG2000-uhd-4k-8k-encoder-decoder Is the U250 suitable for encoding at that data rate ? |
Absolutely; keep in mind that this is (on purpose) a bad example of how to use this, because the application is I/O bound. Video decompression (or compression) does not have this problem. Also at 60 FPS you have ~16 ms per frame for instance, which is generally about an 8MB data transfer. For reference the example that I linked to you that runs in ~23 ms on the U250 is processing 256 MB of data per channel, so it's reading 512 MB and writing 256 back to DDR in that amount of time. If it were not I/O bound, the operation would take ~7ms for the compute at 16 words per clock to handle this amount of data. Example 5 is better, because it shows overlapping of compute and transfer instead of doing everything linearly, but you still do generally want to focus on either pipelining operations (stacking things up without going to external memory) or on algorithms which are compute-bound instead of I/O bound. |
Thanks, that is good to know! I will take a look at Example 5, thanks. |
@rwarmstr got another question if you have time: I profiled the
wide_vadd
tutorial with some smallmodification to add float instead of integers. I also used 4 concurrent kernels.
Environment: U250, 48 core CPU, Ubuntu 18, latest XRT.
Timing for a single run of adding two buffers of size 810241024 ints was
24 ms
.This seems like a long time to simply add two buffers together - is there anything else I can
do to speed things up. For example, is there a way of harnessing the PLRAM on the board
to make it faster ?
Any insights here would be really appreciated.
The text was updated successfully, but these errors were encountered: