-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Support]: Interrupts seem only delivering to AF_XDP core rather than the core specified by IRQ affinity #334
Comments
Thank you for raising this issue, we will look into it and provide feedback soon. |
Thank you, David! Another note is: on the c5n.18xlarge instance, using iperf with 32 connection (ie, -P 32 --dualtest) is able to achieve around 190 Gbps bandwidth under 9k MTU; with 3.5k MTU (ie, the maximum MTU of AFXDP), iperf is able to achieve around 183 Gbps. In comparison, AFXDP with 3.5k MTU can only achieve 150 Gps. I suspect there might be some driver issues for AFXPD support. Btw, I disable interrupt collapsing by Best, |
I am digging a bit on this, and realize that the ENA driver might run the softirq for TX and RX inside the send/recv syscall, while the mlx5 driver optimizes to run the softirq inside the NIC hardware interrupt processing. If so, is there any way to optimize or confie ENA driver to use the mlx5 manner? |
Hi, I'll start by what I can answer right now.
That depends on the HW generation. On newer multi-numa instances you can use the approach above to determine the NUMA node of the device.
Yes it is also observed in my tests. The difference from mlx5 stands from the implementation of the wakeup command sent to the driver. ENA on the other hand doesn't currently have the ability to invoke an interrupt and so it schedules the napi handler directly. At least as I see it, it does have the benefit of better performance as waiting for an interrupt from a device, just adds an additional step to invoking napi. The benefit of mlx5 approach of course is respecting the irq affinity. The issue can be relieved in one of the following ways:
echo 50 | sudo tee /proc/sys/net/core/busy_poll
echo 50 | sudo tee /proc/sys/net/core/busy_read
echo 2 | sudo tee /sys/class/net/ens6/napi_defer_hard_irqs
echo 200000 | sudo tee /sys/class/net/ens6/gro_flush_timeout The second approach was tested by me and made the application thread run almost exclusively in userspace.
These configurations allow to reduce the number of interrupts while allowing to retain the same BW. In your usecase, I don't think that it matters much as the IRQ cores are pretty free already.
I still owe you an answer for it, but it'd require me to write a new test application (unless you have one I can use (: ) |
@ShayAgros Thank you for the thorough response---That indeed helps a lot! Now I just map IRQs to the app cores, and it does not impact any performance. I also tried the "busy poll mechanism + IRQ deferring", but it gives very poor performance (~7Gbps per core). I will polish and open source my code soon, and will get back to you once I get a version for easy testing. |
Preliminary Actions
Driver Type
Linux kernel driver for Elastic Network Adapter (ENA)
Driver Tag/Commit
ena_linux_2.13.0
Custom Code
No
OS Platform and Distribution
Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-1015-aws x86_64)
Support request
Hi AWS driver maintainer,
I am using AFXDP zero-copy support from AWS ENA driver to send and receive 100Gbps traffic. I configure multiple NICs queues to receive interrupts, and bind each queue's irq affinity to different cores (through
/proc/irq/$IRQ/smp_affinity_list
). I also use another set of cores to run user-space applications (which receive and send packets using AFXDP APIs).However, from
htop
, I find that the IRQ cores (ie, cores specified by irq affinity) do not nearly consume any CPU time, while the application cores consume around half of CPU time in kernel (ie, red bars in htop). Fromperf
, the application cores are spending significant time on syscalls like__lib_sendto
and alsonet_rx_action
. So it seems that the NIC interrupts are handled by the application cores.I am worried that this frequent context switching between user and kernel space causes poor networking performance. For example, when using Mellanox ConnectX-5 100G NIC with mlx5 driver, my AFXDP applications only requires 4 application cores and 4 IRQ cores to saturate dual-directional 200G traffic; on these 4 application cores, nearly no CPU time spent on the kernel, while the 4 IRQ cores are spent significant time in the kernel handling interrupts. When using c5n.18xlarge with AWS 100G NIC and ENA driver, I need to use 12 application cores and specify 12 IRQ cores (ie, 12 NIC queues) to saturate only 150G traffic.
Another short question is: how to determine the NUMA affinity of the ENA NIC? c5.18xlarge has two NUMA nodes, but I got -1 from
/sys/bus/pci/devices/<PCI_device_ID>/numa_node
and/sys/class/net/<nic_dev>/device/numa_node
.Best,
Yang
Contact Details
No response
The text was updated successfully, but these errors were encountered: