Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Support]: Interrupts seem only delivering to AF_XDP core rather than the core specified by IRQ affinity #334

Open
3 tasks done
YangZhou1997 opened this issue Jan 5, 2025 · 5 comments
Labels
Linux ENA driver support Ask a question or request support triage Determine the priority and severity

Comments

@YangZhou1997
Copy link

YangZhou1997 commented Jan 5, 2025

Preliminary Actions

Driver Type

Linux kernel driver for Elastic Network Adapter (ENA)

Driver Tag/Commit

ena_linux_2.13.0

Custom Code

No

OS Platform and Distribution

Ubuntu 22.04.5 LTS (GNU/Linux 6.8.0-1015-aws x86_64)

Support request

Hi AWS driver maintainer,

I am using AFXDP zero-copy support from AWS ENA driver to send and receive 100Gbps traffic. I configure multiple NICs queues to receive interrupts, and bind each queue's irq affinity to different cores (through /proc/irq/$IRQ/smp_affinity_list). I also use another set of cores to run user-space applications (which receive and send packets using AFXDP APIs).

However, from htop, I find that the IRQ cores (ie, cores specified by irq affinity) do not nearly consume any CPU time, while the application cores consume around half of CPU time in kernel (ie, red bars in htop). From perf, the application cores are spending significant time on syscalls like __lib_sendto and also net_rx_action. So it seems that the NIC interrupts are handled by the application cores.

I am worried that this frequent context switching between user and kernel space causes poor networking performance. For example, when using Mellanox ConnectX-5 100G NIC with mlx5 driver, my AFXDP applications only requires 4 application cores and 4 IRQ cores to saturate dual-directional 200G traffic; on these 4 application cores, nearly no CPU time spent on the kernel, while the 4 IRQ cores are spent significant time in the kernel handling interrupts. When using c5n.18xlarge with AWS 100G NIC and ENA driver, I need to use 12 application cores and specify 12 IRQ cores (ie, 12 NIC queues) to saturate only 150G traffic.

Another short question is: how to determine the NUMA affinity of the ENA NIC? c5.18xlarge has two NUMA nodes, but I got -1 from /sys/bus/pci/devices/<PCI_device_ID>/numa_node and /sys/class/net/<nic_dev>/device/numa_node.

Best,
Yang

Contact Details

No response

@YangZhou1997 YangZhou1997 added support Ask a question or request support triage Determine the priority and severity labels Jan 5, 2025
@davidarinzon
Copy link
Contributor

Hi @YangZhou1997

Thank you for raising this issue, we will look into it and provide feedback soon.

@YangZhou1997
Copy link
Author

Thank you, David!

Another note is: on the c5n.18xlarge instance, using iperf with 32 connection (ie, -P 32 --dualtest) is able to achieve around 190 Gbps bandwidth under 9k MTU; with 3.5k MTU (ie, the maximum MTU of AFXDP), iperf is able to achieve around 183 Gbps. In comparison, AFXDP with 3.5k MTU can only achieve 150 Gps. I suspect there might be some driver issues for AFXPD support.

Btw, I disable interrupt collapsing by sudo ethtool -C ${NIC} adaptive-rx off rx-usecs 0 tx-usecs 0, as I found it typically does not help much.

Best,
Yang

@YangZhou1997
Copy link
Author

I am digging a bit on this, and realize that the ENA driver might run the softirq for TX and RX inside the send/recv syscall, while the mlx5 driver optimizes to run the softirq inside the NIC hardware interrupt processing. If so, is there any way to optimize or confie ENA driver to use the mlx5 manner?

@ShayAgros
Copy link
Contributor

Hi,
This is probably just my first comment as I need more time to dig deeper into this (as well as prepare test programs which simulate your setup).

I'll start by what I can answer right now.

Another short question is: how to determine the NUMA affinity of the ENA NIC? c5.18xlarge has two NUMA nodes, but I got -1 from /sys/bus/pci/devices/<PCI_device_ID>/numa_node and /sys/class/net/<nic_dev>/device/numa_node.

That depends on the HW generation. On newer multi-numa instances you can use the approach above to determine the NUMA node of the device.
On c5n.18xlarge however this is not supported.
I can tell you that on the current c5n.18xlarge HW provided by AWS the numa node of the device is 0 (this might change in the future though, hopefully by that time you'd be able to query the correct numa node).

However, from htop, I find that the IRQ cores (ie, cores specified by irq affinity) do not nearly consume any CPU time, while the application cores consume around half of CPU time in kernel (ie, red bars in htop).

Yes it is also observed in my tests. The difference from mlx5 stands from the implementation of the wakeup command sent to the driver.
When asking the driver to poll for new packets (e.g. using the sendto command), mlx5 driver sends a command to the underlying device to invoke an interrupt. This allows to retain the existing irq affinity.

ENA on the other hand doesn't currently have the ability to invoke an interrupt and so it schedules the napi handler directly. At least as I see it, it does have the benefit of better performance as waiting for an interrupt from a device, just adds an additional step to invoking napi. The benefit of mlx5 approach of course is respecting the irq affinity.

The issue can be relieved in one of the following ways:

  • One can have an application thread pinned to the same CPU as the irq which is responsible for waking the driver to start polling
  • busy poll mechanism + IRQ deferring might be used instead of manually asking for poll. For example, using these unoptimized configurations (didn't fine tune their values):
echo 50 | sudo tee /proc/sys/net/core/busy_poll
echo 50 | sudo tee /proc/sys/net/core/busy_read
echo 2 | sudo tee /sys/class/net/ens6/napi_defer_hard_irqs
echo 200000 | sudo tee /sys/class/net/ens6/gro_flush_timeout

The second approach was tested by me and made the application thread run almost exclusively in userspace.
A solution similar to mlx5 might be utilized in the future, but it is not currently planned.

Btw, I disable interrupt collapsing by sudo ethtool -C ${NIC} adaptive-rx off rx-usecs 0 tx-usecs 0, as I found it typically does not help much.

These configurations allow to reduce the number of interrupts while allowing to retain the same BW. In your usecase, I don't think that it matters much as the IRQ cores are pretty free already.

Another note is: on the c5n.18xlarge instance, using iperf with 32 connection (ie, -P 32 --dualtest) is able to achieve around 190 Gbps bandwidth under 9k MTU; with 3.5k MTU (ie, the maximum MTU of AFXDP), iperf is able to achieve around 183 Gbps. In comparison, AFXDP with 3.5k MTU can only achieve 150 Gps. I suspect there might be some driver issues for AFXPD support.

I still owe you an answer for it, but it'd require me to write a new test application (unless you have one I can use (: )

@YangZhou1997
Copy link
Author

YangZhou1997 commented Jan 7, 2025

@ShayAgros Thank you for the thorough response---That indeed helps a lot! Now I just map IRQs to the app cores, and it does not impact any performance. I also tried the "busy poll mechanism + IRQ deferring", but it gives very poor performance (~7Gbps per core).

I will polish and open source my code soon, and will get back to you once I get a version for easy testing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Linux ENA driver support Ask a question or request support triage Determine the priority and severity
Projects
None yet
Development

No branches or pull requests

3 participants