XDP benchmark DROP

Establish what XDP_DROP test tool we use?

Tool: xdp1 is way too simple (and annoying)

This xdp1 tool is the most low-overhead tool avail, but it is annoying to use. Both the form of output format and parameters. The parameter is the ifindex (and not the interface name).

A cmdline hack can be used for looking up the ifindex by name:

sudo ./xdp1 $(</sys/class/net/mlx5p2/ifindex)

Tool: xdp_rxq_info with –action XDP_DROP

The xdp_rxq_info tool is avail in kernel.

It have the advantage of showing stats per CPU and RX-queue, which is very practical for the multi-flow multi-CPU tests. As it allow us to quickly identify if the flow RSS hash distribution is off/wrong.

The name ‘xdp_rxq_info’ is misleading, for a drop test, but it have a parameter –action that allows us to turn this into a XDP_DROP test.

Tool: xdp_bench01_mem_access_cost

The xdp_bench01_mem_access_cost tool is part of my github repo prototype-kernel.

It is specifically written for benchmarking the different XDP-modes via –action parameter. Plus, it have the ability to make sure data is “read”.

https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/samples/bpf

It’s implementation of XDP_TX action is also more correct than xdp_rxq_info, as it can do a –swapmac.

Jesper01: XDP benchmark drop

Choosing a kernel version or git-tree. Timing is that kernel v4.18 have not been released yet. Today <2018-06-07 Thu> we are at the beginning of the merge window for v4.18. The git tree net-next, have just been merged by Linus torvalds.

Thus, Linus’es tree is in merge flux at the moment, but net-next and bpf-next are “closed”, and thus in are more stable state (which is unusual).

Base kernel compile on bpf-next at commit 75d4e704fa8d (“netdev-FAQ: clarify DaveM’s position for stable backports”).

Kernel version: 4.17.0-rc7-bpf-next-xdp-paper01-02308-g75d4e704fa8d

Issue: the mlx5 redirect patches does not apply to bpf-next. Update <2018-06-07 Thu>: Fixed up mlx5 redirect patches.

Created a kernel.org git branch named: xdp_paper01 https://git.kernel.org/pub/scm/linux/kernel/git/hawk/net-next-xdp.git/?h=xdp_paper01

NIC: mlx5 - ConnectX-5

mlx5: single core

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       25,583,300  0          
XDP-RX CPU      total   25,583,300 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    4:0   25,583,303  0          
rx_queue_index    4:sum 25,583,303

mlx5: multi core

setup notes

$uname -a Linux broadwell 4.17.0-rc7-bpf-next-xdp-paper01-02308-g75d4e704fa8d #26 SMP PREEMPT

$ ethtool –show-priv-flags mlx5p1 Private flags for mlx5p1: rx_cqe_moder : on tx_cqe_moder : off rx_cqe_compress: off rx_striding_rq : off

t-rex setup

t-rex script:

~/git/xdp-paper/benchmarks/udp_for_benchmarks.py

t-rex cmdline:

start -f /home/jbrouer/git/xdp-paper/benchmarks/udp_for_benchmarks.py -t packet_len=64,stream_count=XX --port 0 -m 100mpps

bench

stream_count=12 XDP_DROP total: 86,338,684

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       14,390,053  0          
XDP-RX CPU      1       14,390,081  0          
XDP-RX CPU      2       14,389,045  0          
XDP-RX CPU      3       14,390,277  0          
XDP-RX CPU      4       14,390,219  0          
XDP-RX CPU      5       14,389,006  0          
XDP-RX CPU      total   86,338,684 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   14,390,090  0          
rx_queue_index    0:sum 14,390,090 
rx_queue_index    1:1   14,390,095  0          
rx_queue_index    1:sum 14,390,095 
rx_queue_index    2:2   14,389,054  0          
rx_queue_index    2:sum 14,389,054 
rx_queue_index    3:3   14,390,270  0          
rx_queue_index    3:sum 14,390,270 
rx_queue_index    4:4   14,390,166  0          
rx_queue_index    4:sum 14,390,166 
rx_queue_index    5:5   14,389,022  0          
rx_queue_index    5:sum 14,389,022

Changing number cores receiving traffic by adjusting stream_count.

stream_count=1 XDP_DROP total: 25,572,977

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       25,572,977  0          
XDP-RX CPU      total   25,572,977 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   25,572,974  0          
rx_queue_index    0:sum 25,572,974

stream_count=2 XDP_DROP total: 51,907,348

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       25,359,525  0          
XDP-RX CPU      1       26,547,822  0          
XDP-RX CPU      total   51,907,348 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   25,359,524  0          
rx_queue_index    0:sum 25,359,524 
rx_queue_index    1:1   26,547,829  0          
rx_queue_index    1:sum 26,547,829

stream_count=3 XDP_DROP total: 75,530,250

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       25,041,439  0          
XDP-RX CPU      1       25,243,786  0          
XDP-RX CPU      2       25,245,025  0          
XDP-RX CPU      total   75,530,250 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   25,041,446  0          
rx_queue_index    0:sum 25,041,446 
rx_queue_index    1:1   25,243,788  0          
rx_queue_index    1:sum 25,243,788 
rx_queue_index    2:2   25,245,037  0          
rx_queue_index    2:sum 25,245,037

stream_count=4 XDP_DROP total: 86,521,177

Notice at stream_count=4, CPUs start to have idle cycles.

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       21,627,817  0          
XDP-RX CPU      1       21,630,688  0          
XDP-RX CPU      2       21,631,349  0          
XDP-RX CPU      3       21,631,321  0          
XDP-RX CPU      total   86,521,177 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   21,627,817  0          
rx_queue_index    0:sum 21,627,817 
rx_queue_index    1:1   21,630,690  0          
rx_queue_index    1:sum 21,630,690 
rx_queue_index    2:2   21,631,359  0          
rx_queue_index    2:sum 21,631,359 
rx_queue_index    3:3   21,631,227  0          
rx_queue_index    3:sum 21,631,227

stream_count=5 XDP_DROP total: 86,837,876

With more idle cycles.

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       17,364,174  0          
XDP-RX CPU      1       17,368,545  0          
XDP-RX CPU      2       17,368,884  0          
XDP-RX CPU      3       17,368,908  0          
XDP-RX CPU      4       17,367,363  0          
XDP-RX CPU      total   86,837,876 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   17,364,143  0          
rx_queue_index    0:sum 17,364,143 
rx_queue_index    1:1   17,368,530  0          
rx_queue_index    1:sum 17,368,530 
rx_queue_index    2:2   17,368,816  0          
rx_queue_index    2:sum 17,368,816 
rx_queue_index    3:3   17,368,884  0          
rx_queue_index    3:sum 17,368,884 
rx_queue_index    4:4   17,367,366  0          
rx_queue_index    4:sum 17,367,366

stream_count=6 XDP_DROP total: 86,809,556

With more idle cycles.

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       14,468,490  0          
XDP-RX CPU      1       14,468,507  0          
XDP-RX CPU      2       14,468,888  0          
XDP-RX CPU      3       14,468,750  0          
XDP-RX CPU      4       14,467,744  0          
XDP-RX CPU      5       14,467,175  0          
XDP-RX CPU      total   86,809,556 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   14,468,463  0          
rx_queue_index    0:sum 14,468,463 
rx_queue_index    1:1   14,468,470  0          
rx_queue_index    1:sum 14,468,470 
rx_queue_index    2:2   14,468,916  0          
rx_queue_index    2:sum 14,468,916 
rx_queue_index    3:3   14,468,746  0          
rx_queue_index    3:sum 14,468,746 
rx_queue_index    4:4   14,467,752  0          
rx_queue_index    4:sum 14,467,752 
rx_queue_index    5:5   14,467,191  0          
rx_queue_index    5:sum 14,467,191

stream_count=7 XDP_DROP total: 85,095,736

Now we are running out of CPUs (6), as we have disabled HT. In this example, CPU2 gets extra traffic and actually don’t have any idle cycles, and handle/drop 24,313,750 pps.

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       12,156,595  0          
XDP-RX CPU      1       12,154,906  0          
XDP-RX CPU      2       24,313,750  0          
XDP-RX CPU      3       12,155,349  0          
XDP-RX CPU      4       12,158,029  0          
XDP-RX CPU      5       12,157,106  0          
XDP-RX CPU      total   85,095,736 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   12,156,625  0          
rx_queue_index    0:sum 12,156,625 
rx_queue_index    1:1   12,154,888  0          
rx_queue_index    1:sum 12,154,888 
rx_queue_index    2:2   24,313,738  0          
rx_queue_index    2:sum 24,313,738 
rx_queue_index    3:3   12,155,287  0          
rx_queue_index    3:sum 12,155,287 
rx_queue_index    4:4   12,158,076  0          
rx_queue_index    4:sum 12,158,076 
rx_queue_index    5:5   12,157,174  0          
rx_queue_index    5:sum 12,157,174

stream_count=8 XDP_DROP total: 86,484,755

All CPUs have idle cycles, but some less than others, e.g CPU-2 have 6.8% idle, and CPU-3 have 10.2% idle.

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       10,811,300  0          
XDP-RX CPU      1       10,811,547  0          
XDP-RX CPU      2       21,623,304  0          
XDP-RX CPU      3       21,622,057  0          
XDP-RX CPU      4       10,805,394  0          
XDP-RX CPU      5       10,811,152  0          
XDP-RX CPU      total   86,484,755 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   10,811,291  0          
rx_queue_index    0:sum 10,811,291 
rx_queue_index    1:1   10,811,570  0          
rx_queue_index    1:sum 10,811,570 
rx_queue_index    2:2   21,623,306  0          
rx_queue_index    2:sum 21,623,306 
rx_queue_index    3:3   21,622,064  0          
rx_queue_index    3:sum 21,622,064 
rx_queue_index    4:4   10,805,406  0          
rx_queue_index    4:sum 10,805,406 
rx_queue_index    5:5   10,811,027  0          
rx_queue_index    5:sum 10,811,027

stream_count=X XDP_DROP total:

Possible PCIe limit?

Tariq expect seeing rx_discards_phy when PCI causes backpressure.

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       14,417,582  0          
XDP-RX CPU      1       14,431,145  0          
XDP-RX CPU      2       14,434,436  0          
XDP-RX CPU      3       14,417,894  0          
XDP-RX CPU      4       14,417,705  0          
XDP-RX CPU      5       14,419,834  0          
XDP-RX CPU      total   86,538,599 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   14,417,582  0          
rx_queue_index    0:sum 14,417,582 
rx_queue_index    1:1   14,431,116  0          
rx_queue_index    1:sum 14,431,116 
rx_queue_index    2:2   14,434,409  0          
rx_queue_index    2:sum 14,434,409 
rx_queue_index    3:3   14,417,894  0          
rx_queue_index    3:sum 14,417,894 
rx_queue_index    4:4   14,417,702  0          
rx_queue_index    4:sum 14,417,702 
rx_queue_index    5:5   14,419,839  0          
rx_queue_index    5:sum 14,419,839

Notice: Ethtool(mlx5p1 ) stat: 98233252 ( 98,233,252) <= rx_packets_phy /sec Ethtool(mlx5p1 ) stat: 12048409 ( 12,048,409) <= rx_discards_phy /sec Ethtool(mlx5p1 ) stat: 86188489 ( 86,188,489) <= rx_prio0_packets /sec 98233252 - 12048409 = 86184843 ( 86,184,843)

Show adapter(s) (mlx5p1) statistics (ONLY that changed!)
Ethtool(mlx5p1  ) stat:        34585 (         34,585) <= ch0_arm /sec
Ethtool(mlx5p1  ) stat:        34585 (         34,585) <= ch0_events /sec
Ethtool(mlx5p1  ) stat:       238494 (        238,494) <= ch0_poll /sec
Ethtool(mlx5p1  ) stat:        33807 (         33,807) <= ch1_arm /sec
Ethtool(mlx5p1  ) stat:        33807 (         33,807) <= ch1_events /sec
Ethtool(mlx5p1  ) stat:       237339 (        237,339) <= ch1_poll /sec
Ethtool(mlx5p1  ) stat:        34512 (         34,512) <= ch2_arm /sec
Ethtool(mlx5p1  ) stat:        34512 (         34,512) <= ch2_events /sec
Ethtool(mlx5p1  ) stat:       238103 (        238,103) <= ch2_poll /sec
Ethtool(mlx5p1  ) stat:        34668 (         34,668) <= ch3_arm /sec
Ethtool(mlx5p1  ) stat:        34668 (         34,668) <= ch3_events /sec
Ethtool(mlx5p1  ) stat:       238448 (        238,448) <= ch3_poll /sec
Ethtool(mlx5p1  ) stat:        34585 (         34,585) <= ch4_arm /sec
Ethtool(mlx5p1  ) stat:        34585 (         34,585) <= ch4_events /sec
Ethtool(mlx5p1  ) stat:       238680 (        238,680) <= ch4_poll /sec
Ethtool(mlx5p1  ) stat:        33943 (         33,943) <= ch5_arm /sec
Ethtool(mlx5p1  ) stat:        33944 (         33,944) <= ch5_events /sec
Ethtool(mlx5p1  ) stat:       237098 (        237,098) <= ch5_poll /sec
Ethtool(mlx5p1  ) stat:       206103 (        206,103) <= ch_arm /sec
Ethtool(mlx5p1  ) stat:       206103 (        206,103) <= ch_events /sec
Ethtool(mlx5p1  ) stat:      1428189 (      1,428,189) <= ch_poll /sec
Ethtool(mlx5p1  ) stat:            1 (              1) <= outbound_pci_stalled_wr_events /sec
Ethtool(mlx5p1  ) stat:     14334364 (     14,334,364) <= rx0_cache_reuse /sec
Ethtool(mlx5p1  ) stat:     14334335 (     14,334,335) <= rx0_xdp_drop /sec
Ethtool(mlx5p1  ) stat:     14362141 (     14,362,141) <= rx1_cache_reuse /sec
Ethtool(mlx5p1  ) stat:     14362141 (     14,362,141) <= rx1_xdp_drop /sec
Ethtool(mlx5p1  ) stat:     14362118 (     14,362,118) <= rx2_cache_reuse /sec
Ethtool(mlx5p1  ) stat:     14362118 (     14,362,118) <= rx2_xdp_drop /sec
Ethtool(mlx5p1  ) stat:     14338843 (     14,338,843) <= rx3_cache_reuse /sec
Ethtool(mlx5p1  ) stat:     14338841 (     14,338,841) <= rx3_xdp_drop /sec
Ethtool(mlx5p1  ) stat:     14356201 (     14,356,201) <= rx4_cache_reuse /sec
Ethtool(mlx5p1  ) stat:     14356183 (     14,356,183) <= rx4_xdp_drop /sec
Ethtool(mlx5p1  ) stat:     14359375 (     14,359,375) <= rx5_cache_reuse /sec
Ethtool(mlx5p1  ) stat:     14359405 (     14,359,405) <= rx5_xdp_drop /sec
Ethtool(mlx5p1  ) stat:     98233801 (     98,233,801) <= rx_64_bytes_phy /sec
Ethtool(mlx5p1  ) stat:   6286927774 (  6,286,927,774) <= rx_bytes_phy /sec
Ethtool(mlx5p1  ) stat:     86114457 (     86,114,457) <= rx_cache_reuse /sec
Ethtool(mlx5p1  ) stat:     12048409 (     12,048,409) <= rx_discards_phy /sec
Ethtool(mlx5p1  ) stat:        69718 (         69,718) <= rx_out_of_buffer /sec
Ethtool(mlx5p1  ) stat:     98233252 (     98,233,252) <= rx_packets_phy /sec
Ethtool(mlx5p1  ) stat:   6287027905 (  6,287,027,905) <= rx_prio0_bytes /sec
Ethtool(mlx5p1  ) stat:     86188489 (     86,188,489) <= rx_prio0_packets /sec
Ethtool(mlx5p1  ) stat:   5171087787 (  5,171,087,787) <= rx_vport_unicast_bytes /sec
Ethtool(mlx5p1  ) stat:     86184793 (     86,184,793) <= rx_vport_unicast_packets /sec
Ethtool(mlx5p1  ) stat:     86114444 (     86,114,444) <= rx_xdp_drop /sec

Perf record notes

Micro optimization possibilities mlx5

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP options:read
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       18879925    0          
XDP-RX CPU      2       18848656    0          
XDP-RX CPU      4       14047461    0          
XDP-RX CPU      5       14048655    0          
XDP-RX CPU      total   65824699   

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   18879897    0          
rx_queue_index    0:sum 18879897   
rx_queue_index    2:2   18848642    0          
rx_queue_index    2:sum 18848642   
rx_queue_index    4:4   14047746    0          
rx_queue_index    4:sum 14047746   
rx_queue_index    5:5   14048712    0          
rx_queue_index    5:sum 14048712

Looking closer at CPU-0, clearly shows that mlx5 driver have many function calls. These each only contribute approx 1.3ns each, but at these speeds they add up!

The function call overhead is measured with: https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

time_bench: Type:funcion_call_cost Per elem: 3 cycles(tsc) 1.009 ns (step:0)
 - (measurement period time:0.100951200 sec time_interval:100951200)
  - (invoke count:100000000 tsc_interval:363427758)

time_bench: Type:func_ptr_call_cost Per elem: 4 cycles(tsc) 1.251 ns (step:0)
 - (measurement period time:0.125172385 sec time_interval:125172385)
 - (invoke count:100000000 tsc_interval:450625065)

The perf report shows the individual function calls, but it does not show the functions that got inlined. For the mlx5 driver most of these function that didn’t get inlined are called as indirect function pointer calls.

$ perf report --sort cpu,symbol --kallsyms=/proc/kallsyms --no-children -C0

Samples: 20K of event 'cycles:ppp', Event count (approx.): 18342640436
  Overhead  CPU  Symbol
+   16.29%  000  [k] bpf_prog_1a32f9805bcb7bb7_xdp_prognum0
+   15.27%  000  [k] mlx5e_post_rx_wqes
+   14.12%  000  [k] mlx5e_poll_rx_cq
+   14.11%  000  [k] mlx5e_skb_from_cqe_linear
+   11.77%  000  [k] mlx5e_handle_rx_cqe
+    8.48%  000  [k] mlx5e_xdp_handle
+    7.54%  000  [k] mlx5e_page_release
+    2.37%  000  [k] swiotlb_sync_single
+    1.87%  000  [k] percpu_array_map_lookup_elem
+    1.02%  000  [k] net_rx_action
+    0.92%  000  [k] swiotlb_sync_single_for_device
+    0.87%  000  [k] swiotlb_sync_single_for_cpu
+    0.79%  000  [k] intel_idle
     0.41%  000  [k] mlx5e_poll_xdpsq_cq
     0.40%  000  [k] mlx5e_napi_poll
     0.31%  000  [k] __softirqentry_text_start
     0.20%  000  [k] smpboot_thread_fn
     0.19%  000  [k] mlx5e_poll_tx_cq
     0.16%  000  [k] __sched_text_start
     0.15%  000  [k] cpuidle_enter_state

function	called per packet?	called by
mlx5e_post_rx_wqes	bulk
mlx5e_poll_rx_cq	bulk NAPI budget
mlx5e_handle_rx_cqe	per packet	mlx5e_poll_rx_cq
mlx5e_skb_from_cqe_linear	per packet	mlx5e_handle_rx_cqe
mlx5e_xdp_handle	per packet	mlx5e_skb_from_cqe_linear
mlx5e_page_release	per packet	mlx5e_handle_rx_cqe
swiotlb_sync_single	2x per packet	mlx5e_skb_from_cqe_linear
swiotlb_sync_single	(above)	mlx5e_post_rx_wqes
swiotlb_sync_single_for_cpu	per packet	mlx5e_skb_from_cqe_linear
swiotlb_sync_single_for_device	per packet	mlx5e_post_rx_wqes
percpu_array_map_lookup_elem	per packet	bpf_prog_xx_xdp_prognum0
bpf_prog_xx_xdp_prognum0	per packet	(cannot be avoided)

Thus, 10 function calls that have a per packet invocation.

The DMA sync calls result in 4 calls, as they are not inlined in kernel/dma/swiotlb.c. Guess the compiler choose not to inline swiotlb_sync_single().

Quantify effect of possible improvements?

Trying to figure out a way to measure, the effect of avoidong some of the DMA calls.

In principle, the DMA sync operation is a no-op with this Intel CPU, as we are not-using-IOMMU and not using the bounce-buffer feature. Thus, I could in principle just compile it out…

For some reason needed to disabled rx_striding_rq for good performance:

ethtool –set-priv-flags mlx5p1 rx_striding_rq off

Baseline XDP_DROP mlx5 performance on 4.19.0-rc5-bpf-next:

Running XDP on dev:mlx5p1 (ifindex:18) action:XDP_DROP options:no_touch
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      3       24,971,062  0          
XDP-RX CPU      total   24,971,062 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    3:3   24,971,055  0          
rx_queue_index    3:sum 24,971,055

(1/24971055)*10^9 = 40.04636568218000000000 nanosec

Removed all DMA sync calls from mlx5 driver code:

Running XDP on dev:mlx5p1 (ifindex:16) action:XDP_DROP options:no_touch
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      3       29,028,209  0          
XDP-RX CPU      total   29,028,209 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    3:3   29,028,207  0          
rx_queue_index    3:sum 29,028,207

(1/29028209)*10^9 = 34.44924900464000000000 ns

Difference, improvement in pps:

29028209-24971055 = 4057154 -> 4,057,154 pps

Difference in nanosec: ((1/24971055)-(1/29028209))*10^9 = 5.59711667754000000000 ns

This removed 4 function calls ( 5.5971/4 ) = 1.399 ns per call. This measured improvement is correlates well with the expected overhead per function call of approx 1.3ns.

Extra-polating, that we could inline the remaining 6 calls.

29Mpps = 34.4 ns

34.4 - (6 * 1.3) = 26.6 ns

26.6 ns converted to pps:

1/(34.4 - (6 * 1.3)) * 1000 = 37.59 Mpps

Looking at driver we/compiler should be able to inline 3 calls, into mlx5e_poll_rx_cq:

function	called per packet?	called by
mlx5e_handle_rx_cqe	per packet	mlx5e_poll_rx_cq
mlx5e_skb_from_cqe_linear	per packet	mlx5e_handle_rx_cqe
mlx5e_xdp_handle	per packet	mlx5e_skb_from_cqe_linear

Thus, a realistic driver optimization can remove/inline 3 calls:

1/(34.4 - (3 * 1.3)) * 1000 = 32.78 Mpps

DPDK uses SSE instructions

The DPDK implementation uses SSE instructions, how much does this matter?

Compiled DPDK PMD (Poll Mode driver) mlx5 with this patch:

diff --git a/drivers/net/mlx5/mlx5_ethdev.c b/drivers/net/mlx5/mlx5_ethdev.c
index 90488af33b81..53c2ff086799 100644
--- a/drivers/net/mlx5/mlx5_ethdev.c
+++ b/drivers/net/mlx5/mlx5_ethdev.c
@@ -1165,7 +1165,8 @@ mlx5_select_rx_function(struct rte_eth_dev *dev)
 
        assert(dev != NULL);
        if (mlx5_check_vec_rx_support(dev) > 0) {
-               rx_pkt_burst = mlx5_rx_burst_vec;
+               rx_pkt_burst = mlx5_rx_burst;                ;
                DRV_LOG(DEBUG, "port %u selected Rx vectorized function",
                        dev->data->port_id);
        } else if (mlx5_mprq_enabled(dev)) {

RXQs/cores	DPDK SSE	DPDK no-sse
1	42314910	43301465
2	74510689	75804436
3	97570907	86492148
4	109448595	105838967
5	115487272	109945465

Issue, above results are NOT stable for the DPDK no-sse case. DPDK is not processing packets on all cores. This can be seen when ctrl-C the testpmd program, and it shows counters for each queue, make special notice if a queue number is not present.

I believe there a setup race condition in DPDK no-sse case, because went I stop all traffic before starting testpmd results are stable.

Also got this assert failure:

testpmd: /home/jbrouer/git/dpdk/dpdk-hacks/drivers/net/mlx5/mlx5_rxtx.c:1980: mlx5_rx_burst: Assertion `len >= (rxq->crc_present << 2)’ failed.

Test 100G bandwidth

Make sure our machine have memory bandwidth to handle: 100Gbit/s = 12.5 GBytes/sec

Below show a single CPU can handle this. I needed two generator machines, each sending a single flow, but I changed RXq to 1 in-order to only use 1 CPU.

Running XDP on dev:mlx5p1 (ifindex:8) action:XDP_DROP options:read
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       8210806     0          
XDP-RX CPU      total   8210806    

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   8210762     0          
rx_queue_index    0:sum 8210762

Show adapter(s) (mlx5p1 mlx5p2) statistics (ONLY that changed!)
Ethtool(mlx5p1  ) stat:        59146 (         59,146) <= ch0_arm /sec
Ethtool(mlx5p1  ) stat:        75717 (         75,717) <= ch0_events /sec
Ethtool(mlx5p1  ) stat:       164685 (        164,685) <= ch0_poll /sec
Ethtool(mlx5p1  ) stat:        59144 (         59,144) <= ch_arm /sec
Ethtool(mlx5p1  ) stat:        75716 (         75,716) <= ch_events /sec
Ethtool(mlx5p1  ) stat:       164684 (        164,684) <= ch_poll /sec
Ethtool(mlx5p1  ) stat:         7676 (          7,676) <= rx0_cache_empty /sec
Ethtool(mlx5p1  ) stat:         7676 (          7,676) <= rx0_cache_full /sec
Ethtool(mlx5p1  ) stat:      8203123 (      8,203,123) <= rx0_cache_reuse /sec
Ethtool(mlx5p1  ) stat:          959 (            959) <= rx0_congst_umr /sec
Ethtool(mlx5p1  ) stat:      8210784 (      8,210,784) <= rx0_xdp_drop /sec
Ethtool(mlx5p1  ) stat:      8210812 (      8,210,812) <= rx_1024_to_1518_bytes_phy /sec
Ethtool(mlx5p1  ) stat:  12336063433 ( 12,336,063,433) <= rx_bytes_phy /sec
Ethtool(mlx5p1  ) stat:         7676 (          7,676) <= rx_cache_empty /sec
Ethtool(mlx5p1  ) stat:         7676 (          7,676) <= rx_cache_full /sec
Ethtool(mlx5p1  ) stat:      8203039 (      8,203,039) <= rx_cache_reuse /sec
Ethtool(mlx5p1  ) stat:          959 (            959) <= rx_congst_umr /sec
Ethtool(mlx5p1  ) stat:          120 (            120) <= rx_out_of_buffer /sec
Ethtool(mlx5p1  ) stat:      8210812 (      8,210,812) <= rx_packets_phy /sec
Ethtool(mlx5p1  ) stat:  12336123874 ( 12,336,123,874) <= rx_prio0_bytes /sec
Ethtool(mlx5p1  ) stat:      8210852 (      8,210,852) <= rx_prio0_packets /sec
Ethtool(mlx5p1  ) stat:  12303230973 ( 12,303,230,973) <= rx_vport_unicast_bytes /sec
Ethtool(mlx5p1  ) stat:      8210819 (      8,210,819) <= rx_vport_unicast_packets /sec
Ethtool(mlx5p1  ) stat:      8210705 (      8,210,705) <= rx_xdp_drop /sec

mpstat -P ALL -u -I SCPU -I SUM 2

04:33:34 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft   %idle
04:33:36 PM  all    0.08    0.00    1.01    0.00    0.93    7.66   90.32
04:33:36 PM    0    0.00    0.00    3.70    0.00    2.65   47.62   46.03
04:33:36 PM    1    0.00    0.00    0.50    0.00    0.00    0.00   99.50
04:33:36 PM    2    0.00    0.00    0.50    0.00    0.00    0.00   99.50
04:33:36 PM    3    0.00    0.00    0.50    0.00    2.51    0.00   96.98
04:33:36 PM    4    0.00    0.00    1.01    0.00    0.00    0.00   98.99
04:33:36 PM    5    0.00    0.00    0.50    0.00    0.00    0.50   99.00

04:33:34 PM  CPU    intr/s
04:33:36 PM  all  99682.00
04:33:36 PM    0 241079.00
04:33:36 PM    1    438.00
04:33:36 PM    2    203.00
04:33:36 PM    3    758.50
04:33:36 PM    4    534.00
04:33:36 PM    5    133.00

04:33:34 PM  CPU       HI/s    TIMER/s   NET_TX/s   NET_RX/s    BLOCK/s IRQ_POLL/s  TASKLET/s    SCHED/s  HRTIMER/s      RCU/s
04:33:36 PM    0       0.00    1000.00       0.50  164745.00       0.00       0.00   75221.50     102.50       0.00       9.50
04:33:36 PM    1       0.00     306.50       0.50       2.00      39.00       0.00       0.00      76.50       0.00      13.50
04:33:36 PM    2       0.00     175.00       0.00       2.50       0.00       0.00       0.00      15.00       0.00      10.50
04:33:36 PM    3       0.00     734.00       0.00       3.50       0.00       0.00       0.00      11.50       0.00       9.50
04:33:36 PM    4       0.00     451.00       0.50      23.00       0.00       0.00      13.50      40.50       0.00       5.50
04:33:36 PM    5       0.00     116.50       0.00       8.00       0.00       0.00       0.00       0.00       0.00       8.50

Intel 40G driver i40e

Why the i40e driver / XL710 NIC was not choosen for benchmarking.

It would basically have been booring to show that, XDP can handle what the HW can offer basically on a single CPU. And comparing against DPDK, the results would have shown the same, as the HW is the real limit. Thus, it would sort-of be unfair for DPDK.

The i40e HW datasheet states that it can “only” handle 40G wirespeed with 128-byte packets.

Intel Ethernet Controller X710/XXV710/XL710: Datasheet: https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf

Quote from Datasheet: Table 1-4. Performance on the network vs. packet size: “Maximize link capacity when operating at 40 Gb/s link or 4x10 Gb/s links with packets larger than 128 bytes at full duplex traffic”

40G wirespeed is:

40000/(84*8) = 59.52 Mpps at 64B due to overhead 84 Bytes
40000/(128*8) = 39.06 Mpps at 128B without inter-frame gap
40000/((128+12)*8) = 35.71 Mpps at 128B with 12B inter-frame gap

XDP can XDP_DROP 35.3 Mpps on a single RX-queue (100% CPU usage), and scaling this up to more CPUs, the sum still remains 36 Mpps, just with a lot of idle CPU cycles.

Single queue XDP_DROP:

Running XDP on dev:i40e1 (ifindex:3) action:XDP_DROP options:no_touch
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      2       35,382,617  0          
XDP-RX CPU      total   35,382,617 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    2:2   35,382,616  0          
rx_queue_index    2:sum 35,382,616

Multi-queue XDP_DROP with xdp_rxq_info:

Running XDP on dev:i40e1 (ifindex:3) action:XDP_DROP options:no_touch
XDP stats       CPU     pps         issue-pps  
XDP-RX CPU      0       6,023,281   0          
XDP-RX CPU      1       3,011,011   0          
XDP-RX CPU      3       16,555,566  0          
XDP-RX CPU      4       4,501,346   0          
XDP-RX CPU      5       6,005,919   0          
XDP-RX CPU      total   36,097,126 

RXQ stats       RXQ:CPU pps         issue-pps  
rx_queue_index    0:0   6,023,283   0          
rx_queue_index    0:sum 6,023,283  
rx_queue_index    1:1   3,011,011   0          
rx_queue_index    1:sum 3,011,011  
rx_queue_index    2:3   3,012,275   0          
rx_queue_index    2:sum 3,012,275  
rx_queue_index    3:3   13,543,264  0          
rx_queue_index    3:sum 13,543,264 
rx_queue_index    4:4   4,501,393   0          
rx_queue_index    4:sum 4,501,393  
rx_queue_index    5:5   6,005,977   0          
rx_queue_index    5:sum 6,005,977

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench02_xdp_drop.org

bench02_xdp_drop.org

XDP benchmark DROP

Tool: xdp1 is way too simple (and annoying)

Tool: xdp_rxq_info with –action XDP_DROP

Tool: xdp_bench01_mem_access_cost

Jesper01: XDP benchmark drop

NIC: mlx5 - ConnectX-5

mlx5: single core

mlx5: multi core

setup notes

t-rex setup

bench

Possible PCIe limit?

Perf record notes

Micro optimization possibilities mlx5

Quantify effect of possible improvements?

DPDK uses SSE instructions

Test 100G bandwidth

Intel 40G driver i40e

Files

bench02_xdp_drop.org

Latest commit

History

bench02_xdp_drop.org

File metadata and controls

XDP benchmark DROP

Tool: xdp1 is way too simple (and annoying)

Tool: xdp_rxq_info with –action XDP_DROP

Tool: xdp_bench01_mem_access_cost

Jesper01: XDP benchmark drop

NIC: mlx5 - ConnectX-5

mlx5: single core

mlx5: multi core

setup notes

t-rex setup

bench

Possible PCIe limit?

Perf record notes

Micro optimization possibilities mlx5

Quantify effect of possible improvements?

DPDK uses SSE instructions

Test 100G bandwidth

Intel 40G driver i40e