10G AXI Ethernet Checksum Offload Example Design
This page provides the details of 2022.1 Zynq UltraScale+ MPSoC 10G AXI Ethernet Checksum Offload Example design. The example design supports Checksum Offload and Receive Side Interrupt Scaling features. The performance improvement achieved in terms of CPU utilization and throughput for TCP and UDP use cases is shared in this page. Following the steps provided in this page, the user can run the example design on a ZCU102 Board with a Solarflare NIC as a link Partner. Build steps are also provided to build the Vivado and PetaLinux projects.
Table of Contents
- 1 Overview
- 2 System Requirements and Development Tools
- 3 Package Directory Structure and Contents
- 4 Test Setup
- 5 Run Flow
- 5.1 UDP RX
- 5.2 UDP TX
- 5.3 TCP RX
- 5.4 TCP TX
- 5.5 TCP RX with RSS
- 5.6 UDP RX with RSS
- 6 Build Flow
- 7 Performance Numbers for Non-CSO, CSO and CSO+RSS
- 8 Hardware CSO Engine
- 9 Receive Side Interrupt Scaling
- 10 Software Architecture of CSO Platform
- 11 Other Information
- 11.1 Known Issues
- 11.2 Future scope
- 11.3 Revision History
Overview
The primary goal of this example design is to showcase the advantages of Checksum Offload (CSO) and Receive Side interrupt Scaling (RSS) features for improving the CPU utilization and Throughput of the 10G AXI Ethernet MCDMA subsystem. The checksum offload feature accelerates the packet processing of the Ethernet stack by offloading checksum computation and validation task to the Programmable Logic (PL). The user can enable or disable* the CSO feature based on the application requirement. The Example design uses a Vivado IP Integrator (IPI) flow to build the hardware design and AMD Xilinx PetaLinux flow for software design. It uses Xilinx IPs and software drivers to demonstrate the capabilities of CSO and Receive Side Interrupt Scaling features.
The Example design has Zynq UltraScale+ MPSoC, MCDMA, XXV Ethernet SoftIP MAC and custom Checksum Offload Engine IP, and RSS IP as major components.
Note: the RSS custom IP is implemented based on the Port Number mapping to demonstrate RSS feature and it is not based on the standard 4/5 tuple Hash function.
*Steps to disable the CSO feature are provided in the Build flow.
Features Supported
PL based Checksum computation for TCP/UDP packets in the TX direction.
PL based Checksum Validation for TCP/UDP packets in the RX direction.
Receive side load balancing through interrupt Scaling for TCP/UDP packets.
Checksum Offload design Block Diagram
System Requirements and Development Tools
Hardware Required
ZCU102 evaluation board with power cable
SFP+ optical cable
10G capable link partner (Solarflare NIC – X2522)
X86 Host Machine to accommodate the NIC (Dell System Precision Tower 7910)
Micro-USB cable for the terminal emulation
SD Card
Software components Required
Operating system
APU: SMP Linux - ZCU102
Host OS :
Ubuntu 20.04.3 LTS; Linux 5.4.0-109-generic #123-Ubuntu SMP
Linux kernel including drivers and TCP/IP Stack
iperf3 Application
mpstat Application
Serial terminal emulator (Tera Term)
SD Card Formatter Tool
Optional applications for debug or troubleshooting - ethtool, tcpdump
Development Tools
PetaLinux Tool version 2022.1 (See UG1144 for installation instructions)
Vivado Design suite version 2022.1 (See UG973 for installation instructions)
Package Directory Structure and Contents
Two packages are released :
zcu102_10G_CSO_Example_Design_2022.1.zip has the Vivado project creation scripts, PetaLinux BSP, and SD card image and binaries that enables the user to run the example design.
cso_example_sources_and_licenses.tar.gz has the sources and licensing information for all PetaLinux recipes used to generate images.
Package Download Links
Download the AXI-Ethernet 10G CSO example design 2022.1 package from here zcu102_10G_CSO_Example_Design_2022.1.zip
Download the PetaLinux sources and licensing information from here cso_example_sources_and_licenses.tar.gz
Package Directory Contents
The package is released with the Vivado project creation scripts, and PetaLinux scripts to create software images.
It has Prebuild SD card images that enable the user to run the example design on the ZCU102 board.
The package contains source files to build two different platforms.
Design-1 supports the Checksum Offload use case (
zcu102_10g_ethernet_CSO
)Design-2 supports the Checksum Offload with RSS use case (
zcu102_10g_ethernet_CSO_RSS
)
Package Directory Structure
The below figure depicts the directory structure and the hierarchy of the zcu102_10G_CSO_Example_Design_2022.1 package
zcu102_10G_xxv_cso
|
+---petalinux
| +---zcu102_10g_ethernet_CSO
| | |---config.project
| | \---project-spec
| \---zcu102_10g_ethernet_CSO_RSS
| |---config.project
| \---project-spec
+---prebuild
| +---zcu102_10g_ethernet_CSO
| | |---BOOT.BIN
| | |---boot.scr
| | |---image.ub
| | \---xsa
| | \---system.xsa
| \---zcu102_10g_ethernet_CSO_RSS
| | |---BOOT.BIN
| | |---boot.scr
| | |---image.ub
| | \---xsa
| | \---system.xsa
| \---zcu102_10g_ethernet_CSO_disable
| |---BOOT.BIN
| |---boot.scr
| \---image.ub
+---vivado
| +---designs
| | |---Makefile
| | |---runs.tcl
| | +---zcu102_10g_ethernet_CSO
| | | | config_bd.tcl
| | | | main.tcl
| | +---zcu102_10g_ethernet_CSO_RSS
| | | config_bd.tcl
| | | main.tcl
+---iprepo
| +---cntrl_strm_rd
| +---csum_rx
| +---csum_rx_rss
| +---csum_tx
| +---pkt_overflow_logic
| +---tdest_align
| \---tdest_mapper
|---xdc
| |---async.xdc
| \---top.xdc
|----IMPORTANT_NOTICE_CONCERNING_THIRD_PARTY_CONTENT
\----Readme
The top-level directory structure is described below:
PetaLinux: This directory contains PetaLinux recipes and metadata to build the images for the two use cases.
zcu102_10g_ethernet_CSO
: This directory contains the PetaLinux recipes and metadata of the checksum offload design.zcu102_10g_ethernet_CSO_RSS
: This directory contains the PetaLinux recipes and metadata of the checksum offload design with RSS.
Prebuild: This directory contains prebuild images of the two use cases
zcu102_10g_ethernet_CSO
: This directory contains the SD card files ( image.ub, BOOT.BIN and boot.scr ) to boot the checksum offload design.zcu102_10g_ethernet_CSO_RSS
: This directory contains the SD card files ( image.ub, BOOT.BIN and boot.scr ) to boot the checksum offload design with RSS.zcu102_10g_ethernet_CSO_disable
: This directory contains the SD card files ( image.ub, BOOT.BIN and boot.scr ) to boot the checksum offload disabled feature.
Vivado: This directory consists of Project creation scripts, Design constraints and the custom IP repository required to create hardware designs for two use cases (checksum offload and checksum offload with RSS).
zcu102_10g_ethernet_CSO
: This directory contains the Project creation scripts of the checksum offload design.zcu102_10g_ethernet_CSO_RSS
: This directory contains the Project creation scripts of the checksum offload design with RSS.
IMPORTANT_NOTICE_CONCERNING_THIRD_PARTY_CONTENT: This file contains information about Xilinx and other third party licenses.
Readme: This file contains the information about PetaLinux sources and licensing information.
Test Setup
This section provides the test setup information between the ZCU102 board and the Host machine.
Connect 12V Power to the ZCU102 board 6-Pin Molex connector (J52).
Connect an SFP+ cable between the ZCU102 board SFP cage assembly (Location Right Top SFP0-UG1182 Table 3-30 ) and the NIC on the x86 Host Machine
Prepare the SD card. There are many options to format the SD Card in the windows tool.
Note: Always format with the FAT32 option. Use the SD Card Formatter tool to format the SD card.
Set the SW6 switches as shown in the below Table. This configures the boot settings to boot from SD.
Boot Mode | Mode Pins [3:0] | Mode Sw6[4:1] |
SD | 1110 | off, off, off, on |
Connect the Micro USB cable into the ZCU102 Board Micro USB port (J96) and the other end into an open USB port on the host PC. This cable is used for UART over USB.
Power on the board and make sure that the operational status LEDs such as power supply status, INIT, DONE and all power rail LEDs are lit green.
Run the Serial terminal emulator (Tera Term) and make sure the serial communication configuration is set as shown below
· Baud Rate: 115200
· Data: 8 bit
· Parity: None
· Stop: 1 bit
· Flow Control: None
Run Flow
This section describes the run flow and commands required to test the checksum offload and checksum offload with RSS features on the ZCU102 Board using the prebuild images.
Prior to running the steps mentioned below, download the Example design package and extract its contents.
Setup the board as explained in the “Test Setup” Section
Copy the ready to test image ( image.ub, BOOT.BIN and boot.scr ) from the “../prebuild/zcu102_10g_ethernet_CSO" folder to the FAT32 formatted SD card.
Insert the SD card in the SD card slot and make sure that the board is in SD boot mode.
Power on the board and after a successful boot, a shell prompt will appear as shown below.
ZCU102-CSUM-2022 login:
5. Login with the username ‘petalinux’ and create a New password when prompted
6. To add/modify the Ethernet interface settings make sure to log in as super user with the password created in step 5 as given below.
ZCU102-CSUM-2022:~$:sudo su
Note: To boot the CSO design with RSS, copy the image from the “../prebuild/zcu102_10g_ethernet_CSO_RSS" folder to the SD card and follow steps 3 and 4.
To check the performance with CSO disabled, copy the image from the “../prebuild/zcu102_10g_ethernet_CSO_disable" folder to the SD card and follow steps 3 and 4.
Use ethtool -k <xxv-eth-interface-name> to verify “rx-checksumming” and “tx-checksumming” features are ON/OFF before running Iperf traffic.
Run Iperf3 application
Once the host and ZCU102 are booted, set up an IP address for each Ethernet port and make sure that the Ethernet link is established using ping. If the link is not detected, make the interface go down and up using the command given below. Do not proceed until you are able to ping each interface.
ZCU102-CSUM-2022:~# ifconfig <xxv-eth-interface-name> down
ZCU102-CSUM-2022:~# ifconfig <xxv-eth-interface-name> up
Note: After applying the interface up command, make sure a valid IP address is set for the interface.
To set the IP Address:
ifconfig <xxv-eth-interface-name> <ip-address>
(for example ifconfig eth1 195.168.1.100)
Pre-requisites:
Looking at /proc/interrupts
ZCU102-CSUM-2022:~# cat /proc/interrupts | grep <interface-name>
Note: the above command lists the Transmit side interrupt number (tx-irq-no) followed by the Receive side interrupt number (rx-irq-no) and associated cores to process the interrupt. The CSO design has two interrupts, one for TX and one for RX whereas the CSO RSS design has five interrupts, one for TX and four for RX.
2. CPU Utilization Reporting
ZCU102-CSUM-2022:~# mpstat -P ALL 1 50
The average CPU IDLE percentage of all cores for a period of 50 sec is reported. The CPU utilization percentage is obtained by subtracting the average CPU IDLE percentage from 100. (For example, when the average CPU idle percentage is 15% , the CPU utilization percentage is 85%.)
UDP RX
Set RX interrupt affinity
Set Ethernet MCDMA RX interrupt affinity to core-1
ZCU102-CSUM-2022:~# echo 2 > /proc/irq/<rx-irq-no>/smp_affinity
Note: echo 2 corresponds to core 1.
Enable Flow control
Enable Receive Flow Steering
ZCU102-CSUM-2022:~# echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
ZCU102-CSUM-2022:~# echo 2048 > /sys/class/net/<interface-name>/queues/rx-0/rps_flow_cnt
ZCU102-CSUM-2022:~# echo 2048 > /sys/class/net/<interface-name>/queues/rx-1/rps_flow_cnt
Note: RFS is disabled by default. To enable RFS, we must edit rps_sock_flow_entries and rps_flow_cnt. For details refer to https://www.kernel.org/doc/Documentation/networking/scaling.txt
Run iperf servers on ZYNQ MP
Run two threads of iperf servers on core -2 and core -3 of ZYNQ MP.
ZCU102-CSUM-2022:~# taskset -c 2 iperf3 -s -p 5301 -i 60 &
ZCU102-CSUM-2022:~# taskset -c 3 iperf3 -s -p 5302 -i 60 &
Run iperf clients on the Host Machine:
Run two threads of iperf clients on the host.
host:~# iperf3 -c <Board_IP> -u -P 2 -T s1 -p 5301 -t 60 -i 60 -b 2500M -l 1472 &
host:~# iperf3 -c <Board_IP> -u -P 2 -T s2 -p 5302 -t 60 -i 60 -b 2500M -l 1472 &
Table 1: Performance Comparison of CSO enabled and disabled use cases for UDP RX
The above table shows that enabling the CSO feature has improved the throughput and CPU utilization.
With CSO
UDP TX
Set TX interrupt affinity
Set Ethernet MCDMA TX interrupt affinity to core-1.
ZCU102-CSUM-2022:~# echo 2 > /proc/irq/<tx-irq_no>/smp_affinity
Start Iperf servers on host machine
Run four threads of Iperf servers on Host machine.
host:~# iperf3 -s -p 5301 & ; iperf3 -s -p 5302 & ; iperf3 -s -p 5303 & ; iperf3 -s -p 5304 & ;
Run Iperf clients on ZYNQMP
Run four threads of iperf clients on core - 0, core -1, core - 2 and core -3 of ZynqMP.
ZCU102-CSUM-2022:~# taskset -c 0 iperf3 -u -P 2 -c <host_IP> -T s1 -p 5301 -t 60 -i 60 -b 450M &
ZCU102-CSUM-2022:~# taskset -c 1 iperf3 -u -P 2 -c <host_IP> -T s2 -p 5302 -t 60 -i 60 -b 450M &
ZCU102-CSUM-2022:~# taskset -c 2 iperf3 -u -P 2 -c <host_IP> -T s3 -p 5303 -t 60 -i 60 -b 450M &
ZCU102-CSUM-2022:~# taskset -c 3 iperf3 -u -P 2 -c <host_IP> -T s4 -p 5304 -t 60 -i 60 -b 450M &
Note : Make sure to run all threads at the same instant.
Table 2: Performance Comparison of CSO enabled and disabled use cases for UDP TX
The above table shows that enabling the CSO feature has improved the CPU utilization.
With CSO
TCP RX
Set RX interrupt affinity
Set Ethernet MCDMA RX interrupt affinity to core-1
ZCU102-CSUM-2022:~# echo 2 > /proc/irq/<rx-irq-no>/smp_affinity
Enable Flow control
Enable Receive Flow Steering and Receive Packet Steering
ZCU102-CSUM-2022:~# echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
ZCU102-CSUM-2022:~# echo 2048 > /sys/class/net/<interface-name>/queues/rx-0/rps_flow_cnt
ZCU102-CSUM-2022:~# echo 2048 > /sys/class/net/<interface-name>/queues/rx-1/rps_flow_cnt
Change BD count
Increase the RX BD count to 1024 from the default 128.
ZCU102-CSUM-2022:~# ifconfig <interface-name> down
ZCU102-CSUM-2022:~# ethtool -G <interface-name> rx 1024
ZCU102-CSUM-2022:~# ifconfig <interface-name> up
Run iperf servers on ZYNQ MP
Run two threads of iperf servers on core -2 and core -3 of ZYNQ MP.
ZCU102-CSUM-2022:~# taskset -c 2 iperf3 -s -p 5301 -i 60 &
ZCU102-CSUM-2022:~# taskset -c 3 iperf3 -s -p 5302 -i 60 &
Run iperf clients on Host Machine
Run two threads of iperf clients on host.
host:~# iperf3 -c <Board_IP> -P 2 -T s1 -p 5301 -t 60 -i 60 -b 500M &
host:~# iperf3 -c <Board_IP> -P 2 -T s2 -p 5302 -t 60 -i 60 -b 500M &
Table 3: Performance Comparison of CSO enabled and disabled use cases for TCP RX
The above table shows that enabling the CSO feature has improved the throughput and CPU utilization.
With CSO
TCP TX
Set TX interrupt affinity
Set Ethernet MCDMA TX interrupt affinity to core-1.
ZCU102-CSUM-2022:~# echo 2 > /proc/irq/<tx-irq-no>/smp_affinity
Start Iperf servers on host machine
Run two threads of Iperf servers on the Host machine.
host:~# iperf3 -s -p 5301 & ; iperf3 -s -p 5302 & ;
3. Run Iperf clients on ZYNQMP
o Run two threads of iperf clients on core - 2 and core - 3 of the ZynqMP.
ZCU102-CSUM-2022:~# taskset -c 2 iperf3 -c <host_IP> -T s1 -p 5301 -t 60 -i 60 -b 1850M &
ZCU102-CSUM-2022:~# taskset -c 3 iperf3 -c <host_IP> -T s2 -p 5302 -t 60 -i 60 -b 1850M &
Table 4: Performance Comparison of CSO enabled and disabled use cases for TCP Tx use case
The above table shows that enabling the CSO feature has improved CPU utilization.
With CSO
TCP RX with RSS
Set RX interrupt affinity
Set Ethernet MCDMA RX interrupts affinity to core-0, core-1,core-2 and core-3
ZCU102-CSUM-2022:~# echo 1 > /proc/irq/<rx-irq-no-1>/smp_affinity
ZCU102-CSUM-2022:~# echo 2 > /proc/irq/<rx-irq-no-2>/smp_affinity
ZCU102-CSUM-2022:~# echo 4 > /proc/irq/<rx-irq-no-4>/smp_affinity
ZCU102-CSUM-2022:~# echo 8 > /proc/irq/<rx-irq-no-4>/smp_affinity
Enable Flow control
Enable Receive Flow Steering and Receive Packet Steering
ZCU102-CSUM-2022:~# echo 32768 > /proc/sys/net/core/rps_sock_flow_entries
ZCU102-CSUM-2022:~# echo 2048 > /sys/class/net/<interface-name>/queues/rx-0/rps_flow_cnt
ZCU102-CSUM-2022:~# echo 2048 > /sys/class/net/<interface-name>/queues/rx-1/rps_flow_cnt
Change BD count
Increase the RX BD count to 1024 from the default 128
ZCU102-CSUM-2022:~# ifconfig <interface-name> down
ZCU102-CSUM-2022:~# ethtool -G <interface-name> rx 1024
ZCU102-CSUM-2022:~# ifconfig <interface-name> up
Run iperf servers on ZYNQ MP
Run four threads of iperf servers on core - 0, core -1, core -2 and core -3 of ZYNQ MP with port numbers 5301, 5302, 5303 and 5304.
ZCU102-CSUM-2022:~# taskset -c 0 iperf3 -s -p 5301 -i 60 &
ZCU102-CSUM-2022:~# taskset -c 1 iperf3 -s -p 5302 -i 60 &
ZCU102-CSUM-2022:~# taskset -c 2 iperf3 -s -p 5303 -i 60 &
ZCU102-CSUM-2022:~# taskset -c 3 iperf3 -s -p 5304 -i 60 &
Note : Make sure to use the same port numbers given in the above command as the RSS IP is implemented with the above destination ports.
Run iperf clients on Host Machine
Run four threads of iperf clients on the host machine with port numbers 5301, 5302, 5303 and 5304.
host:~# iperf3 -c <Board_IP> -P 2 -T s1 -p 5301 -t 60 -i 60 -b 500M &
host:~# iperf3 -c <Board_IP> -P 2 -T s2 -p 5302 -t 60 -i 60 -b 500M &
host:~# iperf3 -c <Board_IP> -P 2 -T s3 -p 5303 -t 60 -i 60 -b 500M &
host:~# iperf3 -c <Board_IP> -P 2 -T s4 -p 5304 -t 60 -i 60 -b 500M &
Note : Make sure to run all threads at the same instant.
Table 5: Performance improvement achieved with RSS implemented for TCP RX
With RSS
UDP RX with RSS
Set RX interrupt affinity
Set Ethernet MCDMA RX interrupt affinity to core-0 and core -1.
ZCU102-CSUM-2022:~# echo 1 > /proc/irq/<rx-irq-no-1>/smp_affinity
ZCU102-CSUM-2022:~# echo 2 > /proc/irq/<rx-irq-no-2>/smp_affinity
Run iperf servers on ZYNQ MP
Run two threads of iperf servers on core -2 and core -3 .
ZCU102-CSUM-2022:~# taskset -c 2 iperf3 -s -p 5301 -i 60 &
ZCU102-CSUM-2022:~# taskset -c 3 iperf3 -s -p 5302 -i 60 &
Run iperf clients on the Host Machine:
Run two threads of iperf clients on the host.
host:~# iperf3 -c <Board_IP> -u -P 2 -T s1 -p 5301 -t 60 -i 60 -b 2500M -l 1472 &
host:~# iperf3 -c <Board_IP> -u -P 2 -T s2 -p 5302 -t 60 -i 60 -b 2500M -l 1472 &
Table 6: Performance improvement achieved with RSS implemented for UDP RX
With RSS
Build Flow
Steps to build the Vivado Hardware Design and generate the XSA
Refer to the Vivado Design Suite User Guide: Using the Vivado IDE, UG893, for information on setting up the Vivado environment.
Note: Prior to running the steps mentioned below, download the CSO example design package and extract its contents
Steps:
Open a Linux terminal.
Change the working directory to the CSO example design folder.
Source the Vivado 2022.1 tool <path/to/vivado-installer>/settings64.csh
Navigate to the ../vivado/designs/zcu102_10g_ethernet_CSO folder to build checksum offload design.
5. Run the following command in the terminal to create the Vivado project, invoke the GUI, populate the IPI block design and generate the XSA. The XSA generation may take an hour.
6. The generated XSA will be located at $working_dir/vivado/designs/zcu102_10g_ethernet_CSO/project/xxv_subsys_wrapper.xsa
Note: To build RSS design Navigate to “$../vivado/designs/zcu102_10g_ethernet_CSO_RSS” folder in step 4.
Steps to build a Linux image with the PetaLinux Tool
This tutorial shows how to build the Linux image using the PetaLinux tools.
Note: Prior to running the steps mentioned below, make sure that the XSA has generated successfully.
Steps:
Open a Linux terminal.
Change directory to the CSO example design folder.
Source the PetaLinux tool <path/to/petalinux-installer>/tool/petalinux-v2022.1-final/settings.sh
Navigate to ../petalinux/zcu102_10g_ethernet_CSO folder to build the checksum offload design.
Run the following command in the terminal to configure the PetaLinux project
6. Build the PetaLinux project
The generated images are located in $working_dir/petalinux/zcu102_10g_ethernet_CSO/images/linux/ folder.
7. Create a boot image (BOOT.BIN) including FSBL, ATF, bitstream, and u-boot.
8. Copy the image (image.ub , BOOT.BIN and boot.scr ) to the FAT32 formatted SD card and insert the card in SD card slot to run the design.
Note: To build RSS design navigate to “$../petalinux/zcu102_10g_ethernet_CSO_RSS” folder in step 4.
Steps to build CSO Disable images
Navigate to $working_dir/project-spec/meta-user/recipes-bsp/device-tree/files folder after step 5
Open system-user.dtsi file in a editor.
Delete the properties below and save the file.
xlnx,txcsum = <0x1>, xlnx,rxcsum = <0x2>Run step 6 ,7 and 8.
Performance Numbers for Non-CSO, CSO and CSO+RSS
Setup Details
Host setup: Dell System Precision Tower 7910
Iperf: iperf 3-CURRENT (cJSON 1.5.2)
OS : Ubuntu 20.04.3 LTS; Linux 5.4.0-109-generic #123-Ubuntu SMP (or)Ubuntu LTS version : Linux 3.13.0-147-generic #196-Ubuntu SMP
NIC (10G Solarflare X2522 Dual-Port 10GbE SFP+ Adapter) : Default
Note: This benchmarking is done with the default system network parameters that are set by the operating system. You can modify the Linux sysctl command in order to improve IPv4 and IPv6 traffic performance. Changing the network parameters might yield different results on different systems.
Table 7: Performance Comparison of CSO disable, CSO and CSO with RSS for TCP/UDP Tx and Rx use cases
Note: UDP loss % for above is < 0.05% and TCP retry count is < 1000 for a span of 60 seconds.
Hardware CSO Engine
The CSO Engine consist of a Tx_csum IP on the TX side and Rx_csum IP on the RX side. The Rx_csum IP on the receive side validates the Checksum of the incoming packet and updates a qualifier field with the status of packets. The Tx_csum IP computes the checksum and inserts it on the TCP/UDP checksum field of the packet. The building blocks of the receive side and transmit side checksum IPs are detailed in this section.
Checksum Validation IP in Receive Side
The Rx_csum IP fits in between the AXIS_RX interface of the XXV Ethernet MAC IP and the S2MM interface of the AXI-MCDMA IP. The RX pipeline and the RXCSUM IP block diagram are shown below.
This IP parses the incoming packet and computes the IP header checksum, TCP/UDP header checksum and RAW checksum and accordingly sends the status to the S2MM AXI-MCDMA control/status stream. The status matrix is shown below.
Receive CSUM Status:
• 000 = Neither the IP header nor the TCP/UDP checksums were checked.
• 001 = The IP header checksum was checked and was correct. The TCP/UDP checksum was not checked.
• 010 = Both the IP header checksum and the TCP checksum were checked and were correct.
• 011 = Both the IP header checksum and the UDP checksum were checked and were correct.
• 100 = Reserved
• 101 = The IP header checksum was checked and was incorrect. The TCP/UDP checksum was not checked.
• 110 = The IP header checksum was checked and is correct but the TCP checksum was checked and was incorrect.
• 111 = The IP header checksum was checked and is correct but the UDP checksum was checked and was incorrect.
The Receive CSUM Status is embedded in the AXI MCDMA Status stream. The App field mapping of S2MM status stream is given below.
App Fields | Name | Description |
App3[15:0] | RX_CSRAW | Receive Raw Checksum |
App4[2:0] | RX_CS_STS | Receive Checksum Status |
Checksum Calculation IP during Transmit - Partial Checksum Offload
The TX_csum IP computes the checksum and inserts it into the TCP/UDP checksum field if the offload enable field (TxCsumCtrl ) in the app field is set by the software, otherwise the packet is sent as it is to the MAC. The TX checksum block calculates the checksum of the packet starting from the Byte index provided in the control stream until the end of Packet. An IP Pseudo header is provided by the Software as an app-field which is used to initialize the checksum value.
The App fields from the control stream provide the Byte index of the checksum calculation starting point (TxCsBegin), checksum insertion Point (TxCsInsert), checksum calculation initial (TxCsInit) value and offload enable field (TxCsumCntrl). The App field mapping of the MM2S control stream is given below.
App Fields | Name | Description |
App1[11:10] | TxCsumCntrl | Transmit Checksum Enable |
App2[15:0] | TxCsInit | Transmit Checksum Calculation Initial Value |
App3[15:0] | TxCsBegin | Transmit Checksum Calculation Starting Point |
App3[31:16] | TxCsInsert | Transmit Checksum Insertion Point |
Receive Side Interrupt Scaling
Receive Side Interrupt Scaling provides the benefits of parallel receive processing in multiprocessing environments. It can improve system performance by distributing receive processing across multiple CPUs. This helps to ensure that no CPU is heavily loaded while another CPU is idle.
When a receive queue (or interrupt) is tied to a specific core, packets from the same flow are steered to that core. Each receive queue has a separate IRQ associated with it, and is triggered to notify a CPU when new packets arrive on the given queue.
Note: in this example design, the RSS implemented is not based on the industry standard 4/5 Tuple Hash function. It is based on the Port Number mapping to demonstrate RSS feature for performance improvement.
The RSS block in the Rx-CSO Engine parses the received packet Header and based on the Destination Port, the Tdest is generated for the MCDMA. The MCDMA based on the Tdest generates an interrupt for the corresponding S2MM channel, which in turn is associated with a CPU core with Interrupt affinity configuration. IRQ affinity allocates multiple interrupts to multiple CPU cores, to distribute the CPU workload and speed up data processing.
Software Architecture of CSO Platform
The Linux Ethernet driver derives Checksum Offload capability information from the design via Device tree parameters.
Please refer to the description of xlnx,tx-csum and xlxn,rx-csum here:
When Hardware offload capability is present,
The driver sets necessary CSUM metadata in the Buffer Descriptor fields for every Transmit packet.
The driver informs the Linux framework of the HW CSUM offload capability, thereby preventing SW checksum operations from running in the framework.
The driver verifies CSUM Offload status in the descriptor fields of every Receive packet and informs the Linux Ethernet framework when there is no need to perform SW CSUM.
Other Information
The following are the conclusions, limitations and future scope of actions from profiling and performance improvement experiments:
The TCP TX checksum offload path utilizes two buffer descriptors instead of one per data transfer, resulting in marginally higher processing in the driver and utilization of a higher number of descriptors. As a result, it is recommended to double the total buffer descriptors (to 256) to ensure that there is enough for TX data processing.
There can be variations in the performance numbers based on the Linux PC and OS version on the link partner. In addition, load on the link partner will also result in differences in performance numbers during multiple runs. For consistent results, please ensure that the load on the link partner remains the same throughout and that the Ethernet (iperf3) process gets priority.
Function profiling of the Linux Ethernet path showed that the average CPU utilization of SW checksum offload functionality is 3-5%. This is the bandwidth that the HW Offload functionality frees up.
Further maximum consumers of CPU bandwidth include user space-kernel or vice versa memory copy functions and AXI Ethernet transfer functions (the latter is profiled and optimized in its current state).
Note that SW GRO is always enabled on the Linux kernel used on the target by default.
Note that Receive Side Scaling example is a proof of concept demonstration and for practical deployment, one can implement standard RSS technique based on 5-tuple etc.
Known Issues
In the TCP TX use case, throughput and CPU utilization match with the table (-4) for most iterations, however occasionally the throughput can drop and CPU utilization is also less proportionally.
Rootfs for CSO RSS design is configured with Init-manager-sysvinit.
Dynamic ON/OFF of “rx-checksumming” and “tx-checksumming” using ethtool utility is not enabled in the driver.
Future scope
Hardware Segmentation Offloads and Software DPDK based improvements can be explored.
Revision History
2022.1_web - 08/31/2022
© Copyright 2019 - 2022 Xilinx Inc. Privacy Policy