10G AXI Ethernet Checksum Offload Example Design

This page provides the details of 2022.1 Zynq UltraScale+ MPSoC 10G AXI Ethernet Checksum Offload Example design. The example design supports Checksum Offload and Receive Side Interrupt Scaling features. The performance improvement achieved in terms of CPU utilization and throughput for TCP and UDP use cases is shared in this page. Following the steps provided in this page, the user can run the example design on a ZCU102 Board with a Solarflare NIC as a link Partner. Build steps are also provided to build the Vivado and PetaLinux projects.

Table of Contents

Overview

The primary goal of this example design is to showcase the advantages of Checksum Offload (CSO) and Receive Side interrupt Scaling (RSS) features for improving the CPU utilization and Throughput of the 10G AXI Ethernet MCDMA subsystem. The checksum offload feature accelerates the packet processing of the Ethernet stack by offloading checksum computation and validation task to the Programmable Logic (PL). The user can enable or disable* the CSO feature based on the application requirement. The Example design uses a Vivado IP Integrator (IPI) flow to build the hardware design and AMD Xilinx PetaLinux flow for software design. It uses Xilinx IPs and software drivers to demonstrate the capabilities of CSO and Receive Side Interrupt Scaling features.

The Example design has Zynq UltraScale+ MPSoC, MCDMA, XXV Ethernet SoftIP MAC and custom Checksum Offload Engine IP, and RSS IP as major components.

Note: the RSS custom IP is implemented based on the Port Number mapping to demonstrate RSS feature and it is not based on the standard 4/5 tuple Hash function.

*Steps to disable the CSO feature are provided in the Build flow.

Features Supported

  • PL based Checksum computation for TCP/UDP packets in the TX direction.

  • PL based Checksum Validation for TCP/UDP packets in the RX direction.

  • Receive side load balancing through interrupt Scaling for TCP/UDP packets.

Checksum Offload design Block Diagram

System Requirements and Development Tools

Hardware Required

  •    ZCU102 evaluation board with power cable

  •    SFP+ optical cable

  •   10G capable link partner (Solarflare NIC – X2522)

  •   X86 Host Machine to accommodate the NIC (Dell System Precision Tower 7910)

  •    Micro-USB cable for the terminal emulation

  •    SD Card

Software components Required

  • Operating system

  • APU: SMP Linux  - ZCU102

  • Host OS : 

    • Ubuntu 20.04.3 LTS; Linux 5.4.0-109-generic #123-Ubuntu SMP

  • Linux kernel including drivers and TCP/IP Stack

  • iperf3 Application

  • mpstat Application

  • Serial terminal emulator (Tera Term)

  • SD Card Formatter Tool

  • Optional applications for debug or troubleshooting - ethtool, tcpdump

Development Tools

  • PetaLinux Tool version 2022.1 (See UG1144 for installation instructions)

  • Vivado Design suite version 2022.1 (See UG973 for installation instructions)

Package Directory Structure and Contents

Two packages are released :

  • zcu102_10G_CSO_Example_Design_2022.1.zip has the Vivado project creation scripts, PetaLinux BSP, and SD card image and binaries that enables the user to run the example design.

  • cso_example_sources_and_licenses.tar.gz has the sources and licensing information for all PetaLinux recipes used to generate images.

Download the AXI-Ethernet 10G CSO example design 2022.1 package from here zcu102_10G_CSO_Example_Design_2022.1.zip

Download the PetaLinux sources and licensing information from here cso_example_sources_and_licenses.tar.gz

Package Directory Contents

The package is released with the Vivado project creation scripts, and PetaLinux scripts to create software images.

It has Prebuild SD card images that enable the user to run the example design on the ZCU102 board.

The package contains source files to build two different platforms.

  • Design-1 supports the Checksum Offload use case (zcu102_10g_ethernet_CSO)

  • Design-2 supports the Checksum Offload with RSS use case (zcu102_10g_ethernet_CSO_RSS)

Package Directory Structure 

The below figure depicts the directory structure and the hierarchy of the zcu102_10G_CSO_Example_Design_2022.1 package

zcu102_10G_xxv_cso | +---petalinux | +---zcu102_10g_ethernet_CSO | | |---config.project | | \---project-spec | \---zcu102_10g_ethernet_CSO_RSS | |---config.project | \---project-spec +---prebuild | +---zcu102_10g_ethernet_CSO | | |---BOOT.BIN | | |---boot.scr | | |---image.ub | | \---xsa | | \---system.xsa | \---zcu102_10g_ethernet_CSO_RSS | | |---BOOT.BIN | | |---boot.scr | | |---image.ub | | \---xsa | | \---system.xsa | \---zcu102_10g_ethernet_CSO_disable | |---BOOT.BIN | |---boot.scr | \---image.ub +---vivado | +---designs | | |---Makefile | | |---runs.tcl | | +---zcu102_10g_ethernet_CSO | | | | config_bd.tcl | | | | main.tcl | | +---zcu102_10g_ethernet_CSO_RSS | | | config_bd.tcl | | | main.tcl +---iprepo | +---cntrl_strm_rd | +---csum_rx | +---csum_rx_rss | +---csum_tx | +---pkt_overflow_logic | +---tdest_align | \---tdest_mapper |---xdc | |---async.xdc | \---top.xdc |----IMPORTANT_NOTICE_CONCERNING_THIRD_PARTY_CONTENT \----Readme

The top-level directory structure is described below:

  • PetaLinux: This directory contains PetaLinux recipes and metadata to build the images for the two use cases.

    • zcu102_10g_ethernet_CSO: This directory contains the PetaLinux recipes and metadata of the checksum offload design.

    • zcu102_10g_ethernet_CSO_RSS: This directory contains the PetaLinux recipes and metadata of the checksum offload design with RSS.

  • Prebuild: This directory contains prebuild images of the two use cases

    • zcu102_10g_ethernet_CSO: This directory contains the SD card files ( image.ub, BOOT.BIN and boot.scr ) to boot the checksum offload design.

    • zcu102_10g_ethernet_CSO_RSS: This directory contains the SD card files ( image.ub, BOOT.BIN and boot.scr ) to boot the checksum offload design with RSS.

    • zcu102_10g_ethernet_CSO_disable: This directory contains the SD card files ( image.ub, BOOT.BIN and boot.scr ) to boot the checksum offload disabled feature.

  • Vivado: This directory consists of Project creation scripts, Design constraints and the custom IP repository required to create hardware designs for two use cases (checksum offload and checksum offload with RSS).

    • zcu102_10g_ethernet_CSO: This directory contains the Project creation scripts of the checksum offload design.

    • zcu102_10g_ethernet_CSO_RSS: This directory contains the Project creation scripts of the checksum offload design with RSS.

  • IMPORTANT_NOTICE_CONCERNING_THIRD_PARTY_CONTENT: This file contains information about Xilinx and other third party licenses.

  • Readme: This file contains the information about PetaLinux sources and licensing information.

Test Setup

This section provides the test setup information between the ZCU102 board and the Host machine.

  •       Connect 12V Power to the ZCU102 board 6-Pin Molex connector (J52).

  •       Connect an SFP+ cable between the ZCU102 board SFP cage assembly (Location Right Top SFP0-UG1182 Table 3-30 ) and the NIC on the x86 Host Machine

  •       Prepare the SD card. There are many options to format the SD Card in the windows tool.

Note: Always format with the FAT32 option. Use the SD Card Formatter tool to format the SD card.

  • Set the SW6 switches as shown in the below Table. This configures the boot settings to boot from SD.

Boot Mode

Mode Pins [3:0]

Mode Sw6[4:1]

SD

1110

off, off, off, on

 

  • Connect the Micro USB cable into the ZCU102 Board Micro USB port (J96) and the other end into an open USB port on the host PC. This cable is used for UART over USB.

  • Power on the board and make sure that the operational status LEDs such as power supply status, INIT, DONE and all power rail LEDs are lit green.

  • Run the Serial terminal emulator (Tera Term) and make sure the serial communication configuration is set as shown below    

·       Baud Rate: 115200

·       Data: 8 bit

·       Parity: None

·       Stop: 1 bit

·       Flow Control: None

Run Flow

This section describes the run flow and commands required to test the checksum offload and checksum offload with RSS features on the ZCU102 Board using the prebuild images.

Prior to running the steps mentioned below, download the Example design package and extract its contents.

  1. Setup the board as explained in the “Test Setup” Section

  2. Copy the ready to test image ( image.ub, BOOT.BIN and boot.scr ) from the “../prebuild/zcu102_10g_ethernet_CSO" folder to the FAT32 formatted SD card.

  3. Insert the SD card in the SD card slot and make sure that the board is in SD boot mode. 

  4. Power on the board and after a successful boot, a shell prompt will appear as shown below.

ZCU102-CSUM-2022 login:

5. Login with the username ‘petalinux’ and create a New password when prompted

6. To add/modify the Ethernet interface settings make sure to log in as super user with the password created in step 5 as given below.

ZCU102-CSUM-2022:~$:sudo su

Note: To boot the CSO design with RSS, copy the image from the “../prebuild/zcu102_10g_ethernet_CSO_RSS" folder to the SD card and follow steps 3 and 4.

To check the performance with CSO disabled, copy the image from the “../prebuild/zcu102_10g_ethernet_CSO_disable" folder to the SD card and follow steps 3 and 4.

Use ethtool -k <xxv-eth-interface-name> to verify “rx-checksumming” and “tx-checksumming” features are ON/OFF before running Iperf traffic.

Run Iperf3 application

Once the host and ZCU102 are booted, set up an IP address for each Ethernet port and make sure that the Ethernet link is established using ping. If the link is not detected, make the interface go down and up using the command given below. Do not proceed until you are able to ping each interface.

ZCU102-CSUM-2022:~# ifconfig <xxv-eth-interface-name> down

ZCU102-CSUM-2022:~# ifconfig <xxv-eth-interface-name> up

Note: After applying the interface up command, make sure a valid IP address is set for the interface.

To set the IP Address:

ifconfig <xxv-eth-interface-name> <ip-address>

(for example ifconfig eth1 195.168.1.100)

Pre-requisites:

  1. Looking at /proc/interrupts

ZCU102-CSUM-2022:~# cat /proc/interrupts | grep <interface-name>

Note: the above command lists the Transmit side interrupt number (tx-irq-no) followed by the Receive side interrupt number (rx-irq-no) and associated cores to process the interrupt. The CSO design has two interrupts, one for TX and one for RX whereas the CSO RSS design has five interrupts, one for TX and four for RX.

2. CPU Utilization Reporting

ZCU102-CSUM-2022:~# mpstat -P ALL 1 50

The average CPU IDLE percentage of all cores for a period of 50 sec is reported. The CPU utilization percentage is obtained by subtracting the average CPU IDLE percentage from 100. (For example, when the average CPU idle percentage is 15% , the CPU utilization percentage is 85%.)

UDP RX
  1. Set RX interrupt affinity

    • Set Ethernet MCDMA RX interrupt affinity to core-1

      ZCU102-CSUM-2022:~# echo 2 > /proc/irq/<rx-irq-no>/smp_affinity

      • Note: echo 2 corresponds to core 1.

  2. Enable Flow control

    • Enable Receive Flow Steering

      ZCU102-CSUM-2022:~# echo 32768 > /proc/sys/net/core/rps_sock_flow_entries

      ZCU102-CSUM-2022:~# echo 2048 > /sys/class/net/<interface-name>/queues/rx-0/rps_flow_cnt

      ZCU102-CSUM-2022:~# echo 2048 >  /sys/class/net/<interface-name>/queues/rx-1/rps_flow_cnt

  3. Run iperf servers on ZYNQ MP

    • Run two threads of iperf servers on core -2 and core -3 of ZYNQ MP.

      ZCU102-CSUM-2022:~# taskset -c 2 iperf3 -s -p 5301 -i 60 &

      ZCU102-CSUM-2022:~# taskset -c 3 iperf3 -s -p 5302 -i 60 &

  4. Run iperf clients on the Host Machine:

    • Run two threads of iperf clients on the host.

      host:~# iperf3 -c <Board_IP> -u -P 2 -T s1 -p 5301 -t 60 -i 60 -b 2500M -l 1472 &

      host:~# iperf3 -c <Board_IP>  -u -P 2 -T s2 -p 5302 -t 60 -i 60 -b 2500M -l 1472 &

Table 1: Performance Comparison of CSO enabled and disabled use cases for UDP RX

The above table shows that enabling the CSO feature has improved the throughput and CPU utilization.

With CSO

UDP TX
  1. Set TX interrupt affinity

    • Set Ethernet MCDMA TX interrupt affinity to core-1.

      ZCU102-CSUM-2022:~# echo 2 > /proc/irq/<tx-irq_no>/smp_affinity

  2. Start Iperf servers on host machine

    • Run four threads of Iperf servers on Host machine.

      host:~# iperf3 -s -p 5301 & ; iperf3 -s -p 5302 & ; iperf3 -s -p 5303 & ; iperf3 -s -p 5304 & ;

  3. Run Iperf clients on ZYNQMP

    • Run four threads of iperf clients on core - 0, core -1, core - 2 and core -3 of ZynqMP.

      ZCU102-CSUM-2022:~# taskset -c 0 iperf3 -u -P 2 -c <host_IP> -T s1 -p 5301 -t 60 -i 60 -b 450M &

      ZCU102-CSUM-2022:~# taskset -c 1 iperf3 -u -P 2 -c <host_IP> -T s2 -p 5302 -t 60 -i 60 -b 450M &

      ZCU102-CSUM-2022:~# taskset -c 2 iperf3 -u -P 2 -c <host_IP> -T s3 -p 5303 -t 60 -i 60 -b 450M &

      ZCU102-CSUM-2022:~# taskset -c 3 iperf3 -u -P 2 -c <host_IP> -T s4 -p 5304 -t 60 -i 60 -b 450M &

Note : Make sure to run all threads at the same instant.

Table 2: Performance Comparison of CSO enabled and disabled use cases for UDP TX

The above table shows that enabling the CSO feature has improved the CPU utilization.

With CSO

TCP RX
  1. Set RX interrupt affinity

    • Set Ethernet MCDMA RX interrupt affinity to core-1

      ZCU102-CSUM-2022:~# echo 2 > /proc/irq/<rx-irq-no>/smp_affinity

  2. Enable Flow control

    • Enable Receive Flow Steering and Receive Packet Steering

      ZCU102-CSUM-2022:~# echo 32768 > /proc/sys/net/core/rps_sock_flow_entries

      ZCU102-CSUM-2022:~# echo 2048 > /sys/class/net/<interface-name>/queues/rx-0/rps_flow_cnt

      ZCU102-CSUM-2022:~# echo 2048 >  /sys/class/net/<interface-name>/queues/rx-1/rps_flow_cnt

  3. Change BD count

    • Increase the RX BD count to 1024 from the default 128.

      ZCU102-CSUM-2022:~# ifconfig <interface-name> down

      ZCU102-CSUM-2022:~# ethtool -G <interface-name> rx 1024

      ZCU102-CSUM-2022:~# ifconfig <interface-name> up

  4. Run iperf servers on ZYNQ MP

    • Run two threads of iperf servers on core -2 and core -3 of ZYNQ MP.

      ZCU102-CSUM-2022:~# taskset -c 2 iperf3 -s -p 5301 -i 60 &

      ZCU102-CSUM-2022:~# taskset -c 3 iperf3 -s -p 5302 -i 60 &

  5. Run iperf clients on Host Machine

    • Run two threads of iperf clients on host.

      host:~# iperf3 -c <Board_IP> -P 2 -T s1 -p 5301 -t 60 -i 60 -b 500M &

      host:~# iperf3 -c <Board_IP> -P 2 -T s2 -p 5302 -t 60 -i 60 -b 500M &

Table 3: Performance Comparison of CSO enabled and disabled use cases for TCP RX

The above table shows that enabling the CSO feature has improved the throughput and CPU utilization.

With CSO

TCP TX
  1. Set TX interrupt affinity

    • Set Ethernet MCDMA TX interrupt affinity to core-1.

      ZCU102-CSUM-2022:~# echo 2 > /proc/irq/<tx-irq-no>/smp_affinity

  2. Start Iperf servers on host machine

    • Run two threads of Iperf servers on the Host machine.

host:~# iperf3 -s -p 5301 & ; iperf3 -s -p 5302 & ;

3. Run Iperf clients on ZYNQMP

o Run two threads of iperf clients on core - 2 and core - 3 of the ZynqMP.

ZCU102-CSUM-2022:~# taskset -c 2 iperf3 -c <host_IP> -T s1 -p 5301 -t 60 -i 60 -b 1850M &

ZCU102-CSUM-2022:~# taskset -c 3 iperf3 -c <host_IP> -T s2 -p 5302 -t 60 -i 60 -b 1850M &

Table 4: Performance Comparison of CSO enabled and disabled use cases for TCP Tx use case

The above table shows that enabling the CSO feature has improved CPU utilization.

With CSO

TCP RX with RSS
  1. Set RX interrupt affinity

    • Set Ethernet MCDMA RX interrupts affinity to core-0, core-1,core-2 and core-3

      ZCU102-CSUM-2022:~# echo 1 > /proc/irq/<rx-irq-no-1>/smp_affinity

      ZCU102-CSUM-2022:~# echo 2 > /proc/irq/<rx-irq-no-2>/smp_affinity

      ZCU102-CSUM-2022:~# echo 4 > /proc/irq/<rx-irq-no-4>/smp_affinity

      ZCU102-CSUM-2022:~# echo 8 > /proc/irq/<rx-irq-no-4>/smp_affinity

  2. Enable Flow control

    • Enable Receive Flow Steering and Receive Packet Steering

      ZCU102-CSUM-2022:~# echo 32768 > /proc/sys/net/core/rps_sock_flow_entries

      ZCU102-CSUM-2022:~# echo 2048 > /sys/class/net/<interface-name>/queues/rx-0/rps_flow_cnt

      ZCU102-CSUM-2022:~# echo 2048 >  /sys/class/net/<interface-name>/queues/rx-1/rps_flow_cnt

  3. Change BD count

    • Increase the RX BD count to 1024 from the default 128

      ZCU102-CSUM-2022:~# ifconfig <interface-name> down

      ZCU102-CSUM-2022:~# ethtool -G <interface-name> rx 1024

      ZCU102-CSUM-2022:~# ifconfig <interface-name> up

  4. Run iperf servers on ZYNQ MP

    • Run four threads of iperf servers on core - 0, core -1, core -2 and core -3 of ZYNQ MP with port numbers 5301, 5302, 5303 and 5304.

      ZCU102-CSUM-2022:~# taskset -c 0 iperf3 -s -p 5301 -i 60 &

      ZCU102-CSUM-2022:~# taskset -c 1 iperf3 -s -p 5302 -i 60 &

      ZCU102-CSUM-2022:~# taskset -c 2 iperf3 -s -p 5303 -i 60 &

      ZCU102-CSUM-2022:~# taskset -c 3 iperf3 -s -p 5304 -i 60 &

      • Note : Make sure to use the same port numbers given in the above command as the RSS IP is implemented with the above destination ports.

  5. Run iperf clients on Host Machine

    • Run four threads of iperf clients on the host machine with port numbers 5301, 5302, 5303 and 5304.

      host:~# iperf3 -c <Board_IP> -P 2 -T s1 -p 5301 -t 60 -i 60 -b 500M &

      host:~# iperf3 -c <Board_IP> -P 2 -T s2 -p 5302 -t 60 -i 60 -b 500M &

      host:~# iperf3 -c <Board_IP> -P 2 -T s3 -p 5303 -t 60 -i 60 -b 500M &

      host:~# iperf3 -c <Board_IP> -P 2 -T s4 -p 5304 -t 60 -i 60 -b 500M &

Note : Make sure to run all threads at the same instant.

Table 5: Performance improvement achieved with RSS implemented for TCP RX

With RSS

UDP RX with RSS
  1. Set RX interrupt affinity

    • Set Ethernet MCDMA RX interrupt affinity to core-0 and core -1.

      ZCU102-CSUM-2022:~# echo 1 > /proc/irq/<rx-irq-no-1>/smp_affinity

      ZCU102-CSUM-2022:~# echo 2 > /proc/irq/<rx-irq-no-2>/smp_affinity

  2. Run iperf servers on ZYNQ MP

    • Run two threads of iperf servers on core -2 and core -3 .

      ZCU102-CSUM-2022:~# taskset -c 2 iperf3 -s -p 5301 -i 60 &

      ZCU102-CSUM-2022:~# taskset -c 3 iperf3 -s -p 5302 -i 60 &

  3. Run iperf clients on the Host Machine:

    • Run two threads of iperf clients on the host.

      host:~# iperf3 -c <Board_IP> -u -P 2 -T s1 -p 5301 -t 60 -i 60 -b 2500M -l 1472 &

      host:~# iperf3 -c <Board_IP>  -u -P 2 -T s2 -p 5302 -t 60 -i 60 -b 2500M -l 1472 &

Table 6: Performance improvement achieved with RSS implemented for UDP RX

With RSS

Build Flow

Steps to build the Vivado Hardware Design and generate the XSA

Refer to the Vivado Design Suite User Guide: Using the Vivado IDE, UG893, for information on setting up the Vivado environment.

Note: Prior to running the steps mentioned below, download the CSO example design package and extract its contents

Steps:

  1. Open a Linux terminal.

  2. Change the working directory to the CSO example design folder.

  3. Source the Vivado 2022.1 tool <path/to/vivado-installer>/settings64.csh

  4. Navigate to the ../vivado/designs/zcu102_10g_ethernet_CSO folder to build checksum offload design.

5. Run the following command in the terminal to create the Vivado project, invoke the GUI, populate the IPI block design and generate the XSA. The XSA generation may take an hour.

6. The generated XSA will be located at $working_dir/vivado/designs/zcu102_10g_ethernet_CSO/project/xxv_subsys_wrapper.xsa

Note: To build RSS design Navigate to “$../vivado/designs/zcu102_10g_ethernet_CSO_RSS” folder in step 4.

Steps to build a Linux image with the PetaLinux Tool

This tutorial shows how to build the Linux image using the PetaLinux tools.

Note: Prior to running the steps mentioned below, make sure that the XSA has generated successfully.

Steps:

  1. Open a Linux terminal.

  2. Change directory to the CSO example design folder.

  3. Source the PetaLinux tool <path/to/petalinux-installer>/tool/petalinux-v2022.1-final/settings.sh

  4. Navigate to ../petalinux/zcu102_10g_ethernet_CSO folder to build the checksum offload design.

  5. Run the following command in the terminal to configure the PetaLinux project

6. Build the PetaLinux project

The generated images are located in $working_dir/petalinux/zcu102_10g_ethernet_CSO/images/linux/ folder.

7. Create a boot image (BOOT.BIN) including FSBL, ATF, bitstream, and u-boot.

8. Copy the image (image.ub , BOOT.BIN and boot.scr ) to the FAT32 formatted SD card and insert the card in SD card slot to run the design.

Note: To build RSS design navigate to “$../petalinux/zcu102_10g_ethernet_CSO_RSS” folder in step 4.

Steps to build CSO Disable images

  1. Navigate to $working_dir/project-spec/meta-user/recipes-bsp/device-tree/files folder after step 5

  2. Open system-user.dtsi file in a editor.

  3. Delete the properties below and save the file.
    xlnx,txcsum = <0x1>, xlnx,rxcsum = <0x2>

  4. Run step 6 ,7 and 8.

Performance Numbers for Non-CSO, CSO and CSO+RSS

Setup Details
Host setup: Dell System Precision Tower 7910
Iperf: iperf 3-CURRENT (cJSON 1.5.2)
OS : Ubuntu 20.04.3 LTS; Linux 5.4.0-109-generic #123-Ubuntu SMP (or)Ubuntu LTS version : Linux 3.13.0-147-generic #196-Ubuntu SMP
NIC (10G Solarflare X2522 Dual-Port 10GbE SFP+ Adapter) : Default

Note: This benchmarking is done with the default system network parameters that are set by the operating system. You can modify the Linux sysctl command in order to improve IPv4 and IPv6 traffic performance. Changing the network parameters might yield different results on different systems.

Table 7: Performance Comparison of CSO disable, CSO and CSO with RSS for TCP/UDP Tx and Rx use cases

Note: UDP loss % for above is < 0.05% and TCP retry count is < 1000 for a span of 60 seconds.

Hardware CSO Engine

The CSO Engine consist of a Tx_csum IP on the TX side and Rx_csum IP on the RX side. The Rx_csum IP on the receive side validates the Checksum of the incoming packet and updates a qualifier field with the status of packets. The Tx_csum IP computes the checksum and inserts it on the TCP/UDP checksum field of the packet. The building blocks of the receive side and transmit side checksum IPs are detailed in this section.

Checksum Validation IP in Receive Side

The Rx_csum IP fits in between the AXIS_RX interface of the XXV Ethernet MAC IP and the S2MM interface of the AXI-MCDMA IP. The RX pipeline and the RXCSUM IP block diagram are shown below.

This IP parses the incoming packet and computes the IP header checksum, TCP/UDP header checksum and RAW checksum and accordingly sends the status to the S2MM AXI-MCDMA control/status stream. The status matrix is shown below.

Receive CSUM Status:
000 = Neither the IP header nor the TCP/UDP checksums were checked.
001 = The IP header checksum was checked and was correct. The TCP/UDP checksum was not checked.
010 = Both the IP header checksum and the TCP checksum were checked and were correct.
011 = Both the IP header checksum and the UDP checksum were checked and were correct.
100 = Reserved
101 = The IP header checksum was checked and was incorrect. The TCP/UDP checksum was not checked.
110 = The IP header checksum was checked and is correct but the TCP checksum was checked and was incorrect.
111 = The IP header checksum was checked and is correct but the UDP checksum was checked and was incorrect.

The Receive CSUM Status is embedded in the AXI MCDMA Status stream. The App field mapping of S2MM status stream is given below.

 

App Fields

Name

Description

App3[15:0]

RX_CSRAW

Receive Raw Checksum

App4[2:0]

RX_CS_STS

Receive Checksum Status

 

Checksum Calculation IP during Transmit - Partial Checksum Offload

The TX_csum IP computes the checksum and inserts it into the TCP/UDP checksum field if the offload enable field (TxCsumCtrl ) in the app field is set by the software, otherwise the packet is sent as it is to the MAC. The TX checksum block calculates the checksum of the packet starting from the Byte index provided in the control stream until the end of Packet. An IP Pseudo header is provided by the Software as an app-field which is used to initialize the checksum value.

The App fields from the control stream provide the Byte index of the checksum calculation starting point (TxCsBegin), checksum insertion Point (TxCsInsert), checksum calculation initial (TxCsInit) value and offload enable field (TxCsumCntrl). The App field mapping of the MM2S control stream is given below.

App Fields

Name

Description

App1[11:10]

TxCsumCntrl

Transmit Checksum Enable

App2[15:0]

TxCsInit

Transmit Checksum Calculation Initial Value

App3[15:0]

TxCsBegin

Transmit Checksum Calculation Starting Point 

App3[31:16]

TxCsInsert

Transmit Checksum Insertion Point

 

Receive Side Interrupt Scaling

Receive Side Interrupt Scaling provides the benefits of parallel receive processing in multiprocessing environments. It can improve system performance by distributing receive processing across multiple CPUs. This helps to ensure that no CPU is heavily loaded while another CPU is idle.

When a receive queue (or interrupt) is tied to a specific core, packets from the same flow are steered to that core. Each receive queue has a separate IRQ associated with it, and is triggered to notify a CPU when new packets arrive on the given queue.

Note: in this example design, the RSS implemented is not based on the industry standard 4/5 Tuple Hash function. It is based on the Port Number mapping to demonstrate RSS feature for performance improvement.

The RSS block in the Rx-CSO Engine parses the received packet Header and based on the Destination Port, the Tdest is generated for the MCDMA. The MCDMA based on the Tdest generates an interrupt for the corresponding S2MM channel, which in turn is associated with a CPU core with Interrupt affinity configuration. IRQ affinity allocates multiple interrupts to multiple CPU cores, to distribute the CPU workload and speed up data processing.

Software Architecture of CSO Platform

The Linux Ethernet driver derives Checksum Offload capability information from the design via Device tree parameters.

Please refer to the description of xlnx,tx-csum and xlxn,rx-csum here:

https://github.com/Xilinx/linux-xlnx/blob/master/Documentation/devicetree/bindings/net/xilinx_axienet.txt#L62

When Hardware offload capability is present,

  • The driver sets necessary CSUM metadata in the Buffer Descriptor fields for every Transmit packet.

    • The driver informs the Linux framework of the HW CSUM offload capability, thereby preventing SW checksum operations from running in the framework.

  • The driver verifies CSUM Offload status in the descriptor fields of every Receive packet and informs the Linux Ethernet framework when there is no need to perform SW CSUM.

Other Information

The following are the conclusions, limitations and future scope of actions from profiling and performance improvement experiments:

  • The TCP TX checksum offload path utilizes two buffer descriptors instead of one per data transfer, resulting in marginally higher processing in the driver and utilization of a higher number of descriptors. As a result, it is recommended to double the total buffer descriptors (to 256) to ensure that there is enough for TX data processing.

  • There can be variations in the performance numbers based on the Linux PC and OS version on the link partner. In addition, load on the link partner will also result in differences in performance numbers during multiple runs. For consistent results, please ensure that the load on the link partner remains the same throughout and that the Ethernet (iperf3) process gets priority.

  • Function profiling of the Linux Ethernet path showed that the average CPU utilization of SW checksum offload functionality is 3-5%. This is the bandwidth that the HW Offload functionality frees up.

    • Further maximum consumers of CPU bandwidth include user space-kernel or vice versa memory copy functions and AXI Ethernet transfer functions (the latter is profiled and optimized in its current state).

  • Note that SW GRO is always enabled on the Linux kernel used on the target by default.

  • Note that Receive Side Scaling example is a proof of concept demonstration and for practical deployment, one can implement standard RSS technique based on 5-tuple etc.

Known Issues

  • In the TCP TX use case, throughput and CPU utilization match with the table (-4) for most iterations, however occasionally the throughput can drop and CPU utilization is also less proportionally.

  • Rootfs for CSO RSS design is configured with Init-manager-sysvinit.

  • Dynamic ON/OFF of “rx-checksumming” and “tx-checksumming” using ethtool utility is not enabled in the driver.

Future scope

  • Hardware Segmentation Offloads and Software DPDK based improvements can be explored.

Revision History

2022.1_web - 08/31/2022





© Copyright 2019 - 2022 Xilinx Inc. Privacy Policy