Traffic Shaping of HP Ports on Zynq UltraScale+
This article is intended to be a guide for tuning the default configuration to shape the traffic of the HP ports to help meet your system throughput requirements.
Table of Contents
Introduction
Performance through the HP ports is very dependent on the traffic patterns generated by the PL masters as well as non-deterministic traffic patterns driven by software running on the processors. Both sets of masters will be competing for DDR access. The non-deterministic nature of software running on the processors makes it difficult to accurately model for optimal DDR efficiency. However, you may be able to iteratively tune the default configurations to help meet your system performance requirements. This is not an exhaustive list of the available controls, but the most effective knobs to help shape the HP traffic. The data path from the HP ports in the PL to the DDR memory controller is shown below.
This article does not address:
HPC, HPM, ACE, ACP or LPD ports
CCI/QVN enablement
SMMU enablement
PS-PL Interface
The PS-PL interface is comprised of an AXI FIFO interface (AFI) per port to bridge the PS and PL domains. The main controls here are the QoS and the issuing capability per port. The QoS specifies the priority of the transaction which is also used to map the channel into traffic classes in the DDR memory controller. The QoS may be static or dynamic depending on your system requirements. The issuing capability defines how many HP outstanding transactions may be in-flight at a given time.
To maximize performance, the PL devices should be configured with an AXI bus width of 128 bits and an AXI burst length (BL) of 16.
The PS-PL interface will slice AXI burst lengths greater than 16, however this conversion may negatively impact performance when heavily loaded.
Registers
Register | Offset | Bits | Description |
---|---|---|---|
0x0 | [2] | FABRIC_QOS_EN: 0-Enable static QoS, 1-Enable dynamic QoS (PL AXI sideband pass-through) | |
0x4 | [3:0] | CAPABILITY: Read/write command issuing capability (0-15); Number of commands minus one | |
0x8 | [3:0] | VALUE: Static QoS value (0-15) | |
0x14 | [2] | FABRIC_QOS_EN: 0-Enable static QoS, 1-Enable dynamic QoS (PL AXI sideband pass-through) | |
0x18 | [3:0] | CAPABILITY: Read/write command issuing capability (0-15); Number of commands minus one | |
0x1c | [3:0] | VALUE: Static QoS value (0-15) |
These registers are dynamic and may be changed at run-time.
Example #1: AFIFM QoS: HP0-3 R/W set to Best Effort (0x0) (default)
Example #2: AFIFM QoS: HP0 R/W set to Video (0x7), HP1-3 R/W set to Best Effort (0x0)
FPD Interconnect Switches
The interconnect switches are comprised of the NIC-400 with QoS-400 ARM IP which provides two traffic regulation mechanisms.
Transaction rate regulation
Outstanding transaction regulation
Registers
The FPD General Programmer's View (GPV) Module contains the QoS-400 registers for regulating the AXI traffic through the NIC-400 switches. The table below maps the “top” NIC-400 switches. The “bottom” NIC-400 switches also have their own resister set for regulating traffic as secondary level of control. Regulating the “bottom” switches is not investigated here.
FPD GPV module base address: 0xFD700000
Transaction Rate Regulation
When regulating the traffic with the QoS-400, you must specify one of these sets.
Peak rate, burstiness and average rate
Peak rate only
Burstiness and average rate
HP0 Registers | HP1 Registers | HP2 Registers | HP3 Registers | Description | ||||
---|---|---|---|---|---|---|---|---|
0x47118 | afifm3M_intfpd_aw_p | 0x4A118 | afifm4M_intfpd_aw_p | 0x4B118 | afifm5M_intfpd_aw_p | 0x4C118 | AW channel peak rate | |
0x4711C | afifm3M_intfpd_aw_b | 0x4A11C | afifm4M_intfpd_aw_b | 0x4B11C | afifm5M_intfpd_aw_b | 0x4C11C | AW channel burstiness allowance | |
0x47120 | afifm3M_intfpd_aw_r | 0x4A120 | afifm4M_intfpd_aw_r | 0x4B120 | afifm5M_intfpd_aw_r | 0x4C120 | AW channel average rate | |
0x47124 | afifm3M_intfpd_ar_p | 0x4A124 | afifm4M_intfpd_ar_p | 0x4B124 | afifm5M_intfpd_ar_p | 0x4C124 | AR channel peak rate | |
0x47128 | afifm3M_intfpd_ar_b | 0x4A128 | afifm4M_intfpd_ar_b | 0x4B128 | afifm5M_intfpd_ar_b | 0x4C128 | AR channel burstiness allowance | |
0x4712C | afifm3M_intfpd_ar_r | 0x4A12C | afifm4M_intfpd_ar_r | 0x4B12C | afifm5M_intfpd_ar_r | 0x4C12C | AR channel average rate | |
0x4710C | afifm3M_intfpd_qos_cntl | 0x4A10C | afifm4M_intfpd_qos_cntl | 0x4B10C | afifm5M_intfpd_qos_cntl | 0x4C10C | Enable rate regulation |
Register | Field | Bits | Description |
---|---|---|---|
*_intfpd_aw_p | aw_p | [31:24] | Channel peak rate (8b fraction of transfers per cycle) |
*_intfpd_aw_b | aw_b | [15:0] | Channel burstiness (integer transfers) |
*_intfpd_aw_r | aw_r | [31:20] | Channel average rate (12b fraction of transfers per cycle) |
*_intfpd_ar_p | ar_p | [31:24] | Channel peak rate (8b fraction of transfers per cycle) |
*_intfpd_ar_b | ar_b | [15:0] | Channel burstiness (integer transfers) |
*_intfpd_ar_r | ar_r | [31:20] | Channel average rate (12b fraction of transfers per cycle) |
*_intfpd_qos_cntrl | en_awar_rate en_ar_rate en_aw_rate | [2] [1] [0] | Enable combined rate regulation Enable AR rate regulation Enable AW rate regulation |
Example #3: Transaction rate regulation
Regulate HP0 port to an average rate of 10% of the interconnect max data rate, but allow up to 4 catch-up transactions capped at 15% of the max data rate.
BL = 16
MAX = 533 MTps (8528 MBps)
Tps_avg = 0.1 * 533M = 53.3 MTps (852.8 MBps)
Tps_max = 0.15 * 533M = 79.95 MTps (1279.2 MBps)
Rate_avg = floor (256 / (100 * BL / %BW_avg)) = floor (4096 / (100 * 16 / 10)) = 25 (0x19)
Rate_peak = floor (4096 / (100 * BL / %BW_peak)) = floor (256 / (100 * 16 / 15)) = 2 (0x2)
Burstiness = 4 (0x4)
Due to quantization effects the actual rate will be lower than the target rate.
Rate (avg) = 25 * 100 * 16 / 4096 = 9.76% of MAX => 8528 * 0.0976 = 832.8 MBps
Rate (peak) = 2 * 100 * 16 / 256 = 12.5% of MAX => 8528 * 0.125 = 1066 MBps
Outstanding transaction regulation
In the QoS-400 you may specify the maximum number of outstanding transactions allowed with fractional transaction resolution. The actual number of transactions will modulate between the lower and upper bound.
HP0 | HP1 | HP2 | HP3 | Description | ||||
---|---|---|---|---|---|---|---|---|
0x47110 | afifm3M_intfpd_max_ot | 0x4A110 | afifm4M_intfpd_max_ot | 0x4B110 | afifm5M_intfpd_max_ot | 0x4C110 | Max number of outstanding transactions | |
0x47114 | afifm3M_intfpd_max_ot | 0x4A114 | afifm4M_intfpd_max_ot | 0x4B114 | afifm5M_intfpd_max_ot | 0x4C114 | Max number of combined outstanding transactions | |
0x4710C | afifm3M_intfpd_qos_cntl | 0x4A10C | afifm4M_intfpd_qos_cntl | 0x4B10C | afifm5M_intfpd_qos_cntl | 0x4C10C | Enable outstanding transaction regulation |
Register | Field | Bits | Description |
---|---|---|---|
*_intfpd_max_ot | ar_max_oti ar_max_otf aw_max_oti aw_max_otf | [29:24] [23:16] [13:8] [7:0] | Integer part of max outstanding AR addresses (6b) Fraction part of max outstanding AR addresses (8b) Integer part of max outstanding AW addresses (6b) Fraction part of max outstanding AW addresses (8b) |
*_intfpd_max_comb_ot | awar_max_oti awar_max_otf | [14:8] [7:0] | Integer part of max combined outstanding AW/AR addresses (6b) Fraction part of max combined outstanding AW/AR addresses (8b) |
*_intfpd_qos_cntrl | en_awar_rate en_ar_rate en_aw_rate | [7] [6] [5] | Enable combined regulation of outstanding transactions Enable regulation of outstanding read transactions Enable regulation of outstanding write transactions |
Example
Regulate HP0 port to 2.5 outstanding transactions.
oti = 2 (0x2)
otf = 256 * 0.5 = 128 (0x80)
DDR QoS Controller
The DDR QoS controller can throttle low latency and best effort traffic based on the Content Addressable Memory (CAM) levels to prioritize video traffic. It also provides software the ability to trigger urgent AXI transactions on a per port basis to dynamically elevate lower priority traffic.
DDR QoS Control Module base address: 0xFD090000
QoS Throttle Control
The QoS DDR controller is designed to ensure that there is always space available in the CAMs for video traffic. By monitoring the individual CAM levels, the QoS DDR controller can throttle low latency and best effort traffic in favor of video traffic. The DBGCAM register can be used to monitor CAM levels to see if any are saturating.
Register | Offset | Field | Bits | Description |
---|---|---|---|---|
0x0 | PORT5_TYPE PORT4_TYPE PORT3_TYPE | [15:14] [13:12] [11:10] | Port 5 Type: 0x0-BE, 0x1-LL, 0x2-Video Port 4 Type: 0x0-BE, 0x1-LL, 0x2-Video Port 4 Type: 0x0-BE, 0x1-LL, 0x2-Video | |
0x4 | PORT5_WR_CTRL PORT5_HPR_CTRL PORT5_LPR_CTRL PORT4_WR_CTRL PORT4_HPR_CTRL PORT4_LPR_CTRL PORT3_WR_CTRL PORT3_HPR_CTRL PORT3_LPR_CTRL | [21] [20] [19] [18] [17] [16] [15] [14] [13] | QoS throttle Control on Write channel QoS throttle Control on Read HPR channel QoS throttle Control on Read LPR channel QoS throttle Control on Write channel QoS throttle Control on Read HPR channel QoS throttle Control on Read LPR channel QoS throttle Control on Write channel QoS throttle Control on Read HPR channel QoS throttle Control on Read LPR channel | |
0x8 | VALUE | [6:0] | Read HPR CAM Threshold Level | |
0xC | VALUE | [6:0] | Read LPR CAM Threshold Level | |
0x10 | VALUE | [6:0] | Write CAM Threshold Level |
Urgent Transactions
Urgent transactions are enabled by default in FSBL which sets in the [rd|wr]_port_urgent_en bits in the PCFG[R|W]_n registers. Urgent transactions may be issued either through expiring aging counters in the DDRC or through software by writing to the DDRC_URGENT register in the DDR QoS Controller.
Register | Offset | Field | Bits | Description |
---|---|---|---|---|
0x510 | ARURGENT_5 AWURGENT_5 ARURGENT_4 AWURGENT_4 ARURGENT_3 AWURGENT_3 | [13] [12] [11] [10] [9] [8] | Sideband signal to indicate a DDRC Port 5 read queue urgent transaction Sideband signal to indicate a DDRC Port 5 write queue urgent transaction Sideband signal to indicate a DDRC Port 4 read queue urgent transaction Sideband signal to indicate a DDRC Port 4 write queue urgent transaction Sideband signal to indicate a DDRC Port 3 read queue urgent transaction Sideband signal to indicate a DDRC Port 3 write queue urgent transaction |
DDR Memory Controller
The DDR memory controller is based on the Synopsys uMCTL2 DDR Memory Controller IP. The four HP ports from the PL funnel down to three AXI Port Interfaces (XPI) through the interconnect switch network (NIC/QoS). Since this is a multi-port memory controller, arbitration occurs based on the priority and direction of each request in an effort to optimize the DDR bandwidth. The commands in the XPI are serviced by the Port Arbiter (PA) based on a set of arbitration rules. The selected command is then forwarded to the DDR Controller (DDRC) for scheduling. Inside the DDRC the commands from the PA are queued in either the read or write Content Addressable Memory (CAM). Once a command is selected from either CAM, it is forwarded to the PHY and issued to the DDR memory.
DDRC Module base address: 0xFD070000
Port Control
The Port Control registers allow software to enable and disable the DDRC ports. This can be helpful for debugging or dynamically managing ports with software.
Traffic Classes
The QoS values on each port are mapped into traffic classes. Variable priority reads and writes (VPR/VPW) have a timer associated with each command. When the timer reaches zero, the command is considered expired and gets elevated to the highest priority whether the command is in an XPI or a CAM. Video traffic class is mapped to VPR/VPW.
Traffic Class | Read QoS Value (default) | Read Priority Mapping | Write QoS Value (default) | Write Priority Mapping |
---|---|---|---|---|
Best Effort (BE) | 0-3 | LPR | 0-7 | NPW |
Video (V) | 4-11 | VPR | 8-15 | VPW |
Low Latency (LL) | 12-15 | HPR | N/A | N/A |
Register | Offset | Description |
---|---|---|
0x6A4 | Map read traffic classes to regions and define separation levels for port 3 | |
0x6AC | Map write traffic classes to regions and define separation levels for port 3 | |
0x754 | Map read traffic classes to regions and define separation levels for port 4 | |
0x75C | Map write traffic classes to regions and define separation levels for port 4 | |
0x804 | Map read traffic classes to regions and define separation levels for port 5 | |
0x80C | Map write traffic classes to regions and define separation levels for port 5 |
Variable Priority Timeouts
Variable priority timeouts can be modified to trade-off latency for throughput.
Register | Offset | Field | Bits | Description |
---|---|---|---|---|
0x6A8 | rqos_map_timeout | [10:0] | Timeout value for read transactions on port 3 | |
0x6B0 | wqos_map_timeout | [10:0] | Timeout value for write transactions on port 3 | |
0x758 |