This page describes cache coherency for Zynq UltraScale+ MPSoC. The ARM Cortex-A Series Programmers Guide refers to coherency as software managed or hardware managed. Zynq UltraScale+ MPSOC defaults to software managed coherency by default at this time (2017.4) and the following page describes the details of turning on hardware managed coherency.
Table of Contents
This page is not intended to be a tutorial about cache coherency in a multi-core system. The reader should refer to other documents (such as the MPSoC Technical Reference Manual and Software Developers Guide) for a more detailed understanding of MPSoC together with ARM documents such as the ARM Cortex-A Series Programmers Guide for a more complete understanding of cache coherency in a multi-core system. The primary focus of this page at this time is on I/O Coherency between the PL and the A53 CPUs with a practical make it work approach. This page is based on AR69446 and adds more details.
Prototyping with AXI DMA in the PL has shown minor performance increases with a hardware managed coherent system but each system implementation may vary such that users should verify performance. CPU utilization may have minor improvements when not required to do cache maintenance in a hardware managed coherent system. Some users may desire hardware coherence to simplify the system software design such as for user space DMA implementations.
2 System Terminology
2.1 Cache Coherent Interconnect (CCI)
The CCI of the MPSoC together with the AXI interconnect allow hardware coherency to be achieved.
2.2 PL I/O Coherency
The I/O (aka one-way) coherent masters in the PL can snoop APU caches via CCI ACE-Lite slave ports HPC0/1. You could see the HPC0/1 ports called AFI0/1 in some inconsistent places in the tools or documentation. Hardware-managed IO coherency can simplify software, improve system performance, and reduce power by sharing on-chip data from APU caches.
2.3 Exception Levels
Xilinx Bare Metal applications run at EL3 natively while the Linux kernel is running at EL1. The exception level is tied to the security state of the system with EL3 being a secure state and all others being non-secure. AXI transactions from masters must match the security state of the system.
A domain refers to a set of bus masters in the system. Domains determine which of the masters are snooped for coherent transactions. The APU, including the four A53s and the L2 cache, of the MPSoC is in the inner shareable domain while the PL is in the outer shareable domain.
3. AXI Signals
The following AXI Signals are driven by AXI Masters during AXI transactions. Some IP may have an option to specify how these signals are driven by the IP while others may not and the user will need to tie the signals to the desired state. Users should refer to the AXI Protocol Specification (ARM document IHI 0022E) for more details.
3.1 ARCACHE[3:0] and AWCACHE[3:0]
These signals describe the memory attributes for the read or write transaction. The upper 2 bits control the caching aspects of the transaction. Non-zero values for the upper two bits are required for cache coherency. Xilinx IP typically set AxCACHE[3:0] to 4'b0000, so user intervention is required.
3.2 ARPROT[2:0] and AWPROT[2:0]
These signals describe the access permissions for the read or write transaction. In this specific application the secure / non-secure nature of the transactions are the concern. Transactions should be appropriate for the EL level of the system. AxPROT should be 0, the default for the IP typically, for secure access for bare metal applications running at EL3. AxPROT should be 1 for non-secure access for Linux.
4 MPSoC Slave Ports
MPSoc provides the AXI slave High Performance Coherent ports (HPC0/1) to support I/O coherent transactions. At the PS-PL interface, these ports use the AXI4 protocol. These ports are routed through the Cache Coherent Interconnect (CCI) of MPSoC. The HP0/1/2/3 ports do not support coherent transactions. A PL master write transaction on the HPC0/1 ports goes to memory without going through the APU caches and the status of the cache lines for the memory is updated.
5 Inner/Outer Shareable
Software must define which address regions are to be used by which masters in the system. Cached memory regions are marked as non-shareable, inner shareable or outer shareable in the MMU. Shareable memory is required to support hardware coherency.
Within the APU the SCU handles coherency among the A53 cores in the inner domain. Outside the APU, in the outer domain, the CCI handles coherency across the system. APU cache/memory transactions are not broadcast outside the inner domain by default such that there is no I/O coherency with the PL. Memory transactions from the inner domain must be visible to the CCI to allow hardware coherency.
Linux sets up the MMU for cacheable memory to be inner shareable as that supports SMP operation.
There are multiple methods to achieve hardware coherency depending on the software runtime environment.
5.1 Outer Shareable
Memory can be marked as outer shareable in the MMU of the A53 such that the PL can snoop the memory transactions. This method works best for Bare Metal applications as it is difficult to do with Linux.
Memory can be marked as outer cachable by altering the source file with the static MMU table entries. All of memory is altered to be outer shareable in this case.
The source file should be copied from the BSP to the application to prevent a loss of changes if the BSP is regenerated in the Xilinx SDK. A source file can be in the BSP and the application and the application version will override the BSP version since the BSP is linked into the application as a library. The following code snippet illustrates the change required to the translation_table.S source file.
The memory could also be altered to be outer shareable at run-time using the driver API rather than at compile time and this manner would not require all of memory to changed to outer shareable.
5.2 Broadcasting Inner Shareable
This method alters a register of MPSoC to enable inner shareable transactions to be broadcast. The brdc_inner bit of the lpd_apu register in the LPD_SLCR module must be written while the APU is in reset. The requirement to alter the register while the APU is in reset can be accomplished in multiple manners.
5.2.1 Vivado CCI Enablement
Vivado allows the coherency to be enabled in the CCI Enablement in the Advanced Configuration for the MPSoC. In 2017.2, the AFI0/1 correlate to the HPC0/1 Ports. This method causes PMU Firmware to set the bit in the register, but it creates some challenges such that it is not recommended at this time. This is because there is potential for a race condition where the APU is taken out of reset before the bit is written, even if the CSU loads the PMUFW. If the FSBL loads the PMU Firmware from the A53, then this method will not work.
5.2.2 Register Write At Early Boot
This is the recommended method for Linux boot as it guarantees that the register is written prior to the APU coming out of reset.
The Boot ROM can be used to write the register by using an init value in the boot image. Bootgen allows the init value to be added to the boot image. The following bif file snippet for bootgen illustrates the addition of the file containing an init value.
The following line illustrates the init value that would be in the regs.init file to cause outer shareable transactions to be broadcast to the CCI.
For more info regarding the "regs.init" file content please see Xilinx UG1283 ("Initialization Pairs and INT File Attribute" section).
The following line illustrate how to instruct the PetaLinux to use the "regs.init" file:
For more info, please have a look in Xilinx UG1157 ("petalinux-package --boot Command Options" section).
5.2.3 Debug Support
The Xilinx SDK provides a TCL file named psu_init.tcl which initializes the system before loading an appliciation into memory. However, in a standard SDK initialization flow, psu_init.tcl runs after the A53 is out of reset. A typical debug configuration is shown below where the APU is brought out of reset in step 2 but psu_init.tcl runs in step 4.
To workaround this, the user must modify the debug launch TCL script. This file is located at <project_path>/<project_name>.sdk/.sdk/launch_scripts/xilinx_c-c++_application_(system_debugger). Rename the file and add the lines as shown below.
The modified launch script can then be called from the XSCT Console window in the SDK debugger.
Note that this bit in the register appears to be a write once register such that a Power On Reset (POR) is required to alter it. The SDK debugger does not do a POR such that a power cycle of the test platform may be required.
5.2.4 Register Write From R5
An R5 CPU can be used to write the value into the register. The R5 must be booted before the A53 for this method to be effective.
Port 3 of the CCI is connected to the APU cluster. The CCI does not performing snooping of this port by default. The Enable_snoops bit of the Snoop_Control_Register_S3 register in the CCI must be set to enable the snooping.
6.1 Linux and ARM Trusted Firmware (ATF)
ATF is a component of an MPSoC software system and it enables snooping by default such that this step is not required for Linux cache coherence.
6.2 CCI-400 Performance Tuning
The CCI-400 has several knobs that can be adjusted if the default settings do not give adequate performance. Any gains will be application specific.
A read over a slave interface (e.g. a PL master reading DDR) may involve speculative fetching. Speculative fetching causes the master interface to issue a downstream fetch in parallel with the issuance of a snoop to the A53 caches. It reduces latency when the probability of a cache miss is high, but introduces extra accesses to DDR if the snoop hits in the cache. The speculative reads have to complete before the slave interface gets its data. Disabling speculative fetching may increase your performance if the pre-fetched data does not match your access pattern. This is done in two places:
0xFD6E0004 - Control Override Register - Bit 2: Disable speculative fetches - Powers up enabled; set to '1' to disable all speculative fetches
0xFD6E0000 - Speculation Control Register - Defaults to '0' (enabled) for all interfaces; set appropriate bits to '1' (e.g. 0x001f0007 to disable for all slave/master interfaces) if you have not set bit 2 in the Control Override Register
This change can be done in the ATF driver for the CCI-400: drivers/arm/cci400/cci400., in the function cci_enable_cluster_coherency()
The CCI-400 uses the QoS Value when it chooses between transaction requests at arbitration points. A higher QoS indicates higher priority. Each interface has its QoS register.
The CCI-400 uses a Least Recently Granted scheme when two or more transactions share the highest QoS.
The QoS for PL masters is controlled in the AFIFM module (base address 0xFD380000) and there are separate QoS assignments for read and write per interface. You can assign the QoS values in the FSBL.
The CCI-400 has further QoS overrides that can be used if needed; please refer to the manual above for details.
Bare metal software must enable this bit to support cache coherence. The software should ensure that the bit is set before the application continues using a memory barrier (dmb), otherwise coherence can change during the application execution. The following code illustrates the enable of snooping.
7 Linux Device Tree
Existing or new kernel drivers for devices in the PL can be coherent by specifying a property in the device tree. A device driver which uses the Linux API to control caching and works in a non-coherent system should not need to be altered to work in a coherent system. The DMA APIs are aware of coherency such that the functions will omit the cache operations required for a non-coherent system. To omit cache operations in the DMA APIs, add the property "dma-coherent" to the device tree for the device as illustrated in the following device tree snippet.
Some PL masters, such as AXI DMA, have multiple AXI interfaces particularly when using scatter gather. For a coherent system with Linux (dma-coherent specified in the device tree), it is important that all AXI interfaces of the master use HPC0/1 ports to ensure that all transactions from the master are coherent.
The term "coherent" in Linux is also referred to as "consistent" which can be clearer. For a non-coherent system, non-cached memory is used. Cached memory is used for a coherent system. The Linux framework for memory allocation, such as the dma_alloc_coherent function, changes behavior based on the dma-coherent property in the device tree. A coherent hardware system can run as a non-coherent software system with Linux by not using the dma-coherent property in the device tree.
7.1 Kernel Page Tables
Some users may want to verify the memory allocated by a device driver is non-cached or cached in Linux. This can be done for ARM64 by enabling the page tables of the kernel to be dumped. The following kernel configuration allows the page tables to be dumped from the command line.
Assuming the debug filesystem is mounted at /sys/kernel/debug, the page tables are located in /sys/kernel/debug/kernel_page_tables. The device driver may require debug to be added to output the physical and virtual addresses of the memory in question as the page tables only include the virtual addresses of memory.
The following line from the kernel page tables illustrates non-cached normal memory.
The following line from the kernel page tables illustrated cached normal memory.
8 System Checklists
There a lot of details to make coherency work on this page such that a concise checklist for bare metal and Linux seems useful.
- Generate coherent transactions by tying AxCACHE correctly (upper bits non-zero)
- Generate secure transactions by tying AxPROT correctly
- Make memory outer shareable or enable broadcast inner shareable
- Enable snooping
8.2 Linux Checklist
- Generate coherent transactions by tying AxCACHE correctly (upper bits non-zero)
- Generate non-secure transactions by tying AxPROT correctly
- Enable broadcast inner shareable
- Alter the device tree for coherent devices to have "dma-coherent" property
9 Prototyping / Testing Coherency In The PL
An easy method to test hardware coherency with the PL is use an AXI CDMA IP core. This core allows memory to memory transfers in simple mode (not scatter gather) with minimal effort. The bare metal driver in the SDK also includes a simple polled mode example. The example should be altered to enable snooping and make the memory outer shareable. The cache operations can be commented out to verify the coherency is working.
Note that the same method of broadcasting inner shareable as described above (rather than making memory outer shareable) can also be used for bare metal to ensure the system is ready for Linux coherency.
9.1 An Example Vivado System
The following system illustrates a system for bare metal with secure transactions. For Linux the axprot constant should be set to 2 to generate non-secure transactions.