Versal Cache Coherency
The following example design is a demonstration of how to perform cache coherent transactions from different masters connected to the CCI-500 or cache coherent interconnect on a Versal device.
Table of Contents
- 1 Overview
- 2 Hardware Block Design
- 2.1 Network on Chip (NoC)
- 2.2 PL DMA LPD/FPD
- 2.3 Interrupts
- 3 Software
- 3.1 Baremetal
- 3.1.1 APU
- 3.1.2 PL DMA LPD/FPD
- 3.1.3 PL DMA CCI0/1
- 3.1.4 LPD DMA
- 3.1.5 RPU
- 3.2 Linux
- 3.2.1 Cache management
- 3.2.2 AXI signaling
- 3.2.3 Register initialization
- 3.1 Baremetal
- 4 Example Source Files
- 5 Example Result
Overview
The primary focus of this example design is to provide a practical case of I/O coherency between the A72 CPUs and the different masters in the system such as DMAs within the processing subsystem or in the programmable logic.
For more detailed information about the cache coherency interconnect, refer to the appropriate section within the Versal Adaptive SoC Technical Reference Manual (AM011). Additionally, use the coherency section of the ARM Cortex-A programmers guide for ARMv8-A as a support document for a more complete understanding of cache coherency in a heterogeneous device.
The example design covers both bare-metal and Linux use cases.
Hardware Block Design
The example design implements four AXI CDMA IPs to perform data transfers through S_AXI_LPD, S_AXI_FPD (S_AXI_GP2), NOC_FPD_CCI_0 and NOC_FPD_CCI_1 respectively.
Additionally two AXI GPIO IPs are added to control at runtime the awcache[3:0]/arcache[3:0] and awprot[2:0]/arprot[2:0] bits for the four interfaces.
Network on Chip (NoC)
The connectivity of the NoC is configured such that CDMA 2 with GPIO1 channel 1 is connected to the NOC_FPD_CCI_0 and CDMA 3 with GPIO1 channel 2 is connected to the NOC_FPD_CCI_1.
Additionally, the two NMUs or master ports should be routed through the CCI instead of going straight to the memory controller. Selecting the “PS Cache Coherent Virtual” option in the Outputs tab guarantees this behavior. This option will route all of the traffic from the NMU to the cache coherent interface NSU using Fixed DestID addressing.
PL DMA LPD/FPD
The connectivity of the PL LPD port (S_AXI_LPD) is assigned to the CDMA1 and GPIO0 channel 2, while the PL FPD port (S_AXI_GP2) is assigned to GPIO0 channel 1.
Interrupts
The Xilinx Linux DMA IP driver requires the DMA's interrupt signal to be connected to the processor, therefore PL to PS interrutps have to be enabled in the CIPS configuration wizard and the interrupt signals of the CDMA IPs routed to them.
Software
Baremetal
The baremetal use case software is distributed into two processors, the APU and the RPU. The APU application is the one that controls the example design, configuring and trigering the DMA engines as well as commanding the RPU application to perform memory transfers.
APU
The test application is executed in the APU, which is in charge of initializing the source and destination buffers and triggering the DMA transactions for each interface.
After initializing the buffers and flushing the data to physical memory, the processor also writes the destination buffer in the cache to check whether coherency is maintained after the DMA transfer (aka, the cache reflects the destination buffer that contains the source buffer pattern).
/* Initialize buffers */
xil_printf("\r\nPL FPD DMA Coherency Test\r\n");
xil_printf("Source buffer pattern: 0x36\r\n");
initBuffers(0x36);
/* Write Cache */
for (int Index = 0; Index < BUFFER_BYTESIZE; Index++) {
DstBuffer[Index] = 0xFF;
}
/* Perform DMA Transfer */
<Specific code for each DMA unit>
xil_printf("Destination buffer readback: 0x%0X\r\n", DstBuffer[0]);
The test application is built as a standalone/baremetal application running on EL3 in the Cortex-A72 processor.
There are two main important configuration steps that the APU application needs to perform:
Enable snooping in the S4 port of the CCI-500 (APU)
Set the memory as outer shareable in the MMU table entries
/* Enable snooping of APU caches from CCI */
Xil_Out32(SNOOP_CONTROL_SI4, ENABLE_SNOOP);
/* Set memory as outer cacheable */
Xil_SetTlbAttributes((UINTPTR)DstBuffer, NORM_WB_OUT_CACHE);
dsb();
PL DMA LPD/FPD
The PL masters have multiple paths to connect to the CCI-500 interconnect. The S_AXI_FPD (S_ACE_Lite_FPD) is the most common one to use (there is a full ACE and ACP ports are also available). Additionally, the S_AXI_LPD port can also be used as a coherent port, making accesses to the DDR require the LPD domain masters to go through the CCI S3 port.
In order for an AXI transaction to become coherent with the APU (snooping), the CCI controller will use the share-ability information of the AXI transaction defined by the AxDOMAIN signals. The PS-PL interface does not provide access to those signals and instead these signals are driven based on the AxCACHE values, where AxCACHE[3:2] != 2'b00 causes AxDOMAIN to be set to 2'10 (outer shareable) and AxCACHE == 2'b00 sets AxDOMAIN to 2'b11 (system shareable). Therefore, as documented in the APU coherency section of the Zynq MPSoC TRM (which also applies for Versal) any non-zero value on AxCACHE[3:2] should be used for coherent transfer and AxCACHE[3:2]==2b00 for a non-coherent transfer.
The CDMA IP does not provide any control for those signals and they are tied to zero, so AXI GPIO IPs have been added to drive these control signals in the AXI bus. As documented in the AXI Protocol specification, the AxProt signals define the access permission attributes and the AxCache signals define the memory attributes. Additionally, in the Versal device a specific control register needs to be written to enable TZ bit usage of the AXI interface instead of the default non-secure value.
In order to generate coherent transactions with the APU test application executing at EL3 (secure), the secure access attribute needs to be used in AxPROT and the Allocate or Other Allocate attribute in the AxCACHE. For this example design, the transaction will be defined as "Write-through No-allocate" as defined by the memory type encoding in the AXI specification.
/* Set the AxPROT and AxCACHE signals */
XGpio_Initialize(&Gpio, XPAR_GPIO_0_DEVICE_ID);
XGpio_DiscreteWrite(&Gpio, FPD_CHANNEL, AXI_ATTR(PROT_UP_S_D, CACHE_OA_M));
/* Set PL_ACELITE_FPD_TZ defined by PL */
Xil_Out32(PL_ACELITE_FPD_TZ, 0x0);
/* Set the AxPROT and AxCACHE signals */
XGpio_Initialize(&Gpio, XPAR_GPIO_0_DEVICE_ID);
XGpio_DiscreteWrite(&Gpio, LPD_CHANNEL, AXI_ATTR(PROT_UP_S_D, CACHE_OA_M));
/* Set PL_AXI_LPD_TZ defined by PL */
Xil_Out32(PL_AXI_LPD_TZ, 0x0);
/* Route PL_AXI_LPD through CCI */
Xil_Out32(PL_AXI_LPD_Route, 0x1);
The DMA engine is managed through the APU using the CDMA driver API and two simple functions.
/* Perform DMA Transfer */
XAxiCdma_SimpleTransfer(&FpdCDma, (UINTPTR)SrcBuffer, (UINTPTR)DstBuffer, BUFFER_BYTESIZE, NULL, NULL);
while (XAxiCdma_IsBusy(&FpdCDma));
/* Perform DMA Transfer */
XAxiCdma_SimpleTransfer(&LpdCDma, (UINTPTR)SrcBuffer, (UINTPTR)DstBuffer, BUFFER_BYTESIZE, NULL, NULL);
while (XAxiCdma_IsBusy(&LpdCDma));
PL DMA CCI0/1
Similar to the PL DMA engines using the S_AXI_LPD and S_AXI_FPD interfaces, the PL Engines using the CCI interfaces need to manage AXI AxPROT and AxCACHE signals through a dedicated AXI GPIO IP.
/* Set the AxPROT and AxCACHE signals */
XGpio_Initialize(&Gpio, XPAR_GPIO_1_DEVICE_ID);
XGpio_DiscreteWrite(&Gpio, CCI0_CHANNEL, AXI_ATTR(PROT_UP_S_D, CACHE_OA_M)); |
/* Set the AxPROT and AxCACHE signals */
XGpio_Initialize(&Gpio, XPAR_GPIO_1_DEVICE_ID);
XGpio_DiscreteWrite(&Gpio, CCI1_CHANNEL, AXI_ATTR(PROT_UP_S_D, CACHE_OA_M)); |
The DMA engine is managed through the APU using the CDMA driver API and two simple functions.
/* Perform DMA Transfer */
XAxiCdma_SimpleTransfer(&Cci0CDma, (UINTPTR)SrcBuffer, (UINTPTR)DstBuffer, BUFFER_BYTESIZE, NULL, NULL);
while (XAxiCdma_IsBusy(&Cci0CDma));
/* Perform DMA Transfer */
XAxiCdma_SimpleTransfer(&Cci1CDma, (UINTPTR)SrcBuffer, (UINTPTR)DstBuffer, BUFFER_BYTESIZE, NULL, NULL);
while (XAxiCdma_IsBusy(&Cci1CDma));
LPD DMA
The LPD DMA engine is connected to the CCI-500 through the S3 port as with any other LPD domain peripheral. The LPD DMA provides an interface to configure the AxCACHE bits of the transaction through the CH_DATA_ATTR register in the address map, and the DMA driver includes the option to configure through the XZDma_SetChDataConfig function. The security access permission attribute on the other hand is configured through the DMA_Ch0_TZ register and there is no driver to control it in the BSP. Additionally, the DMA_Route register needs to be configured to route the traffic through the CCI-500.
/* Configuration settings */
Configure.SrcBurstType = XZDMA_INCR_BURST;
Configure.SrcBurstLen = 0xF;
Configure.DstBurstType = XZDMA_INCR_BURST;
Configure.DstBurstLen = 0xF;
Configure.SrcCache = CACHE_OA_M;
Configure.DstCache = CACHE_OA_M;
XZDma_SetChDataConfig(&ZDma, &Configure);
/* Change TZ bit to be secure master */
Xil_Out32(DMA_Ch0_TZ , 0x0);
/* Route LPD DMA through CCI */
Xil_Out32(DMA_Route , 0x1); |
The DMA transaction control is performed using the XZDma API.
/* Transfer data */
XZDma_Transfer Data;
Data.SrcAddr = (UINTPTR)SrcBuffer;
Data.DstAddr = (UINTPTR)DstBuffer;
Data.SrcCoherent = 1;
Data.DstCoherent = 1;
Data.Size = BUFFER_BYTESIZE;
XZDma_Start(&ZDma, &Data, 1);
while(XZDma_ChannelState(&ZDma) == XZDMA_BUSY);
RPU
The RPU cluster is capable of accessing the DDR memory either directly or using the coherent path as with any other LPD master. By default, the coherency is disabled in the RPU0_Route register, but transactions can be directed through the CCI-500 engine setting the Coherent bit in the register. The RPU is also configured as secure master by default in the RPU0_TZ register so there is no need to modify it.
/* Route RPU0 through CCI */
Xil_Out32(RPU0_Route, 0x1); |
The data transfer generated by the RPU is managed by the APU, providing both the source buffer address and the destination buffer address using the IPI communication channel. The RPU application is a simple IPI channel monitor that configures an interrupt handler that reads the incoming message to get the buffer addresses and then performs the copy through read/write operations.
The APU generates a message with 3 elements, source buffer address, destination buffer address, and buffer size:
/* Send message to RPU0 */
u32 Msg[3] = {(UINTPTR)SrcBuffer, (UINTPTR)DstBuffer, BUFFER_BYTESIZE};
XIpiPsu_WriteMessage(&IpiInst,DestCpuMask, Msg, 3, XIPIPSU_BUF_TYPE_MSG);
XIpiPsu_TriggerIpi(&IpiInst, DestCpuMask);
XIpiPsu_PollForAck(&IpiInst, DestCpuMask, 100000);
The RPU reads the incoming message and use the received data to copy the source buffer into the destination buffer:
/* Read Incoming Message Buffer Corresponding to Source CPU */
XIpiPsu_ReadMessage(InstancePtr, InstancePtr->Config.TargetList[SrcIndex].Mask, TmpBufPtr, 3, XIPIPSU_BUF_TYPE_MSG);
u8* SrcBuffer = (u8*)TmpBufPtr[0];
u8* DstBuffer = (u8*)TmpBufPtr[1];
for (int Index = 0; Index < TmpBufPtr[2]; Index++) {
DstBuffer[Index] = SrcBuffer[Index];
}
Linux
The linux test is performed from the command prompt using the Linux DMA Test driver in a similar way that is documented in the Xilinx Linux Soft DMA Driver wiki page that documents the AXI CDMA IP driver for Linux. In order to test the hardware cache coherency there are three main things that needs to be done in addition to running the dmatest.
Cache management
The Linux kernel DMA framework maintains the coherency for architectures where hardware based coherency is not provided. As this example is intended to demonstrate the cache coherency features provided by the CCI500, the "dma-coherent" property is added in each of the DMA IPs used for this example. This propery express that the hardware is cache coherent and therefore software does not need to worry about maintaining coherency in the system.
&axi_cdma_0 {
dma-coherent;
}
&axi_cdma_1 {
dma-coherent;
}
&axi_cdma_2 {
dma-coherent;
}
&axi_cdma_3 {
dma-coherent;
}
AXI signaling
As explained already in the baremetal use case, in order to the PL DMA transactions be cache coherent, the AxPROT and AxCACHE signals needs to be driven by the AXI GPIO IP included in the design. In this case as the Linux kernel and userspace are executing at EL1/0 (non-secure), the non-secure attribute needs to be used in AxPROT signal.
devmem 0xa4020000 32 0x2b
devmem 0xa4020008 32 0x2b
devmem 0xa4050000 32 0x2b
devmem 0xa4050008 32 0x2b
Register initialization
In order to make possible the coherency for PL based masters, there are few control registers that needs to be configured. As these registers are not accessible for Linux userpace due to security reasons, a register initialization needs to be performed during the boot process. There are multiple ways to accomplish this task but for this example a custom CDO file has been created that configures the appropriate registers.
Inner Cache Broadcasting
Linux sets up the MMU for cacheable memory to be inner shareable as that supports SMP operation. As modifying the MMU tables from kernel or userspace is not a straightforwards task, the inner cache broadcasting feature can be used to allow inner cacheble transactions be broadcasted. Outside the APU, in the outer domain, the CCI handles coherency across the system. The brdc_inner bit of the lpd_apu register must be written while the APU is in reset. The requirement to alter the register while the APU is in reset can be accomplished in multiple manners.
TZ bit usage
In Versal devices, the AxPROT bits driven by the PL are gated in the PS side for security reasons, and by default these are set to zero (non-secure). In order to enable driving these signals from the PL, the PL_AXI_LPD_TZ and PL_AXI_FPD_TZ have to be configured.
LPD Routing
By default the LPD AXI interface in the PS is not routed through the CCI block, and therefore the PL_AXI_LPD_Route register needs to be configured to route the traffic through the CCI.
The following is a sample of the coherency.cdo file used.
write 0x00FF41A040 0x3
write 0x00FD690118 0x0
write 0x00FF510050 0x0
write 0x00FE600018 0x1
The CDO is then added in a custom BIF file that is used to generate the boot image.
the_ROM_image:
{
image {
{ type=bootimage, file=<petalinux-project>/project-spec/hw-description/vck190_axi_cci_wrapper.pdi }
{ type=bootloader, file=<petalinux-project>/images/linux/plm.elf }
{ core=psm, file=<petalinux-project>/images/linux/psmfw.elf }
}
image {
id = 0x1c000000, name=apu_subsystem
{ type=cdo, file=<petalinux-project>/images/linux/coherency.cdo}
{ type=raw, load=0x00001000, file=<petalinux-project>/images/linux/system-default.dtb }
{ core=a72-0, exception_level=el-3, trustzone, file=<petalinux-project>/images/linux/bl31.elf }
{ core=a72-0, exception_level=el-2, file=<petalinux-project>/images/linux/u-boot.elf }
}
}
Example Source Files
This example design has been tested using a VCK190 board and the Vivado/Vitis/Petalinux 2024.2 release.
Example Result
Baremetal
************************************************************
Versal CCI-500 Coherency example
Source buffer 0x40000, Destination buffer 0x60000
************************************************************
PL FPD DMA Coherency Test
Source buffer pattern: 0x36
Destination buffer pattern: 0x36
Coherent DMA transfer, OK
PL LPD DMA Coherency Test
Source buffer pattern: 0x37
Destination buffer pattern: 0x37
Coherent DMA transfer, OK
CCI 0 DMA Coherency Test
Source buffer pattern: 0x38
Destination buffer pattern: 0x38
Coherent DMA transfer, OK
CCI 1 DMA Coherency Test
Source buffer pattern: 0x39
Destination buffer pattern: 0x39
Coherent DMA transfer, OK
LPD DMA Coherency Test
Source buffer pattern: 0xA5
Destination buffer pattern: 0xA5
Coherent DMA transfer, OK
RPU Coherency Test
Source buffer pattern: 0x5A
Destination buffer pattern: 0x5A
Coherent DMA transfer, OK
Linux
/ # devmem 0xa4020000 32 0x2b
/ # devmem 0xa4020008 32 0x2b
/ # devmem 0xa4050000 32 0x2b
/ # devmem 0xa4050008 32 0x2b
/ # echo 2000 > /sys/module/dmatest/parameters/timeout; echo 1 > /sys/module/dmatest/parameters/iterations;
/ # echo dma0chan0 > /sys/module/dmatest/parameters/channel;
[ 107.139354] dmatest: Added 1 threads using dma0chan0
/ # echo 1 > /sys/module/dmatest/parameters/run
[ 110.100973] dmatest: Started 1 threads using dma0chan0
[ 110.101260] dmatest: dma0chan0-copy0: summary 1 tests, 0 failures 20408.16 iops 20408 KB/s (0)
/ # echo dma1chan0 > /sys/module/dmatest/parameters/channel;
[ 121.306917] dmatest: Added 1 threads using dma1chan0
/ # echo 1 > /sys/module/dmatest/parameters/run
[ 124.158828] dmatest: Started 1 threads using dma1chan0
[ 124.159120] dmatest: dma1chan0-copy0: summary 1 tests, 0 failures 22727.26 iops 340909 KB/s (0)
/ # echo dma2chan0 > /sys/module/dmatest/parameters/channel;
[ 133.014252] dmatest: Added 1 threads using dma2chan0
/ # echo 1 > /sys/module/dmatest/parameters/run
[ 135.262810] dmatest: Started 1 threads using dma2chan0
[ 135.263100] dmatest: dma2chan0-copy0: summary 1 tests, 0 failures 22727.26 iops 318181 KB/s (0)
/ # echo dma3chan0 > /sys/module/dmatest/parameters/channel;
[ 153.391206] dmatest: Added 1 threads using dma3chan0
/ # echo 1 > /sys/module/dmatest/parameters/run
[ 155.956435] dmatest: Started 1 threads using dma3chan0
[ 155.956694] dmatest: dma3chan0-copy0: summary 1 tests, 0 failures 31250.00 iops 31250 KB/s (0)
Related content
© Copyright 2019 - 2022 Xilinx Inc. Privacy Policy