The following example design is a demonstration of how to perform cache coherent transactions from different masters connected to the CCI-400 or cache coherent interconnect on a Zynq UltraScale+ MPSoC device.

Table of Contents

Overview

The primary focus of this example design is to provide a practical case of I/O coherency between the A53 CPUs and the different masters in the system such as DMAs within the processing subsystem or in the programmable logic.

For more detailed information about the cache coherency interconnect, refer to the appropriate section within the Zynq UltraScale+Technical Reference Manual (UG1085). Additionally, use the coherency section of the ARM Cortex-A programmers guide for ARMv8-A as a support document for a more complete understanding of cache coherency in a heterogeneous device.

The example design covers both bare-metal and Linux use cases and is based on the Answer Record 69446.

Hardware Block Design

The example design implements two AXI CDMA IPs to perform data transfer through S_AXI_HPC0_FPD and S_AXI_LPD respectively. Additionally an AXI GPIO IP is added to control at runtime the wcache[3:0]/arcache[3:0] and awprot[2:0]/arprot[2:0] bits for both interfaces.

PL DMA LPD/FPD

The connectivity of the PL LPD port (S_AXI_LPD) is assigned to the CDMA1 and GPIO channel 2, while the PL FPD port (S_AXI_HPC0_FPD) is assigned to GPIO channel 1.

Interrupts

The Xilinx Linux DMA IP driver requires the DMA's interrupt signal to be connected to the processor, therefore PL to PS interrutps have to be enabled in the CIPS configuration wizard and the interrupt signals of the CDMA IPs routed to them.

Software

Baremetal

The baremetal use case software is distributed into two processors, the APU and the RPU. The APU application is the one that controls the example design, configuring and trigering the DMA engines as well as commanding the RPU application to perform memory transfers.

APU

The test application is executed in the APU, which is in charge of initializing the source and destination buffers and triggering the DMA transactions for each interface.

After initializing the buffers and flushing the data to physical memory, the processor also writes the destination buffer in the cache to check whether coherency is maintained after the DMA transfer (aka, the cache reflects the destination buffer that contains the source buffer pattern).

/* Initialize buffers */
xil_printf("\r\nPL FPD(HPC) DMA Coherency Test\r\n");
xil_printf("Source buffer pattern: 0x36\r\n");
initBuffers(0x36);

/* Write Cache */
for (int Index = 0; Index < BUFFER_BYTESIZE; Index++) {
	DestBuffer[Index] = 0xFF;
}

/* Perform DMA Transfer */
XAxiCdma_SimpleTransfer(&FpdCDma, (UINTPTR)SrcBuffer, (UINTPTR)DestBuffer, BUFFER_BYTESIZE, NULL, NULL);
while (XAxiCdma_IsBusy(&FpdCDma));

xil_printf("Destination buffer readback: 0x%0X\r\n", DestBuffer[0]);

The test application is built as a standalone/baremetal application running on EL3 in the Cortex-A53 processor.

There are two main important configuration steps that the APU application needs to perform:

/* Enable snooping of APU caches from CCI */
Xil_Out32(SNOOP_CONTROL_S3, ENABLE_SNOOP);

/* Set memory as outer cacheable */
Xil_SetTlbAttributes((UINTPTR)DestBuffer, NORM_WB_OUT_CACHE);
dsb();

PL DMA LPD/FPD

The PL masters have multiple paths to connect to the CCI-400 interconnect, being the HPC ports (0/1) the most common ones to use. Additionally the S_AXI_LPD port can be also used as coherent port, as accesses to the DDR require the LPD domain masters to go through the CCI S2 port.

In order for an AXI transaction to become coherent with the APU (snooping), the CCI controller will use the share-ability information of the AXI transaction defined by the AxDOMAIN signals. The PS-PL interface does not provide access to those signals and instead these signals are driven based on the AxCACHE values, where AxCACHE[3:2] != 2'b00 causes AxDOMAIN to be set to 2'10 (outer shareable) and AxCACHE == 2'b00 sets AxDOMAIN to 2'b11 (system shareable). Therefore, as documented in the APU coherency section of the Zynq MPSoC TRM any non-zero value on AxCACHE[3:2] should be used for coherent transfer and AxCACHE[3:2]==2b00 for a non-coherent transfer.

The CDMA IP does not provide any control for those signals and they are tied to zero, so AXI GPIO IPs have been added to drive these control signals in the AXI bus. As documented in the AXI Protocol specification, the AxProt signals define the access permission attributes and the AxCache signals define the memory attributes.

In order to generate coherent transactions with the APU test application executing at EL3 (secure), the secure access attribute needs to be used in AxPROT and the Allocate or Other Allocate attribute in the AxCACHE. For this example design, the transaction will be defined as "Write-through No-allocate" as defined by the memory type encoding in the AXI specification.

/* Set the AxPROT and AxCACHE signals */
XGpio_Initialize(&Gpio, XPAR_GPIO_0_DEVICE_ID);
XGpio_DiscreteWrite(&Gpio, HPC_CHANNEL, AXI_ATTR(AXI_PROT, CACHE_OA_M));


/* Set the AxPROT and AxCACHE signals */
XGpio_Initialize(&Gpio, XPAR_GPIO_0_DEVICE_ID);
XGpio_DiscreteWrite(&Gpio, LPD_CHANNEL, AXI_ATTR(AXI_PROT, CACHE_OA_M));

LPD DMA

The LPD DMA engine is connected to the CCI-400 through the S2 port as any other LPD domain peripheral. The LPD DMA provides an interface to configure the AxCACHE bits of the transaction through the ZDMA_CH_DATA_ATTR register in the address map, and the DMA driver includes the option to configure through the XZDma_SetChDataConfig function . The security access permission attribute on the other hand is configured through the slcr_adma register and there is no driver to control it in the BSP.

/* Configuration settings */
Configure.SrcBurstType = XZDMA_INCR_BURST;
Configure.SrcBurstLen = 0xF;
Configure.DstBurstType = XZDMA_INCR_BURST;
Configure.DstBurstLen = 0xF;
Configure.SrcCache = CACHE_OA_M;
Configure.DstCache = CACHE_OA_M;
XZDma_SetChDataConfig(&ZDma, &Configure);

/* Change TZ bit to be secure master */
Xil_Out32(SLCR_ADMA, 0x0);

RPU

The RPU cluster is capable of accessing to the DDR memory either directly (DDR S0 port) or using the coherent path as any other LPD master. By default the coherency is disabled in the RPU[X]_CFG register, but transactions can be directed through the CCI-400 engine setting the Coherent bit in the register. The RPU is also configured as secure master by default in the LPD_SLCR_SECURE.slcr_rpu register so there is no need to modify it.

/* Configure RPU as coherent */
Xil_Out32(RPU_0_CFG, Xil_In32(RPU_0_CFG) | RPU_COHERENT);

The data transfer generated by the RPU is managed by the APU, providing both the source buffer address and the destination buffer address using the IPI communication channel. The RPU application is a simple IPI channel monitor that configures an interrupt handler that reads the incoming message to get the buffer addresses and then performs the copy through read/write operations.

The APU generates a message with 3 elements, source buffer address, destination buffer address, and buffer size:

/* Send message to RPU0 */
u32 Msg[3] = {(UINTPTR)SrcBuffer, (UINTPTR)DestBuffer, BUFFER_BYTESIZE};
XIpiPsu_WriteMessage(&IpiInst,DestCpuMask, Msg, 3, XIPIPSU_BUF_TYPE_MSG);
XIpiPsu_TriggerIpi(&IpiInst, DestCpuMask);
XIpiPsu_PollForAck(&IpiInst, DestCpuMask, 100000);

The RPU reads the incoming message and use the received data to copy the source buffer into the destination buffer:


/* Read Incoming Message Buffer Corresponding to Source CPU */
XIpiPsu_ReadMessage(InstancePtr, InstancePtr->Config.TargetList[SrcIndex].Mask, TmpBufPtr, 3, XIPIPSU_BUF_TYPE_MSG);

u8* SrcBuffer = (u8*)TmpBufPtr[0];
u8* DstBuffer = (u8*)TmpBufPtr[1];
for (int Index = 0; Index < TmpBufPtr[2]; Index++) {
	DstBuffer[Index] = SrcBuffer[Index];
}

Linux

The linux test is performed from the command prompt using the Linux DMA Test driver in a similar way that is documented in the Xilinx Linux Soft DMA Driver wiki page that documents the AXI CDMA IP driver for Linux. In order to test the hardware cache coherency there are three main things that needs to be done in addition to running the dmatest.

Cache management

The Linux kernel DMA framework maintains the coherency for architectures where hardware based coherency is not provided. As this example is intended to demonstrate the cache coherency features provided by the CCI400, the "dma-coherent" property is added in each of the DMA IPs used for this example. This propery express that the hardware is cache coherent and therefore software does not need to worry about maintaining coherency in the system.

&axi_cdma_0 {
    dma-coherent;
}
 
&axi_cdma_1 {
    dma-coherent;
}

AXI signaling

As explained already in the baremetal use case, in order to the PL DMA transactions be cache coherent, the AxPROT and AxCACHE signals needs to be driven by the AXI GPIO IP included in the design. In this case as the Linux kernel and userspace are executing at EL1/0 (non-secure), the non-secure attribute needs to be used in AxPROT signal.

devmem 0xa0010000 32 0x2b
devmem 0xa0010008 32 0x2b

Register initialization

In order to make possible the coherency for PL based masters, there is a control registers that needs to be configured.

Inner Cache Broadcasting

Linux sets up the MMU for cacheable memory to be inner shareable as that supports SMP operation. As modifying the MMU tables from kernel or userspace is not a straightforwards task, the inner cache broadcasting feature can be used to allow inner cacheble transactions be broadcasted. Outside the APU, in the outer domain, the CCI handles coherency across the system. The brdc_inner bit of the lpd_apu register within the LPD_SLCR module must be written while the APU is in reset.

The requirement to alter the register while the APU is in reset can be accomplished using the register initialization feature in the boot image.

.set. 0xFF41A040 = 0x3;

The initiailization file can be easily incorporated in the boot image generated in a petalinux project using the --bif-attribute options in the petalinux-package command.

petalinux-package --boot --u-boot --bif-attribute init --bif-attribute-value regs.init –force

Example Source Files

This example design has been tested using a ZCU102 board and the Vivado/Vitis/Petalinux 2022.2 release.

It can be easily reproduced using the following files in the github repository:

Example Results

Baremetal

************************************************************
ZynqMP CCI-400 Coherency example
************************************************************

PL FPD(HPC) DMA Coherency Test
Source buffer pattern: 0x36
Destination buffer readback: 0x36

PL LPD DMA Coherency Test
Source buffer pattern: 0x37
Destination buffer readback: 0x37

LPD DMA Coherency Test
Source buffer pattern: 0xA5
Destination buffer readback: 0xA5

RPU Coherency Test
Source buffer pattern: 0x5A
Destination buffer readback: 0x5A

Linux

/ # echo dma0chan0 > /sys/module/dmatest/parameters/channel;
[   27.369107] dmatest: Added 1 threads using dma0chan0
/ # echo 1 > /sys/module/dmatest/parameters/iterations;
/ # echo 1 > /sys/module/dmatest/parameters/run
[   39.770234] dmatest: Started 1 threads using dma0chan0
[   39.770672] dmatest: dma0chan0-copy0: dstbuf[0x1498] not copied! Expected c3, got 00
/ # [   39.783192] dmatest: dma0chan0-copy0: dstbuf[0x1499] not copied! Expected c2, got 00
[   39.791274] dmatest: dma0chan0-copy0: dstbuf[0x149a] not copied! Expected c1, got 00
[   39.799006] dmatest: dma0chan0-copy0: dstbuf[0x149b] not copied! Expected c0, got 00
 
/ # devmem 0xa0010000 32 0x2b
/ # echo 1 > /sys/module/dmatest/parameters/run
[   97.235908] dmatest: No channels configured, continue with any
[   97.241888] dmatest: Added 1 threads using dma0chan0
[   97.246867] dmatest: Started 1 threads using dma0chan0
[   97.247376] dmatest: dma0chan0-copy0: summary 1 tests, 0 failures 14492.75 iops 72463 KB/s (0)
 
/ # devmem 0xa0010008 32 0x2b
/ # echo dma1chan0 > /sys/module/dmatest/parameters/channel;
[   59.097875] dmatest: Added 1 threads using dma1chan0
/ # echo 1 > /sys/module/dmatest/parameters/run
[   63.278909] dmatest: Started 1 threads using dma1chan0
[   63.279427] dmatest: dma1chan0-copy0: summary 1 tests, 0 failures 13333.33 iops 120000 KB/s (0)