Zynq UltraScale MPSoC SMMU
The System Memory Management Unit (SMMU) allows AXI masters to have a virtualized view of the system memory, similar to the CPUs. The SMMU is analogous to the MMU of the A53 processor.Ā
Table of Contents
- 1 Table of Contents
- 2 1. Introduction
- 3 2. SMMU Use Cases
- 4 3. User Space DMA
- 5 4. Enabling SMMU
- 6 5. User space DMA example with simple AXI DMA Hardware System
- 7 6 Debug Methods
- 8 7 Performance Ramifications
- 9 8 References
- 9.1 Answer Records
- 9.2 App Notes
1. Introduction
The SMMU acts just like an MMU for processors, however it translates addresses coming from System I/O devices such as DMA. The SMMU requires masters such as DMAs to use virtual addresses rather than physical addresses. The SMMU is transparent to device drivers in Linux as the DMA framework knows how to handle it. The SMMU is equivalent to an IOMMU used in other system architectures.
This page is not intended to be a tutorial about the SMMU. Readers should refer to other documents (such as the MPSoC Technical Reference Manual and Software Developers Guide) for a more detailed understanding of MPSoC, together with ARM documents such as the ARM System Memory Management Unit Architecture Specification and the ARM Cortex-A Series Programmers Guide for a more complete understanding of the SMMU operation.
The primary focus of this page is to provide the below information to users on the usage of SMMU on their Zynq UltraScale+ platforms.
How to enable SMMU for Zynq UltraScale MPSoC+ devices both from a Hardware and Software perspective.
Usage of the SMMU with devices in the programmable logic (PL).
Note that the use of the SMMU with the PL is an advanced topic with some limitations that should be clearly understood. This page attempts to clarify the limitations to allow users to do prototyping with the SMMU.
By default, all PetaLinux BSPs that target Zynq UltraScale+ SoC devices have the SMMU turned OFF/disabled by default. For those users that are aware of the SMMU and want to take advantage of it, this page provides several datapoints that can be taken into consideration before going through the efforts of enabling it. This page walks through the process of building a complete prototype system to support user space DMA using the SMMU.
2. SMMU Use Cases
2.1 Memory protection
SMMU works with the Linux IOMMU framework. Through the IOMMU framework, SMMU ensures that each bus master in the system is assigned a separate context and translation table. Once done, any rogue DMA access fails to get through the IOMMU protection. Because the rogue DMA will not be assigned a context, it results in an IOMMU fault, and the access fails.
2.2 VFIO
All modern hardware architectures provide DMA and interrupt remapping facilities to help ensure that I/O devices behave within the boundaries to which they have been allocated. The VFIO driver framework available in Linux uses the IOMMU framework and ensures direct device access at the Linux user space in a secure and IOMMU protected environment. In summary, VFIO provides safe and non-privileged user space drivers.
Some applications, particularly in the high performance computing field (for example, network adapters or computer accelerators), benefit from low-overhead direct device access from the user space. This makes VFIO support really important. The only way VFIO can be enabled in AMD Adaptive devices is by enabling SMMU through the IOMMU framework.
The AXI DMA example described below uses VFIO for the user space DMA solution.
2.3 Virtualization (Hypervisors)
In virtualization environments, virtual machines (VMs) make use of SMMU to provide isolation between VMs.
2.4 Xen
Xen needs SMMU enabled through hardware to work.
3. User Space DMA
A primary motive for the use of the SMMU is to allow a user space DMA implementation. In the context discussed here, User space DMA is defined as allocation of memory and control of a DMA device from the user space in Linux. In the past, User space DMA had several challenges that kept it from being an easy solution. Most users tend to write kernel space drivers rather than a user space solution.
3.1 User Space Cache Control
The first challenge is cache control from the user space. For systems which are only software coherent and not hardware coherent, the software must do cache maintenance. This is challenging from user space as a kernel space driver is typically required.
3.2 User Space Memory Allocation
The second challenge is memory allocation from the user space. Systems without an SMMU use physical addresses with the DMA device such that contiguous memory is typically used or scatter gather with scattered memory pages. A kernel space driver is typically required to allocate contiguous memory.
3.3 Zynq UltraScale+ MPSoC Solution
MPSoC provides solutions to both challenges described above. The SMMU allows user space memory allocation to be used for DMA. The hardware coherency of MPSoC allows cached memory to be used for DMA from the user space and removes the need for cache control. The SMMU also provides an additional level of protection in that DMAs cannot access memory other than the memory that has been set up in the SMMU.
Ā
4. Enabling SMMU
This section provides information on changes needed to enable SMMU both at the hardware level and at the software level.
Hardware Requirements
For FPD and PL masters, changes are not required in the hardware design. Only the LPD masters in the Zynq UltraScale+ MPSOC need the chaanges outlined below at the hardware design (Vivado) stage.
Routing an LPD I/O master's traffic through SMMU (Changes from hardware design end)
An LPD master's traffic by default takes the LPD path, which skips SMMU.Ā Ā In order to manage a specific LPD I/O master device through SMMU, its traffic must be routed through SMMU.Ā The Vivado GUI has an option to do so. The steps to enable that option are shown below:
Open the hardware project in Vivado, and open the customization window for the Zynq UltraScale+ MPSoC block by double clicking on it.
Ā
Switch to Advanced mode.
Ā
Select the "Advanced Configurationā tab
Ā
Select the "Route LPD Traffic through FPDā option. It will show available LPD I/O masters enabledĀ in the hardware design
Ā
You can select the checkbox for I/O masters for which you would like to enable SMMU.
5. User space DMA example with simple AXI DMA Hardware System
The section describes the end-to-end solution of a user space DMA using the simple AXI DMA Hardware System. It provides a step-by-step procedure for the use case, with changes needed both from hardware and software.
The simplest prototype system for this purpose is designed with the following characteristics:
Does not include any AXI interconnect for the DMA data interface
Connects the DMA data interface to HPC0 of the MPSOC
Uses a single AXI data interface on the DMA
Is hardware coherent such that all AXI transactions from the DMA are coherent as described in MPSOC Cache Coherency
Uses only simple DMA rather than scatter gather
Enables high addresses in Vivado as described in PL Masters
Loops the DMA transmit stream back to the DMA receive stream for easy testing.
Ā
5.1 Hardware Details
The following diagrams illustrate such a system with the AXI DMA connected to the HPC0 port of MPSOC.
5.2 DMA Configuration
Ā
Ā
Ā
5.3 AXI Interconnect Limitations
Vivado adds AXI interconnects (or SmartConnect) by default when connecting a DMA IP to the MPSoC. The interconnect contains crossbar switches which are configured to allow only the physical addresses of slaves to be passed through the interconnect. This will not allow virtual addresses to be used for DMA, as transactions get blocked at the crossbar switch.
A user can manually insert an AXI crossbar switch and then manually configure the address ranges that pass through to open it up for virtual addresses. The following illustration shows an additional address range added to a crossbar switch.
Ā
5.4 Device Tree
4.1 SMMU Enable
The device tree for the SMMU is disabled by default in PetaLinux. As a result, it must be enabled. The following device tree snippet illustrates enabling the SMMU.
&smmu {
status = "okay";
}; |
4.2 AXI DMA Stream IDs
The device tree for the AXI DMA must be set up with the "iommu" property to cause the SMMU to function with the device. Stream IDs are required for each master interface of the DMA in the iommu property.
The number of stream IDs required for the device can be challenging to understand as they do not always directly correlate to physical AXI master interfaces. The AXI DMA configured with the most master interfaces (scatter gather with separate data interfaces) uses three stream IDs, so this is the easiest starting point. Stream ID details are explained on this page: Xen and PL Masters.
The following device tree snippet illustrates adding the iommu property to the AXI DMA in the device tree with three stream IDs for the HPC0 port of the MPSoC:
&axi_dma_0 {
iommus = <&smmu 0x200>, <&smmu 0x201>, <&smmu 0x202>;
}; |
Ā
5. VFIO
The VFIO framework in Linux is designed to use the SMMU to allow DMA from the user space. VFIO is similar to the UIO framework in that it provides a method to map a device into user space memory, allowing register access of the device. VFIO also controls the SMMU such that DMA has a virtualized view of memory similar to the CPUs. Unlike UIO, VFIO has very few examples and minimal documentation, and most of the examples are PCI related.
5.1 VFIO Device Driver
There are multiple VFIO drivers in the kernel. The VFIO platform driver is best suited for this solution with a device-tree based Linux architecture.
5.1.1 Kernel Configuration
The following illustration shows the kernel configuration to build the driver as a kernel module.
5.1.2 Using the VFIO Platform Driver
The AXI DMA driver is built into the kernel statically by default. The driver must be unbound from the device to allow the VFIO platform driver to be used for the device. By default, the VFIO platform driver requires a reset function which is not provided in the system at this time. A module parameter is used when inserting the VFIO platform driver to indicate that the reset function is not required.
The following commands illustrate the steps required to load the VFIO platform driver, unbind an existing DMA driver and start the VFIO platform driver for the AXI DMA hardware which was connected to the HPC0 port of the MPSoC:
modprobe vfio_platform reset_required=0
echo a0000000.dma > /sys/bus/platform/drivers/xilinx-vdma/unbind
echo vfio-platform > /sys/bus/platform/devices/a0000000.dma/driver_override
echo a0000000.dma > /sys/bus/platform/drivers_probe |
The driver does not probe successfully without the reset_required = 0 module parameter. The resulting error is illustrated below.
vfio: no reset function found for device a0000000.dma
vfio-platform: probe of a0000000.dma failed with error -2 |
5.2 VFIO User Space Application
The attached Linux application source code is a working example of a VFIO with the previously described AXI DMA system. The application assumes that the system is coherent such that user space allocated memory is cached but does not require any software cache maintenance. The application is executed on the target as the last step.
Ā
6 Debug Methods
6.1 Challenges
The devmem utility is typically used for debugging Linux issues. The devmem utility might appear to no longer be useful once virtual memory addresses are being used by the DMA.
The tracing functions described below that allow the user to see the physical addresses such the devmem are still useful.
6.2 Tracing SMMU Events in Linux
The kernel must be configured for kernel function tracing. Enable function tracing for the IOMMU functions as shown below.
cd /sys/kernel/debug/tracing
echo 1 > events/iommu/map/enable
echo 1 > events/iommu/unmap/enable
echo 1 > events/iommu/io_page_fault/enable |
After SMMU operations have occurred, dump the trace buffer as illustrated below.
cat trace
...
modprobe-2274 [001] .... 153.598323: map: IOMMU: iova=0x0000ffffffff8000 paddr=0x00000008797a2000 size=4096
modprobe-2274 [001] .... 153.598329: map: IOMMU: iova=0x0000ffffffff9000 paddr=0x00000008794c8000 size=4096
dma16chan0-dma1-2276 [003] .n.. 153.611087: unmap: IOMMU: iova=0x0000ffffffff0000 size=16384 unmapped_size=16384 |
6.3 DMA Hangs, Kernel Crashes, Kernel Hangs
If the SMMU is not set up correctly with the correct stream IDs in the device tree, then you might see the application hang during the DMA transfer, a kernel crash or a kernel hang. When the stream IDs are not set up correctly a virtual address used by the DMA is not translated to a correct physical address.
Use of a System ILA in the hardware system might be required to watch the AXI transactions and verify the addresses being used. There is limited visibility once the AXI transaction enters the MPSOC where the SMMU performs the address translation from virtual to physical.
7 Performance Ramifications
With SMMU/IOMMU enabled, there could be use cases where despite providing all of the advantages explained earlier, SMMU/IOMMU could also lead to performance degradation. In Linux use cases, every DMA framework allocation/API gets attached to IOMMU framework. Each DMA buffer getting allocated or freed also results in the creation of IOMMU page tables (with a typical kernel page size of 4KB) and deletion of the same.
For use cases, where DMAable buffers are constantly being allocated and freed, the performance overhead could be significant.Ā
Simple use cases of DMA transfers with very large buffers could be impacted by performance degradation.
Please refer to the table below where numbers for a typical performance benchmarking done on a Zynq UltraScale+ MPSoC device for zDMA are tabulated.
Each table entry below corresponds to a memory-to-memory transfer for a certain buffer size.Ā
Buffer size (KB) | Performance with SMMU Ā (MB/sec) | Performance without SMMU (MB/sec) | |
---|---|---|---|
1 | 1 | 24.672 | 38.896 |
2 | 5 | 104.390 | 191.343 |
3 | 16 | 250.517 | 472.483 |
4 | 32 | 339.838 | 728.507 |
5 | 64 | 409.010 | 1003.011 |
Note:Ā As the buffer size increases, the performance difference can become significant. For example, a 64 KB buffer with IOMMU enabled will be divided into 16 4KB entries in an SMMU/IOMMU translation table. Creation of these entries and subsequent deletion happens in the direct data path for the above use case. A greater number of entries results in more performance degradation.
Ā
8 References
VFIO
Virtual Open Systems VFIO Prototyping
https://xilinx-wiki.atlassian.net/wiki/spaces/XWUC/pages/18841981
Answer Records
App Notes
Ā
Related content
Ā© Copyright 2019 - 2022 Xilinx Inc. Privacy Policy