USB 3.0 provides a high speed interface which is useful for acquiring data at a high data rate. MPSoC includes two USB controllers capable of USB 3.0 along with programmable logic. An A53 or R5 CPU of MPSoC is used to manage the USB controller. The following prototype system illustrates principles required to build a data acquisition system including the MPSoC hardware design along with software for the USB device (MPSoC) and a USB host running Linux. The system was tuned to provide good performance running on the R5 CPU, but it can also run on the A53.
The system was created and tested with the Xilinx 2020.1 tools (Vivado and Vitis) using the ZCU102 board with a USB 3.0 (SuperSpeed) cable to the Linux based X86 host.
A key principle of this system is that the USB of MPSoC incorporates DMA which is capable of reading or writing from/to a device in the PL such that no buffering of the data in DDR is required. The following design incorporates an AXI Stream FIFO which allows a memory mapped interface to connect to a stream interface. Stream interfaces are very common in the PL so that this design should support many typical applications. Since the USB controller driver and hardware do not support a keyhole address mode to access the AXI Stream FIFO, a hard-coded AXI address for the FIFO data register is generated as shown below. On the software side, a 64KB DMA transfer is initiated at the C_AXI4_BASEADDR value for the AXI Stream FIFO, even though the address is actually the TX Data FIFO address. The 64K aligned address, rather than a 4K aligned address, provides the best performance for the larger 64K transfers.
A data generator for writing data into the stream FIFO is created along with a GPIO to allow software to control the data generation. The data generator is a counter counting up and inserting the count into the FIFO. Flow control is incorporated into the data generator so that a continuous count is seen in the FIFO when reading it from the AXI slave interface. TLast generation causes packets of 1024 bytes (256 x 32 bits) to be inserted into the FIFO which correlates to 1K byte packets used for USB 3.0 bulk transfers. The AXI GPIO to enable the data generator is probably not required since throttling is incorporated.
Tuning For Performance
The Vivado system was tuned for performance by increasing the PL frequency to 300 MHz, the R5 frequency to 534 MHz, and setting the width of the AXI HPM0 LPD interface to 128 bits.
The software running on MPSoC as a USB device is based on the mass storage example provided in the Vitis embedded software and is also available in the embedded software repository at GitHub. For this application, a mass storage protocol has a lot of overhead and complexity that is not required such that a vendor specific device class is more appropriate. A vendor specific device class for USB 3.0 allows the USB host to perform large bulk transfers with the USB device after the enumeration.
A USB communications device class (CDC) is another possible device class that was considered for this application. As with the mass storage device class, it appears to have some complexity and protocol that is un-needed. The mass storage and CDC device classes are also very typical such that many USB hosts have installed drivers which will bind to the device which can require manual intervention for the test system.
Mass Storage Example Details
The mass storage example, along with all the other examples for the driver, are a bit complex for a newbie to USB due to minimal documentation. Performing a baseline of the example can be worthwhile to get ensure the hardware is working well. The ZCU102 acts as the USB device emulating a mass storage device. From a USB host, the device shows up as raw storage device with no partitions formatted. Formatting the device and copying a file to it ensures the example is working.
XUsbPsu Driver and Example Details
The xusbpsu driver is only a driver for the USB controller such that the driver examples are required to build a functioning USB device. For those without in-depth USB experience, it can seem a bit overwhelming to understand all the details of USB. There are number of good USB presentations that can be helpful to get more familiar with the USB protocol. Using a free software based USB analyzer can help to understand more of the protocol of a functioning example on the hardware.
The examples all include source files for chapter nine of the USB specification which is the enumeration of the device including configuration. After converting the mass storage example to be a vendor specific example the source code can be cleaned up to make it much simpler as shown in the following points.
The class storage source files, xusb_class_storage.*, can be deleted completely since they support the mass storage protocol that is un-needed.
The examples support both USB 2.0 and 3.0 such that if you plan to only use 3.0 for high speed transfers then removing the 2.0 code can also greatly simply the example.
Renaming the xusb_ch9_storage.* source files to xusb_ch9_cfg.* helps with clarity as they are not specific to storage.
Moving the start of any device class specific protocol into the main source file (xusb_ch9_storage.c to xusb_intr_example.c) after configuration is complete cleans up the design.
The main source file, xusb_intr_example, is cleaned up to remove the storage support including the virtualflash storage and associate command wrapper data (CBW, CSW).
Enumeration of the USB device is a protocol that is common across all devices and separate from the USB device class protocol the device implements. These protocols are separate in the example source files but are all named storage which can be confusing.
The EpBufferReceive function performs a bulk in transfer while the EpBufferSend performs a bulk out transfer. The EpBufferSend function can be used without the EpBufferReceive function when only sending data to the host. The bulk data handlers get called after the bulk transfer completes so that the next transfer can be setup by the software.
Bulk transfers for USB 3.0 support 1024 byte transfers on the USB bus. Larger transfers are supported at the API level and are required for high performance systems.
The USB driver uses the DMA of the USB Controller to implement the USB protocol. The USB driver performs cache maintenance for the driver data structures and the data buffers that are used by DMA assuming the data buffers were produced by the CPU.
The mass storage example provides debug by defining CLASS_STORAGE_DEBUG when building the example. This debug is an excellent way to learn the details of the mass storage USB protocol.
Adding debug to the enumeration process such as in the Usb_SetConfigurationApp function is helpful to understand when the host configures the device.
Enumeration failures can happen when making extensive changes to the device example and these can be harder to debug because software based USB analyzers will not have anything to capture. Software based USB analyzers depend on the host to enumerate the device successfully and generally provide a higher level capture. Hardware based USB analyzers are recommended for those doing extensive USB work. Smaller iterative changes from a baseline can help avoid a failing enumeration that is difficult to debug.
The interrupt from the FIFO to the CPU is connected in hardware, but is not used in the modified software example.
Create a project in Vitis for the ZCU102 board based on the XSA file exported from Vivado. From the CPU BSP, import the xusb_intr_example to get started.
USB Device Software Changes
Edit the source file xusb_ch9_storage.c to make the following changes.
Alter the USB Vendor to be Xilinx making it easier to see the device on the host. This change is not mandatory but makes testing easier.
Alter the USB 3 configuration to change it from USB_CLASS_STORAGE to USB_CLASS_VENDOR. This changes the device so that it no longer appears as a mass storage device to the host and prevents the host from binding a mass storage driver to the device.
In the Usb_SetConfigurationApp function, remove the call to the EpBufferRecv function which is specific to the mass storage example.
Edit the xusb_intr_example.c source file to make the following changes.
Add 2 new include files for the GPIO and FIFO drivers.
Edit the storage_data structure to change the initialization of the Usb_ClassReq member from the ClassReq function to NULL. This removes the requirement for any device class handling as the new vendor specific class does not require it.
Add the following highlighted code to the end of the main function. The comments in the code describe the functionality.
Edit the BulkInHandler function removing the existing mass storage processing and replacing it with the highlighted code. This code checks to see if there is 64K of data in the FIFO and if so then starts sending it to the host using bulk transfer in. In theory smaller transfers might not require the keyhole address for the read data of the FIFO but this was not tested.
R5 Performance Tuning
The following driver change is applicable for this system but may not be applicable to every system. In this system the CPU is not creating the data that is being sent over USB such that there will not be any data in the CPU cache. Removing the cache operation for the sending data improves performance for the R5 by about 20%. By default, without this driver change, the A53 is about 20% higher performance than the R5 in this application.
Edit the driver file xusbpsu_ephandler.c and in the function XUsbPsu_EpBufferSend remove the flush of the data cache only for the buffer. The TRB must still be flushed as it is created by the CPU.
Bulk Out Details: there are also cache maintenance operations (invalidate) for the data buffer in the path for bulk out transfers in the functions XUsbPsu_EpBufferRecv() and XUsbPsu_EpXferComplete() that must be eliminated to get maximum performance when the R5 is only being used to control the USB and is not consuming the data.
The burst size, XUSBPSU_DEPCFG_BURST_SIZE(0x3), can also be tuned in XUsbPsu_SetEpConfig() to allow better performance.
The lsusb utility on a Linux host is the primary tool. With the USB device software running on the MPSoC, the host should see the device on USB as illustrated below.
The lsusb utility with the -s and verbose options (lsusb -s 04:64 -v) is very useful with a lot of output about the device endpoints. The shorter -t option is illustrated below.
A Linux host is used for the USB testing. LibUSB provides a nice framework to make USB communication simpler. A simple LibUsb application would be easy but not the highest performance. Since this application requires high performance the asynchronous transfer feature of the LibUsb framework is used. A key to higher performance is the size of the transfers. Testing showed that a 64k transfers or larger have good performance. 128K transfers yield a bit higher performance but the USB Device FIFO data register address space is only 64K by default such that the address space size must be altered to support 128K.
Data verification is not in the application but was performed to verify the data was correct based on the data generation in the PL (an incrementing 32 bit value).
The following Makefile is used to build the application test.c creating an executable named test. The Makefile sets up the compiler and linker flags to use the LibUSB library.
Host USB Capture
The Virtual USB Analyzer application along with the usbmon driver is used on Linux to capture USB transactions.
The following output of the host test program illustrates the system performance running on the R5 executing from DDR.
Prototyping with the A53 shows similar performance of near 400 MB / sec appearing to be the maximum system performance with the USB controller and driver. The USB Host performance must also be considered.
The PS DDR is the easiest method for performance testing of bulk in and bulk out transfers as no PL design is needed at all.
A PL BRAM is an easy alternative for performance testing of bulk and bulk out transfers while also testing the path into the PL.
Another design was tested which used 2 endpoints in the USB device. This design yielded good performance with the R5, about 20% higher than a single end point, but complicated the PL design since there is only one stream of data. The driver optimization described above for the R5 causes it to be about the same performance without the higher complexity of multiple endpoints with data ordering and combining requirements.
Prototyping was done with using TCM and OCM for the USB transfers. TCM gets a bit complicated as the address map for the R5 is a local address while the USB DMA must use global addresses for the TCM. TCM or OCM for the USB data did not help the performance. The R5 application running from TCM or OCM rather than DDR did not seem to help the performance which is likely due to the CPU caches.
USB 3.0 has shown to be effective in reading data from the PL while also not requiring the data to be buffered in DDR.