Zynq UltraScale+MPSoC Graphics- GPU Profiling using ARM Streamline performance analyzer

Table of Contents

Document History

DateVersionAuthorDescription of Revisions
11/11/20161.0Rajesh GugulothuInitial Release
21/02/20172.0Rajesh GugulothuUpdated with 2016.4 tools Release and Added design
files to support zcu102 Rev-B/Rev-D ,Rev-1.0 boards
10/07/20173.0Rajesh GugulothuUpdated with 2017.1 tools Release and Added design
files to support zcu102 Rev-D2 with production silicon,Rev-1.0 boards with production silicon and Rev 1.0 board.
14/06/20184.0Surender PolsaniUpdated with 2018.1 tools Release and Added design
files to support zcu102 Rev-D2 with production silicon,Rev-1.0 boards with production silicon and Rev 1.0 board.


Implementation Details
Design TypePS Only
SW TypeLinux
CPUsQuad-core ARM® Cortex™-A53 Application Processing Unit,ARM® Mali-400 MP2 Graphics Processing Unit
PS FeaturesDDR controller,UART,SD/eMMC interface, USB 3.0,DisplayPort
Boards/ToolsZCU102 Rev-B/Rev-D,Rev 1.0 with production silicon,Rev-1.0
Xilinx Tools VersionSDK 2018.1
Other DetailsARM® DS-5 Development Streamline performance analyzer
Host TypeWindows 64-bit
Files Provided
ARM_Streamline_Performance_Analyzer.zipArchive file contains the Design_files directory.


  • Zynq® UltraScale+™ MPSoC delivers unprecedented levels of heterogeneous multi-processing and combines seven user programmable processors including Quad-coreARM® Cortex™-A53 Application Processing Unit (APU), Dual-core 32-bit ARM® Cortex™-R5 Real Time Processing Unit (RPU), and ARM® Mali™-400 MP2 Graphics Processing Unit (GPU). It is Industry’s First Multi-Processor SoC delivering 5x system level performance-per-watt and any-to-any connectivity.
  • The main aim of this tech tip is to provide steps by step procedure to analyze performance of a graphics application running on Zynq UltraSCale + MPSoC Mali GPU, using ARM® DS-5 Streamline performance analyzer.


  • ARM® DS-5 Development Studio Streamline performance analyzer enables you to get the best out of your system’s resources and create high performance, energy efficient products. Its innovative user interface brings together system performance metrics, software tracing, statistical profiling and power measurement to present you with a system dashboard where you can quickly identify code hotspots, system bottlenecks and other unintended effects of your code or the system architecture.
  • This tech tip showcases how easy a user can analyse the performance of a graphics application, using ARM® DS-5 development studio streamline performance analyzer which enables the user to trace their system bottlenecks, helps in optimizing their application.It is also provides the steps to compile the complete Linux system for Zynq UltraScale + MPSoC and demonstrates how to analyze a simple graphics application.
  • Highlights of this Tech tip:
  • Below are the main topics which are covered in this tech tip :
  • Executing the tech tip with prebuilt images.
  • Compiling & building Linux for different silicon versions.
  • Compiling & building gator module and gator daemon.
  • Installing ARM Streamline performance analyzer tool on windows machine.
  • Profiling an example graphics application.


  • Linux or windows Host machine
  • Xilinx ZCU102 (Rev D and above) evaluation kit with power supply.
  • Class 10 SD card (4GB or more).
  • Micro USB to Standard USB cable
  • 1080p Monitor or Full HD Monitor and Display Port Cable.
  • USB 3.0 connector or USB 2.0 micro cable to standard USB female adapter, USB Hub to connect USB mouse and USB keyboard or connect USB keyboard with mouse integrated etc.
  • Install any HyperTerminal like Tera term, putty, etc. for serial terminal prints. Download the Tera term application at http://download.cnet.com/Tera-Term/3000-20432_4-75766675.html and follow the instructions to install.

Block Diagram:

  • Below Block diagram depicts hardware stack and software flow diagram of profiling Graphics application running on Zynq® UltraScale+™ MPSoC GPU, using ARM streamline performance analyzer.

Executing the tech tip with prebuilt images :

  • This section helps how to execute this this tech tip using the prebuilt images supplied along with this tech tip.If you want to build all the images from the scratch then, skip this section.
  • First of all install the performance analyzer tool by following the section "Installing ARM Streamline performance analyzer tool on windows machine " below.
  • Download the ARM_Streamline_Performance_Analyzer.zip file using link provided at the top of this tech tip and extract to your local directory.It will be extracted as below directories
    • Rev_1_0_Design_Files - It has images are for zcu102 Rev-1.0 board.
    • Rev_D_Prod_Design_Files - It has images are for zcu102 Rev-B/Rev-D ,Rev 1.0 boards with production silicon .
  • According to the board, copy corresponding Prebuilt_SD_Images on to a SD card (4GB or above ).
  • Do the board setup by following the "ZCU102 board setup" section below in this page.
  • Once board setup is done,follow the section "profiling an example graphics application " to execute the demo.

Compiling and building Linux images :

  • This section goes through the procedure to compile the Linux for different versions of zcu102 boards.It covers the following topics
  • Preparing the Linux Kernel using the Petalinux 2018.1
    • BSP Creation
      • Kernel Configuration
      • Device Tree Configuration
    • Building the kernel and device tree blob
    • Preparing the FSBL
    • Preparing the BOOT.bin file
  • Download the Petalinux petalinux-v2018.1-installer.run and ZCU102 BSP and ZCU102 BSP for corresponding board
  • Note:Download the bsp based on the board you have.Download the ZCU102 BSP ( prod-silicon) for Rev-B/Rev-C/Rev-D,Rev 1.0 boards with production silicon and ZCU102 ES2.0 Rev1.0 BSP for Rev-1.0 board.
  • Install the Petalinux by running the above downloaded installer
    • $ . / petalinux-v2018.1-installer.run
    Note: Refere the Petalinux user guide for installing Petalinux on Linux host machine.
    • $ source <Petalinux_installation_path>/settings.sh
  • After installation is done set the Petalinux environment by running below command in bash shell
  • Note: use source <Petalinux_installation_path>/settings.csh command for c shell.
  • Cross Check the PETALINUX environment variable is set to the above installation path
    • $ echo $PETALINUX
  • Create the Petalinux project with the below command
    • $ petalinux-create -t project –s <path to the downloaded zcu102 bsp>/Xilinx-ZCU102-v2018.1-final.bsp
  • Change the directory to created Petalinux project.
    Note: In the above command, give the bsp path according to the board revision.
  • Configure the kernel with below options save the configuration
    • $petalinux-config -c kernel
    • Kernel hacking > Tracers > Kernel function Tracer
  • Configure the rootfs with below mentioned package.
    • $petalinux-config -c rootfs
    • Filesystem Packages > libs> libmali-xlnx > libmali-xlnx
    • Filesystem Packages > libs > libmali-xlnx > libmali-xlnx-dev
    • Petalinux Package Groups > packagegroup-petalinux-x11 > packagegroup-petalinux-x11
    • Petalinux Package Groups > packagegroup-petalinux-x11 > packagegroup-petalinux-x11-dev
  • After enabling all the packages, save the config file and exit the rootfs configuration settings.
  • Once the above changes are done, build the petalinux project.
    • $ Petalinux-build
  • After Petalinux build is done , output binaries get created under the Xilinx-ZCU102-2018.1images/linux
  • Create BOOT.bin with above created binaries.Boot.bin will get created under Xilinx-ZCU102-2018.1/images/linux directory.
    • $ Petalinux-package --boot --fsbl Xilinx-ZCU102-2018.1/images/linux/zynqmp_fsbl.elf --u-boot Xilinx-ZCU102-2018.1/images/linux/u-boot.elf --fpga Xilinx-ZCU102-2018.1/images/linux/design_1_wrapper.bit

Copy the BOOT.BIN, image.ub files on to a SD card.
Compiling & building gator module and gator daemon:

  • This section explains about how to compile & build gator kernel module and daemon.
  • Clone the gator-6.1 source from the git path provided below
  • change the directory to daemon directory under the Gator source and run the below command to build the gator daemon.
    • $bash
    • $ export CROSS_COMPILE=<petalinux_installed_directory>/tools/linux-i386/aarch64-linux-gnu/bin/aarch64-linux-gnu-
    • cd <Gator_source>/daemon
    • $make -f Makefile_aarch64
  • After the above build , copy the gatord binary created under the <Gator_source>/daemon directory, to a SD card.
    Note: Make sure that cross compilation path is set to the current working environment.
  • Extract the Mali driver source code by following below command
    • $cd <petalinux_project_directory>/build/downloads
    • $tar -xvf DX910-SW-99002-r5p1-01rel0.tgz
  • Build the gator.ko module by following below commands.
    • $ cd <Gator_source>/driver
    • $bash
    • $ export CROSS_COMPILE=<petalinux_installed_directory>/tools/linux-i386/aarch64-linux-gnu/bin/aarch64-linux-gnu-
    • $ GATOR_WITH_MALI_SUPPORT=MALI_4xx CONFIG_GATOR_MALI_4XXMP_PATH=<petalinux_project_directory>/build/downloads/DX910-SW-99002-r7p0-00rel0/driver/src/devicedrv/mali/ make -C <petalinux_project_directory>build/tmp/work-shared/zcu102-zynqmp/kernel-build-artifacts/M=`pwd` ARCH=arm64 modules

  • Note: In the above command, it is mandatory to use the Petalinux project directory (Xilinx-ZCU102-2018.1) which is created in above section and set the absolute paths.
    After executing the above command , gator.ko module get created under the <Gator_source>/driver directory.
  • Now copy the above built gatord and gator.ko files on to the SD card( The once which used above to copy the BOOT.bin,image.ub images).
  • Make sure that below listed images are present on the SD card and use this for executing the demo.
    • BOOT.BIN
    • image.ub
    • gatord
    • gator.ko
    • tri_cube(Copy this application from the Prebuilt_SD_Images directory).

Installing ARM Streamline performance analyzer tool on windows machine :

  • In this section goes through the procedure for installing streamline performance analyzer tool on windows machine.
  • Note:This tech tip assumes host machine as windows 64-bit.

ZCU102 Board Setup:

  • Before going to execute the demo , make sure the proper connections between the host system and board.Bellow are the steps to do the same
  • Connect the Micro USB cable into the ZCU102 Board Micro USB port J83, and the other end into an open USB port on the host PC. This cable will be used for UART over USB communication.
  • Connect one end of Ethernet cable into the ZCU102 connector J73, and other end connect to the Ethernet socket of the host machine.
  • As target board and host machines are connected locally in the above step, set the host machine address as to by following below steps
    • In the Windows machine, click on the start button and select the option control Panel-->Network and Internet -->Network and Sharing Center.
    • At the left side pane select the option Change adapter settings.
    • Right click on the Local Area Connection and select properties option.
    • A user access control dialog box will appear, select yes in that.
    • In the Local Area Connection Properties wizard select the internet protocol version 4 ( TCP/IPv4 )option and click on properties tab as marked in the below figure.
    • A properties wizard opens, select the Use the following IP address option and give the IP address as and subnet mask as as following
  • Insert SD card into the SD card slot J100.
  • Set the SW6 switches as shown. This configures the boot settings to boot from SD:
  • Connect 12V Power to the ZCU102 6-Pin Molex connector.
  • Connect the monitor using DisplayPort cable to U50.
  • The following figure shows the ZCU102 board with connections

Profiling an example graphics application:

  • This section walks through the steps how to execute,capture and analyze the graphics application performance data using ARM Streamline performance analyzer.
  • Now power on the board.
  • Setup the Tera term application with following configuration
    • Baud rate 115200
    • Stop 1 bit
    • Data 8bit
    • Parity none
    Note: In the serial communication port selection, select the Interface 0 as marked yellow in the below figure

  • Once the Tera term is connected to the target, Linux booting logs will appear on the terminal. After few second, it will asks for linux username and password. Type username and password as “root “.
  • Set the IP address as on the target side with the following command
    • $ ifconfig eth0
    Note: setting the IP address for target is of user choice but make sure that both host and target IP addresses are in same network demine
  • Make sure that host and target are connected properly by doing ping form either side.
  • To communicate with the target device, Streamline performance analyzer tool requires the gator daemon gatord, and gator kernel module gator.ko to be running on the device.Type below commands on at Linux command prompt to insert the module and run the gatord daemon.
    • $ mount /dev/<device_name> /media
    • $ cd /media
    • $ insmod gator.ko
    • $ ./gatord &
    Note: while running the gatord daemon, few unremarkable errors may come.Those are not functional effect.
  • Now Run the example tricube application by following below commands.
    • $ln -s /usr/lib/libMali.so.8 /usr/lib/libMali.so.1
    • $ export DISPLAY=:0.0
    • $ ./tri_cube &
    Observer the application running on the display connected the board like as shown
  • Now open the ARM DS-5 Streamline performance analyzer tool and provide the target IP address in the box marked in the below shown figure
Fig:setting target IP address in streamline tool

  • Now click on the capture and analysis option , see the below figure.

  • A capture and analysis wizard opens like as shown in the below figure, configure the required option and click on save.
Fig:capture and analysis wizard

  • Now click on the counter configure option ,as marked red in the below figure
Fig:selecting counter configure option

  • Observe the counter configuration wizard available option and save.
Fig:counter configure wizard
  • Now click on the Start capture option, as shown in below figure
Fig:selecting start capture option

  • After clicking start capture option, it will asks for the location to save the captured data, browse the working directory and save the file for future analysis. Then you will start seeing the GPU performance metrics information in the Streamline performance analyzer like as show in the below figure
Fig: GPU performance metrics information

Appendix A:File Description in Design directory
  • ARM_Streamline_Performance_Analyzer.zip is extracted as
    • Rev_1_0_Design_Files
      • Prebuilt_SD_Images
        • BOOT.BIN
        • image.ub
        • gatord
        • gator.ko
        • tri_cube
    • Rev_D_Prod_Design_Files
      • Prebuilt_SD_Images
        • BOOT.BIN
        • image.ub
        • garord
        • gator.ko
        • tri_cube