Minimizing AArch64/EL3 Bare Metal Application Size
Minimizing a bare metal application’s size can speed up its boot time, free up memory for other uses, or reduce costs due to the smaller memory sizes required. This article lists some options to achieve this. Results are showcased using the default hello_world application on the VCK190 using the default GCC toolchain in Vitis 2021.2.
Table of Contents
Background
AMD/Xilinx devices with processors have customizable hardware. As a result, the set of software accessible peripherals and their configurations can be different for two hardware designs even if they use the same part number. Arm v8 processors also have the ability to run hypervisors; bare metal applications can run at EL3 or EL1.
Vitis is run on a supported host to build user applications for a given hardware platform, either as a GUI or in a scripted manner. It generates a Board Support Package (BSP) based on the hardware design and populates it with appropriate drivers and default options. These drivers will be present regardless of whether you intend to use them in a specific application. In addition, you can incorporate higher level libraries for use by your bare metal application. Vitis also automatically generates a default linker script.
You can specify compiler and linker options for the BSP as well as your application. Your application files are then compiled and linked against the BSP and any optional libraries. The application is generated as an ELF format file, with a ‘.elf’ extension.
An ELF file is composed of sections. These sections contain meta-information as well as code/data for the application. Meta information consists of run-time attributes of the code/data, such as load addresses and the code entry point, as well as debug/build information. Some sections do not contain data, but instead instruct the loader to set up memory areas for the application, such as heap, stack and bss. As a result, an application’s run time memory footprint is typically larger than the loaded code/data. Meta information is conveyed to the boot loader (“BOOTROM”) in the BIN/PDI file by the Bootgen tool. Debug information is not loaded at run time. This means that the size of the ELF file on the host is only loosely related to the load-time and run-time code/data size.
Commonly used approaches
Optimization levels
Your application can be configured for ‘Debug’ or ‘Release’. These provide different default compiler options and can be further customized by the user. The default optimization level for Release is ‘-O2’. You can change the optimization level to ‘-O3’ or ‘-Os’ to reduce your application size. Some applications benefit from ‘-O3’ more than ‘-Os’. In addition, the BSP and libraries have their own flags that need to be set; changing an application's default configuration does not change the BSP’s options, and vice versa.
Application specific optimizations
Many libraries have ‘knobs’ for features that should be set appropriately for a particular application. Similarly, BSP drivers that are not used by a particular application can be disabled. However, this requires the user to be aware of all the hardware in the design and needs to be re-visited every time the BSP is regenerated due to hardware changes. There might be compile time defines available for specific libraries and drivers to remove unwanted code sections. The stack and heap size in the linker script can be reduced, but this must be done knowing the full use case of the application and so is not recommended by default.
Discarding unused code
Linker Scripts: Background
The available memories are indicated using the MEMORY command in a linker script. These correspond to physical hardware addresses, however the user can edit them, for example to reserve a part of memory for another program.
The generated object files place code and data in ‘sections’. The linker script defines the available physical memories in the system and directs the linker to place different ‘input sections’ from the provided object files at different offsets in the available physical memories using ‘output sections’. The user application can control the input section for every variable/function in the application, with the compiler supplying default behavior. Output sections are consumed by the run-time environment’s loader and so must be compatible with it.
Code in C files is, by default, put into a ‘.text’ section. Global initialized data is placed in a ‘.data’ section, while uninitialized data is put in a ‘.bss’ section which is not loaded from the boot device. Assembly files typically declare additional sections such as ‘.vectors’, which have to be placed at specific memory locations in order to allow a processor to boot.
Example Linker Script
Examine the linker script at https://github.com/Xilinx/embeddedsw/blob/xilinx_v2021.2/lib/sw_apps/img_rcvry/src/lscript.ld. This script defines constants such as “_STACK_SIZE”, defines named memory sections for high and low DDR, OCM and QSPI, the entry point of the code, and the order and location of input sections in the various output sections. Note that an entry such as “*(.boot)” pulls all input sections named “.boot” from all input files into the same output section. Note the KEEP for certain input sections, such as “KEEP (*(.vectors))
". This directs the linker to preserve the “.vectors” input section, even if its code/data appear to be unused by the application. This is because an external agent (the processor) requires the vectors during the boot/exception handling process. The entry point of the ELF file is noted as “_vector_table”, which is how the processor boots. The code at this offset eventually leads to the C entry point, ‘main()’, after setting up the processor and software environment for main(), and is also responsible for handling the return from “main()” by calling “exit()”.
Garbage Collection: Discarding unused input sections
The linker can be directed to discard unused input sections. When directed to do so, it creates a dependency graph between input sections starting at the entry point’s section; if code that has been marked for preservation leads to another section, that section is marked for preservation as well. Any leftover sections after this process completes are discarded, reducing code size. This behavior of the linker is triggered by the “--gc-sections” option. However, the default allocation of code into large sections such as ‘.text’ and ‘.data’ precludes discarding most unused code. Marking code and data at a finer level of granularity enables the linker to discard larger amounts of unused code and data.
Manual section names
The C compiler allows for manual marking of code/data into specific sections. However, this is a tedious process with very little return beyond the linker’s garbage collection. Manual marking is typically done for the opposite reason: code/data that must always be present for a particular processor and application, for example, boot vectors, interrupt routines, etc., in order to NOT discard those sections. The KEEP directive in the linker script is used for this purpose. For example, if a build ID must be present in the run-time code, but is not referenced by the code itself, it must be put in its own section with a KEEP directive.
Automatic section naming
The compiler can automatically place every function and global variable in its own uniquely named section. Such names will have a common prefix (such as ‘.text.foo' for a function ‘foo’), allowing the placement of all such sections in one place in the linker script using “*(.text.*)”. This allows for efficient garbage collection with very little effort from the user. The compiler options ‘-ffunction-sections’ and ‘-fdata-sections’ are used for this purpose. These options should be used for ALL code making up the application: the BSP, the libraries, and the user’s application files, so they need to be specified in multiple menus. There is a slight run-time overhead to using these options because code ends up being aligned to particular boundaries, leaving holes in the memory space; but if the amount of code being discarded is large enough, it more than makes up for this overhead.
The combination of the compiler options and linker option allows for efficient removal of entire swathes of unused code, such as a driver that is not used in an application. However, note that the driver will still be built before being discarded. This will consume build time, require the driver to build without errors, and make the linker work harder.
Summary
This technique can be used for all bare metal applications, regardless of whether they are executing under the control of a hypervisor. It is also generally applicable across all architectures, so long as the compiler supports it. Another advantage is that this can be used with the Debug configuration as well. Debug mode typically takes up more code space, so garbage collection allows the code to fit in tight memory situations.
Discarding MMU page tables
Modern processors typically have an MMU for use by high level operating systems such as Linux, allowing for dynamic virtualization of addresses. However, they can operate with the MMU disabled, for example when the boot code runs while setting up the initial multi-level page tables for the MMU. These tables define the properties of different pages, such as whether the page represents normal memory or memory-mapped devices, which allows the processor to correctly and efficiently access that page. However, the page tables themselves consume a large amount of memory.
Bare metal BSPs typically enable the MMU and set up an identity mapping for it using static page tables. These tables will not be discarded by the linker, because they are referenced by the “.boot” section via the entry point. Leaving the MMU disabled is an opportunity to reduce the application’s footprint and speed up boot time. The disadvantage is that, without page tables, the processor has to be conservative about the kind of memory being referenced, and so it sacrifices efficiency at run time. This might be acceptable for some applications, and the compiler has to be informed of this situation so that it does not generate code that assumes normal memory operation.
The hypervisor in use can determine the acceptability of disabling the MMU for an EL1 application. EL3 applications can disable the MMU at their discretion.
Generating aligned memory accesses
The compiler must be instructed to generate code that accesses data in an aligned manner, due to the lack of an MMU. This is done with the ‘-mstrict-align’ option. GCC version 10 is required to handle -mstrict-align correctly for certain optimization levels and is provided with Vitis 2021.2.
Discarding unused stacks
The armv8 architecture allows each exception level to have its own stack. The typical boot process allocates space for each level as it has to service different scenarios involving multiple exception levels. If your application is restricted to EL3, you can safely remove the stacks for the other exception levels. This does not reduce the load size; it reduces the run-time memory footprint.
Code walkthrough
We use https://github.com/Xilinx/embeddedsw/blob/xilinx_v2021.2/lib/bsp/standalone/src/arm/ARMv8/64bit/gcc/asm_vectors.S, https://github.com/Xilinx/embeddedsw/blob/xilinx_v2021.2/lib/bsp/standalone/src/arm/ARMv8/64bit/gcc/boot.S, and https://github.com/Xilinx/embeddedsw/blob/xilinx_v2021.2/lib/bsp/standalone/src/arm/ARMv8/64bit/gcc/xil-crt0.S to understand the flow for EL3.
The ‘.vectors’ and ‘.boot’ sections together set the stage for running the application via startup in the .text section. Execution at boot starts from the linker script defined entry point _vector_table (line 164 of asm_vectors.S), which branches to _boot on line 199. Notice that boot.S defines the symbols and aliases for the different levels of MMU tables (lines 74-76 and 94-96). Lines 245-258 set up the base pointer to the level 0 table, and the set of all memory attributes available for page tables. In addition, line 290 enables the MMU and line 295 jumps to _startup in “xil_crt0.S”. This file sets up the C environment prior to jumping to main (as a function call), and then arranges to call “exit” after it returns. In particular, lines 70-90 zero out the BSS.
In addition, boot.S references stacks for all exception levels (lines 80-89). The stack space is allocated in the linker script (lines 16-18, 308-319) for EL2-EL0.
Summary
Actions for AArch64 + gcc builds
Line numbers in this section refer to the example files in the links above; adjust as needed for your specific case.
General instructions
Add a copy of boot.S from your specific BSP to the application. In this file, the following lines should be commented out: 74-76, 81-83, 87-89, 94-96, 245-258, 290. This file will be compiled with the application, and its code will be linked prior to the default boot.S in the BSP, so the default code will not be used.
Edit the linker script and comment out lines 16-18 and 308-319 to eliminate the space used by the EL0-EL2 stacks. The MMU page tables will be discarded. Comment out the MMU tables (lines 187-203) as well, to prevent unnecessary page-aligned holes.
Configure the application for a Release build. Change the build flags for the application, libraries and BSP together. Set the optimization to ‘-O3’ or ‘-Os’ as appropriate and use ‘-ffunction-sections -fdata-sections -mstrict-align’. The linker flags should also include ‘--gc-sections’.
Walkthrough: ‘Hello World’ for VCK190
Note that files in the Explorer view of Vitis are shown using their relative paths in Unix-style notation. For example, hello_a72_system/hello_a72/src/helloworld.c refers to the source file created for a default hello world application.
The hammer icon in the Vitis GUI is located below the Xilinx menu item at the top, as shown here:
The Save All button in the Vitis GUI is located below the Search menu item at the top, as shown here:
Build with Default Optimization
Open the Vitis 2021.2 GUI, and perform the following actions:
Select a workspace.
Create a platform based on the VCK190, called vck190_a72, with default settings, and build it by clicking the Hammer icon below the Xilinx menu.
Create a hello_a72 example application from the Hello world template for the domain a72_0 for this platform
In the Explorer view, right click hello_a72_system/hello_a72 [standalone on versal_cips_0_pspmc_0_psv_cortexa72_0] and select C/C++ Build Settings. In the Properties dialog box:
Make the box wide
Ensure Settings is selected on the left
Click Manage Configurations…. In the dialog box that pops up:
Select Release
Click Set Active
The Status column for Release should read Active
Click OK
Ensure Tool Settings is selected below the Manage Configurations… button
Select Optimization under ARM v8 gcc compiler
Change Optimization Level to Optimize most (-O3)
Click Apply and Close
Open the platform:
In the Explorer view, double click on vck190_a72/platform.spr
Click on Board Support Package
Wait for the UI to initialize
Click Modify BSP Settings…. In the Board Support Package Settings dialog:
Make the dialog box as wide as possible
Select versal_cips_0_pspmc_0_psv_cortexa72_0 on the left
Select the extra_compiler_flags row
Make the Value column as wide as possible. It should read:
-g -Wall -Wextra -Dversal -DARMA72_EL3 -fno-tree-loop-distribute-patternsClick the Value column for the extra_compiler_flags row. You can now edit it in place. Do the following:
Change the ‘-g’ to ‘-g0 -O3’.
Ensure there is a space before the rest of the line, beginning with -Wall
Click OK
A Generating BSP Sources progress box should automatically pop up, wait for it to finish
Select hello_a72_system [ vck190_a72] in the Explorer view
Click the hammer icon below the Xilinx menu item to build everything.
Watch the console
There should be no errors
Some warnings are normal.
The build is done when you see a line that starts with Generating BOOT.BIN in system project is not supported
In the Explorer view, double click hello_a72_system/hello_a72/Release/hello_a72.elf.size to see the various section sizes
Build with Maximal Optimization
Open the Vitis 2021.2 GUI, and perform the following actions:
Select a workspace. Note the workspace path.
Create a platform based on the VCK190, called vck190_min_a72, with default settings, and build it by clicking the hammer icon below the Xilinx menu.
Create a hello_min_a72 example application from the Hello World template for the domain a72_0 for this platform
In the Explorer view, right click on the src folder of the hello_min_a72 app, and select Import Sources…
In the Import Sources dialog box that pops up:
Click Browse… on the line that reads From directory:
Navigate to the directory shown below in your workspace, and click Open
vck190_min_a72/psv_cortexa72_0/standalone_domain/bsp/psv_cortexa72_0/libsrc/standalone_v7_6/src/arm/ARMv8/64bit/gccCheck the box for boot.S from the list of files shown
Ensure that Into folder: shows hello_min_a72/src
Click Finish
Double click on the hello_min_a72 application’s boot.S in the Explorer view to open it
Comment out lines 74-76, 81-83, 87-89, 94-96, 245-258, 290, using ‘//’ C++ style comments at the beginning of each line
Double click on the hello_min_a72 application‘s lscript.ld in the Explorer view to open it. Click the Source tab at the bottom of the lscript.ld tab view.
Comment out lines 16-18 by adding /* on the empty line 15 and */ on the empty line 19
Add a C comment start ‘/*’ to the beginning of line 311, where it reads ‘_el2_stack_end = .;'
Add a C comment end ‘*/’ to the end of line 322, after the semicolon. The line reads ‘__el0_stack = .;’
Add a C comment start ‘/*’ on the empty line 186
Add a C comment end ‘*/’ on empty line 204
In the Explorer view, right click the hello_min_a72 application and select C/C++ Build Settings. In the Properties dialog box:
Make the box wide
Ensure Settings is selected on the left
Click Manage Configurations…. In the dialog box that pops up:
Select Release
Click Set Active
The Status column for Release should read Active
Click OK
Ensure Tool Settings is selected below the Manage Configurations… button
Select Optimization under ARM v8 gcc compiler
Change Optimization Level to Optimize most (-O3)
In the Other optimization flags field, add: -ffunction-sections -fdata-sections -mstrict-align
Select Miscellaneous under ARM v8 gcc linker
Click '+' in Other options (-XLinker [option])
Add --gc-sections in the empty field of the Enter Value dialog box
Click OK
Click Apply and Close
Open the platform:
In the Explorer view, expand the platform for vck190_min_a72 and double click on platform.spr
Click on Board Support Package
Wait for the UI to initialize
Click Modify BSP Settings…. In the Board Support Package Settings dialog:
Make the dialog box as wide as possible
Select versal_cips_0_pspmc_0_psv_cortexa72_0 on the left
Select the extra_compiler_flags row
Make the Value column as wide as possible. It should read:
-g -Wall -Wextra -Dversal -DARMA72_EL3 -fno-tree-loop-distribute-patternsClick the Value column for the extra_compiler_flags row. You can now edit it in place. Do the following:
Change the leading -g to -g0 -O3 -ffunction-sections -fdata-sections -mstrict-align.
Ensure there is a space before the rest of the line, beginning with -Wall
Click OK
A Generating BSP Sources progress box should automatically pop up, wait for it to finish
Click the Save All button at the top of the GUI, just below the Search menu
Select hello_min_a72_system [ vck190_min_a72] in the Explorer view
Click the hammer icon below the Xilinx menu item to build everything.
Watch the console
There should be no errors
Some warnings are normal
The build is done when you see a line that starts with Generating BOOT.BIN in system project is not supported
In the Explorer view, double click hello_min_a72_system/hello_min_a72/Release/hello_min_a72.elf.size to see the various section sizes
Results
KiloByte estimates are reported to the next integer value.
Note that the Total Column includes the BSS. The BSS is memory that is zeroed at run time, but is not loaded from the boot device. The more interesting sizes are in the Text+Data column. The load size of hello_world is 4% of its original size, and the run-time memory use is 13% of its original size. Fine tuning the heap and stack size, if possible, can reduce the BSS size.
Configuration | Text | Data | BSS | Text+Data | Total (K) |
---|---|---|---|---|---|
Default hello world for A72#0, Release, -O3 for BSP and application | 157364 | 2048 | 20676 | 159412 ~= 156 KiB | 180088 ~= 176 KiB |
Above changes | 4560 | 2040 | 16588 | 6600 ~= 7 KiB | 23188 ~= 23 KiB |
© Copyright 2019 - 2022 Xilinx Inc. Privacy Policy