AI Engine SSW Driver

Table of Contents

AI Engine Driver Introduction

  • Quick introduction to AI Engine SSW Driver

What exactly is the AI Engine System Software Driver  & where does it fit in the overall AIE programming pyramid.

XRT API’s are designed to facilitate the development of AIE applications that run on AI Engines. They provide a high-level programming interface and tools for building, optimizing, and deploying algorithms & designs into the AI Engine. The AI Engine System Software Driver is a layer of software that sits between the application code (using XRT APIs) and the underlying hardware. It handles tasks such as configuring, initializing, and managing the AI Engines, handling communication between the APU and the AI Engines, and providing the necessary software interfaces for the application to interact with the AI Engines efficiently.

 

 

image-20240511-151107.png

 

image-20240511-151303.png

 

In summary, while XRT provide programming interfaces and runtime environments for AIE applications, the AI Engine System Software Driver is responsible for managing and controlling the AI Engines at a lower level, ensuring they work effectively within the system. The driver is a critical component for enabling the overall acceleration.

AI Engine - Error Injection

Error injection is available for both program and data memory.

 

Program Memory (PM) Error Injection in AMD’s Versal AI Engine

The AI Engine on Versal devices features an error injection capability for testing the robustness of your applications. This wiki page focuses on the Program Memory (PM) error injection feature.

Program Memory ECC Protection

All Program Memory (PM) in the core tiles is protected by Error Correction Code (ECC). This means that the system can detect and correct single-bit errors in the PM data.

Error Injection Mechanism

The PM is mapped twice in the address space. The second mapping is specifically designed for error injection. ECC checking is disabled in this second range.

This allows for controlled introduction of errors to test the ECC functionality. Details in Register Reference Manual here - https://docs.amd.com/r/en-US/am015-versal-aie-register-reference/Checkbit_Error_Generation-AIE_MEMORY_MODULE-Register

Steps for Error Injection:

Write data to Program Memory: Write the desired data to a specific offset in the normal PM address range.

Flip bits in the data: Modify the previously written data by flipping one or more bits.

Write modified data to Error Injection range: Write this modified data to the same offset in the error injection address range. This effectively introduces errors into the PM.

Read from Program Memory: Read the data from the original offset in the normal PM address range.

Observe ECC behavior:

If only one bit was flipped, the ECC mechanism should correct the error and the read value will match the original data.

If two or more bits were flipped, the ECC mechanism will not be able to correct the error and an error will be generated.

Confirm error status: Read the Event Status Register to confirm the error status. The XAIE_EVENT_PM_ECC_ERROR_1BIT_CORE bit in this register indicates whether a single-bit error was detected and corrected.

Example Code:

The image provided shows example C code demonstrating the process of injecting a single-bit error into the PM and verifying the ECC correction.

Refer to the AIE SSW Driver for details on the API functions used in this example.

This example is a unit test case for error injection verification within AIE SSW Driver.

 

/************************** Constant Definitions *****************************/ #define PM_ADDR 0x20000 #define PM_ERR_ADDR 0x24000 /************************** Function Definitions *****************************/ int test_aie_pm_err_injection(XAie_DevInst *DevInst) { XAie_LocType Loc = XAie_TileLoc(0, XAIE_AIE_TILE_ROW_START); uint32_t TileAddr, TestData; uint32_t Data = 0xDEADBEEF; uint8_t Status = 0; AieRC RC = XAIE_OK; TileAddr = _XAie_GetTileAddr(DevInst, Loc.Row, Loc.Col); RC = XAie_Write32(DevInst, TileAddr + PM_ADDR, Data); if (RC != XAIE_OK) { printf("Failed to write to program memory\n"); return -1; } RC = XAie_Read32(DevInst, TileAddr + PM_ADDR, &TestData); if (RC != XAIE_OK || TestData != Data) { printf("Failed to read to program memory\n"); return -1; } /* Flip last bit */ TestData ^= 0b1; RC = XAie_Write32(DevInst, TileAddr + PM_ERR_ADDR, TestData); if (RC != XAIE_OK) { printf("Failed to write to program memory error injection range\n"); return -1; } RC = XAie_Read32(DevInst, TileAddr + PM_ADDR, &TestData); if (RC != XAIE_OK) { printf("Failed to read to program memory\n"); return -1; } else if (Data != TestData) { printf("Failed to correct 1 bit error\n"); return -1; } RC = XAie_EventReadStatus(DevInst, Loc, XAIE_CORE_MOD, XAIE_EVENT_PM_ECC_ERROR_1BIT_CORE, &Status); if (RC != XAIE_OK) { printf("Failed to read event status\n"); return -1; } else if (!Status) { printf("ECC event not generated\n"); return -1; } printf("AIE PM error injection test success\n"); return 0; }

 

Address Mapping:

The table in the image above shows the addresses for the normal PM and the PM error injection range. Note that these addresses are specific to the device and configuration used in the example.

Important Note: Use caution when performing error injection.

 

Data Memory (DM) Error Injection in AMD’s Versal AI Engine

The AI Engine on Versal devices provides an error injection feature for testing Data Memory (DM) integrity and error handling capabilities.

DM Error Protection

Unlike Program Memory, Data Memory utilizes a combination of ECC and parity protection:

ECC: Only two banks of the DM are protected by ECC.

Parity: The remaining DM banks are protected by parity bits.

This hybrid approach balances error detection capabilities with resource utilization.

Error Injection Mechanism

The "Checkbit Error Generation" register (at address 0x00012000) plays a crucial role in error injection. This register allows you to selectively disable check bit generation (parity or ECC) on a per-lane basis (each lane corresponds to 32 bits).

 

 

Steps for Error Injection:

Write data to Data Memory: Write the desired data to the target DM bank and offset.

Disable check bit generation: Use the "Checkbit Error Generation" register to disable check bit updates for the specific 32-bit lane(s) where you want to inject errors.

Inject errors: Modify the data in the selected lane(s) by flipping one or more bits. This can be done by writing back to the same memory location with modified data.

Read from Data Memory: Read the data from the original offset.

Observe error handling:

ECC Banks:

A single-bit error will be corrected by the ECC mechanism, and the read value will match the original data.

A two-bit error will trigger an ECC error.

Parity Banks:

Any error will trigger a parity error.

Confirm error status: Read the Event Status Register to confirm the type and location of the error. Specific event flags are dedicated to:

DM ECC single-bit errors

DM ECC double-bit errors

DM Parity errors for each bank (Bank 2 through Bank 7)

Event Flag

Description

DM ECC Error 1bit

Single-bit ECC error detected and corrected

DM ECC Error 2bit

Double-bit ECC error detected

DM Parity Error Bank 2

Parity error in DM Bank 2

DM Parity Error Bank 3

Parity error in DM Bank 3

...

...

DM Parity Error Bank 7

Parity error in DM Bank 7

Important Note: Exercise caution when injecting errors into Data Memory.

 

Alternatively , Users can use XSDB to read and write to these registers for Error Injection.

Monitoring AI Engine Status with XSDB

Users can also examine the AI Engine's status on both Linux and bare-metal operating systems using the xsdb utility. This tool is particularly helpful for debugging applications and diagnosing issues like deadlocks or system hangs, even without XRT. Unlike the xbutil command, which relies on XRT, xsdb operates independently. It provides valuable information about the AI Engine's state through the aiestatus examine command, outputting the data in a convenient JSON file format. This command can be executed before, during, or after running an application.

Further details about the aiestatus examine command, including its options and usage, can be looked here in the user guide.

Command

aiestatus examine

Supported Options

aiestatus examine [-graphs] <graph-list> [-work-dir] <dir-path> [-aie-version] <version> [-file] <file-name> [-run-summary] [-target-name] <target-name> [-tiles] <tile-list>

Using XSDB For Printing and Modifying Variables

XSDB command

Details

XSDB command

Details

Prints the expression exprexpr can be a single variable or multiple variables combined with operators in a way that's syntactically valid.

If -add is specified, expr is added to the auto expression list.  Expressions in the auto expression list are printed every time print is called.

If -defs is specified, the definition (type, size, address, RW flags) of expr is returned.

If -dict is specified, the result of the expression is returned in Tcl dict format, with variable names as dict keys and values as dict values.

If -remove is specified, an expression that was previously added to the auto expression list via add is removed.

If -set is specified, var is set to the value of expr.

Prints 1 word from address addr.

If num is specified, num values are printed.

If -force is specified, access protection is overrided, allowing access to reserved and invalid address ranges.

If -size is specified, the amount of data read is determined by access-size, where access-size is one of the following:

  • b - Read a byte

  • h - Read a half word

  • w - Read a word (default)

  • d - Read a double word

If -value is specified, a Tcl list of values is returned.

If -bin is specified, the data is written in binary format to the file name on the host machine.

If -address-space is specified, the address space name is accessed instead of the default address space.  For ARM DAP targets, the address spaces are as follows:

  • DPR - DP registers

  • APR - AP registers

  • AP<n> - MEM-AP<n> registers

If unaligned-access is specified, memory access is not aligned to access size.

Writes a list of values values to address addr sequentially.

If num is specified, num values are written.

If -force is specified, access protection is overrided, allowing access to reserved and invalid address ranges.

If -size is specified, the amount of data written is determined by access-size, where access-size is one of the following:

  • b - Read a byte

  • h - Read a half word

  • w - Read a word (default)

  • d - Read a double word

If -bin is specified, the data is read from file name and written to addr in binary format.

If -address-space is specified, the address space name is accessed instead of the default address space.  For ARM DAP targets, the address spaces are as follows:

  • DPR - DP registers

  • APR - AP registers

  • AP<n> - MEM-AP<n> registers

If unaligned-access is specified, memory access is not aligned to access size.

Example

root@xilinx-versal-system-controller-2024:~#

xsdb% mwr -force 0x80082004 0x05F5E100
xsdb% mrd -force 0x80082004
80082004: 05F5E100

xsdb% mrd -force 0x80082000
80082000: 00000000

xsdb% mwr -force 0x80082000 0x00000066
xsdb% mrd -force 0x80082000
80082000: 00000066

xsdb% mwr -force 0x80082000 0x000000C6
xsdb% mrd -force 0x80082000
80082000: 000001C6

 

Developers can refer to the AIE register details for PM , DM and Error Injection here - https://docs.amd.com/r/en-US/am015-versal-aie-register-reference/Checkbit_Error_Generation-AIE_MEMORY_MODULE-Register & use xsdb to do that, Which will be similar to what has been orchestrated above using the AIE driver API’s.

Mainline status

  • The driver is available in Mainline

Related Links

https://github.com/Xilinx/linux-xlnx/tree/master/drivers/misc/xilinx-ai-engine

 

Published by - Alok Gupta and Gregory Williams.

© Copyright 2019 - 2022 Xilinx Inc. Privacy Policy