AI Engine SSW Driver
Table of Contents
- 1 AI Engine Driver Introduction
- 2 AI Engine - Error Injection
- 2.1 Program Memory (PM) Error Injection in AMD’s Versal AI Engine
- 2.1.1 Program Memory ECC Protection
- 2.1.2 Error Injection Mechanism
- 2.1.3 Steps for Error Injection:
- 2.1.4 Example Code:
- 2.1.5 Address Mapping:
- 2.2 Data Memory (DM) Error Injection in AMD’s Versal AI Engine
- 2.3 Alternatively , Users can use XSDB to read and write to these registers for Error Injection.
- 2.4 Monitoring AI Engine Status with XSDB
- 2.5 Command
- 2.6 Supported Options
- 2.7 Using XSDB For Printing and Modifying Variables
- 2.8 Example
- 2.1 Program Memory (PM) Error Injection in AMD’s Versal AI Engine
- 3 Mainline status
- 4 Related Links
AI Engine Driver Introduction
Quick introduction to AI Engine SSW Driver
What exactly is the AI Engine System Software Driver & where does it fit in the overall AIE programming pyramid.
XRT API’s are designed to facilitate the development of AIE applications that run on AI Engines. They provide a high-level programming interface and tools for building, optimizing, and deploying algorithms & designs into the AI Engine. The AI Engine System Software Driver is a layer of software that sits between the application code (using XRT APIs) and the underlying hardware. It handles tasks such as configuring, initializing, and managing the AI Engines, handling communication between the APU and the AI Engines, and providing the necessary software interfaces for the application to interact with the AI Engines efficiently.
In summary, while XRT provide programming interfaces and runtime environments for AIE applications, the AI Engine System Software Driver is responsible for managing and controlling the AI Engines at a lower level, ensuring they work effectively within the system. The driver is a critical component for enabling the overall acceleration.
AI Engine - Error Injection
Error injection is available for both program and data memory.
Program Memory (PM) Error Injection in AMD’s Versal AI Engine
The AI Engine on Versal devices features an error injection capability for testing the robustness of your applications. This wiki page focuses on the Program Memory (PM) error injection feature.
Program Memory ECC Protection
All Program Memory (PM) in the core tiles is protected by Error Correction Code (ECC). This means that the system can detect and correct single-bit errors in the PM data.
Error Injection Mechanism
The PM is mapped twice in the address space. The second mapping is specifically designed for error injection. ECC checking is disabled in this second range.
This allows for controlled introduction of errors to test the ECC functionality. Details in Register Reference Manual here - AMD Technical Information Portal
Steps for Error Injection:
Write data to Program Memory: Write the desired data to a specific offset in the normal PM address range.
Flip bits in the data: Modify the previously written data by flipping one or more bits.
Write modified data to Error Injection range: Write this modified data to the same offset in the error injection address range. This effectively introduces errors into the PM.
Read from Program Memory: Read the data from the original offset in the normal PM address range.
Observe ECC behavior:
If only one bit was flipped, the ECC mechanism should correct the error and the read value will match the original data.
If two or more bits were flipped, the ECC mechanism will not be able to correct the error and an error will be generated.
Confirm error status: Read the Event Status Register to confirm the error status. The XAIE_EVENT_PM_ECC_ERROR_1BIT_CORE
bit in this register indicates whether a single-bit error was detected and corrected.
Example Code:
The image provided shows example C code demonstrating the process of injecting a single-bit error into the PM and verifying the ECC correction.
Refer to the AIE SSW Driver for details on the API functions used in this example.
This example is a unit test case for error injection verification within AIE SSW Driver.
/************************** Constant Definitions *****************************/
#define PM_ADDR 0x20000
#define PM_ERR_ADDR 0x24000
/************************** Function Definitions *****************************/
int test_aie_pm_err_injection(XAie_DevInst *DevInst)
{
XAie_LocType Loc = XAie_TileLoc(0, XAIE_AIE_TILE_ROW_START);
uint32_t TileAddr, TestData;
uint32_t Data = 0xDEADBEEF;
uint8_t Status = 0;
AieRC RC = XAIE_OK;
TileAddr = _XAie_GetTileAddr(DevInst, Loc.Row, Loc.Col);
RC = XAie_Write32(DevInst, TileAddr + PM_ADDR, Data);
if (RC != XAIE_OK) {
printf("Failed to write to program memory\n");
return -1;
}
RC = XAie_Read32(DevInst, TileAddr + PM_ADDR, &TestData);
if (RC != XAIE_OK || TestData != Data) {
printf("Failed to read to program memory\n");
return -1;
}
/* Flip last bit */
TestData ^= 0b1;
RC = XAie_Write32(DevInst, TileAddr + PM_ERR_ADDR, TestData);
if (RC != XAIE_OK) {
printf("Failed to write to program memory error injection range\n");
return -1;
}
RC = XAie_Read32(DevInst, TileAddr + PM_ADDR, &TestData);
if (RC != XAIE_OK) {
printf("Failed to read to program memory\n");
return -1;
} else if (Data != TestData) {
printf("Failed to correct 1 bit error\n");
return -1;
}
RC = XAie_EventReadStatus(DevInst, Loc, XAIE_CORE_MOD,
XAIE_EVENT_PM_ECC_ERROR_1BIT_CORE, &Status);
if (RC != XAIE_OK) {
printf("Failed to read event status\n");
return -1;
} else if (!Status) {
printf("ECC event not generated\n");
return -1;
}
printf("AIE PM error injection test success\n");
return 0;
}
Address Mapping:
The table in the image above shows the addresses for the normal PM and the PM error injection range. Note that these addresses are specific to the device and configuration used in the example.
Important Note: Use caution when performing error injection.
Data Memory (DM) Error Injection in AMD’s Versal AI Engine
The AI Engine on Versal devices provides an error injection feature for testing Data Memory (DM) integrity and error handling capabilities.
Details in Register Reference Manual here AMD Technical Information Portal
DM Error Protection
Unlike Program Memory, Data Memory utilizes a combination of ECC and parity protection:
ECC: Only two banks of the DM are protected by ECC.
Parity: The remaining DM banks are protected by parity bits.
This hybrid approach balances error detection capabilities with resource utilization.
Error Injection Mechanism
The "Checkbit Error Generation" register (at address 0x00012000) plays a crucial role in error injection. This register allows you to selectively disable check bit generation (parity or ECC) on a per-lane basis (each lane corresponds to 32 bits).
Steps for Error Injection:
Write data to Data Memory: Write the desired data to the target DM bank and offset.
Disable check bit generation: Use the "Checkbit Error Generation" register to disable check bit updates for the specific 32-bit lane(s) where you want to inject errors.
Inject errors: Modify the data in the selected lane(s) by flipping one or more bits. This can be done by writing back to the same memory location with modified data.
Read from Data Memory: Read the data from the original offset.
Observe error handling:
ECC Banks:
A single-bit error will be corrected by the ECC mechanism, and the read value will match the original data.
A two-bit error will trigger an ECC error.
Parity Banks:
Any error will trigger a parity error.
Confirm error status: Read the Event Status Register to confirm the type and location of the error. Specific event flags are dedicated to:
DM ECC single-bit errors
DM ECC double-bit errors
DM Parity errors for each bank (Bank 2 through Bank 7)
Event Flag | Description |
DM ECC Error 1bit | Single-bit ECC error detected and corrected |
DM ECC Error 2bit | Double-bit ECC error detected |
DM Parity Error Bank 2 | Parity error in DM Bank 2 |
DM Parity Error Bank 3 | Parity error in DM Bank 3 |
... | ... |
DM Parity Error Bank 7 | Parity error in DM Bank 7 |
Important Note: Exercise caution when injecting errors into Data Memory.
Alternatively , Users can use XSDB to read and write to these registers for Error Injection.
Monitoring AI Engine Status with XSDB
Users can also examine the AI Engine's status on both Linux and bare-metal operating systems using the xsdb utility. This tool is particularly helpful for debugging applications and diagnosing issues like deadlocks or system hangs, even without XRT. Unlike the xbutil command, which relies on XRT, xsdb operates independently. It provides valuable information about the AI Engine's state through the aiestatus examine
command, outputting the data in a convenient JSON file format. This command can be executed before, during, or after running an application.
Further details about the aiestatus examine
command, including its options and usage, can be looked here in the user guide.