Zynq UltraScale+ MPSoC VCU TRD 2021.1 - Xilinx Low Latency PL DDR XV20 HDMI Video Capture and Display
This page provides all the information related to Design Module 9 - VCU TRD Xilinx low latency(LLP2) PL DDR XV20 HDMI design.
Table of Contents
1 Overview
This module enables capture of video from an HDMI-Rx subsystem implemented in the PL. The video can be displayed through the HDMI-Tx subsystem implemented in the PL. The module can stream-out and stream-in live captured video frames through an Ethernet interface at ultra-low latencies using Sync IP. This module supports four video streams using AXI broadcaster at capture side and mixer at display side for XV20 pixel format. In this design PL_DDR is used for decoding and PS_DDR for encoding so that DDR bandwidth would be enough to support high bandwidth VCU applications requiring simultaneous encoder and decoder operations and transcoding at 4k @60 FPS.
The VCU encoder and decoder operate in slice mode. An input frame is divided into multiple slices (8 or 16) horizontally. The encoder generates a slice_done interrupt at every end of the slice. Generated NAL unit data can be passed to a downstream element immediately without waiting for the frame_done interrupt. The VCU decoder also starts processing data as soon as one slice of data is ready in its circular buffer instead of waiting for complete frame data. The Sync IP does an AXI transaction-level tracking so that the producer and consumer can be synchronized at the granularity of AXI transactions instead of granularity at the video buffer level. Sync IP is responsible for synchronizing buffers between Capture DMA and VCU encoder as both work on same buffer.
The capture element (FB write DMA) writes video buffers in raster-scan order. SyncIP monitors the buffer level while the capture element is writing into DRAM and allows the encoder to read input buffer data if the requested data is already written by DMA, otherwise it blocks the encoder until DMA completes its writes. On the decoder side, the VCU decoder writes decoded video buffer data into DRAM in block-raster scan order and displays reads data in raster-scan order. To avoid display under-run problems, software ensures a phase difference of "~frame_period/2", so that decoder is ahead compare to display.
This design supports the following video interfaces:
Sources:
HDMI-Rx capture pipeline implemented in the PL.
Stream-In from network or internet.
Sinks:
HDMI-Tx display pipeline implemented in the PL.
VCU Codec:
Video Encode/Decode capability using VCU hard block in PL.
AVC/HEVC encoding
Encoder/decoder parameter configuration.
Video format:
XV20
Supported Resolution:
The table below provides the supported resolution from command line app only in this design.
Resolution | Command Line | |
Single Stream | Multi-stream | |
4kp60 | √ | NA |
4kp30 | √ | √ (Max 2) |
1080p60 | √ | √ (Max 4 for encoder) (Max 2 for decoder) |
√ - Supported
NA – Not applicable
x – Not supported
When using Low Latency mode (LLP1/LLP2), The encoder and decoder are limited by the number of internal cores. The encoder has maximum of four streams and the decoder has maximum of two streams.
The below table gives information about the features supported in this design.
Pipeline | Input source | Format | Output Type | Resolution | VCU codec |
---|---|---|---|---|---|
Serial pipeline | HDMI-Rx | XV20 | HDMI-Tx | 4kp60 / 4kp30 / 1080p60 | HEVC/AVC |
Stream-Out pipeline | HDMI-Rx | XV20 | Stream-Out | 4kp60 / 4kp30 / 1080p60 | HEVC/AVC |
Stream-in pipeline | Stream-In | XV20 | HDMI-Tx | 4kp60 / 4kp30 / 1080p60 | HEVC/AVC |
The below figure shows the Xilinx Low Latency PL DDR XV20 HDMI design hardware block diagram.
The below figure shows the Xilinx Low Latency PL DDR XV20 HDMI design software block diagram.
1.1 Board Setup
Refer below link for Board Setup
1.2 Run Flow
The TRD package is released with the source code, Vivado project, Petalinux BSP, and SD card image that enables the user to run the demonstration. It also includes the binaries necessary to configure and boot the ZCU106 board. Prior to running the steps mentioned in this wiki page, download the TRD package and extract its contents to a directory referred to as TRD_HOME
which is the home directory.
Refer Section 4.1 : Download the TRD of Zynq UltraScale+ MPSoC VCU TRD 2021.1
wiki page to download all TRD contents.
TRD package contents are placed in the following directory structure. The user needs to copy all the files from the $TRD_HOME/images/vcu_llp2_hdmi_xv20/
to FAT32 formatted SD card directory.
rdf0428-zcu106-vcu-trd-2021-1/
├── apu
│ └── vcu_petalinux_bsp
├── images
│ ├── vcu_10g
│ ├── vcu_audio
│ ├── vcu_llp2_hdmi_nv12
│ ├── vcu_llp2_hdmi_nv16
│ ├── vcu_llp2_hdmi_xv20
│ ├── vcu_llp2_sdi_xv20
│ ├── vcu_multistream_nv12
│ ├── vcu_pcie
│ ├── vcu_plddrv1_hdr10_hdmi
│ ├── vcu_plddrv2_hdr10_hdmi
│ └── vcu_sdi_xv20
├── pcie_host_package
│ ├── COPYING
│ ├── include
│ ├── LICENSE
│ ├── readme.txt
│ ├── RELEASE
│ ├── tests
│ ├── tools
│ └── xdma
├── pl
│ ├── constrs
│ ├── designs
│ ├── prebuild
│ ├── README.md
│ └── srcs
└── README.txt
└── zcu106_vcu_trd_sources_and_licenses.tar.gz
TRD package contents specific to VCU Xilinx Low Latency PL DDR XV20 HDMI design are placed in the following directory structure.
rdf0428-zcu106-vcu-trd-2021-1/
├── apu
│ └── vcu_petalinux_bsp
│ └── xilinx-vcu-zcu106-v2021.1-final.bsp
├── images
│ ├── vcu_llp2_hdmi_xv20
│ │ ├── autostart.sh
│ │ ├── BOOT.BIN
│ │ ├── boot.scr
│ │ ├── config
│ │ ├── Image
│ │ ├── rootfs.cpio.gz.u-boot
│ │ ├── system.dtb
│ │ └── vcu
├── pcie_host_package
├── pl
│ ├── constrs
│ ├── designs
│ │ ├── zcu106_llp2_xv20
│ ├── prebuild
│ │ ├── zcu106_llp2_xv20
│ ├── README.md
│ └── srcs
│ ├── hdl
│ └── ip
└── README.txt
└── zcu106_vcu_trd_sources_and_licenses.tar.gz
Configuration files(input.cfg) for various resolutions are placed in the following directory structure in /media/card
.
config
├── 1-4kp60
│ ├── Display
│ └── Stream-out
├── 2-1080p60
│ ├── Display
│ └── Stream-out
├── 2-4kp30
│ ├── Display
│ └── Stream-out
└── 4-1080p60
│ ├── Display
│ ├── Stream-in
│ └── Stream-out
└── input.cfg
1.2.1 GStreamer Application (vcu_gst_app)
The vcu_gst_app is a command line multi-threaded linux application. The command line application requires an input configuration file (input.cfg) to be provided in the plain text.
Run below modetest command to set CRTC configurations for 4kp60:
Run below modetest command to set CRTC configurations for 4kp30:
Execution of the application is shown below:
Example:
Make sure HDMI-Rx should be configured to 4kp60 mode, while running below example pipelines.
Low latency(LLP1/LLP2) stream-in pipelines are not supported in vcu_gst_app.
4kp60 XV20 HEVC_25Mbps ultra low-latency(LLP2) display pipeline execution.
4kp60 XV20 HEVC_25Mbps ultra low-latency(LLP2) stream-out pipeline execution.
4kp60 XV20 HEVC ultra low-latency(LLP2) stream-in pipeline execution.
For LLP1/LLP2 Multistream HEVC serial and stream-out use-cases (2-4kp30, 2-1080p60, 4-1080p60), use ENC_EXTRA_OP_BUFFERS=10
variable before vcu_gst_app command. Below is the sample pipeline:
The above macro is recommended to use for LLP1/LLP2 multi-stream HEVC use-cases only.
To measure the latency of the pipeline, run the below command. The latency data is huge, so dump it to a file.
Refer below link for detailed run flow steps
1.3 Build Flow
Refer below link for detailed build flow steps
2 Other Information
2.1 Known Issues
For Petalinux related known issues please refer: PetaLinux 2021.1 - Product Update Release Notes and Known Issues
For VCU related known issues please refer AR# 76600: LogiCORE H.264/H.265 Video Codec Unit (VCU) - Release Notes and Known Issues and Xilinx Zynq UltraScale+ MPSoC Video Codec Unit.
To reduce performance issues with llp2 4x serial pipelines, please refer to chapter# 40 of Section VI: Appendices for IRQ Balancing scheme in PG252.
2.2 Limitations
For Petalinux related limitations please refer: PetaLinux 2021.1 - Product Update Release Notes and Known Issues
For VCU related limitations please refer AR# 76600: LogiCORE H.264/H.265 Video Codec Unit (VCU) - Release Notes and Known Issues , Xilinx Zynq UltraScale+ MPSoC Video Codec Unit and PG252.
2.3 Optimum VCU Encoder parameters for use-cases
Video streaming:
Video streaming use-case requires very stable bitrate graph for all pictures.
It is good to avoid periodic large Intra pictures during the encoding session
Low-latency rate control (hardware RC) is the preferred control-rate for video streaming, it tries to maintain equal amount frame sizes for all pictures.
Good to avoid periodic Intra frames instead use low-delay-p (IPPPPP…)
VBR is not a preferred mode of streaming.
Performance: AVC Encoder settings:
It is preferred to use 8 slices only for better AVC encoder performance.
AVC standard does not support Tile mode processing which results in the processing of MB rows sequentially for entropy coding.
Quality: Low bitrate AVC encoding:
Enable profile=high and use qp-mode=auto for low-bitrate encoding use-cases.
The high profile enables 8x8 transform which results in better video quality at low bitrates.
2.4 Max Bit-rate Benchmarking
The following tables summarize the maximum bit rate achievable for 3840x2610p60 resolution, XV20 pixel format at GStreamer level. The maximum supported target bit rate values vary based on what elements and type of input used in the pipeline.
Maximum Bit Rate support for LLP1/LLP2 Streaming Use case with 4kp60 resolution.
The table below provides Encoder/Decoder Maximum Bit Rate Tests with XV20 format (For Streaming).
Video Streaming ( Server: Live video capture → VCU encoder → Parser → rtppay → Stream-out ) (Client: Stream-in → rtpdepay → Decoder → Display ) | ||||||
Format | Codec | Rate Control Mode | Latency Mode | B-Frames = 0 | DDR Mode | Max Target Bitrate |
4:2:2, 10 bit | H.264 (AVC) | LOW_LATENCY | LLP1 | IPPP | Encoder (PS_DDR), Decoder (PL_DDR) | 25 Mb/s |
LLP2 | 25 Mb/s | |||||
H.265 (HEVC) | LLP1 | 25 Mb/s | ||||
LLP2 | 25 Mb/s |
Maximum Bit Rate support for LLP1/LLP2 Serial Use case with 4kp60 resolution.
The table below provides Encoder/Decoder Maximum Bit Rate Tests with XV20 format.
Serial ( Live video capture → VCU encoder → VCU decoder → Display ) | ||||||
Format | Codec | Rate Control Mode | Latency Mode | B-Frames = 0 | DDR Mode | Max Target Bitrate |
4:2:2, 10 bit | H.264 (AVC) | LOW_LATENCY | LLP1 | IPPP | Encoder (PS_DDR), Decoder (PL_DDR) | 25 Mb/s |
LLP2 | 25 Mb/s | |||||
H.265 (HEVC) | LLP1 | 25 Mb/s | ||||
LLP2 | 25 Mb/s |
3 Appendix A - Input Configuration File (input.cfg)
The example configuration files are stored at /media/card/config/
folder.
Configuration Type | Configuration Name | Description | Available Options | Note |
---|---|---|---|---|
Common | Common configuration | It is the starting point of common configuration |
|
|
Number of Input |
| 1,2,3,4 |
| |
Output | Select the video interface. | HDMI |
| |
Out Type |
| display and stream |
| |
Display Rate | Pipeline frame rate | 30 or 60 FPS |
| |
Exit | It indicates to the application that the configuration is over |
|
| |
Input | Input Configuration | It is the starting point of the input configuration |
|
|
Input Numbers | Starting Nth input configuration | 1, 2, 3, 4 |
| |
Input Type | Input Type | HDMI |
| |
Raw | To tell the pipeline is processed or pass-through | FALSE | Raw use-case is not supported for both LLP2 and non-LLP2 use-case as mixer is not connected to PS DDR | |
Width | The width of the live source | 3840, 1920 |
| |
Height | The height of the live source | 2160,1080 |
| |
Format | The format of input data | XV20 |
| |
Enable LLP2 | To enable LLP2 configuration | TRUE, FALSE | Set Enable LLP2 equals to False for non-LLP2 use-case. | |
Exit | It indicates to the application that the configuration is over |
|
| |
Encoder | Encoder Configuration | It is the starting point of encoder configuration |
|
|
Encoder Number | Starting Nth encoder configuration | 1,2,3,4 |
| |
Encoder Name | Name of the encoder | AVC/HEVC |
| |
Profile | Name of the profile | high for AVC main for HEVC |
| |
Rate Control | Rate control options | Low_Latency |
| |
Filler Data | Filler Data NAL units for CBR rate control | False |
| |
QP | QP control mode used by the VCU encoder | Uniform, Auto |
| |
L2 Cache | Enable or Disable L2Cache buffer in encoding process | True, False |
| |
Latency Mode | Encoder latency mode | sub_frame |
| |
Low Bandwidth | If enabled, decrease the vertical search range used for P-frame motion estimation to reduce the bandwidth. | True, False |
| |
Gop Mode | Group of Pictures mode | Basic, low_delay_p, low_delay_b |
| |
Bitrate | Target bitrate in Kbps | 1-25000 |
| |
B Frames | Number of B-frames between two consecutive P-frames | 0 |
| |
Slice | The number of slices produced for each frame. Each slice contains one or more complete macroblock/CTU row(s). Slices are distributed over the frame as regularly as possible. If slice-size is defined as well more slices may be produced to fit the slice-size requirement. |
| The recommended slice for LLP2 use-case is 8. | |
GoP Length | The distance between two consecutive I frames | 1-1000 |
| |
GDR Mode | It specifies which Gradual Decoder Refresh(GDR) scheme should be used when gop-mode = low_delay_p | Horizontal/Vertical/Disabled | GDR mode is currently supported with LLP1/LLP2 low-delay-p use-cases only | |
Entropy Mode | It specifies the entropy mode for H.264 (AVC) encoding process | CAVLC/CABAC/Default |
| |
Max Picture Size | It is used to curtail instantaneous peak in the bit-stream using this parameter. It works in CBR/VBR rate-control only. When it is enabled, max-picture-size value is calculated and set with 10% of AllowedPeakMargin. i.e. | TRUE/FALSE |
| |
Preset |
| Custom |
| |
Exit | It indicates to the application that the configuration is over |
|
| |
Streaming | Streaming Configuration | It is the starting point of streaming configuration |
|
|
Streaming Number | Starting Nth Streaming configuration | 1, 2, 3, 4 |
| |
Host IP | The host to send the packets to | 192.168.25.89 or Windows PC IP |
| |
Port | The port to send the packets to | 5004, 5008, 5012 and 5016 |
| |
Exit | It indicates to the application that the configuration is over |
|
| |
Trace | Trace Configuration | It is the starting point of trace configuration |
|
|
|