Zynq UltraScale+ MPSoC VCU ROI 2019.2

Table of Contents

1 Overview

The primary goal of this VCU ROI design is to demonstrate the use of DPU (Deep learning Processor Unit) block for extracting the ROI (Region of Interest) from input video frames and to use this information to perform ROI based encoding using VCU (Video Codec Unit) encoder hard block present in Zynq UltraScale+ EV devices.

The design will serve as a platform to accelerate Deep Neural Network inference algorithms using DPU and demonstrate the ROI feature of VCU encoder. The design uses a Deep Convolutional Neural Network (CNN) named Densebox, running on DPU to extract ROI Information (e.g. ‘face’ in this case).

The Design will use Vivado IPI flow for building the hardware design and Xilinx Yocto Petalinux flow for software design & DNNDK Toolchain for compiling the Densebox algorithm from Caffe, a high-level ML framework. It will use Xilinx IP and Software driver to demonstrate the capabilities of different components.

The following figure shows one of the use cases (streaming pipeline) with face detection with enhanced encoding for ROI on ZCU106.

Enhanced encoding for ROI using ZCU106 Boards

1.1 System Architecture

The following figure shows the system level diagram which includes the components of the evaluation board.

1.2 Hardware Architecture

This section gives a detailed description of the blocks used in the hardware design. The functional block diagram of the design is shown in the below figure.

There are seven primary Sections in the design.

  • HDMI Capture Pipeline:

    • Captures video frame buffers from Capture source in 4k Resolution, NV12 Format. 

    •  writes the buffers into DDR Memory with Frame Buffer Write IP. 

  • Multi-scaler Block:

    • Reads the Video Buffers from DDR Memory.

    • Scales down the buffer to the VGA(640x480) size (suitable for dpu).

    • Converts the format from NV12 to BGR.

    • Writes the Down-scaled buffer to DDR Memory.

  • DPU Block: 

    • Reads the downscaled buffers from DDR Memory.

    • Runs the Densebox algorithm to generate the ROI information for each frame buffer.

    • Passes the ROI information to VCU Encoder.

  • VCU Encoder: 

    • Reads the 4k NV12 Buffer from DDR Memory.

    • Receives the ROI metadata from DPU IP.

    • Encodes the video buffers based on the ROI Information.

    • Finally writes the encoded stream to DDR Memory.

  • PS GEM:

    • Reads the Encoder stream from DDR Memory.

    • Streams out the encoded stream via Ethernet.

  • VCU Decoder:

    • Decodes the received encoded frame and writes to memory

  • HDMI Tx:

    • Displays the decoded frames on HDMI Display

The below figure shows the Processing System (PS) and Programmable Logic (PL) components in this TRD. All PL components are in gray color.

This design supports the following video interfaces:

Sources:

  • HDMI-Rx capture pipeline implemented in the PL

  • File source (SD card, USB storage, SATA hard disk)

  • Stream-In from network or internet

Sinks:

  • HDMI-Tx display pipeline implemented in the PL

  • Stream-out on network or internet

VCU Codec:

  • Video Encoder/Decoder capability using VCU hard block in PL 

  • H.264/H.265 encoding

  • Encoder/decoder parameter configuration using OMX interface

DPU:

Streaming Interfaces:

  • 1G Ethernet PS GEM

Video format:

  • NV12

Supported Resolution:

  • 4kp30

  • 1080p30

1.3 VCU ROI Software

1.3.1 GStreamer Pipeline

The GStreamer plugin demonstrates the DPU capabilities with Xilinx VCU encoder’s ROI(Region Of Interest) feature. The plugin will detect ROI(face co-ordinates) from input frames using DPU IP and pass the detected ROI information to the Xilinx VCU encoder. The following figure shows the data flow for GStreamer pipeline of stream-out use case.

Block Diagram of Stream-out Pipeline

As shown in the above figure, the stream-out GStreamer pipeline performs the below list of operations:

  1. v4l2src captures the data from HDMI-Rx in NV12 format and pass to xlnxroivideo1detect GStreamer plugin

  2. xlnxroivideo1detect GStreamer plugin will scale down to 640x480 resolution and convert the data into BGR format

  3. 640x480 BGR frame will be provided to DPU IP as an input to find ROI(face co-ordinates)

  4. Extracted ROI information will be passed to VCU encoder

  5. The encoder will encode the input data by encoding ROI regions with high quality as compared to non-ROI region using received ROI information

  6. Stream-out the encoded data using RTP protocol

The following figure shows the data flow for the GStreamer pipeline of stream-in use cases.

Block Diagram of Stream-in Pipeline

As shown in the above figure, the stream-in GStreamer pipeline performs the below list of operations:

  1. Stream-in the encoded data using RTP protocol

  2. The Xilinx VCU decoder will decode the data

  3. Display the decoded data on HDMI-Tx display

The below figure shows the xlnxroivideo1detect GStreamer plugin data flow.

As shown in the above figure, our xlnxroivideo1detect GStreamer plugin will perform below the list of operations:

  1. DPU is initialized and DPU kernel will be loaded using libn2cube APIs - int dpuOpen(), DPUKernel *dpuLoadKernel(const char *networkName)

  2. DPU GStreamer plugin receives the data frame from HDMI-Rx through a v4l2src plugin

  3. Create the DPU task - int dpuCreateTask(DPUKernel *kernel, int mode)

  4. Scale the input frame to 640x480 resolution using Xilinx Scaler IP

  5. Convert the input frame data format from NV12 to BGR format using Xilinx Color Space Converter(CSC) soft IP

  6. Prepare the OpenCV image using BGR data

  7. Pass the intermediate OpenCV image to the DPU - int dpuSetInputImage2(DPUTask *task, const char *nodeName, const cv::Mat &image, int idx=0)

  8. Run the DPU task - int dpuRunTask(DPUTask *task)

  9. Extract the ROI(face) co-ordinates from the DPU output

  10. Map the detected face co-ordinates to the original input frame resolution

  11. Fill the ROI metadata buffer using extracted ROI (face) co-ordinates

  12. Pass the ROI metadata buffer and input NV12 frame data buffer to the Xilinx VCU encoder

  13. Destroy the DPU task and kernel - int dpuDestroyTask(DPUTask *task), int dpuDestroyKernel(DPUKernel*kernel)

  14. Close the DPU - int dpuClose()

1.3.2 Deep Learning Processor Unit(DPU)

DPU is a programmable engine dedicated to the convolutional neural network. The unit contains a register configure the module, data controller module, and convolution computing module. There is a specialized instruction set for DPU, which enables DPU to work efficiently for many convolutional neural networks. The deployed convolutional neural network in DPU includes VGG, ResNet, GoogLeNet, YOLO, SSD, MobileNet, FPN, etc.

 

To use DPU, you should prepare the instructions and input image data in the specific memory address that DPU can access. The DPU operation also requires the application processing unit (APU) to service interrupts to coordinate data transfer. The DPU operation also requires the application processing unit (APU) to service interrupts to coordinate data transfer.

Refer to