Automatic Speech Recognition on Zynq UltraScale+ MPSoC

This page benchmarks the implementation of a real-time speech-to-text solution on a Xilinx edge device using the Vitis accelerated flow. The model chosen for the automatic speech recognition (ASR) task is Baidu’s DeepSpeech2. Results are broadly discussed for three designs, two designs run only on the processing system (PS) and the third design offloads some computation on the programmable logic (PL) in combination with the PS. The designs are evaluated on Zynq UltraScale+ MPSoC ZCU102 evaluation kit and the comparative results are reported.

Table of Contents

Introduction

Machine learning (ML) and deep learning (DL) based solutions have become the new normal. Be it an image, video, speech or text, all forms of data are being processed by ML/DL solutions and utilized for a variety of applications. The area of speech and language processing, categorized as Natural Language Processing (NLP), is an active area where ML/DL is being applied to provide new ASR solutions.  Generally, most solutions are compute-intensive and require high-end graphics processing units (GPUs) or cloud-based inference. However, GPUs are power hungry and cloud-based solutions raise questions about privacy and data security. Field Programmable Gate Arrays (FPGAs) solutions have evolved to include the combination of Processor Subsystems combined with traditional FPGA logic, enabling power efficient, edge-base solutions for these traditionally compute-intensive tasks. Due to these advantages, there has been a trend to target more and more complex ML/DL solutions on edge-based FPGA system-on-chip (SoC) devices.

This page benchmarks a pre-recorded audio file-based standalone speech-to-text translation application, on a Xilinx SoC device, using the Vitis Unified Software Platform acceleration flow.

Speech-to-Text Translation

A speech recognition task involves three major steps, as shown in the below figure. It shows a top-level diagram for a speech-to-text conversion task.

The three major blocks are:

  1. Feature Extraction:  Given an input audio file, it is required to extract the information in a form which can be interpreted by a deep learning model. This information extraction step is termed as a feature extraction step. For a speech recognition task, a spectrogram is generated from the audio file and it is given as an input to the deep learning model.

  2. Deep Learning Model: A trained deep learning (DL) model receives input from the feature extraction block and predicts the output. For speech recognition tasks the DL models are generally based on a Long Short-Term Memory (LSTM) model.

  3. Decoder: The output of the DL model are numbers/probabilities which need to be interpreted as the final prediction. The Decoder performs this task and it is generally based on a CTC (Connectionist Temporal Classification) decoder. The decoder finally predicts the equivalent text of the input audio.

The NLP task of speech-to-text translation is a compute-intensive task, and this page aims to benchmark this algorithm implementation on the Xilinx Zynq UltraScale+ series of SoC.  The Zynq UltraScale+ family features a heterogeneous processing system comprised of a low-power 64bit quad-core ARM Cortex-A53 processing unit running up to 1.5GHz, supporting advanced SIMD, VFPv4 floating-point extension instructions.  The aim is to benchmark this type of task on the Zynq UltraScale+ as an exercise to demonstrate the relationship between the algorithm compute requirements and hardware solutions to support standalone edge-based implementations.

Brief Overview of DeepSpeech2

Baidu's DeepSpeech2 is an end-to-end solution for automatic speech recognition (ASR). The model intakes normalized sound spectrogram as an input (generated by feature extraction step) and generates a sequence of characters, which are then reduced to final prediction by a decoder. The general architecture of the DeepSpeech2 model is shown by the left image in the figure below. The number of CNN and RNN layers can vary as mentioned in the DeepSpeech2 paper.

 

The model intakes a normalized spectrogram. The input is then passed through a set of convolutional (CNN) layers. After the convolutional layers, there is a stack of LSTM layers (it can be a simple Recurrent Neural Network (RNN), or Gated Recurrent Unit (GRU) also). These LSTM layers are generally bi-directional layers, which process frames in both forward and backward directions. The LSTM layers are then followed by a fully connected (FC) layer. The output of the FC layer is then passed to a CTC-based decoder to predict the final text output.

Architecture of DeepSpeech2 Model used for Benchmarking

The architecture used is shown on the right side of the figure above. The model implemented consists of a stack of bi-directional LSTM layers along with other layers such as CNN and FC. The implementation is a dense model (not pruned) that is deployed without quantization and leverages a 2-layer CNN and a 5-layer bi-directional LSTM. The numbers mentioned in the figure correspond to a convolutional layer representing input channels, output channels, filter height, filter width, stride height, and stride width respectively. The input size and the number of hidden units in each gate are also mentioned for all the LSTM layers (input size, number of hidden units).

LSTM cell overview

The LSTM cell used by the DeepSpeech2 model is shown in the below figure. It is a standard LSTM cell with no “peephole” connections (i.e. the gate layers do not read the cell states) as represented by the set of equations in the figure. Depending on the input audio length the LSTM cell calculates the output for a fixed number of timesteps. At every timestep ‘t’, these equations are to be computed, where ‘W’ represents a weight matrix, ‘b’ represents the biases, ‘x’ is the input vector to the LSTM cell and ‘y’ is the output vector of the LSTM cell.

For every timestep ‘t’, input vector corresponding to the same time step ‘xt’ is multiplied by a weight matrix, while output vector ‘yt-1’ of previous time step is multiplied by a weight matrix. Thus, there is a feedback between every timestep, which is one of the bottlenecks in achieving maximum parallelization of an RNN based model.

For one LSTM cell, ‘it’, ‘ft', ‘ot' and ‘gt' represent the four major gates (input gate, forget gate, output gate and intermediate cell gate respectively).

Every gate has two weight matrices (’W_x'and ‘W_r') and a bias term (’b_’) associated to it (e.g. ‘Wix', ‘Wir' and ‘bi' for input gate). The ‘W_x' matrix is multiplied with the input vector ‘xt' and the 'W_r' matrix is multiplied with the recurrent output vector of previous time step 'yt-1'. Every gate output is the result of a nonlinear function (sigmoid or tanh) applied to the summation of the two matrix-vector products and the bias term. Finally the gate outputs (‘it’, ‘ft', ‘ot' and ‘gt') and the previous cell state (’ct-1’) are used to obtain the new cell state (’ct’) and the output of LSTM cell (’yt’).

This Wiki page provides a comparison result of two implementations, one which runs only on the PS and other which uses both PS and PL. The application is tested on a ZCU102 board, which reads an audio file from the SD card and prints the predicted text on the screen.

Specific Details of Implementation

Hardware and Software requirements

The hardware and software resources used for the implementation:

  • Xilinx ZCU102 evaluation kit

  • USB type-A to USB mini-B cables (for UART communication)

  • Secure Digital (SD) memory card

  • Xilinx Vitis 2019.2

  • Xilinx Vivado 2019.2

  • Xilinx Petalinux 2019.2

  • Serial communication terminal software (such as Tera Term or PuTTY)

Details of DeepSpeech2 model used

The DeepSpeech2 model used for benchmarking the two designs:

  • Model Size: ~196 MB

  • Precision: Single Precision Floating Point (32-bit)

  • Pruning: No

  • Quantization: No

  • Training Dataset: LibriSpeech dataset

  • Word Error Rate (WER): 10.347 (on ‘test_clean’ subset of Librispeech)

 Details of Input Audio File

  • File Format: Wav file

  • Format: PCM

  • Format settings:

    • Endianness: Little

    • Sign: Signed

  • Codec ID: 1

  • Bit rate mode: Constant

  • Bit rate: 256 Kbps

  • Channel(s): 1 channel

  • Sampling rate: 16.0 kHz

  • Bit depth: 16 bits

Designs Implemented

We benchmark broadly two designs, one which runs only on the processing system (PS only) and other which runs on a combination of processing system and programmable logic (PS+PL).

PS only solution

The complete DeepSpeech2 pipeline is implemented using C++. The complete pipeline is developed in-house from scratch and no external high-end libraries or IP cores are used. The design is compiled and built using the Vitis acceleration flow and then tested on ZCU102 board using the executables generated by Vitis. The initial version of the implementation was profiled for 10 audio samples, with the samples ranging from 1 second to 11 seconds. The profiling results are shown in the table below.

Audio File

Audio Length (sec)

Feature Extraction (sec)

CNN Block (sec)

LSTM Block (sec)

Post LSTM (sec)

Total Time (sec)

1SEC_1.wav

1.000

0.016

0.572

4.562

0.005

5.154

2SEC_1.wav

2.170

0.030

1.454

11.927

0.012

13.423

3SEC_1.wav

3.540

0.047

2.508

20.654

0.021

23.230

4SEC_1.wav

4.275

0.056

3.051

25.199

0.026

28.332

5SEC_1.wav

5.000

0.065

3.612

29.900

0.030

33.608

6SEC_1.wav

6.220

0.079

4.522

37.667

0.038

42.307

7SEC_1.wav

7.120

0.091

5.222

43.317

0.044

48.674

8SEC_1.wav

8.040

0.102

5.908

49.144

0.050

55.204

10SEC_1.wav

10.390

0.131

7.668

63.863

0.065

71.727

11SEC_1.wav

11.290

0.141

8.378

69.703

0.071

78.293

Average Time

1.000 (Base)

0.013

0.726

6.028

0.006

6.773

Profiling the pipeline resulted in the following major takeaways: 

  1. The time taken for the initial PS only solution computing the complete pipeline was on an average equal to 6.77 times the audio length.

  2. The combined time taken by feature extraction and post LSTM block is less than 1% of the total time taken.

  3. On average, the time spent in computation of the CNN block is around 11% for all the audio samples.

  4. As expected, the LSTM block is the most expensive block and consumes around 89% of the total computation time.

After profiling the initial PS-only solution, to achieve better performance  a multi-threaded design is then implemented to reduce the computation time and utilize the ARM cores efficiently.

Optimized Multi-threaded PS only solution

The initial PS-only solution is modified using a multi-threaded approach to make use of the four ARM cores available. The CNN block and LSTM block computations are parallelized using four threads and helped to bring down the computation time. Table below shows the profiling numbers for the optimized PS-only solution.

Audio File

Audio Length (sec)

Feature Extraction (sec)

CNN Block (sec)

LSTM Block (sec)

Post LSTM (sec)

Total Time (sec)

1SEC_1.wav

1.000

0.016

0.184

1.243

0.005

1.448

2SEC_1.wav

2.170

0.030

0.477

3.235

0.012

3.755

3SEC_1.wav

3.540

0.047

0.830

5.601

0.021

6.499

4SEC_1.wav

4.275

0.051

0.991

6.850

0.026

7.918

5SEC_1.wav

5.000

0.065

1.148

8.115

0.030

9.359

6SEC_1.wav

6.220

0.079

1.456

10.213

0.038

11.786

7SEC_1.wav

7.120

0.090

1.666

11.742

0.044

13.542

8SEC_1.wav

8.040

0.103

1.879

13.327

0.050

15.359

10SEC_1.wav

10.390

0.132

2.435

17.317

0.065

19.949

11SEC_1.wav

11.290

0.142

2.669

18.891

0.071

21.772

Average Time

1.000 (Base)

0.013

0.233

1.635

0.006

1.887

Key observations after profiling the multi-threaded PS-only solution are as follows:

  1. The multi-threaded implementation of CNN and LSTM blocks helped in considerably reducing the overall run time.

  2. Overall run time reduced from the initial 6.77 times to 1.887 times the audio length. This amounts to approximately 72% reduction in the run time.

  3. The CNN block now contributes to around 12% of the total run time.

  4. Still, the LSTM block is the major block in terms of computation time and consumes around 87% of the total computation time

The next step is to offload some of the computations into the programmable logic (PL) and try to achieve more acceleration. As LSTM block is the major time-consuming block, so is the ideal candidate for being accelerated in PL.

PS +PL Solution (Single Precision Float)

The LSTM block mainly consists of computing the equations mentioned in the figure in LSTM cell sub section. The major time-consuming operation in these equations is the matrix-vector multiplication for every gate (‘Wx’ and ‘Wy’).

As explained previously, for ‘Wx’ computation, ‘x’ is the input of current timestep (‘t’) and there is no dependency on the previous time step, thus it is termed a non-recurrent computation. While for ‘Wy’ computation, the vector ‘y’ is the output of the previous timestep (t-1), thus a feedback and the ‘Wy’ computation is termed a recurrent computation. In the PS+PL design, the complete LSTM Layer is accelerated in the PL. The matrix-vector multiplication is converted into matrix-matrix multiplication, by bringing the ‘Wx’ computation outside of the non-linear functions. Tiling, pipelining and unrolling are carried out to accelerate the matrix-matrix multiplication on PL. Acceleration of ‘Wy’ computation and all the gates equations, along with the non-linear functions are also carried on PL.
The two bottlenecks, (1) the feedback nature of the computations, and (2) the limited on-chip memory results in a high communication bandwidth requirement. These requirements are efficiently handled by using a variety of optimizations.

The design is synthesized for an operating frequency of 250 MHz. The resource utilization summary of the PS+PL design is shown in the below table. Apart from Block RAM (BRAM), other resources used are relatively low.

 

BRAM (18k)

DSP48E

FF

LUT

Total Available

1824

2520

548160

274080

Total Used

1458

540

168794

131246

Utilization (%)

79.93

21.42

30.79

47.88

Figure below shows the top-level block diagram of the PS+PL implementation. The audio file is read from the SD card. Then feature extraction and CNN are processed on the PS, after which the LSTM block is executed entirely on the PL. The remainder of the pipeline is again executed on PS, and finally the predicted text is printed on a terminal using UART. External DDR memory is used to store the model weights and the intermediate results. These weights are fetched from DDR and stored locally in the on-chip BRAM.

The PS+PL design is also profiled for the same ten audio samples and the results are shown in the table below.

 

The PS+PL design is also profiled for the same ten audio samples and the results are shown in the table below.

Audio File

Audio Length (sec)

Feature Extraction (sec)

CNN Block (sec)

LSTM Block (sec)

Post LSTM (sec)

Total Time (sec)

1SEC_1.wav

1.000

0.016

0.186

0.583

0.005

0.790

2SEC_1.wav

2.170

0.030

0.476

1.340

0.012

1.858

3SEC_1.wav

3.540

0.047

0.828

2.199

0.021

3.095

4SEC_1.wav

4.275

0.057

0.991

2.654

0.026

3.728

5SEC_1.wav

5.000

0.066

1.149

3.117

0.030

4.362

6SEC_1.wav

6.220

0.079

1.457

3.874

0.038

5.448

7SEC_1.wav

7.120

0.091

1.661

4.441

0.044

60237

8SEC_1.wav

8.040

0.102

1.875

5.010

0.050

7.037

10SEC_1.wav

10.390

0.130

2.456

6.457

0.065

9.108

11SEC_1.wav

11.290

0.142

2.660

7.033

0.071

9.906

Average Time

1.000 (Base)

0.013

0.233

0.622

0.006

0.874

The following points are derived from the profiling data: 

  1. As expected, the time taken by feature extraction, CNN and post-LSTM blocks remain the same for both PS-only (4 threads) and PS+PL solutions.

  2. There is approximately a 62% speedup in the processing time for LSTM blocks after pushing ‘the LSTM layer on PL.

  3. The overall time taken by the complete pipeline is reduced to around 0.874 times the input audio length.

  4. Compared to PS-only (4 threads) there is a reduction of around 54% in the computation time.

  5. The PS+PL solution achieves less than 1x performance for different audio lengths.

  6. Real-time performance is achieved for the non-quantized and non-pruned model with Floating point precision.

The floating point implementation successfully achieved real-time performance. As an extension to the PS+PL solution we also worked on an INT16 implementation of design as covered in following section.

PS +PL Solution (INT16)

Generally, the INT16 quantization is one of the most preferred option to accelerate the inference of Deep Learning solutions on edge devices. There is a negligible loss (or no loss at all) in accuarcy when INT16 quantization is adopted in most scenarios. Regarding the performance and resource utilization fixed point computations always offer a great advantage for FPGAs as compared to a floating point implementation.

We also explored the INT16 quantization for the DeepSpeech2 model. Since in our implementation only the LSTM part is being accelerated on the PL, we quantized the weights of LSTM layers to INT16. Other layers executing on PS are left untouched and continue to use floating operations. For all the 5 Bi-directional LSTM layers, both weights and activations use INT16 precision. Look up tables are used for Sigmoid and Tanh calculations on PL. INT16 design is also synthesized for a maximum operating frequency of 250 MHz. Below table gives the utilization summary of INT16 PS+PL solution.

 

BRAM (18k)

DSP48E

FF

LUT

Total Available

1824

2520

548160

274080

Total Used

1001

807

22307

68156

Utilization (%)

54.87

32

4.07

24.86

Table below gives the profiling information for the performance of PS+PL solution with INT16.

Audio File

Audio Length (sec)

LSTM Block (sec)

Total Time (sec)

1SEC_1.wav

1.000

0.295

0.476

2SEC_1.wav

2.170

0.610

1.118

3SEC_1.wav

3.540

0.965

1.830

4SEC_1.wav

4.275

1.145

2.249

5SEC_1.wav

5.000

1.345

2.609

6SEC_1.wav

6.220

1.664

3.274

7SEC_1.wav

7.120

1.902

3.769

8SEC_1.wav

8.040

2.145

4.240

10SEC_1.wav

10.390

2.757

5.484

11SEC_1.wav

11.290

3.003

5.997

Average Time

1.000 (Base)

0.268

0.52

The following points are derived from the profiling data: 

  1. Compared to PS (4-threaded) solution there is 84% speedup for INT16 LSTM blocks on PL.

  2. There is approximately a 57% speedup in the processing time for INT16 LSTM blocks as compared to Float implementation.

  3. The overall time taken by the complete pipeline is reduced to around 0.52 times the input audio length.

  4. Compared to PS+PL Float implementation there is a reduction of around 40% in the total computation time.

  5. The INT16 PS+PL solution also achieves around real-time performance of around 0.52x times of different audio lengths.

Results Comparison

In this section we compare the runtime achieved by the three designs for all the 10 audio samples.

Table below compares the total runtime of all the designs. Both PS+PL designs are able to achieve real-time performance, andas expected INT16 PS+PL is the fastest.

Audio Length (sec)

PS only (no threads) (sec)

PS only (4 threads) (sec)

PS + PL with Float (sec)

PS + PL with INT16 (sec)

1.000

5.154

1.448

0.790

0.476

2.170

13.423

3.755

1.858

1.118

3.540

23.230

6.499

3.095

1.830

4.275

28.332

7.918

3.728

2.249

5.000

33.608

9.359

4.362

2.609

6.220

42.307

11.786

5.448

3.274

7.120

48.674

13.542

60237

3.769

8.040

55.204

15.359

7.037

4.240

10.390

71.727

19.949

9.108

5.484

11.290

78.293

21.772

9.906