This page benchmarks the implementation of a real-time speech-to-text solution on a Xilinx edge device using the Vitis accelerated flow. The model chosen for the automatic speech recognition (ASR) task is Baidu’s DeepSpeech2. Results are broadly discussed for three designs, two designs run only on the processing system (PS) and the third design offloads some computation on the programmable logic (PL) in combination with the PS. The designs are evaluated on Zynq UltraScale+ MPSoC ZCU102 evaluation kit and the comparative results are reported.

Introduction

Machine learning (ML) and deep learning (DL) based solutions have become the new normal. Be it an image, video, speech or text, all forms of data are being processed by ML/DL solutions and utilized for a variety of applications. The area of speech and language processing, categorized as Natural Language Processing (NLP), is an active area where ML/DL is being applied to provide new ASR solutions. Generally, most solutions are compute-intensive and require high-end graphics processing units (GPUs) or cloud-based inference. However, GPUs are power hungry and cloud-based solutions raise questions about privacy and data security. Field Programmable Gate Arrays (FPGAs) solutions have evolved to include the combination of Processor Subsystems combined with traditional FPGA logic, enabling power efficient, edge-base solutions for these traditionally compute-intensive tasks. Due to these advantages, there has been a trend to target more and more complex ML/DL solutions on edge-based FPGA system-on-chip (SoC) devices.

This page benchmarks a pre-recorded audio file-based standalone speech-to-text translation application, on a Xilinx SoC device, using the Vitis Unified Software Platform acceleration flow.

Speech-to-Text Translation

A speech recognition task involves three major steps, as shown in the below figure. It shows a top-level diagram for a speech-to-text conversion task.

The three major blocks are:

Feature Extraction: Given an input audio file, it is required to extract the information in a form which can be interpreted by a deep learning model. This information extraction step is termed as a feature extraction step. For a speech recognition task, a spectrogram is generated from the audio file and it is given as an input to the deep learning model.
Deep Learning Model: A trained deep learning (DL) model receives input from the feature extraction block and predicts the output. For speech recognition tasks the DL models are generally based on a Long Short-Term Memory (LSTM) model.
Decoder: The output of the DL model are numbers/probabilities which need to be interpreted as the final prediction. The Decoder performs this task and it is generally based on a CTC (Connectionist Temporal Classification) decoder. The decoder finally predicts the equivalent text of the input audio.

The NLP task of speech-to-text translation is a compute-intensive task, and this page aims to benchmark this algorithm implementation on the Xilinx Zynq UltraScale+ series of SoC. The Zynq UltraScale+ family features a heterogeneous processing system comprised of a low-power 64bit quad-core ARM Cortex-A53 processing unit running up to 1.5GHz, supporting advanced SIMD, VFPv4 floating-point extension instructions. The aim is to benchmark this type of task on the Zynq UltraScale+ as an exercise to demonstrate the relationship between the algorithm compute requirements and hardware solutions to support standalone edge-based implementations.

Brief Overview of DeepSpeech2

Baidu's DeepSpeech2 is an end-to-end solution for automatic speech recognition (ASR). The model intakes normalized sound spectrogram as an input (generated by feature extraction step) and generates a sequence of characters, which are then reduced to final prediction by a decoder. The general architecture of the DeepSpeech2 model is shown by the left image in the figure below. The number of CNN and RNN layers can vary as mentioned in the DeepSpeech2 paper.

The model intakes a normalized spectrogram. The input is then passed through a set of convolutional (CNN) layers. After the convolutional layers, there is a stack of LSTM layers (it can be a simple Recurrent Neural Network (RNN), or Gated Recurrent Unit (GRU) also). These LSTM layers are generally bi-directional layers, which process frames in both forward and backward directions. The LSTM layers are then followed by a fully connected (FC) layer. The output of the FC layer is then passed to a CTC-based decoder to predict the final text output.

Architecture of DeepSpeech2 Model used for Benchmarking

The architecture used is shown on the right side of the figure above. The model implemented consists of a stack of bi-directional LSTM layers along with other layers such as CNN and FC. The implementation is a dense model (not pruned) that is deployed without quantization and leverages a 2-layer CNN and a 5-layer bi-directional LSTM. The numbers mentioned in the figure correspond to a convolutional layer representing input channels, output channels, filter height, filter width, stride height, and stride width respectively. The input size and the number of hidden units in each gate are also mentioned for all the LSTM layers (input size, number of hidden units).

LSTM cell overview

The LSTM cell used by the DeepSpeech2 model is shown in the below figure. It is a standard LSTM cell with no “peephole” connections (i.e. the gate layers do not read the cell states) as represented by the set of equations in the figure. Depending on the input audio length the LSTM cell calculates the output for a fixed number of timesteps. At every timestep ‘t’, these equations are to be computed, where ‘W’ represents a weight matrix, ‘b’ represents the biases, ‘x’ is the input vector to the LSTM cell and ‘y’ is the output vector of the LSTM cell.

For every timestep ‘t’, input vector corresponding to the same time step ‘x_t’ is multiplied by a weight matrix, while output vector ‘y_t-1’ of previous time step is multiplied by a weight matrix. Thus, there is a feedback between every timestep, which is one of the bottlenecks in achieving maximum parallelization of an RNN based model.

For one LSTM cell, ‘i_t’, ‘f_t', ‘o_t' and ‘g_t' represent the four major gates (input gate, forget gate, output gate and intermediate cell gate respectively).

Every gate has two weight matrices (’W_{_x}'and ‘W_{_r}') and a bias term (’b_’) associated to it (e.g. ‘W_ix', ‘W_ir' and ‘b_i' for input gate). The ‘W_{_x}' matrix is multiplied with the input vector ‘x_t' and the 'W_{_r}' matrix is multiplied with the recurrent output vector of previous time step 'y_t-1'. Every gate output is the result of a nonlinear function (sigmoid or tanh) applied to the summation of the two matrix-vector products and the bias term. Finally the gate outputs (‘i_t’, ‘f_t', ‘o_t' and ‘g_t') and the previous cell state (’c_t-1’) are used to obtain the new cell state (’c_t’) and the output of LSTM cell (’y_t’).

This Wiki page provides a comparison result of two implementations, one which runs only on the PS and other which uses both PS and PL. The application is tested on a ZCU102 board, which reads an audio file from the SD card and prints the predicted text on the screen.

Specific Details of Implementation

Hardware and Software requirements

The hardware and software resources used for the implementation:

Xilinx ZCU102 evaluation kit
USB type-A to USB mini-B cables (for UART communication)
Secure Digital (SD) memory card
Xilinx Vitis 2019.2
Xilinx Vivado 2019.2
Xilinx Petalinux 2019.2
Serial communication terminal software (such as Tera Term or PuTTY)

Details of DeepSpeech2 model used

The DeepSpeech2 model used for benchmarking the two designs:

Model Size: ~196 MB
Precision: Single Precision Floating Point (32-bit)
Pruning: No
Quantization: No
Training Dataset: LibriSpeech dataset
Word Error Rate (WER): 10.347 (on ‘test_clean’ subset of Librispeech)

Details of Input Audio File

File Format: Wav file
Format: PCM
Format settings:
- Endianness: Little
- Sign: Signed
Codec ID: 1
Bit rate mode: Constant
Bit rate: 256 Kbps
Channel(s): 1 channel
Sampling rate: 16.0 kHz
Bit depth: 16 bits

Designs Implemented

We benchmark broadly two designs, one which runs only on the processing system (PS only) and other which runs on a combination of processing system and programmable logic (PS+PL).

PS only solution

The complete DeepSpeech2 pipeline is implemented using C++. The complete pipeline is developed in-house from scratch and no external high-end libraries or IP cores are used. The design is compiled and built using the Vitis acceleration flow and then tested on ZCU102 board using the executables generated by Vitis. The initial version of the implementation was profiled for 10 audio samples, with the samples ranging from 1 second to 11 seconds. The profiling results are shown in the table below.

Audio File	Audio Length (sec)	Feature Extraction (sec)	CNN Block (sec)	LSTM Block (sec)	Post LSTM (sec)	Total Time (sec)
1SEC_1.wav	1.000	0.016	0.572	4.562	0.005	5.154
2SEC_1.wav	2.170	0.030	1.454	11.927	0.012	13.423
3SEC_1.wav	3.540	0.047	2.508	20.654	0.021	23.230
4SEC_1.wav	4.275	0.056	3.051	25.199	0.026	28.332
5SEC_1.wav	5.000	0.065	3.612	29.900	0.030	33.608
6SEC_1.wav	6.220	0.079	4.522	37.667	0.038	42.307
7SEC_1.wav	7.120	0.091	5.222	43.317	0.044	48.674
8SEC_1.wav	8.040	0.102	5.908	49.144	0.050	55.204
10SEC_1.wav	10.390	0.131	7.668	63.863	0.065	71.727
11SEC_1.wav	11.290	0.141	8.378	69.703	0.071	78.293
Average Time	1.000 (Base)	0.013	0.726	6.028	0.006	6.773

Profiling the pipeline resulted in the following major takeaways:

The time taken for the initial PS only solution computing the complete pipeline was on an average equal to 6.77 times the audio length.
The combined time taken by feature extraction and post LSTM block is less than 1% of the total time taken.
On average, the time spent in computation of the CNN block is around 11% for all the audio samples.
As expected, the LSTM block is the most expensive block and consumes around 89% of the total computation time.

After profiling the initial PS-only solution, to achieve better performance a multi-threaded design is then implemented to reduce the computation time and utilize the ARM cores efficiently.

Optimized Multi-threaded PS only solution

The initial PS-only solution is modified using a multi-threaded approach to make use of the four ARM cores available. The CNN block and LSTM block computations are parallelized using four threads and helped to bring down the computation time. Table below shows the profiling numbers for the optimized PS-only solution.

Audio File	Audio Length (sec)	Feature Extraction (sec)	CNN Block (sec)	LSTM Block (sec)	Post LSTM (sec)	Total Time (sec)
1SEC_1.wav	1.000	0.016	0.184	1.243	0.005	1.448
2SEC_1.wav	2.170	0.030	0.477	3.235	0.012	3.755
3SEC_1.wav	3.540	0.047	0.830	5.601	0.021	6.499
4SEC_1.wav	4.275	0.051	0.991	6.850	0.026	7.918
5SEC_1.wav	5.000	0.065	1.148	8.115	0.030	9.359
6SEC_1.wav	6.220	0.079	1.456	10.213	0.038	11.786
7SEC_1.wav	7.120	0.090	1.666	11.742	0.044	13.542
8SEC_1.wav	8.040	0.103	1.879	13.327	0.050	15.359
10SEC_1.wav	10.390	0.132	2.435	17.317	0.065	19.949
11SEC_1.wav	11.290	0.142	2.669	18.891	0.071	21.772
Average Time	1.000 (Base)	0.013	0.233	1.635	0.006	1.887

Key observations after profiling the multi-threaded PS-only solution are as follows:

The multi-threaded implementation of CNN and LSTM blocks helped in considerably reducing the overall run time.
Overall run time reduced from the initial 6.77 times to 1.887 times the audio length. This amounts to approximately 72% reduction in the run time.
The CNN block now contributes to around 12% of the total run time.
Still, the LSTM block is the major block in terms of computation time and consumes around 87% of the total computation time.

The next step is to offload some of the computations into the programmable logic (PL) and try to achieve more acceleration. As LSTM block is the major time-consuming block, so is the ideal candidate for being accelerated in PL.

PS +PL Solution (Single Precision Float)

The LSTM block mainly consists of computing the equations mentioned in the figure in LSTM cell sub section. The major time-consuming operation in these equations is the matrix-vector multiplication for every gate (‘Wx’ and ‘Wy’).

As explained previously, for ‘Wx’ computation, ‘x’ is the input of current timestep (‘t’) and there is no dependency on the previous time step, thus it is termed a non-recurrent computation. While for ‘Wy’ computation, the vector ‘y’ is the output of the previous timestep (t-1), thus a feedback and the ‘Wy’ computation is termed a recurrent computation. In the PS+PL design, the complete LSTM Layer is accelerated in the PL. The matrix-vector multiplication is converted into matrix-matrix multiplication, by bringing the ‘Wx’ computation outside of the non-linear functions. Tiling, pipelining and unrolling are carried out to accelerate the matrix-matrix multiplication on PL. Acceleration of ‘Wy’ computation and all the gates equations, along with the non-linear functions are also carried on PL.
The two bottlenecks, (1) the feedback nature of the computations, and (2) the limited on-chip memory results in a high communication bandwidth requirement. These requirements are efficiently handled by using a variety of optimizations.

The design is synthesized for an operating frequency of 250 MHz. The resource utilization summary of the PS+PL design is shown in the below table. Apart from Block RAM (BRAM), other resources used are relatively low.

	BRAM (18k)	DSP48E	FF	LUT
Total Available	1824	2520	548160	274080
Total Used	1458	540	168794	131246
Utilization (%)	79.93	21.42	30.79	47.88

Figure below shows the top-level block diagram of the PS+PL implementation. The audio file is read from the SD card. Then feature extraction and CNN are processed on the PS, after which the LSTM block is executed entirely on the PL. The remainder of the pipeline is again executed on PS, and finally the predicted text is printed on a terminal using UART. External DDR memory is used to store the model weights and the intermediate results. These weights are fetched from DDR and stored locally in the on-chip BRAM.

The PS+PL design is also profiled for the same ten audio samples and the results are shown in the table below.

Audio File	Audio Length (sec)	Feature Extraction (sec)	CNN Block (sec)	LSTM Block (sec)	Post LSTM (sec)	Total Time (sec)
1SEC_1.wav	1.000	0.016	0.186	0.583	0.005	0.790
2SEC_1.wav	2.170	0.030	0.476	1.340	0.012	1.858
3SEC_1.wav	3.540	0.047	0.828	2.199	0.021	3.095
4SEC_1.wav	4.275	0.057	0.991	2.654	0.026	3.728
5SEC_1.wav	5.000	0.066	1.149	3.117	0.030	4.362
6SEC_1.wav	6.220	0.079	1.457	3.874	0.038	5.448
7SEC_1.wav	7.120	0.091	1.661	4.441	0.044	60237
8SEC_1.wav	8.040	0.102	1.875	5.010	0.050	7.037
10SEC_1.wav	10.390	0.130	2.456	6.457	0.065	9.108
11SEC_1.wav	11.290	0.142	2.660	7.033	0.071	9.906
Average Time	1.000 (Base)	0.013	0.233	0.622	0.006	0.874

The following points are derived from the profiling data:

As expected, the time taken by feature extraction, CNN and post-LSTM blocks remain the same for both PS-only (4 threads) and PS+PL solutions.
There is approximately a 62% speedup in the processing time for LSTM blocks after pushing ‘the LSTM layer on PL.
The overall time taken by the complete pipeline is reduced to around 0.874 times the input audio length.
Compared to PS-only (4 threads) there is a reduction of around 54% in the computation time.
The PS+PL solution achieves less than 1x performance for different audio lengths.
Real-time performance is achieved for the non-quantized and non-pruned model with Floating point precision.

The floating point implementation successfully achieved real-time performance. As an extension to the PS+PL solution we also worked on an INT16 implementation of design as covered in following section.

PS +PL Solution (INT16)

Generally, the INT16 quantization is one of the most preferred option to accelerate the inference of Deep Learning solutions on edge devices. There is a negligible loss (or no loss at all) in accuarcy when INT16 quantization is adopted in most scenarios. Regarding the performance and resource utilization fixed point computations always offer a great advantage for FPGAs as compared to a floating point implementation.

We also explored the INT16 quantization for the DeepSpeech2 model. Since in our implementation only the LSTM part is being accelerated on the PL, we quantized the weights of LSTM layers to INT16. Other layers executing on PS are left untouched and continue to use floating operations. For all the 5 Bi-directional LSTM layers, both weights and activations use INT16 precision. Look up tables are used for Sigmoid and Tanh calculations on PL. INT16 design is also synthesized for a maximum operating frequency of 250 MHz. Below table gives the utilization summary of INT16 PS+PL solution.

	BRAM (18k)	DSP48E	FF	LUT
Total Available	1824	2520	548160	274080
Total Used	1001	807	22307	68156
Utilization (%)	54.87	32	4.07	24.86

Table below gives the profiling information for the performance of PS+PL solution with INT16.

Audio File	Audio Length (sec)	LSTM Block (sec)	Total Time (sec)
1SEC_1.wav	1.000	0.295	0.476
2SEC_1.wav	2.170	0.610	1.118
3SEC_1.wav	3.540	0.965	1.830
4SEC_1.wav	4.275	1.145	2.249
5SEC_1.wav	5.000	1.345	2.609
6SEC_1.wav	6.220	1.664	3.274
7SEC_1.wav	7.120	1.902	3.769
8SEC_1.wav	8.040	2.145	4.240
10SEC_1.wav	10.390	2.757	5.484
11SEC_1.wav	11.290	3.003	5.997
Average Time	1.000 (Base)	0.268	0.52

The following points are derived from the profiling data:

Compared to PS (4-threaded) solution there is 84% speedup for INT16 LSTM blocks on PL.
There is approximately a 57% speedup in the processing time for INT16 LSTM blocks as compared to Float implementation.
The overall time taken by the complete pipeline is reduced to around 0.52 times the input audio length.
Compared to PS+PL Float implementation there is a reduction of around 40% in the total computation time.
The INT16 PS+PL solution also achieves around real-time performance of around 0.52x times of different audio lengths.

Results Comparison

In this section we compare the runtime achieved by the three designs for all the 10 audio samples.

Table below compares the total runtime of all the designs. Both PS+PL designs are able to achieve real-time performance, andas expected INT16 PS+PL is the fastest.

Audio Length (sec)	PS only (no threads) (sec)	PS only (4 threads) (sec)	PS + PL with Float (sec)	PS + PL with INT16 (sec)
1.000	5.154	1.448	0.790	0.476
2.170	13.423	3.755	1.858	1.118
3.540	23.230	6.499	3.095	1.830
4.275	28.332	7.918	3.728	2.249
5.000	33.608	9.359	4.362	2.609
6.220	42.307	11.786	5.448	3.274
7.120	48.674	13.542	60237	3.769
8.040	55.204	15.359	7.037	4.240
10.390	71.727	19.949	9.108	5.484
11.290	78.293	21.772	9.906	5.997
1.000 (Base)	6.773	1.887	0.874	0.52

Figure below displays the normalized runtime for each of the separate processing blocks of the pipeline. The runtime is normalized with respect to the input audio length. The best runtime of 0.523x is provided by the INT16 PS+PL solution.

How to Run the Demo using Binaries

NOTE: Contact Xilinx Technical Marketing team for binaries.

The top level contents of the zip are as shown in the figure:

Sample Audios for testing the demo are in the “audios” folder and the model weights are present in the “weights” folder. Three executables (“STT_PS.exe”, “STT_PS_NoThreads.exe” and “STT_PSPL_200MHz.exe”) for the three designs are there along with the “BOOT.bin” and “image.ub”. The xclbin file “matmul.xclbin” which is required by the “STT_PSPL_200MHz.exe? is also present.

Steps to setup the demo:

Extract the zip and copy the contents in a SD card.
Insert the SD card on ZCU102 board.
Make sure the board is set to SD card boot mode (follow page 21, table 2-4 in user guide).
Connect Power cable and the UART cable.
Power ON the board.
Open and setup up Putty/Tera Term (or any other terminal emulator) in your machine/laptop with a baud rate of '115200' (This link can be useful).
You should see a prompt root@xilinx-zcu102-2019_2:~#
Set the XRT enviornment variable using: export XILINX_XRT=/usr
Now mount the SD card using the command: mount /dev/mmcblk0p1 /mnt/
Enter the /mnt folder using command: cd /mnt
Run the command “ls” you should be able to see the SD card contents now.

PS only (no threads)

Run the application with the path of the audio passed as an argument.
- Ex. to convert audio file 1SEC_1.wav run the command
  - ./STT_PS_No_Threads.exe
  - Enter audio file name: audios/1SEC_1.wav

The predicted text and the time taken will be printed as an output on the terminal as shown below.

PS Only (Multithreaded)

Run the application with the path of the audio passed as an argument.
- Ex. to convert audio file 1SEC_1.wav run the command
  - ./STT_PS.exe
  - Enter audio file name: audios/1SEC_1.wav

The predicted text and the time taken will be printed as an output on the terminal.

PS+PL (Single Precision Float)

Run the application with the path of the audio passed as an argument.
- Ex. to convert audio file 1SEC_1.wav run the command
  - ./STT_PSPL_200MHz.exe matmul.xclbin
  - Enter audio file name: audios/1SEC_1.wav

The predicted text and the time taken will be printed as an output on the terminal.

Due to context switching, for different runs there will be minor variation (3rd decimal place) in the time taken for conversion.

Xilinx Wiki

Automatic Speech Recognition on Zynq UltraScale+ MPSoC

Table of Contents