This page benchmarks the implementation of a real-time speech-to-text solution on a Xilinx edge device using the Vitis accelerated flow. The model chosen for the automatic speech recognition (ASR) task is Baidu’s DeepSpeech2. Results are broadly discussed for three designs, two designs run only on the processing system (PS) and the third design offloads some computation on the programmable logic (PL) in combination with the PS. The designs are evaluated on Zynq UltraScale+ MPSoC ZCU102 evaluation kit and the comparative results are reported.
...
Machine learning (ML) and deep learning (DL) based solutions have become the new normal. Be it an image, video, speech or text, all forms of data are being processed by ML/DL solutions and utilized for a variety of applications. The area of speech and language processing, categorized as Natural Language Processing (NLP), is an active area where ML/DL is being applied to provide new ASR solutions. Generally, most solutions are compute-intensive and require high-end graphics processing units (GPUs) or cloud-based inference. However, GPUs are power hungry and cloud-based solutions raise questions questions about privacy and data security. Field Programmable Gate Arrays (FPGAs) solutions have evolved to include the combination of Processor Subsystems combined with traditional FPGA logic, enabling power efficient, edge-base solutions for these traditionally compute-intensive tasks. Due to these advantages, there has been a trend to target more and more complex ML/DL solutions on edge-based FPGA system-on-chip (SoC) devices.
...
File Format: Wav file
Format: PCM
Format settings:
Endianness: Little
Sign: Signed
Codec ID: 1
Bit rate mode: Constant
Bit rate: 256 Kbps
Channel(s): 1 channel
Sampling rate: 16.0 KHzkHz
Bit depth: 16 bits
Designs Implemented
...
The next step is to offload some of the computations into the programmable logic (PL) and try to achieve more acceleration. As LSTM block is the major time-consuming block, so is the ideal candidate for being accelerated in PL.
PS +PL
...
Solution (Single Precision Float)
The LSTM block mainly consists of computing the equations mentioned in the figure in LSTM cell sub section. The major time-consuming operation in these equations is the matrix-vector multiplication for every gate (‘Wx’ and ‘Wy’).
As explained previously, for ‘Wx’ computation, ‘x’ is the input of current timestep (‘t’) and there is no dependency on the previous time step, thus it is termed a non-recurrent computation. While for ‘Wy’ computation, the vector ‘y’ is the output of the previous timestep (t-1), thus a feedback and the the ‘Wy’ computation is termed a recurrent computation. In the PS+PL design, the ‘Wx’ computation complete LSTM Layer is accelerated in the PL. The matrix-vector multiplication is converted into matrix-matrix multiplication, by bringing the ‘Wx’ computation outside of the non-linear functions. Tiling, pipelining and unrolling are carried out to accelerate the matrix-matrix multiplication on PL. An option Acceleration of also accelerating ‘Wy’ computation in PL was explored, but was not achievable due to two bottlenecks: and all the gates equations, along with the non-linear functions are also carried on PL.
The two bottlenecks, (1) the feedback nature of the computations, and the (2) the limited on-chip memory available on the Zynq UltraScale+ device on the ZCU102 board. Thus, only ‘Wx’ computation of all LSTM cells (for both forward and backward pass) is accelerated on the FPGA fabric and rest all computations of the pipeline are carried out on the ARM cores (similar to the multi-threaded PS only solution). The results in a high communication bandwidth requirement. These requirements are efficiently handled by using a variety of optimizations.
The design is synthesized for an operating frequency of 200MHz250 MHz. The resource utilization summary of the PS+PL design is shown in the below table. Apart from Block RAM (BRAM), other resources used are relatively low.
| BRAM (18k) | DSP48E | FF | LUT |
Total Available | 1824 | 2520 | 548160 | 27408274080 |
Total Used | 16081458 | 502540 | 146998168794 | 8334131246 |
Utilization (%) | 8879.1693 | 1921.9242 | 2630.827930 | 47.4188 |
Figure below shows the top-level block diagram of the PS+PL implementation. The audio file is read from the SD card. Then feature extraction and CNN are processed on on the PS, after which the LSTM block is executed using both PS and entirely on the PL. The remainder of the pipeline is again executed on PS, and finally the predicted text is printed on a terminal using UART. External DDR memory is used to store the model weights and the intermediate results. These weights are fetched from DDR and stored locally in the on-chip BRAM.
The PS+PL design is also profiled for the same ten audio samples and the results are shown in the table below.
...
The PS+PL design is also profiled for the same ten audio samples and the results are shown in the table below.
Audio File | Audio Length (sec) | Feature Extraction (sec) | CNN Block (sec) | LSTM Block (sec) | Post LSTM (sec) | Total Time (sec) |
1SEC_1.wav | 1.000 | 0.016 | 0.186 | 0.583 | 0.005 | 0.790 |
2SEC_1.wav | 2.170 | 0.030 | 0.476 | 1.340 | 0.012 | 1.858 |
3SEC_1.wav | 3.540 | 0.047 | 0.828 | 2.199 | 0.021 | 3.095 |
4SEC_1.wav | 4.275 | 0.057 | 0.991 | 2.654 | 0.026 | 3.728 |
5SEC_1.wav | 5.000 | 0.066 | 1.149 | 3.117 | 0.030 | 4.362 |
6SEC_1.wav | 6.220 | 0.079 | 1.457 | 3.874 | 0.038 | 5.448 |
7SEC_1.wav | 7.120 | 0.091 | 1.661 | 4.441 | 0.044 | 60237 |
8SEC_1.wav | 8.040 | 0.102 | 1.875 | 5.010 | 0.050 | 7.037 |
10SEC_1.wav | 10.390 | 0.130 | 2.456 | 6.457 | 0.065 | 9.108 |
11SEC_1.wav | 11.290 | 0.142 | 2.660 | 7.033 | 0.071 | 9.906 |
Average Time | 1.000 (Base) | 0.013 | 0.233 | 0.622 | 0.006 | 0.874 |
The following points are derived from the profiling data:
As expected, the time taken by feature extraction, CNN and post-LSTM blocks remain the same for both PS-only (4 threads) and PS+PL solutions.
There is approximately a 62% speedup in the processing time for LSTM blocks after pushing ‘the LSTM layer on PL.
The overall time taken by the complete pipeline is reduced to around 0.874 times the input audio length.
Compared to PS-only (4 threads) there is a reduction of around 54% in the computation time.
The PS+PL solution achieves less than 1x performance for different audio lengths.
Real-time performance is achieved for the non-quantized and non-pruned model with Floating point precision.
The floating point implementation successfully achieved real-time performance. As an extension to the PS+PL solution we also worked on an INT16 implementation of design as covered in following section.
PS +PL Solution (INT16)
Generally, the INT16 quantization is one of the most preferred option to accelerate the inference of Deep Learning solutions on edge devices. There is a negligible loss (or no loss at all) in accuarcy when INT16 quantization is adopted in most scenarios. Regarding the performance and resource utilization fixed point computations always offer a great advantage for FPGAs as compared to a floating point implementation.
We also explored the INT16 quantization for the DeepSpeech2 model. Since in our implementation only the LSTM part is being accelerated on the PL, we quantized the weights of LSTM layers to INT16. Other layers executing on PS are left untouched and continue to use floating operations. For all the 5 Bi-directional LSTM layers, both weights and activations use INT16 precision. Look up tables are used for Sigmoid and Tanh calculations on PL. INT16 design is also synthesized for a maximum operating frequency of 250 MHz. Below table gives the utilization summary of INT16 PS+PL solution.
| BRAM (18k) | DSP48E | FF | LUT |
Total Available | 1824 | 2520 | 548160 | 274080 |
Total Used | 1001 | 807 | 22307 | 68156 |
Utilization (%) | 54.87 | 32 | 4.07 | 24.86 |
Table below gives the profiling information for the performance of PS+PL solution with INT16.
Audio File | Audio Length (sec)Feature Extraction (sec) | CNN Block (sec) | LSTM Block (sec)Post LSTM (sec) | Total Time (sec) | |||
1SEC_1.wav | 1.000 | 0.016295 | 0.186 | 0.918 | 0.005 | 1.125476 | |
2SEC_1.wav | 2.170 | 0.030 | 0.476 | 2.102 | 0.012 | 2.621610 | 1.118 |
3SEC_1.wav | 3.540 | 0.047965 | 0.828 | 3.498 | 0.021 | 4.3931.830 | |
4SEC_1.wav | 4.275 | 01.057145 | 0.991 | 4.221 | 0.026 | 5.295 | 2.249 |
5SEC_1.wav | 5.0000.066 | 1.149345 | 4.960 | 0.030 | 6.205 | 2.609 | |
6SEC_1.wav | 6.2200.079 | 1.457 | 6.197 | 0.038 | 7.771664 | 3.274 | |
7SEC_1.wav | 7.1200.091 | 1.661902 | 7.107 | 0.044 | 8.902 | 3.769 | |
8SEC_1.wav | 8.040 | 02.102145 | 1.875 | 8.045 | 0.050 | 10.072 | 4.240 |
10SEC_1.wav | 10.3900.130 | 2.456757 | 10.381 | 0.065 | 13.031 | 5.484 | |
11SEC_1.wav | 11.290 | 0.142 | 2.660 | 11.312290 | 03.071003 | 145.185997 | |
Average Time | 1.000 (Base) | 0.013268 | 0.233 | 0.995 | 0.006 | 1.24752 |
The following points are derived from the profiling data:
As expected, the time taken by feature extraction, CNN and post-LSTM blocks remain the same for both PS-only (4 threads) and PS+PL solutionsCompared to PS (4-threaded) solution there is 84% speedup for INT16 LSTM blocks on PL.
There is approximately a 40% 57% speedup in the processing time for INT16 LSTM blocks after ‘Wx’ computation is accelerated using PL resources as compared to Float implementation.
The overall time taken by the complete pipeline is reduced to around 10.247 52 times the the input audio length.
Compared to PS-only (4 threads)+PL Float implementation there is a reduction of around 36%40% in the total computation time.
The INT16 PS+PL solution also achieves around real-time performance of around 0.52x times of different audio lengths.
Results Comparison
In this section we compare the runtime achieved by the three designs for all the 10 audio samples.
Table below compares the total runtime of the three designs, and all the designs. Both PS+PL designs are able to achieve real-time performance, andas expected INT16 PS+PL is the fastest.
...
Audio Length (sec) | PS only (no threads) (sec) | PS only (4 threads) (sec) | PS + PL with Float (sec) | 1SEC_1.wavPS + PL with INT16 (sec) | |
1.000 | 5.154 | 1.4481 | 0.125790 | 2SEC_10. | wav476 |
2.170 | 13.423 | 3.755 | 21.621858 | 3SEC_1. | wav118 |
3.540 | 23.230 | 6.499 | 43.393095 | 4SEC_1. | wav830 |
4.275 | 28.332 | 7.918 | 53.295728 | 5SEC_12. | wav249 |
5.000 | 33.608 | 9.359 | 64.205362 | 6SEC_12. | wav609 |
6.220 | 42.307 | 11.786 | 75.771448 | 7SEC_13. | wav274 |
7.120 | 48.674 | 13.542 | 8.902 | 8SEC_1.wav60237 | 3.769 |
8.040 | 55.204 | 15.359 | 107.072037 | 10SEC_14. | wav240 |
10.390 | 71.727 | 19.949 | 139.031108 | 11SEC_15. | wav484 |
11.290 | 78.293 | 21.772 | 149.185Average Time906 | 5.997 | |
1.000 (Base) | 6.773 | 1.8871.247 | 0.874 | 0.52 |
Figure below displays the normalized runtime for each of the separate processing blocks of the pipeline. The runtime is normalized with respect to the input audio length. The best runtime of 10.247x 523x is provided by the INT16 PS+PL solution.
...
How to Run the Demo using Binaries
...
The predicted text and the time taken will be printed as an output on the terminal.
...
PS+PL (Single Precision Float)
Run the application with the path of the audio passed as an argument.
Ex. to convert audio file 1SEC_1.wav run the command
./STT_PSPL_200MHz.exe matmul.xclbin
Enter audio file name: audios/1SEC_1.wav
The predicted text and the time taken will be printed as an output on the terminal.
...
NOTE:
Due to context switching, for different runs there will be minor variation (3rd decimal place) in the time taken for conversion.