Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

This page benchmarks the implementation of a real-time speech-to-text solution on a Xilinx edge device using the Vitis accelerated flow. The model chosen for the automatic speech recognition (ASR) task is Baidu’s DeepSpeech2. Results are broadly discussed for three designs, two designs run only on the processing system (PS) and the third design offloads some computation on the programmable logic (PL) in combination with the PS. The designs are evaluated on Zynq UltraScale+ MPSoC ZCU102 evaluation kit and the comparative results are reported.

...

Machine learning (ML) and deep learning (DL) based solutions have become the new normal. Be it an image, video, speech or text, all forms of data are being processed by ML/DL solutions and utilized for a variety of applications. The area of speech and language processing, categorized as Natural Language Processing (NLP), is an active area where ML/DL is being applied to provide new ASR solutions.  Generally, most solutions are compute-intensive and require high-end graphics processing units (GPUs) or cloud-based inference. However, GPUs are power hungry and cloud-based solutions raise  questions questions about privacy and data security. Field Programmable Gate Arrays (FPGAs) solutions have evolved to include the combination of Processor Subsystems combined with traditional FPGA logic, enabling power efficient, edge-base solutions for these traditionally compute-intensive tasks. Due to these advantages, there has been a trend to target more and more complex ML/DL solutions on edge-based FPGA system-on-chip (SoC) devices.

...

  • File Format: Wav file

  • Format: PCM

  • Format settings:

    • Endianness: Little

    • Sign: Signed

  • Codec ID: 1

  • Bit rate mode: Constant

  • Bit rate: 256 Kbps

  • Channel(s): 1 channel

  • Sampling rate: 16.0 KHzkHz

  • Bit depth: 16 bits

Designs Implemented

...

The next step is to offload some of the computations into the programmable logic (PL) and try to achieve more acceleration. As LSTM block is the major time-consuming block, so is the ideal candidate for being accelerated in PL.

PS +PL

...

Solution (Single Precision Float)

The LSTM block mainly consists of computing the equations mentioned in the figure in LSTM cell sub section. The major time-consuming operation in these equations is the matrix-vector multiplication for every gate (‘Wx’ and ‘Wy’).

As explained previously, for ‘Wx’ computation, ‘x’ is the input of current timestep (‘t’) and there is no dependency on the previous time step, thus it is termed a non-recurrent computation. While for ‘Wy’ computation, the vector ‘y’ is the output of the previous timestep (t-1), thus a feedback and  the the ‘Wy’ computation is termed a recurrent computation. In the PS+PL design, the ‘Wx’ computation complete LSTM Layer is accelerated in the PL. The matrix-vector multiplication is converted into matrix-matrix multiplication, by bringing the ‘Wx’ computation outside of the non-linear functions. Tiling, pipelining and unrolling are carried out to accelerate the matrix-matrix multiplication on PL. An option Acceleration of also accelerating ‘Wy’ computation in PL was explored, but was not achievable due to two bottlenecks: and all the gates equations, along with the non-linear functions are also carried on PL.
The two bottlenecks, (1) the feedback nature of the computations, and  the (2) the limited on-chip memory available on the Zynq UltraScale+ device on the ZCU102 board. Thus, only ‘Wx’ computation of all LSTM cells (for both forward and backward pass) is accelerated on the FPGA fabric and rest all computations of the pipeline are carried out on the ARM cores (similar to the multi-threaded PS only solution). The results in a high communication bandwidth requirement. These requirements are efficiently handled by using a variety of optimizations.

The design is synthesized for an operating frequency of 200MHz250 MHz. The resource utilization summary of the PS+PL design is shown in the below table. Apart from Block RAM (BRAM), other resources used are relatively low.

 

BRAM (18k)

DSP48E

FF

LUT

Total Available

1824

2520

548160

27408274080

Total Used

16081458

502540

146998168794

8334131246

Utilization (%)

8879.1693

1921.9242

2630.827930

47.4188

Figure below shows the top-level block diagram of the PS+PL implementation. The audio file is read from the SD card. Then feature extraction and CNN are processed  on on the PS, after which the LSTM block is executed using both PS and entirely on the PL. The remainder of the pipeline is again executed on PS, and finally the predicted text is printed on a terminal using UART. External DDR memory is used to store the model weights and the intermediate results. These weights are fetched from DDR and stored locally in the on-chip BRAM.

The PS+PL design is also profiled for the same ten audio samples and the results are shown in the table below.

...

The PS+PL design is also profiled for the same ten audio samples and the results are shown in the table below.

Audio File

Audio Length (sec)

Feature Extraction (sec)

CNN Block (sec)

LSTM Block (sec)

Post LSTM (sec)

Total Time (sec)

1SEC_1.wav

1.000

0.016

0.186

0.583

0.005

0.790

2SEC_1.wav

2.170

0.030

0.476

1.340

0.012

1.858

3SEC_1.wav

3.540

0.047

0.828

2.199

0.021

3.095

4SEC_1.wav

4.275

0.057

0.991

2.654

0.026

3.728

5SEC_1.wav

5.000

0.066

1.149

3.117

0.030

4.362

6SEC_1.wav

6.220

0.079

1.457

3.874

0.038

5.448

7SEC_1.wav

7.120

0.091

1.661

4.441

0.044

60237

8SEC_1.wav

8.040

0.102

1.875

5.010

0.050

7.037

10SEC_1.wav

10.390

0.130

2.456

6.457

0.065

9.108

11SEC_1.wav

11.290

0.142

2.660

7.033

0.071

9.906

Average Time

1.000 (Base)

0.013

0.233

0.622

0.006

0.874

The following points are derived from the profiling data: 

  1. As expected, the time taken by feature extraction, CNN and post-LSTM blocks remain the same for both PS-only (4 threads) and PS+PL solutions.

  2. There is approximately a 62% speedup in the processing time for LSTM blocks after pushing ‘the LSTM layer on PL.

  3. The overall time taken by the complete pipeline is reduced to around 0.874 times the input audio length.

  4. Compared to PS-only (4 threads) there is a reduction of around 54% in the computation time.

  5. The PS+PL solution achieves less than 1x performance for different audio lengths.

  6. Real-time performance is achieved for the non-quantized and non-pruned model with Floating point precision.

The floating point implementation successfully achieved real-time performance. As an extension to the PS+PL solution we also worked on an INT16 implementation of design as covered in following section.

PS +PL Solution (INT16)

Generally, the INT16 quantization is one of the most preferred option to accelerate the inference of Deep Learning solutions on edge devices. There is a negligible loss (or no loss at all) in accuarcy when INT16 quantization is adopted in most scenarios. Regarding the performance and resource utilization fixed point computations always offer a great advantage for FPGAs as compared to a floating point implementation.

We also explored the INT16 quantization for the DeepSpeech2 model. Since in our implementation only the LSTM part is being accelerated on the PL, we quantized the weights of LSTM layers to INT16. Other layers executing on PS are left untouched and continue to use floating operations. For all the 5 Bi-directional LSTM layers, both weights and activations use INT16 precision. Look up tables are used for Sigmoid and Tanh calculations on PL. INT16 design is also synthesized for a maximum operating frequency of 250 MHz. Below table gives the utilization summary of INT16 PS+PL solution.

 

BRAM (18k)

DSP48E

FF

LUT

Total Available

1824

2520

548160

274080

Total Used

1001

807

22307

68156

Utilization (%)

54.87

32

4.07

24.86

Table below gives the profiling information for the performance of PS+PL solution with INT16.

Audio File

Audio Length (sec)Feature Extraction (sec)

CNN Block (sec)

LSTM Block (sec)Post LSTM (sec)

Total Time (sec)

1SEC_1.wav

1.000

0.016295

0.186

0.918

0.005

1.125476

2SEC_1.wav

2.170

0.030

0.476

2.102

0.012

2.621610

1.118

3SEC_1.wav

3.540

0.047965

0.828

3.498

0.021

4.3931.830

4SEC_1.wav

4.275

01.057145

0.991

4.221

0.026

5.295

2.249

5SEC_1.wav

5.0000.066

1.149345

4.960

0.030

6.205

2.609

6SEC_1.wav

6.2200.079

1.457

6.197

0.038

7.771664

3.274

7SEC_1.wav

7.1200.091

1.661902

7.107

0.044

8.902

3.769

8SEC_1.wav

8.040

02.102145

1.875

8.045

0.050

10.072

4.240

10SEC_1.wav

10.3900.130

2.456757

10.381

0.065

13.031

5.484

11SEC_1.wav

11.290

0.142

2.660

11.312290

03.071003

145.185997

Average Time

1.000 (Base)

0.013268

0.233

0.995

0.006

1.24752

The following points are derived from the profiling data: 

  1. As expected, the time taken by feature extraction, CNN and post-LSTM blocks remain the same for both PS-only (4 threads) and PS+PL solutionsCompared to PS (4-threaded) solution there is 84% speedup for INT16 LSTM blocks on PL.

  2. There is approximately a 40% 57% speedup in the processing time for INT16 LSTM blocks after ‘Wx’ computation is accelerated using PL resources as compared to Float implementation.

  3. The overall time taken by the complete pipeline is reduced to around 10.247 52 times  the the input audio length.

  4. Compared to PS-only (4 threads)+PL Float implementation there is a reduction of around 36%40% in the total computation time.

  5. The INT16 PS+PL solution also achieves around real-time performance of around 0.52x times of different audio lengths.

Results Comparison

In this section we compare the runtime achieved by the three designs for all the 10 audio samples.

Table below compares the total runtime of the three designs, and all the designs. Both PS+PL designs are able to achieve real-time performance, andas expected INT16 PS+PL is the fastest.

...

1SEC_1.wav2SEC_1wav3SEC_wav4SEC_wav5SEC_1wav6SEC_1wav7SEC_1wav8SEC_1.wav10SEC_1wav11SEC_1wav

Audio Length (sec)

PS only (no threads) (sec)

PS only (4 threads) (sec)

PS + PL with Float (sec)

PS + PL with INT16 (sec)

1.000

5.154

1.4481

0.125790

0.

476

2.170

13.423

3.755

21.621858

1.

118

3.540

23.230

6.499

43.393095

1.

830

4.275

28.332

7.918

53.295728

2.

249

5.000

33.608

9.359

64.205362

2.

609

6.220

42.307

11.786

75.771448

3.

274

7.120

48.674

13.542

8.902

60237

3.769

8.040

55.204

15.359

107.072037

4.

240

10.390

71.727

19.949

139.031108

5.

484

11.290

78.293

21.772

149.185Average Time906

5.997

1.000 (Base)

6.773

1.8871.247

0.874

0.52

Figure below displays the normalized runtime for each of the separate processing blocks of the pipeline. The runtime is normalized with respect to the input audio length. The best runtime of 10.247x 523x is provided by the INT16 PS+PL solution.

...

How to Run the Demo using Binaries

...

  • The predicted text and the time taken will be printed as an output on the terminal.

...

PS+PL (Single Precision Float)

  • Run the application with the path of the audio passed as an argument.

    • Ex. to convert audio file 1SEC_1.wav run the command

      • ./STT_PSPL_200MHz.exe matmul.xclbin

      • Enter audio file name: audios/1SEC_1.wav

  • The predicted text and the time taken will be printed as an output on the terminal.

...

NOTE:

Due to context switching, for different runs there will be minor variation (3rd decimal place) in the time taken for conversion.