Multi-Attention Bottleneck for Gated Convolutional Encoder-Decoder-Based Speech Enhancement

Nasir Saleem, Teddy Surya Gunawan, Muhammad Shafi, Sami Bourouis, Aymen Trigui

Research output: Contribution to journalArticlepeer-review

12 Citations (Scopus)
1 Downloads (Pure)

Abstract

Convolutional encoder-decoder (CED) has emerged as a powerful architecture, particularly in speech enhancement (SE), which aims to improve the intelligibility and quality and intelligibility of noise-contaminated speech. This architecture leverages the strength of the convolutional neural networks (CNNs) in capturing high-level features. Usually, the CED architectures use the gated recurrent unit (GRU) or long-short-term memory (LSTM) as a bottleneck to capture temporal dependencies, enabling a SE model to effectively learn the dynamics and long-term temporal dependencies in the speech signal. However, Transformers neural networks with self-attention effectively capture long-term temporal dependencies. This study proposes a multi-attention bottleneck (MAB) comprised of a self-attention Transformer powered by a time-frequency attention (TFA) module followed by a channel attention module (CAM) to focus on the important features. The proposed bottleneck (MAB) is integrated into a CED architecture and named MAB-CED. The MAB-CED uses an encoder-decoder structure including a shared encoder and two decoders, where one decoder is dedicated to spectral masking and the other is used for spectral mapping. Convolutional Gated Linear Units (ConvGLU) and Deconvolutional Gated Linear Units (DeconvGLU) are used to construct the encoder-decoder framework. The outputs of two decoders are coupled by applying coherent averaging to synthesize the enhanced speech signal. The proposed speech enhancement is examined using two databases, VoiceBank+DEMAND and LibriSpeech. The results show that the proposed speech enhancement outperforms the benchmarks in terms of intelligibility and quality at various input SNRs. This indicates the performance of the proposed MAB-CED at improving the average PESQ by 0.55 (22.85%) with VoiceBank+DEMAND and by 0.58 (23.79%) with LibriSpeech. The average STOI is improved by 9.63% (VoiceBank+DEMAND) and 9.78% (LibriSpeech) over the noisy mixtures.
Original languageEnglish
Pages (from-to)114172-114186
Number of pages15
JournalIEEE Access
Volume11
Issue number2023
Early online date12 Oct 2023
DOIs
Publication statusPublished online - 12 Oct 2023

Keywords

  • Multi-attention
  • time-frequency attention
  • channel attention
  • transformer
  • Speech enhancement
  • gated convolutional encoder-deconder
  • Convolutional neural networks
  • Time-frequency analysis
  • Decoding
  • logic gates
  • Noise measurem
  • Transformers
  • Encoding

Fingerprint

Dive into the research topics of 'Multi-Attention Bottleneck for Gated Convolutional Encoder-Decoder-Based Speech Enhancement'. Together they form a unique fingerprint.

Cite this