Skip to main navigation Skip to search Skip to main content

Multi-Attention Bottleneck for Gated Convolutional Encoder-Decoder-Based Speech Enhancement

  • Nasir Saleem
  • , Teddy Surya Gunawan
  • , Muhammad Shafi
  • , Sami Bourouis
  • , Aymen Trigui

Research output: Contribution to journalArticlepeer-review

53 Downloads (Pure)

Abstract

Convolutional encoder-decoder (CED) has emerged as a powerful architecture, particularly in speech enhancement (SE), which aims to improve the intelligibility and quality and intelligibility of noise-contaminated speech. This architecture leverages the strength of the convolutional neural networks (CNNs) in capturing high-level features. Usually, the CED architectures use the gated recurrent unit (GRU) or long-short-term memory (LSTM) as a bottleneck to capture temporal dependencies, enabling a SE model to effectively learn the dynamics and long-term temporal dependencies in the speech signal. However, Transformers neural networks with self-attention effectively capture long-term temporal dependencies. This study proposes a multi-attention bottleneck (MAB) comprised of a self-attention Transformer powered by a time-frequency attention (TFA) module followed by a channel attention module (CAM) to focus on the important features. The proposed bottleneck (MAB) is integrated into a CED architecture and named MAB-CED. The MAB-CED uses an encoder-decoder structure including a shared encoder and two decoders, where one decoder is dedicated to spectral masking and the other is used for spectral mapping. Convolutional Gated Linear Units (ConvGLU) and Deconvolutional Gated Linear Units (DeconvGLU) are used to construct the encoder-decoder framework. The outputs of two decoders are coupled by applying coherent averaging to synthesize the enhanced speech signal. The proposed speech enhancement is examined using two databases, VoiceBank+DEMAND and LibriSpeech. The results show that the proposed speech enhancement outperforms the benchmarks in terms of intelligibility and quality at various input SNRs. This indicates the performance of the proposed MAB-CED at improving the average PESQ by 0.55 (22.85%) with VoiceBank+DEMAND and by 0.58 (23.79%) with LibriSpeech. The average STOI is improved by 9.63% (VoiceBank+DEMAND) and 9.78% (LibriSpeech) over the noisy mixtures.
Original languageEnglish
Pages (from-to)114172-114186
Number of pages15
JournalIEEE Access
Volume11
Issue number2023
Early online date12 Oct 2023
DOIs
Publication statusPublished (in print/issue) - 30 Dec 2023

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 11 - Sustainable Cities and Communities
    SDG 11 Sustainable Cities and Communities
  2. SDG 15 - Life on Land
    SDG 15 Life on Land

Keywords

  • Multi-attention
  • time-frequency attention
  • channel attention
  • transformer
  • Speech enhancement
  • gated convolutional encoder-deconder
  • Convolutional neural networks
  • Time-frequency analysis
  • Decoding
  • logic gates
  • Noise measurem
  • Transformers
  • Encoding

Fingerprint

Dive into the research topics of 'Multi-Attention Bottleneck for Gated Convolutional Encoder-Decoder-Based Speech Enhancement'. Together they form a unique fingerprint.

Cite this