HiACC: Hinglish adult & children code-switched corpus

Shruti Singh, Muskaan Singh, Virender Kadyan

Research output: Contribution to journalArticlepeer-review

6 Downloads (Pure)

Abstract

Code-switching is the frequent alternation between two or more languages within a single utterance and is a widespread phenomenon among bilingual and multilingual speakers. In India, more than 250 million people are estimated to engage in code-switched communication, especially blending English with Hindi (Hinglish), making it one of the largest bilingual populations globally, making challenging for developing accurate and robust Automatic Speech Recognition (ASR) systems. Existing ASR models, typically trained on monolingual corpus, struggle with code-switched input due to a lack of large, balanced, and representative datasets—particularly for diverse age groups. Recent evaluations have shown that ASR models experience a relative increase in Word Error Rate (WER) of 30–50 % when exposed to code-switched speech compared to monolingual input. To address this resource gap, we introduce a benchmark Hinglish speech corpus, HiACC, to improve ASR performance in resource-constrained settings. While several monolingual Hindi and English corpus exist, publicly available code-switched datasets remain scarce, and none till date include children's speech. Our corpus fills this gap by providing the first code-switched Hinglish speech dataset with recordings from both adults and children. It comprises 3,318 audio segments from adult participants and 1,858 segments from children, covering 5.24 hours of read and spontaneous speech. The transcriptions include detailed annotations and code-switching tags to assist in linguistic and computational analysis. The corpus is publicly available at [https://zenodo.org/records/15551669], offering segmented audio and aligned transcripts for open research. We also present baseline ASR experiments, which show that standard models trained on monolingual data underperform by approximately 42 % WER on our test set, highlighting the complexity of the task. To our knowledge, this is the first publicly available resource on code-switched Hinglish speech encompassing both adult and child speakers, designed to catalyse progress in this challenging yet important area of speech recognition.
Original languageEnglish
Article number111886
Pages (from-to)1-16
Number of pages16
JournalData in Brief, Elsevier
Volume62
Early online date17 Jul 2025
DOIs
Publication statusPublished (in print/issue) - 31 Oct 2025

Bibliographical note

Publisher Copyright:
© 2025

Data Access Statement


HiACC: Hinglish Adult & Children Code-switched Corpus (Original data) (Zenodo)

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or non-profit sectors.

Keywords

  • Code-switching
  • Automatic speech recognition
  • Children speech
  • Adult speech
  • Hinglish corpus
  • Adult speech, Hinglish corpus

Fingerprint

Dive into the research topics of 'HiACC: Hinglish adult & children code-switched corpus'. Together they form a unique fingerprint.

Cite this