Abstract
Code-switching is the frequent alternation between two or more languages within a single utterance and is a widespread phenomenon among bilingual and multilingual speakers. In India, more than 250 million people are estimated to engage in code-switched communication, especially blending English with Hindi (Hinglish), making it one of the largest bilingual populations globally, making challenging for developing accurate and robust Automatic Speech Recognition (ASR) systems. Existing ASR models, typically trained on monolingual corpus, struggle with code-switched input due to a lack of large, balanced, and representative datasets—particularly for diverse age groups. Recent evaluations have shown that ASR models experience a relative increase in Word Error Rate (WER) of 30–50 % when exposed to code-switched speech compared to monolingual input. To address this resource gap, we introduce a benchmark Hinglish speech corpus, HiACC, to improve ASR performance in resource-constrained settings. While several monolingual Hindi and English corpus exist, publicly available code-switched datasets remain scarce, and none till date include children's speech. Our corpus fills this gap by providing the first code-switched Hinglish speech dataset with recordings from both adults and children. It comprises 3,318 audio segments from adult participants and 1,858 segments from children, covering 5.24 hours of read and spontaneous speech. The transcriptions include detailed annotations and code-switching tags to assist in linguistic and computational analysis. The corpus is publicly available at [https://zenodo.org/records/15551669], offering segmented audio and aligned transcripts for open research. We also present baseline ASR experiments, which show that standard models trained on monolingual data underperform by approximately 42 % WER on our test set, highlighting the complexity of the task. To our knowledge, this is the first publicly available resource on code-switched Hinglish speech encompassing both adult and child speakers, designed to catalyse progress in this challenging yet important area of speech recognition.
| Original language | English |
|---|---|
| Article number | 111886 |
| Pages (from-to) | 1-16 |
| Number of pages | 16 |
| Journal | Data in Brief, Elsevier |
| Volume | 62 |
| Early online date | 17 Jul 2025 |
| DOIs | |
| Publication status | Published (in print/issue) - 31 Oct 2025 |
Bibliographical note
Publisher Copyright:© 2025
Data Access Statement
HiACC: Hinglish Adult & Children Code-switched Corpus (Original data) (Zenodo)
Funding
This research did not receive any specific grant from funding agencies in the public, commercial, or non-profit sectors.
Keywords
- Code-switching
- Automatic speech recognition
- Children speech
- Adult speech
- Hinglish corpus
- Adult speech, Hinglish corpus