DOLMA-NLP | Community-Driven ASR: Creating Speech Recognition for Low-Resource Middle Eastern Languages

DOLMA-NLP Developing Technologies for Middle Eastern Languages

Community-Driven ASR: Creating Speech Recognition for Low-Resource Middle Eastern Languages

When we started the DOLMA-NLP project, our goal was ambitious yet focused: develop natural language processing technologies for under-represented Middle Eastern languages. Our initial plan centered on creating parallel text corpora for machine translation. However, as we collaborated with language communities, something unexpected and wonderful happened. What began as a text-based initiative organically evolved into something more. The incredible enthusiasm from our community members led us to expand into automatic speech recognition (ASR), addressing a critical need for these languages.

Our approach was fundamentally community-driven. We built a network of over 40 volunteers, with 30 highly active contributors who were passionate about advancing technology for their native languages. Leveraging this network, we then extended our work to collect audio data. The star of our data collection approach was a custom-built Telegram bot that streamlined the recording process. We chose Telegram because of its widespread use in the Middle East and its robust audio recording capabilities. Our bot implemented a systematic workflow:

Users selected their language and specified their gender
The bot presented sentences from our parallel corpora for recording
Users could record, replay, and approve their submissions
Quality control measures verified recording length and quality

This approach yielded impressive results: over 40 hours of speech data comprising more than 28,000 utterances across six languages: Gilaki, Hawrami, Laki Kurdish, Mazandarani, Southern Kurdish, and Zazaki. This collection represents the first substantial speech dataset for most of these languages, with only Zazaki having any previous presence in platforms like Common Voice (and even then, with just three hours of recorded speech).

Impact and Looking Forward

The success of this initiative demonstrates the power of community-driven approaches for low-resource language technology development. By integrating our speech data with our parallel text corpora, we’ve created resources that can support not just ASR but potentially speech-to-speech translation and other advanced applications.

This project underscores a critical lesson: when given the right tools and opportunities, language communities will enthusiastically participate in preserving and advancing their linguistic heritage through technology. The journey continues, and we’re excited to see how these resources will empower speakers of these languages in the digital age.