Although interested in all low-resourced languages mainly in the Middle East, these are the languages for which enthusiastic volunteers have shown interest to work on in DOLMA-NLP. Is your language missing? Let us know and we might be able to help!
Luri Bakhtiari is spoken by more than one million speakers across southwestern Iran, primarily in Lorestan, Chaharmahal and Bakhtiari, Khuzestan, and Isfahan provinces. The language belongs to the Southwestern branch of West Iranian languages within the Indo-European family, positioned in the Luri continuum between Persian and Kurdish. Despite modernization challenges including forced sedentarization and formal education in Persian, the Bakhtiari language remains vital to cultural identity. Traditionally, Bakhtiari society practiced long-distance nomadic patterns with seasonal migrations.
Writing System: Bakhtiari primarily uses the Perso-Arabic script. Recent standardization efforts include the Pāpêrik orthography, which adds special graphemes to the standard script.
Gilaki is an Indo-European language from the Caspian subgroup of Northwestern Iranian languages, spoken by over 4 million people along the southern Caspian Sea, mainly in Gilan province. It features four main varieties: Western, Central, Eastern, and Southern Gilaki, each with distinctive regional characteristics.
Writing System: Gilaki has a documented writing tradition spanning five centuries. Current standardization efforts include two systems: Sarkhat, which prioritizes compatibility with Persian orthography, and Vrg, which introduces specific graphemes for Gilaki phonemes.
Hawrami is spoken by approximately 300,000 speakers in the Hawraman region across the Iran-Iraq border, including parts of Iran’s Kurdistan and Kermanshah provinces and Iraq’s Halabja Governorate. Though sometimes classified as a Kurdish dialect due to shared Kurdish identity, it more commonly falls within the Gorani language group of the Northwestern Iranian languages.
Writing System: Hawrami uses the Kurdish Perso-Arabic script with additional distinctive graphemes specific to Gorani, including <ڎ> and <ۋ>.ۋ>ڎ>
Laki is primarily spoken in the Kermanshah and Lorestan provinces of Iran, with additional communities in Chamchamal and Erbil regions of Iraqi Kurdistan. Recent research supports its status as a distinct language despite sharing characteristics with both Kurdish and Lurish languages. Laki exhibits interesting linguistic constructions, including ergativity, contributing to its unique typological profile.
Writing System: Laki employs the Kurdish Perso-Arabic script, distinguished by the characteristic grapheme <ۏ>.ۏ>
Mazandarani (also known as Mazanderani, Mazani, or Tabari) is a northwestern Iranian language spoken primarily along the eastern Caspian coastline in Iran’s Mazandaran Province by approximately 2.5-3 million speakers. It belongs to the Caspian language group and shares significant features with Gilaki. Recent studies indicate concerning decline in language use among younger generations.
Writing System: Mazandarani has a documented literary tradition extending over five centuries. It adopts the Perso-Arabic script with minor modifications from Persian orthography, notably the distinctive diacritic <ˇ>.ˇ>
Southern Kurdish is spoken by over 4 million people but has received less academic and technological attention compared to other Kurdish varieties. It encompasses numerous varieties including Garusi (Bijar), Pehley (Feyli), Badrei, Khezli (Ilam province), and Kolyai, Sanjabi, Krmashani (Kirmaşanî), and Kalhuri (Kermanshah province), with Kalhuri also extending into Iraqi Kurdistan.
Writing System: Southern Kurdish employs the Kurdish Perso-Arabic script, distinguished by the unique grapheme <ۊ>.ۊ>
Talysh is spoken along the southwestern Caspian coast, divided between Azerbaijan and Iran by the Araxes River. It has three main dialects: Southern and Central (Iran) and Northern (Azerbaijan). Conservative estimates suggest around 1 million speakers in Azerbaijan and 700,000 in Iran. Talysh has faced assimilation pressures, but recent years have seen increased language rights activism.
Writing System: Northern Talysh in Azerbaijan uses a standardized Latin-based script, while Iranian varieties typically employ a Persian-based script with diacritical marks, though lacking standardization. “Hamvaj” is another proposed orthography for Talysh.
Zazaki (also known as Dimlî or Kirmanckî) forms the Zaza-Gorani subgroup with Gorani. While linguistically distinct, speakers often identify as Kurdish. Among the languages discussed here, Zazaki has received the most extensive linguistic documentation, including comprehensive grammar books, dictionaries, and corpora.
Writing System: Zazaki employs two distinct orthographic systems: one used on Zazaki Wikipedia (influenced by Turkish orthography) and another based on Bedirxan’s Kurdish orthography used by magazines like Vate.