This shared task at ArabicNLP 2025 brings together three critical challenges in Arabic speech processing: Spoken Dialect Identification, Multidialectal ASR, and Diacritization Restoration. By leveraging diverse dialectal data and supporting both text-only and multimodal systems, the task aims to advance inclusive, robust, and generalizable speech technologies for the Arabic-speaking world.
Three complementary tasks addressing core challenges in Arabic speech processing.
Despite notable progress in speech processing, Arabic dialect identification from speech remains a significant challenge due to the rich linguistic diversity of Arabic and the limited availability of labeled datasets. Although earlier shared tasks on spoken Arabic dialect identification (e.g., Ali et al., 2017; Ali et al., 2019) laid important groundwork, the field has since seen substantial advancements—particularly with the rise of large-scale, joint ASR and language identification models such as Whisper (Radford et al., 2023) and MMS (Pratap et al., 2024). This subtask is designed to (1) encourage community involvement in Arabic speech technology by providing a benchmark dataset and evaluation framework that supports innovation and collaboration and (2) evaluate the effectiveness of post-i-vector and x-vector-based language identification systems in the context of Arabic dialects. The task is of high practical relevance for multilingual AI systems, voice assistants, and digital accessibility, particularly in supporting Arabic speakers across various dialect regions. It also holds promise for real-world applications, including automated transcription, conversational AI, and technologies targeting low-resource languages.
The goal of this subtask is to develop ASR systems capable of accurately transcribing Arabic speech across multiple dialects. Participants must address challenges such as phonetic variation, code-switching, and dialectal diversity, using the Casablanca dataset as a benchmark for building robust, dialect-aware speech recognition models. Participants will be provided with training and validation/development data for model development for the ASR subtask, while a private, previously unseen test dataset will be hosted on Codabench. The ASR subtask can be approached in a zero-shot setting, where models are evaluated directly on the validation set to prepare systems for the blind testing. Alternatively, participants can finetine models on the training data or use data from the training set for few-shot learning. For the SER subtask, participants will be evaluating directly on the test set via Codabench without separate training or validation data provided.
The proposed task aims to advance research on automatic vowelization for spoken Arabic varieties. As the vast majority of existing vowelization or diacritic restoration efforts focus on Classical Arabic (CA) or Modern Standard Arabic (MSA), we aim to raise attention to more challenging spoken varieties, such as dialects and code-switching, with a focus on generalization across different varieties. The shared task also encourages multimodal (text/speech) approaches for distant supervision to achieve generalizable performance. Participants will be provided with baseline systems and relevant speech/text resources, and their submitted systems will be evaluated against manually annotated test sets that include CA, MSA, and dialects with codeswitching instances.
Dataset | Type | Diacritized | Train |
---|---|---|---|
MDASPC | Multi-dialectal | True | 60677 |
TunSwitch | Dialectal, CS | True | 5212 |
ArzEn | Dialectal, CS | False | 3344 |
Mixat | Dialectal, CS | False | 3721 |
ClArTTS | CA | True | 9500 |
ArVoice | MSA | True | 2507 |
Key milestones for the ArabicNLP 2025 Shared Task.
Release of training/dev data and evaluation scripts.
Final registration deadline and test set release.
Test submission deadline via Codabench.
Final results released to participants.
System description papers due.
ArabicNLP 2025 Workshop in Suzhou, China.
🤝 Please fill out [this form] to register and participate.
---
For each task, please participate through the Codabench link, which can be found in the respective subtask sections above. For any questions or clarifications, please visit our FAQ page. Detailed instructions for preparing and submitting your paper(s) can be found on the Paper Guidelines page. .
Muhammad Abdul-Mageed, University of British Columbia
Bashar Talafha, University of British Columbia
Hawau Olamide Toyin, MBZUAI
Peter Sullivan, University of British Columbia
AbdelRahim Elmadany, University of British Columbia
Abdurrahman Juma, Birzeit University
Amirbek Djanibekov, MBZUAI
Chiyu Zhang, University of British Columbia
Hamad Alshehhi, MBZUAI
Hanan Aldarmaki, MBZUAI
Mustafa Jarar, Birzeit University
Nizar Habash, New York University Abu Dhabi & MBZUAI
![]() |
![]() |
![]() |
![]() |