ناديــــــ ٢٠٢٥ | NADI 2025

Multidialectal Arabic Speech Processing

This shared task at ArabicNLP 2025 brings together three critical challenges in Arabic speech processing: Spoken Dialect Identification, Multidialectal ASR, and Diacritization Restoration. By leveraging diverse dialectal data and supporting both text-only and multimodal systems, the task aims to advance inclusive, robust, and generalizable speech technologies for the Arabic-speaking world.

Shared Task Subtasks

Three complementary tasks addressing core challenges in Arabic speech processing.

1

Spoken Arabic Dialect Identification (ADI)

Open Track

Objective

Despite notable progress in speech processing, Arabic dialect identification from speech remains a significant challenge due to the rich linguistic diversity of Arabic and the limited availability of labeled datasets. Although earlier shared tasks on spoken Arabic dialect identification (e.g., Ali et al., 2017; Ali et al., 2019) laid important groundwork, the field has since seen substantial advancements—particularly with the rise of large-scale, joint ASR and language identification models such as Whisper (Radford et al., 2023) and MMS (Pratap et al., 2024). This subtask is designed to (1) encourage community involvement in Arabic speech technology by providing a benchmark dataset and evaluation framework that supports innovation and collaboration and (2) evaluate the effectiveness of post-i-vector and x-vector-based language identification systems in the context of Arabic dialects. The task is of high practical relevance for multilingual AI systems, voice assistants, and digital accessibility, particularly in supporting Arabic speakers across various dialect regions. It also holds promise for real-world applications, including automated transcription, conversational AI, and technologies targeting low-resource languages.

Dataset

  • Newly created, high-quality multidialectal Arabic speech corpus
  • 8 hours of dialect-annotated speech for adaptation
  • 8 hours for validation
  • 8 hours blind test set
  • No mandatory training data; external resources like ADI-5/ADI-17 are allowed.

Evaluation

  • Primary: Accuracy
  • Secondary: Average Cost (LRE 2022 formulation)
  • Baseline: Pretrained ECAPA-TDNN VoxLingua107 system Finetuned on adaptation split.
2

Multidialectal Arabic ASR

Open Track

Objective

The goal of this subtask is to develop ASR systems capable of accurately transcribing Arabic speech across multiple dialects. Participants must address challenges such as phonetic variation, code-switching, and dialectal diversity, using the Casablanca dataset as a benchmark for building robust, dialect-aware speech recognition models. Participants will be provided with training and validation/development data for model development for the ASR subtask, while a private, previously unseen test dataset will be hosted on Codabench. The ASR subtask can be approached in a zero-shot setting, where models are evaluated directly on the validation set to prepare systems for the blind testing. Alternatively, participants can finetine models on the training data or use data from the training set for few-shot learning. For the SER subtask, participants will be evaluating directly on the test set via Codabench without separate training or validation data provided.

Dataset

  • Casablanca dataset
  • Train: 12,800 utterances (1,600 per dialect)
  • Dev: 12,800 utterances
  • Test: 10,298 blind utterances
  • Total: 35,898 utterances

Evaluation

  • Primary: Word Error Rate (WER)
  • Secondary: Character Error Rate (CER)
  • Baseline: Zero-shot Whisper-large-v3
3

Diacritic Restoration

Open Track Closed Track

Objective

The proposed task aims to advance research on automatic vowelization for spoken Arabic varieties. As the vast majority of existing vowelization or diacritic restoration efforts focus on Classical Arabic (CA) or Modern Standard Arabic (MSA), we aim to raise attention to more challenging spoken varieties, such as dialects and code-switching, with a focus on generalization across different varieties. The shared task also encourages multimodal (text/speech) approaches for distant supervision to achieve generalizable performance. Participants will be provided with baseline systems and relevant speech/text resources, and their submitted systems will be evaluated against manually annotated test sets that include CA, MSA, and dialects with codeswitching instances.

Tracks

  • Track 1 (Open): Participants may use any speech or text data, including external resources, as long as test sets are excluded from training. All used resources must be documented in the system description paper.
  • Track 2 (Closed): k In this track, participants are asked to use only the provided training/validation resources for a fair and controlled comparison

Datasets

  • Provided training and development sets for closed track are available here: https://huggingface.co/collections/MBZUAI/nadi-2025-sub-task-3-datasets-683739edbf94db861a4d4edf
  • Development sets are named test/development on huggingface. The official test set will be provided at a later date.
  • The datasets represent a wide range of Arabic varieties and recording conditions, with over 85K training sentences in total. Datasets consist of dialectal, modern standard, classical, and code-switched Arabic speech and diacritized transcriptions.

Dataset Type Diacritized Train
MDASPC Multi-dialectal True 60677
TunSwitch Dialectal, CS True 5212
ArzEn Dialectal, CS False 3344
Mixat Dialectal, CS False 3721
ClArTTS CA True 9500
ArVoice MSA True 2507
Table 1: Number of sentences in datasets provided for the shared task. (Our test set) refers to the held-out test set for this shared task. CA refers to Classical Arabic. CS refers to Code Switched Arabic.

Evaluation

  • Primary: Word Error Rate (WER) and Character Error Rate (CER)
  • Baselines:
  • - Multimodal: ArTST-based attention fusion
  • - Text-only: CATT (https://arxiv.org/abs/2407.03236)

Possible Research Directions

  • Semi-supervised Data Augmentation: diacritization of speech transcripts using text-based diacritizers / LLMs for model training.

Important Dates

Key milestones for the ArabicNLP 2025 Shared Task.

Training Data Release

June 1,2025     June 10,2025

Release of training/dev data and evaluation scripts.

Registration Deadline

July 20, 2025

Final registration deadline and test set release.

Submission Deadline

July 25, 2025

Test submission deadline via Codabench.

Results Announcement

July 30, 2025

Final results released to participants.

Paper Submission

August 15, 2025

System description papers due.

Workshop

November 5–9, 2025

ArabicNLP 2025 Workshop in Suzhou, China.

Participation Guidelines

🤝 Please fill out [this form] to register and participate.

---

For each task, please participate through the Codabench link, which can be found in the respective subtask sections above. For any questions or clarifications, please visit our FAQ page. Detailed instructions for preparing and submitting your paper(s) can be found on the Paper Guidelines page. .

Organizing Committee

Muhammad Abdul-Mageed, University of British Columbia

Bashar Talafha, University of British Columbia

Hawau Olamide Toyin, MBZUAI

Peter Sullivan, University of British Columbia

AbdelRahim Elmadany, University of British Columbia

Abdurrahman Juma, Birzeit University

Amirbek Djanibekov, MBZUAI

Chiyu Zhang, University of British Columbia

Hamad Alshehhi, MBZUAI

Hanan Aldarmaki, MBZUAI

Mustafa Jarar, Birzeit University

Nizar Habash, New York University Abu Dhabi & MBZUAI

Photo 1 Photo 2 Photo 3 Photo 4