NADI Shared Tasks

INTRODUCTION

Arabic is a rich language with a wide collection of dialects. Many of these dialects remain under-studied, primarily due to limited resources (research funding, datasets, etc.). The goal of the Nuanced Arabic Dialect Identification (NADI) shared task series (Abdul-Mageed et al., 2020; 2021; 2022) is to alleviate this bottleneck by providing datasets and modeling opportunities for participants to carry out dialect identification, and other dialect processing. Dialect identification is the task of automatically detecting the source variety of a given text or speech segment. In addition to nuanced dialect identification at the country level, NADI 2022 offered a new subtask focused on country-level sentiment analysis. NADI 2023 continues this tradition of extending to tasks beyond dialect identification. Namely, we propose two new subtasks focused at machine translation (MT) from dialect to MSA. One of these two new subtasks is "open track" where we allow participants to develop datasets under particular conditions and use them to build systems. Namely, we allow participants to create datasets mapping dialectal Arabic (DA) to MSA that can be exploited to train their MT systems.

While we invite participation in any of the subtasks, we hope that teams will submit systems to all three tasks (i.e., participate in the three tasks rather than only one task). By offering three subtasks, we hope to receive systems that exploit diverse methods and machine learning architectures. This could include multi-task learning systems as well as sequence-to-sequence architectures in a single model such as the text-to-text Transformers (e.g., mT5, AraT5). Many other approaches could also be possible, and we look forward to creative approaches to the subtasks. We introduce the three subtasks next.

SHARED TASK

This shared task targets country-level dialect ID (Subtask 1) and dialect to MSA machine translation (Subtasks 2 and 3). The subtasks are:

Subtask 1 (Closed Country-level Dialect ID): In this subtask, we provide a new Twitter dataset (TWT-2023) that covers 18 dialects (a total of 23.4K tweets). We split this dataset into Train (18K), Dev (1.8K), and Test (3.6K). In addation, we provide external data from NADI 2020 (Abdul-Mageed et al., 2020), NADI 2021 (Abdul-Mageed et al., 2021), and MADAR (Bouamor et al., 2018) train datasets. We refer to these additional datasets as NADI-2020-TWT, NADI-2021-TWT, and MADAR-2018, respectively. This subtask is a closed track. In other words, participants are not allowed to use other external data except the ones we provide to train their systems.

Subtask 2 (Closed Dialect to MSA MT): Sentence-level machine translation from four dialects to MSA. This subtask is a closed track where participants are allowed to use only our provided training data (i.e., MADAR-parallel-corpus). Our new Test datasets for this subtask will be in the following four dialects: Egyptian, Emirati, Jordanian, and Palestinian. In total, Test set is 2,000 sentences (500 from each dialect), and we also provide a Dev set that we will release to participants (n=400 sentences, 100 from each of the four dialects). Both our Dev and Test for Subtask 2 are new datasets that we manually prepare for the shared task. We keep the domain from which these Dev and Test sets unknown.
For training, we restrict this subtask to the MADAR parallel dataset. The MADAR dataset is available for acquisition directly from authors at MADAR-parallel-corpus. The dataset is described as follows: “The MADAR corpus is a collection of parallel sentences covering the dialects of 25 cities from the Arab World, in addition to English, French, and MSA. The corpus is created by translating selected sentences from the Basic Traveling Expression Corpus (BTEC) (Takezawa et al., 2007) to the different dialects. The exact details on the translation process and source and target languages are described in Bouamor et al. (2018).” Participants will be allowed to use only the Train split of MADAR parallel data for this subtask, and report on Dev and Test sets we provide. In other words, use of MADAR Dev and Test sets will not be allowed for this subtask.

Subtask 3 (Open Dialect to MSA MT): Sentence-level machine translation from four dialects to MSA.
This is the same as Subtask 2, but allows participants to train their systems on any additional datasets so long as these additional training datasets are public at the time of submission. For example, participants are allowed to manually create new parallel datasets. For transparency and wider community benefits, we require researchers participating in the open track subtask to submit the datasets they create along with their Test set submissions. We will collect all new datasets developed by participants and facilitate their distribution from direct download links. All new datasets will be distributed with appropriate credit to their authors. Clearly, one objective of this open track shared task is to encourage the creation of new datasets for machine translation of Arabic dialects.

METRICS

The F1-score score will be the official metrics for Subtask 1 and BLEU score for Subtask 2 and Subtask 3.
Evaluation of the shared task will be hosted through CODALAB.
- Subtask 1 (Closed Country-level Dialect ID) Codalab Link
- Subtask 2 (Closed Dialect to MSA MT) Codalab Link
- Subtask 3 (Open Dialect to MSA MT) Codalab Link

FT CODE

We provide a finetuning examples on Google Colab: Colab Link

DOWNLOAD DATASET

Train, development, and test (unlabelled) dataset has already been released to registered participants via email. The evaluation stage is over but you can score your system on the Codalab by the post-evaluation phase. By downloading the NADI-2023 Shared Task files from HERE, you agree to the terms of the license. This is the data registration form. https://forms.gle/VRh58e2ypMRYzn4K6

IMPORTANT DATES

- July 18, 2023: Shared task announcement. Release of training data and scoring script.
- August 13, 2023: Registration deadline.
- August 14, 2023: Test set made available.
- August 30, 2023: Codalab TEST system submission deadline.
- September 5, 2023 September 12, 2023: Shared task system paper submissions due (Extended)
- October 12, 2023: Notification of acceptance.
- October 20, 2023: Camera-ready version.
- TBA: WANLP 2023 Conference.
* All deadlines are 11:59 PM UTC-12:00 (Anywhere On Earth).

CONTACT

For any questions related to this task, please contact the organizers directly using the following email address: ubc.nadi2020@gmail.com or join the google group.

ORGANIZERS

- Muhammad Abdul-Mageed, The University of British Columbia (Canada) and MBZUAI (UAE).
- AbdelRahim Elmadany, The University of British Columbia (Canada).
- Chiyu Zhang, The University of British Columbia (Canada).
- El Moatez Billah Nagoudi, The University of British Columbia (Canada).
- Houda Bouamor, Carnegie Mellon University (Qatar).
- Nizar Habash, New York University Abu Dhabi (UAE).

REFERENCES

@inproceedings{abdul-mageed-etal-2020-nadi,
    title = "{NADI} 2020: The First Nuanced {A}rabic Dialect Identification Shared Task",
    author = "Abdul-Mageed, Muhammad  and
      Zhang, Chiyu  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Fifth Arabic Natural Language Processing Workshop",
    month = dec,
    year = "2020",
    address = "Barcelona, Spain (Online)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.wanlp-1.9",
    pages = "97--110",
    
}

@inproceedings{abdul-mageed-etal-2021-nadi,
    title = "{NADI} 2021: The Second Nuanced {A}rabic Dialect Identification Shared Task",
    author = "Abdul-Mageed, Muhammad  and
      Zhang, Chiyu  and
      Elmadany, AbdelRahim  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the Sixth Arabic Natural Language Processing Workshop",
    month = apr,
    year = "2021",
    address = "Kyiv, Ukraine (Virtual)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.wanlp-1.28",
    pages = "244--259",
    
}

@inproceedings{abdul-mageed-etal-2022-nadi,
    title = "{NADI} 2022: The Third Nuanced {A}rabic Dialect Identification Shared Task",
    author = "Abdul-Mageed, Muhammad  and
      Zhang, Chiyu  and
      Elmadany, AbdelRahim  and
      Bouamor, Houda  and
      Habash, Nizar",
    booktitle = "Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)",
    month = dec,
    year = "2022",
    address = "Abu Dhabi, United Arab Emirates (Hybrid)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.wanlp-1.9",
    pages = "85--97",
    
}

@inproceedings{bouamor-etal-2018-madar,
    title = "The {MADAR} {A}rabic Dialect Corpus and Lexicon",
    author = "Bouamor, Houda  and
      Habash, Nizar  and
      Salameh, Mohammad  and
      Zaghouani, Wajdi  and
      Rambow, Owen  and
      Abdulrahim, Dana  and
      Obeid, Ossama  and
      Khalifa, Salam  and
      Eryani, Fadhl  and
      Erdmann, Alexander  and
      Oflazer, Kemal",
    booktitle = "Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018)",
    month = may,
    year = "2018",
    address = "Miyazaki, Japan",
    publisher = "European Language Resources Association (ELRA)",
    url = "https://aclanthology.org/L18-1535",
}

@inproceedings{takezawa-etal-2007-multilingual,
    title = "Multilingual Spoken Language Corpus Development for Communication Research",
    author = "Takezawa, Toshiyuki  and
      Kikui, Genichiro  and
      Mizushima, Masahide  and
      Sumita, Eiichiro",
    booktitle = "International Journal of Computational Linguistics {\&} {C}hinese Language Processing, Volume 12, Number 3, September 2007: Special Issue on Invited Papers from {ISCSLP} 2006",
    month = sep,
    year = "2007",
    url = "https://aclanthology.org/O07-5005",
    pages = "303--324",
}