Arabic is a rich language with a wide collection of dialects. Many of these dialects remain under-studied, primarily due to limited resources (research funding, datasets, etc.). The goal of the Nuanced Arabic Dialect Identification (NADI) shared task series (Abdul-Mageed et al., 2020; 2021; 2022) is to alleviate this bottleneck by providing datasets and modeling opportunities for participants to carry out dialect identification, and other dialect processing. Dialect identification is the task of automatically detecting the source variety of a given text or speech segment. In addition to nuanced dialect identification at the country level, NADI 2022 offered a new subtask focused on country-level sentiment analysis. NADI 2023 continues this tradition of extending to tasks beyond dialect identification. Namely, we propose a new open track subtask focused at machine translation (MT) from dialect to MSA. In this open track subtask, we allow participants to develop datasets under particular conditions and use them to develop systems. Namely, we allow participants to create datasets mapping MSA into dialectal Arabic (DA) that can be exploited to train their MT systems.
While we invite participation in either of the two subtasks, we hope that teams will submit systems to both tasks (i.e., participate in the two tasks rather than only one task). By offering two subtasks, we hope to receive systems that exploit diverse methods and machine learning architectures. This could include multi-task learning systems as well as sequence-to-sequence architectures in a single model such as the text-to-text Transformers (e.g., mT5, AraT5). Many other approaches could also be possible and we look forward to creative approaches to the subtasks. We introduce the two subtasks next.
(To receive access to the data, teams intending to participate are invited to fill in the form on the official website of NADI shared task. ).
This CodaLab is for shared task targets Closed Country-level Dialect ID (Subtask 1).
Subtask 1 (Closed Country-level Dialect ID): In this subtask, we provide a new Twitter dataset (NADI-2023-TWT) that covers 18 dialects (a total of 23.4K tweets). We spit this dataset into Train (18K), Dev (1.8K), and Test (3.6K). In addation, we provide external data from NADI 2020 (Abdul-Mageed et al., 2020), NADI 2021 (Abdul-Mageed et al., 2021), and MADAR (Bouamor et al., 2018) train datasets. We refer to these additional datasets as NADI-2020-TWT, NADI-2021-TWT, and MADAR-2018, respectively. In other words, participants are not allowed to use any external data except the ones we provide to train their systems.
Metrics:
For subtask 1, the evaluation metrics will include precision/recall/f-score/accuracy. Macro Averaged F-score will be the official metric.
This is a closed track. Participating teams will be provided with a common training data set and a common development set. No external manually labelled data sets are allowed. A blind test data set will be used to evaluate the output of the participating teams. All teams are required to report on the development and test set in their writeups.
The shared task will be hosted through CODALAB. Teams will be provided with a CODALAB link for each subtask.
August 7, 2023: Registration deadline.
All deadlines are 11:59 PM UTC-12:00 (Anywhere On Earth).
Please visit the official website of the NADI shared task for more information.
For any questions related to this task, please contact the organizers directly using the following email address: ubc.nadi2020@gmail.com
Muhammad Abdul-Mageed, Chiyu Zhang, El Moatez Billah Nagoudi, Abdelrahim Elmadany (The University of British Columbia, Canada), Nizar Habash (New York University Abu Dhabi), and Houda Bouamor (Carnegie Mellon University, Qatar)
Metrics: The evaluation metrics will include precision/recall/f-score/accuracy. Macro Averaged F-score will be the official metric.
To receive access to the data, teams intending to participate are invited to fill in the form on the official website.
Copyright (c) 2023 The University of British Columbia, Canada; Carnegie Mellon University Qatar; New York University Abu Dhabi. All rights reserved.
Start: July 1, 2023, midnight
Description: Development phase: Develop your models and submit prediction labels on the DEV set of subtask 1. Note: The name of your submission should be 'teamname_subtask1_dev_numberOFsubmission.zip' that includes a text file of your prediction (e.g., A submission 'UBC_subtask1_dev_1.zip' that is the zip file of my first prediction, 'UBC_subtask1_dev_1.txt'.)
Start: Aug. 14, 2023, midnight
Description: Test phase: Submit your prediction labels on the TEST set of subtask 1. Each team is allowed a maximum of 3 submissions. Note: The name of your submission should be 'teamname_subtask1_test_numberOFsubmission.zip' that includes a text file of your predictions (e.g., A submission 'UBC_subtask1_test_1.zip' that is the zip file of my prediction, 'UBC_subtask1_test_1.txt'.)
Start: Aug. 31, 2023, noon
Description: Post-Evaluation: Submit your prediction labels on the TEST set of subtask 1 after competition. Note: The name of your submission should be 'teamname_subtask1_test_numberOFsubmission.zip' that includes a text file of your predictions (e.g., A submission 'UBC_subtask1_test_1.zip' that is the zip file of my prediction, 'UBC_subtask1_test_1.txt'.)
Never
You must be logged in to participate in competitions.
Sign In# | Username | Score |
---|---|---|
1 | asalhi85 | 0.8586 |
2 | Samah | 0.8543 |
3 | Dilshod | 0.8476 |