Content moderation is imperative in social media platforms to support healthy online discussions. However, though multiple existing works identify offensive content in code-mixed Dravidian languages, they are only restricted to classifying whole comments without identifying part of content contributing to offensiveness. Such restriction is primarily due to the lack of annotated data for offensive spans in code-mixed Dravidian languages. In this shared task, we provide code-mixed social media text for Tamil language with offensive spans for which the participants are expected to develop systems.
Following are various subtasks that are developed as part of this offensive span identification task.
Subtask 1 - Supervised Offensive Span Identification
Given Youtube comments and annotated offensive spans for training and development, systems have to identify the offensive spans in each of the comments in test data. This task could be approached as task of Named Entity Recognition, Building an offensive language classifier and coupling with interpretability etc.
Subtask 2 - Less Data Offensive Span Identification
All the participants of subtask 1 are encouraged to also submit a "Less Data approach", where the participants are expected to submit a model while using only parts (not fully) of training data of subtask 1 . Participants need to develop systems to achieve competitive performance with limited data. To this end, there are plethora of creative ways to do this including data subset selection, coreset theory to something as simple as selecting representative data points from clusters etc. Participants are welcome to explore diverse interesting ideas. Further we encourage participants to release such implementation for public to foster further research.
Dataset
To download the data and participate, go to the Participate tab.
Working Notes
All the participating teams are encouraged to submit working notes of the developed system.
The paper's name format should be TEAM_NAME@DravidianLangTech-RANLP_2023: <Title of the paper>.
Example: NUIG@DravidianLangTech-RANLP_2023: Toxic Span Identification in Tamil
For electronic submission of papers to the DravidianLangTech workshop, please use this link: https://softconf.com/ranlp23/DravidianLangTech/
Each of the submission will be evaluated with character level F1 as in [1]
For more details on this metric please see https://github.com/ipavlopoulos/toxic_spans/blob/master/evaluation/semeval2021.py
We accept the test results only through the google form. Please see participate page for submission format. Meanwhile the google form for accepting submission will be shared once we start evaluation. The google form is released here. Please submit according to sample submission format in participate page.
The results along with summary of approach used by each system should be submitted at the end via google form in here
You should cite following paper if you are using our data.
Citations
==========
[1]
@inproceedings{toxicspans-acl,
title={Findings of the Shared Task on {O}ffensive {S}pan {I}dentification in Code-Mixed {T}amil-{E}nglish Comments},
author = "Ravikiran, Manikandan and
Chakravarthi, Bharathi Raja and
Madasamy, Anand Kumar and
Sivanesan, Sangeetha and
Rajalakshmi, Ratnavel and
Thavareesan, Sajeetha and
Ponnusamy, Rahul and
Mahadevan, Shankar",
booktitle = "Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages",
month = may,
year = "2022",
publisher = "Association for Computational Linguistics",
}
[2]
@inproceedings{ravikiran-annamalai-2021-dosa,
title = "{DOSA}: {D}ravidian Code-Mixed Offensive Span Identification Dataset",
author = "Ravikiran, Manikandan and
Annamalai, Subbiah",
booktitle = "Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages",
month = apr,
year = "2021",
address = "Kyiv",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.dravidianlangtech-1.2",
pages = "10--17",
abstract = "This paper presents the Dravidian Offensive Span Identification Dataset (DOSA) for under-resourced Tamil-English and Kannada-English code-mixed text. The dataset addresses the lack of code-mixed datasets with annotated offensive spans by extending annotations of existing code-mixed offensive language identification datasets. It provides span annotations for Tamil-English and Kannada-English code-mixed comments posted by users on YouTube social media. Overall the dataset consists of 4786 Tamil-English comments with 6202 annotated spans and 1097 Kannada-English comments with 1641 annotated spans, each annotated by two different annotators. We further present some of our baseline experimental results on the developed dataset, thereby eliciting research in under-resourced languages, leading to an essential step towards semi-automated content moderation in Dravidian languages. The dataset is available in https://github.com/teamdl-mlsg/DOSA",
}
Important Dates for shared tasks:
Task announcement: Feb 20, 2023
Release of Training data: Feb 28, 2023
Release of Test data: May 13, 2023 (Released please see the participate page)
Run submission deadline: June 1, 2023 (Google form submission released in form. Please upload your test results here.)
Results declared: June 10, 2023
Paper submission:1 July 2023
Peer review notification: 5 August 2023
Camera-ready paper due: 20 August 2023
Workshop Dates: 7-8 September 2023
Manikandan Ravikiran, Georgia Institute of Technology/ R&D Center, Hitachi India Pvt Ltd, India
Bharathi Raja Chakravarthi, Insight SFI Research Centre for Data Analytics, School of Computer Science, University of Galway, Ireland
Anand Kumar Madasamy, National Institute of Technology Karnataka Surathkal, India
Ananth Ganesh, R&D Center, Hitachi India Pvt Ltd, India
Ratnavel Rajalakshmi, School of Computer Science and Engineering, Vellore Institute of Technology
Email: manikandan.ravikiran@gmail.com, and bharathiraja.akr@gmail.com
Team | Rank | F1 |
AJS | 1 | 0.285893 |
DLRG_RUN1 | 2 | 0.225467 |
DLRG_RUN2 | 3 | 0.213472 |
Start: Feb. 10, 2023, midnight
June 10, 2023, 8:07 a.m.
You must be logged in to participate in competitions.
Sign In