Offensive Span Identification in Tamil-DravidianLangTech@RANLP 2023

Organized by DravidianLangTech - Current server time: Jan. 9, 2025, 9:46 p.m. UTC

First phase

First phase
Feb. 10, 2023, midnight UTC

End

Competition Ends
June 10, 2023, 8:07 a.m. UTC

Offensive Span Identification in Tamil - DravidianLangTech@RANLP-2023

Content moderation is imperative in social media platforms to support healthy online discussions. However, though multiple existing works identify offensive content in code-mixed Dravidian languages, they are only restricted to classifying whole comments without identifying part of content contributing to offensiveness. Such restriction is primarily due to the lack of annotated data for offensive spans in code-mixed Dravidian languages. In this shared task, we provide code-mixed social media text for Tamil language with offensive spans for which the participants are expected to develop systems.

Following are various subtasks that are developed as part of this offensive span identification task. 

Subtask 1 - Supervised Offensive Span Identification

Given Youtube comments and annotated offensive spans for training and development, systems have to identify the offensive spans in each of the comments in test data. This task could be approached as task of Named Entity Recognition, Building an offensive language classifier and coupling with interpretability etc. 


Subtask 2 - Less Data Offensive Span Identification

All the participants of subtask 1 are encouraged to also submit a "Less Data approach", where the participants are expected to submit a model while using only parts (not fully) of training data of subtask 1 . Participants need to develop systems to achieve competitive performance with limited data. To this end, there are plethora of creative ways to do this including data subset selection, coreset theory to something as simple as selecting representative data points from clusters etc. Participants are welcome to explore diverse interesting ideas. Further we encourage participants to release such implementation for public to foster further research.

Dataset

To download the data and participate, go to the Participate tab.

Working Notes

All the participating teams are encouraged to submit working notes of the developed system.

The paper's name format should be TEAM_NAME@DravidianLangTech-RANLP_2023: <Title of the paper>. 

Example: NUIG@DravidianLangTech-RANLP_2023: Toxic Span Identification in Tamil

For electronic submission of papers to the DravidianLangTech workshop, please use this link: https://softconf.com/ranlp23/DravidianLangTech/

 
Guidelines
 
Following are some general guidelines to keep in mind while submitting the working notes.
 
  • Basic sanity check for grammatical errors and reported results
  • Papers should have sufficient information for reproducing the mentioned results- Papers should follow the appropriate style (We will use RANLP 2023 style: details below)
  • Check the papers for text reuse / Plagiarism. This includes self-plagiarism as well. We would like to stress this point as ACL is quite strict about it. Any paper found to have plagiarized content should be rejected without further consideration.
  • Please ensure the author names do not have any salutations like Dr., Prof., etc in the final version
 
All submissions should be in Double column RANLP 2023 format. Authors should use one of the Templates below:
 
For more details and queries please email to following 
 
Google Groups
We also have google groups https://groups.google.com/g/dravidianlangtech-toxicspan/  . Request everyone to join this for periodic announcements. 

Evaluation Criteria

Each of the submission will be evaluated with character level F1 as in [1]

For more details on this metric please see https://github.com/ipavlopoulos/toxic_spans/blob/master/evaluation/semeval2021.py 

Submission Procedure

We accept the test results only through the google form. Please see participate page for submission format. Meanwhile the google form for accepting submission will be shared once we start evaluation. The google form is released here. Please submit according to sample submission format in participate page.

Additional Details

The results along with summary of approach used by each system should be submitted at the end via google form in here

 

Terms and Conditions

  • By downloading the data or by accessing it any manner, you agree not to redistribute the data except for non-commercial and academic-research purposes. The data must not be used for providing surveillance, analyses or research that isolates a group of individuals or any single individual for any unlawful or discriminatory purpose
  • This task has a single evaluation phase. To be considered a valid participation/submission in the task's evaluation, you agree to submit a single (possibly empty) list of character offsets (as in the task overview) per test text (post), for every test text. 
  • Each team must create and use exactly one CodaLab account.
  • The organizers and the organizations they are affiliated with make no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.
  • Each task participant will be assigned at least one other teams' system description paper for review, using the START system. The papers will thus be peer reviewed.
  • The datasets will not be made publicly available by the participants.

You should cite following paper if you are using our data.

Citations
==========
[1]

@inproceedings{toxicspans-acl,
title={Findings of the Shared Task on {O}ffensive {S}pan {I}dentification in Code-Mixed {T}amil-{E}nglish Comments},
author = "Ravikiran, Manikandan and
Chakravarthi, Bharathi Raja and
Madasamy, Anand Kumar and
Sivanesan, Sangeetha and
Rajalakshmi, Ratnavel and
Thavareesan, Sajeetha and
Ponnusamy, Rahul and
Mahadevan, Shankar",
booktitle = "Proceedings of the Second Workshop on Speech and Language Technologies for Dravidian Languages",
month = may,
year = "2022",
publisher = "Association for Computational Linguistics",
}


[2]

@inproceedings{ravikiran-annamalai-2021-dosa,
title = "{DOSA}: {D}ravidian Code-Mixed Offensive Span Identification Dataset",
author = "Ravikiran, Manikandan and
Annamalai, Subbiah",
booktitle = "Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages",
month = apr,
year = "2021",
address = "Kyiv",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.dravidianlangtech-1.2",
pages = "10--17",
abstract = "This paper presents the Dravidian Offensive Span Identification Dataset (DOSA) for under-resourced Tamil-English and Kannada-English code-mixed text. The dataset addresses the lack of code-mixed datasets with annotated offensive spans by extending annotations of existing code-mixed offensive language identification datasets. It provides span annotations for Tamil-English and Kannada-English code-mixed comments posted by users on YouTube social media. Overall the dataset consists of 4786 Tamil-English comments with 6202 annotated spans and 1097 Kannada-English comments with 1641 annotated spans, each annotated by two different annotators. We further present some of our baseline experimental results on the developed dataset, thereby eliciting research in under-resourced languages, leading to an essential step towards semi-automated content moderation in Dravidian languages. The dataset is available in https://github.com/teamdl-mlsg/DOSA",
}

Important Dates for shared tasks:

Task announcement: Feb 20, 2023

Release of Training data: Feb 28, 2023

Release of Test data: May 13, 2023 (Released please see the participate page)

Run submission deadline: June 1, 2023 (Google form submission released in form. Please upload your test results here.)

Results declared: June 10, 2023

Paper submission:1 July 2023

Peer review notification: 5 August 2023

Camera-ready paper due: 20 August 2023

Workshop Dates: 7-8 September 2023

Manikandan Ravikiran, Georgia Institute of Technology/ R&D Center, Hitachi India Pvt Ltd, India

Bharathi Raja Chakravarthi, Insight SFI Research Centre for Data Analytics, School of Computer Science, University of Galway, Ireland

Anand Kumar Madasamy, National Institute of Technology Karnataka Surathkal, India

Ananth Ganesh, R&D Center, Hitachi India Pvt Ltd, India

Ratnavel Rajalakshmi, School of Computer Science and Engineering, Vellore Institute of Technology

 

 

Email:  manikandan.ravikiran@gmail.com, and  bharathiraja.akr@gmail.com

Team Rank F1
AJS  1 0.285893
DLRG_RUN1  2 0.225467
DLRG_RUN2  3 0.213472

First phase

Start: Feb. 10, 2023, midnight

Competition Ends

June 10, 2023, 8:07 a.m.

You must be logged in to participate in competitions.

Sign In