Welcome to the shared task GUA-SPA - Guarani-Spanish Code Switching Analysis, a task to automatically detect and analyze instances of code-switcihng between Guarani and Spanish in news and social media. This task is part of IberLEF 2023.
Guarani is a South American indigenous language that belongs to the Tupi-Guarani family, Spanish is a Romance language that belongs to the Indo-European family, and both languages have been in contact in the South American region for about 500 years [Rodríguez, 2018], resulting in many interesting varieties with different levels of mixture. Paraguay is a South American country where Guarani and Spanish are the two official languages (Ley de Lenguas, [Ley 4251]). According to the most recent census in Paraguay, most of the population of the country speak at least some Guarani, and there is a high prevalence of Guarani-Spanish bilingualism in urban areas, while Guarani monolingualism is more limited to rural areas.
Bilingual speakers often make use of the two languages at the same time, mixing them in different ways, in a phenomenon called code-switching [Joshi, 1982]. This phenomenon is very frequent in situations where two or more languages come into contact. In Paraguay, this has resulted in several identified language varieties that combine Guarani and Spanish [Kallfell, 2016].
There have been a number of competitions focusing on detection and analysis of code-switching, starting in [Solorio et al., 2014] with language identification in code-switched data for some language pairs, including Spanish-English. Later on these competitions started to include more complex tasks in code-switched contexts, such as NER [Aguilar et al., 2020] and MT into the code-switched languages [Chen et al., 2021]. In our case, we are proposing language identification (Task 1), NER (Task 2), and a novel classification task for Spanish spans in a code-switched Guarani-Spanish context (Task 3).
Guarani is considered a low resource language [Joshi et al., 2020] because, despite having millions of speakers, it does not have many digital resources to work with, its written use online is scarce, and it has been mostly under-researched from the NLP perspective. The situation of this and other American indigenous languages could change in the future as there are now some initiatives to build resources for these languages [Mager et al., 2021], but there is still a long way to go. Spanish, on the other hand, belongs to the set of very resource-rich languages [Joshi et al., 2020], which is good for this competition as there are many tools for Spanish that could be leveraged to see how they work in this context.
The expected target audience are NLP researchers interested in working with low-resource languages and code-switched data. Also, researchers interested in NER and MT in general.
We propose a challenge for analyzing code-switched texts in Guarani and Spanish, trying to identify the language used in each span of text, the named entities mentioned in the text, and the way Spanish is used. The challenge will be structured as three tasks:
Given a text (sequence of tokens), label each token of the sequence with one of the following categories:
Examples:
→ che kuerai de pagar 6000 gs. por una llamada de 40 segundos . son aliados del gobierno parece ustedes Could be tagged as: che/gn kuerai/gn de/es pagar/es 6000/other gs./other por/es una/es llamada/es de/es 40/other segundos/es ./other son/es aliados/es del/es gobierno/es parece/es ustedes/es
→ Ministerio de Salud omombe'u ko'ã káso malaria ojuhúva importado Guinea Ecuatorial guive ha oîma jesareko ohapejokóvo jeipyso . Could be tagged as: Ministerio/ne de/ne Salud/ne omombe'u/gn ko'ã/gn káso/mix malaria/es ojuhúva/gn importado/es Guinea/ne Ecuatorial/ne guive/gn ha/gn oîma/gn jesareko/gn ohapejokóvo/gn jeipyso/gn ./other
The metrics for task 1 are accuracy, weighted precision, weighted recall and weighted F1. The main metric is weighted F1.
Given a text (sequence of tokens), identify the named entities as spans in the text, and classify each one with a category: person, location or organization. These must be marked in the tokens using BIO labels (only B and I markers are necessary): ne-b-per, ne-b-loc, ne-b-org, ne-i-per, ne-i-loc, ne-i-org.
Examples:
→ [ORG Ministerio de Salud] omombe'u ko'ã káso malaria ojuhúva importado [LOC Guinea Ecuatorial] guive ha oîma jesareko ohapejokóvo jeipyso .
→ [PER Ministra de Hacienda Lea Giménez] he'i oñepromulga léi capitalidad ary 2014
The metrics for task 2 are precision, recall and F1, either labeled or unlabeled. The criterion for finding a named entity is exact match. The main metric is labeled F1.
Given a text (sequence of tokens), identify spans of text in Spanish and label them in one of these categories:
These must be marked in the tokens using BIO labels (only B and I markers are necessary): es-b-cc, es-b-ul, es-i-cc, es-i-ul.
Examples:
→ che kuerai [CC de pagar 6000 gs. por una llamada de 40 segundos . son aliados del gobierno parece ustedes]
→ Okañývo pe Policía Nacional ha orekóva caso omomarandúvo Fiscalía , peteî [UL investigación ámbito penal] .
The metrics for task 3 are precision, recall and F1, either labeled or unlabeled. The criterion for finding a Spanish span is exact match. The main metric is labeled F1.
These are the final results for competition over the test data:
User | Task 1 - wF1 | Task 2 - Labeled F1 | Task 3 - Labeled F1 |
---|---|---|---|
pughrob | 0.9381 (1) | 0.7028 (1) | 0.3836 (1) |
tsjauhia | 0.9139 (2) | - | - |
amunozo | 0.8500 (3) | 0.4153 (3) | 0.1939 (3) |
baseline | 0.7325 (4) | 0.4946 (2) | 0.2195 (2) |
pakapro | 0.0452 (5) | - | - |
The system description papers will be published in the proceedings of IberLEF 2023. The IberLEF organizers have sent us the following instructions for the papers:
* System description papers should be formatted according to the uniform 1-column CEURART style. Latex and Word templates can be found in: https://ceur-ws.org/Vol-XXX/CEURART.zip
* The minimum length of a regular paper should be 5 pages. There is no maximum page limit.
* Papers must be written in English.
* The copyright year command must be changed to \copyrightyear{2023}.
* The conference command must be changed to \conference{IberLEF 2023, September 2023, Jaén, Spain}
* Eliminate the numbering in the pages of the paper, if there is one, and make sure that there are no headers or footnotes, except the mandatory copyright as a footnote on the first page.
* Authors should be described with their name and their full affiliation (university and country). Names must be complete (no initials), e.g. “María García” instead of “M. García”.
* Titles of papers should be in emphatic capital English notation, i.e., "Filling an Author Agreement by Autocompletion" rather than "Filling an author agreement by autocompletion".
* At least one author of each paper must sign the CEUR copyright agreement. The signed form must be sent along with the paper to the task organizers. Important: it must be physically signed with pen on paper. These are the two agreement variants, select the one that fits your case.
1. AUTHOR-AGREEMENT (NTP): Authors shall use this form if they included no copyrighted third party material in their paper text (or accompanying sources, datasets). This is the right variant in most cases.
2. AUTHOR-AGREEMENT (TP): Authors shall use this form if they did include copyrighted third party material in their paper or accompanying material. They must then also attach a copy of the permission by the third party to use this material in the signed author agreement!
* In the field Name and year of the event of the CEUR agreement should be written: IberLEF 2023.
In the field Editors of the proceedings (editors), the following names must appear: Manuel Montes-y-Gómez, Francisco Rangel, Salud María Jiménez-Zafra, Marco Casavantes, Begoña Altuna, Miguel Ángel Álvarez Carmona, Gemma Bel-Enguix, Luis Chiruzzo, Iker de la Iglesia, Hugo Jair Escalante, Miguel Ángel García-Cumbreras, José Antonio García-Díaz, José Ángel Gónzalez Barba, Roberto Labadie Tamayo, Salvador Lima, Pablo Moral, Flor Miriam Plaza del Arco, Rafael Valencia-García
* In your paper, please cite the overview paper with the following reference:
@article{gua-spa-overview,
title={{Overview of GUA-SPA at IberLEF 2023: Guarani-Spanish Code-Switching Analysis}},
author={ Chiruzzo, Luis and Agüero-Torales, Marvin and Giménez-Lugo, Gustavo and Alvarez, Aldo and Rodríguez, Yliana and Góngora, Santiago and Solorio, Thamar },
journal = {Procesamiento del Lenguaje Natural},
volume = {71},
number = {0},
pages={},
year = {2023},
keywords = {},
issn = {1989-7553},
}
Reference in APA style: Chiruzzo, L., Agüero-Torales, M., Giménez-Lugo, G., Alvarez, A., Rodríguez, Y., Góngora, S., & Solorio, T. (2023). Overview of GUA-SPA at IberLEF 2023: Guarani-Spanish Code-Switching Analysis. Procesamiento del Lenguaje Natural, 71.
* Please remember the deadline to submit the working notes is June 14th, 2023.
You can also find more information here: https://sites.google.com/view/iberlef-2023/working-notes
The submissions in this competition will be evaluated and scored using:
- Task 1: weighted F1 score.
- Task 2: labeled F1 score.
- Task 3: labeled F1 score.
The organizers of the task are:
If you have any question, contact us via luischir@fing.edu.uy
The submission must be a zip file that contains a single txt file.
The txt file must have the following format: one token per line, including its numeric id, (optionally) the token text, and the predicted label.
The sequence of tokens must follow the development data format. Please include the line breaks after the end of each text.
For example:
1 Péva gn
2 , other
3 ombotove gn
4 mbarete gn
5 odefini mix
6 situación es-b-ul
7 encuesta es-i-ul
8 rupive gn
9 . other
1 Cámara ne-b-org
2 de ne-b-org
3 Diputados ne-b-org
4 omonéî gn
5 ampliación es-b-cc
6 presupuestaria es-i-cc
Notice that there is only one label per token. If you just want to participate in task 1, you can use the labels {gn,es,mix,ne,foreign,other}.
If you also want to participate in task 2, your "ne" tags must include the named entity BIO information {ne-b-per, ne-i-per, ne-b-org, ne-i-org, ne-b-loc, ne-i-loc}.
If you also want to participate in task 3, your "es" tags must include the code mixing BIO information {es-b-cc, es-i-cc, es-b-cc, es-i-cc}.
Start: March 22, 2023, midnight
Description: Development phase.
Start: May 24, 2023, midnight
Description: Evaluation phase.
Start: June 7, 2023, midnight
Description: Open Post-Evaluation phase.
Never
You must be logged in to participate in competitions.
Sign In# | Username | Score |
---|---|---|
1 | amunozo | 0.8226 |
2 | pughrob | 0.7954 |
3 | katsuki.ohto | 0.0452 |