GUA-SPA@IberLEF2023: Guarani-Spanish Code Switching Analysis

Organized by luischir - Current server time: March 29, 2025, 10:22 p.m. UTC

Previous

Evaluation
May 24, 2023, midnight UTC

Current

Post-Evaluation
June 7, 2023, midnight UTC

End

Competition Ends
Never

Welcome to GUA-SPA@IberLEF2023: Guarani-Spanish Code Switching Analysis

Welcome to the shared task GUA-SPA - Guarani-Spanish Code Switching Analysis, a task to automatically detect and analyze instances of code-switcihng between Guarani and Spanish in news and social media. This task is part of IberLEF 2023.

Introduction

Guarani is a South American indigenous language that belongs to the Tupi-Guarani family, Spanish is a Romance language that belongs to the Indo-European family, and both languages have been in contact in the South American region for about 500 years [Rodríguez, 2018], resulting in many interesting varieties with different levels of mixture. Paraguay is a South American country where Guarani and Spanish are the two official languages (Ley de Lenguas, [Ley 4251]). According to the most recent census in Paraguay, most of the population of the country speak at least some Guarani, and there is a high prevalence of Guarani-Spanish bilingualism in urban areas, while Guarani monolingualism is more limited to rural areas.

Bilingual speakers often make use of the two languages at the same time, mixing them in different ways, in a phenomenon called code-switching [Joshi, 1982]. This phenomenon is very frequent in situations where two or more languages come into contact. In Paraguay, this has resulted in several identified language varieties that combine Guarani and Spanish [Kallfell, 2016].

There have been a number of competitions focusing on detection and analysis of code-switching, starting in [Solorio et al., 2014] with language identification in code-switched data for some language pairs, including Spanish-English. Later on these competitions started to include more complex tasks in code-switched contexts, such as NER [Aguilar et al., 2020] and MT into the code-switched languages [Chen et al., 2021]. In our case, we are proposing language identification (Task 1), NER (Task 2), and a novel classification task for Spanish spans in a code-switched Guarani-Spanish context (Task 3).

Guarani is considered a low resource language [Joshi et al., 2020] because, despite having millions of speakers, it does not have many digital resources to work with, its written use online is scarce, and it has been mostly under-researched from the NLP perspective. The situation of this and other American indigenous languages could change in the future as there are now some initiatives to build resources for these languages [Mager et al., 2021], but there is still a long way to go. Spanish, on the other hand, belongs to the set of very resource-rich languages [Joshi et al., 2020], which is good for this competition as there are many tools for Spanish that could be leveraged to see how they work in this context.

The expected target audience are NLP researchers interested in working with low-resource languages and code-switched data. Also, researchers interested in NER and MT in general.

Task description

We propose a challenge for analyzing code-switched texts in Guarani and Spanish, trying to identify the language used in each span of text, the named entities mentioned in the text, and the way Spanish is used. The challenge will be structured as three tasks:

Task 1: Language identification in code-switched data

Given a text (sequence of tokens), label each token of the sequence with one of the following categories:

  • gn: It is a Guarani token.
  • es: It is a Spanish token.
  • ne: It is part of a named entity (either in Guarani or in Spanish).
  • mix: The token is a mixture between Guarani and Spanish. For example a verb with a Spanish root that has been transformed into the Guarani morphology, like: ‘osuspendeta’ (he/she will suspend)
  • foreign: Used for tokens that are in languages other than Guarani or Spanish.
  • other: Used for other types of tokens that are invariant to language, like punctuation, emojis and URLs.

Examples:

che kuerai de pagar 6000 gs. por una llamada de 40 segundos . son aliados del gobierno parece ustedes Could be tagged as: che/gn kuerai/gn de/es pagar/es 6000/other gs./other por/es una/es llamada/es de/es 40/other segundos/es ./other son/es aliados/es del/es gobierno/es parece/es ustedes/es

Ministerio de Salud omombe'u ko'ã káso malaria ojuhúva importado Guinea Ecuatorial guive ha oîma jesareko ohapejokóvo jeipyso . Could be tagged as: Ministerio/ne de/ne Salud/ne omombe'u/gn ko'ã/gn káso/mix malaria/es ojuhúva/gn importado/es Guinea/ne Ecuatorial/ne guive/gn ha/gn oîma/gn jesareko/gn ohapejokóvo/gn jeipyso/gn ./other

The metrics for task 1 are accuracy, weighted precision, weighted recall and weighted F1. The main metric is weighted F1.

Task 2: Named entity classification

Given a text (sequence of tokens), identify the named entities as spans in the text, and classify each one with a category: person, location or organization. These must be marked in the tokens using BIO labels (only B and I markers are necessary): ne-b-per, ne-b-loc, ne-b-org, ne-i-per, ne-i-loc, ne-i-org.

Examples:

[ORG Ministerio de Salud] omombe'u ko'ã káso malaria ojuhúva importado [LOC Guinea Ecuatorial] guive ha oîma jesareko ohapejokóvo jeipyso .

[PER Ministra de Hacienda Lea Giménez] he'i oñepromulga léi capitalidad ary 2014

The metrics for task 2 are precision, recall and F1, either labeled or unlabeled. The criterion for finding a named entity is exact match. The main metric is labeled F1.

Task 3: Spanish code classification

Given a text (sequence of tokens), identify spans of text in Spanish and label them in one of these categories:

  • change in code (CC): the text keeps all the characteristics of Spanish.
  • unadapted loan (UL): the Spanish text could be partially adapted in some ways to Guarani syntax, but it is not fully merged into Guarani, in particular it does not present orthographic transformations.

These must be marked in the tokens using BIO labels (only B and I markers are necessary): es-b-cc, es-b-ul, es-i-cc, es-i-ul.

Examples:

che kuerai [CC de pagar 6000 gs. por una llamada de 40 segundos . son aliados del gobierno parece ustedes]

Okañývo pe Policía Nacional ha orekóva caso omomarandúvo Fiscalía , peteî [UL investigación ámbito penal] .

The metrics for task 3 are precision, recall and F1, either labeled or unlabeled. The criterion for finding a Spanish span is exact match. The main metric is labeled F1.

 

Results

These are the final results for competition over the test data:

UserTask 1 - wF1Task 2 - Labeled F1Task 3 - Labeled F1
pughrob 0.9381 (1) 0.7028 (1) 0.3836 (1)
tsjauhia 0.9139 (2) - -
amunozo 0.8500 (3) 0.4153 (3) 0.1939 (3)
baseline 0.7325 (4) 0.4946 (2) 0.2195 (2)
pakapro 0.0452 (5) - -

 

Information for system description papers

The system description papers will be published in the proceedings of IberLEF 2023. The IberLEF organizers have sent us the following instructions for the papers:

* System description papers should be formatted according to the uniform 1-column CEURART style. Latex and Word templates can be found in: https://ceur-ws.org/Vol-XXX/CEURART.zip

* The minimum length of a regular paper should be 5 pages. There is no maximum page limit.

* Papers must be written in English.

* The copyright year command must be changed to \copyrightyear{2023}.

* The conference command must be changed to \conference{IberLEF 2023, September 2023, Jaén, Spain}

* Eliminate the numbering in the pages of the paper, if there is one, and make sure that there are no headers or footnotes, except the mandatory copyright as a footnote on the first page.

* Authors should be described with their name and their full affiliation (university and country). Names must be complete (no initials), e.g. “María García” instead of “M. García”.

* Titles of papers should be in emphatic capital English notation, i.e., "Filling an Author Agreement by Autocompletion" rather than "Filling an author agreement by autocompletion".

* At least one author of each paper must sign the CEUR copyright agreement. The signed form must be sent along with the paper to the task organizers. Important: it must be physically signed with pen on paper. These are the two agreement variants, select the one that fits your case.

1. AUTHOR-AGREEMENT (NTP): Authors shall use this form if they included no copyrighted third party material in their paper text (or accompanying sources, datasets). This is the right variant in most cases.

2. AUTHOR-AGREEMENT (TP): Authors shall use this form if they did include copyrighted third party material in their paper or accompanying material. They must then also attach a copy of the permission by the third party to use this material in the signed author agreement!

* In the field Name and year of the event of the CEUR agreement should be written: IberLEF 2023.
In the field Editors of the proceedings (editors), the following names must appear: Manuel Montes-y-Gómez, Francisco Rangel, Salud María Jiménez-Zafra, Marco Casavantes, Begoña Altuna, Miguel Ángel Álvarez Carmona, Gemma Bel-Enguix, Luis Chiruzzo, Iker de la Iglesia, Hugo Jair Escalante, Miguel Ángel García-Cumbreras, José Antonio García-Díaz, José Ángel Gónzalez Barba, Roberto Labadie Tamayo, Salvador Lima, Pablo Moral, Flor Miriam Plaza del Arco, Rafael Valencia-García

* In your paper, please cite the overview paper with the following reference:

@article{gua-spa-overview,
title={{Overview of GUA-SPA at IberLEF 2023: Guarani-Spanish Code-Switching Analysis}},
author={ Chiruzzo, Luis and Agüero-Torales, Marvin and Giménez-Lugo, Gustavo and Alvarez, Aldo and Rodríguez, Yliana and Góngora, Santiago and Solorio, Thamar },
journal = {Procesamiento del Lenguaje Natural},
volume = {71},
number = {0},
pages={},
year = {2023},
keywords = {},
issn = {1989-7553},
}

Reference in APA style: Chiruzzo, L., Agüero-Torales, M., Giménez-Lugo, G., Alvarez, A., Rodríguez, Y., Góngora, S., & Solorio, T. (2023). Overview of GUA-SPA at IberLEF 2023: Guarani-Spanish Code-Switching Analysis. Procesamiento del Lenguaje Natural, 71.

* Please remember the deadline to submit the working notes is June 14th, 2023.

You can also find more information here: https://sites.google.com/view/iberlef-2023/working-notes

Evaluation Criteria

The submissions in this competition will be evaluated and scored using:
- Task 1: weighted F1 score.
- Task 2: labeled F1 score.
- Task 3: labeled F1 score.

Terms and Conditions

  • By submitting results to this competition, you consent to the public release of your scores at this website and at the IberLEF 2023 workshop and in the associated proceedings, at the task organizers' discretion. Scores may include, but are not limited to, automatic and manual quantitative judgments, qualitative judgments, and such other metrics as the task organizers see fit. You accept that the ultimate decision of metric choice and score value is that of the task organizers.
  • You further agree that the task organizers are under no obligation to release scores and that scores may be withheld if it is the task organizers' judgment that the submission was incomplete, erroneous, deceptive, or violated the letter or spirit of the competition's rules. Inclusion of a submission's scores is not an endorsement of a team or individual's submission, system, or science.
  • This task has a single evaluation phase. To be considered a valid participation/submission in the task's evaluation, you agree to submit a single file with the tags associated to all the tokens in the test set.
  • Each team must create and use exactly one CodaLab account.
  • Team constitution (members of a team) cannot be changed after the evaluation phase has begun.
  • During the evaluation phase, each team can submit as many as ten submissions; the top-scoring submission will be considered as the official submission to the competition.
  • The organizers and the organizations they are affiliated with make no warranties regarding the datasets provided, including but not limited to being correct or complete. They cannot be held liable for providing access to the datasets or the usage of the datasets.

Schedule

  • March 22nd, 2023: Training and development set. Development phase begins.
  • May 24th, 2023: Test set and open for submissions. Evaluation phase begins.
  • June 7th, 2023: Evaluation phase ends. Publication of results.
  • June 14th, June 19th, 2023: Paper submission.
  • June 28th, 2023: Notification of acceptance.
  • July 3rd, 2023: Camera-ready paper submission.
  • September 26th, 2023: IberLEF 2023 Workshop.

Organizers

The organizers of the task are:

  • Luis Chiruzzo. Universidad de la República, Montevideo, Uruguay.
  • Marvin Agüero-Torales. Universidad de Granada, Granada, Spain. Global CoE of Data Intelligence, Fujitsu, Spain.
  • Gustavo Giménez-Lugo. Universidade Tecnologica Federal do Paraná, Curitiba, PR, Brasil.
  • Santiago Góngora. Universidad de la República, Montevideo, Uruguay.
  • Aldo Alvarez. Universidad Nacional de Itapúa, Encarnación, Paraguay.
  • Yliana Rodríguez. Universidad de la República, Montevideo, Uruguay.
  • Thamar Solorio. University of Houston, Houston, TX, USA.

Contact

If you have any question, contact us via luischir@fing.edu.uy

Submissions

The submission must be a zip file that contains a single txt file.

The txt file must have the following format: one token per line, including its numeric id, (optionally) the token text, and the predicted label.


The sequence of tokens must follow the development data format. Please include the line breaks after the end of each text.

For example:

1 Péva gn
2 , other
3 ombotove gn
4 mbarete gn
5 odefini mix
6 situación es-b-ul
7 encuesta es-i-ul
8 rupive gn
9 . other

1 Cámara ne-b-org
2 de ne-b-org
3 Diputados ne-b-org
4 omonéî gn
5 ampliación es-b-cc
6 presupuestaria es-i-cc

Notice that there is only one label per token. If you just want to participate in task 1, you can use the labels {gn,es,mix,ne,foreign,other}.

If you also want to participate in task 2, your "ne" tags must include the named entity BIO information {ne-b-per, ne-i-per, ne-b-org, ne-i-org, ne-b-loc, ne-i-loc}.

If you also want to participate in task 3, your "es" tags must include the code mixing BIO information {es-b-cc, es-i-cc, es-b-cc, es-i-cc}.

Development

Start: March 22, 2023, midnight

Description: Development phase.

Evaluation

Start: May 24, 2023, midnight

Description: Evaluation phase.

Post-Evaluation

Start: June 7, 2023, midnight

Description: Open Post-Evaluation phase.

Competition Ends

Never

You must be logged in to participate in competitions.

Sign In
# Username Score
1 amunozo 0.8226
2 pughrob 0.7954
3 katsuki.ohto 0.0452