ODESIA Challenge @ SEPLN 2024 (September 26 - February 2)
Registration open until January 20, 2025
HOW TO PARTICPATE
System | Team |
Arithmetic mean
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Ixa ehu ixambert base cased | ODESIA | 0.4748 | 0.6743 | 0.4875 | 0.7666 | 0.3796 | 0.0543 | 0.6868 | 0.6117 | 0.3890 | 0.3412 | 0.3570 |
Bertin roberta base spanish | ODESIA | 0.4912 | 0.7280 | 0.4941 | 0.7596 | 0.2532 | 0.1782 | 0.6877 | 0.6465 | 0.4146 | 0.3331 | 0.4172 |
Xlm roberta large | ODESIA | 0.5873 | 0.7663 | 0.5593 | 0.8186 | 0.5343 | 0.4527 | 0.7855 | 0.6564 | 0.4414 | 0.3995 | 0.4589 |
Xlm roberta base | ODESIA | 0.5264 | 0.7395 | 0.4997 | 0.7894 | 0.4504 | 0.2668 | 0.7819 | 0.6236 | 0.4245 | 0.3195 | 0.3691 |
PlanTL GOB ES roberta large bne | ODESIA | 0.5626 | 0.7241 | 0.5668 | 0.8177 | 0.5173 | 0.3894 | 0.6757 | 0.6671 | 0.4237 | 0.3798 | 0.4640 |
PlanTL GOB ES roberta base bne | ODESIA | 0.5453 | 0.7356 | 0.5554 | 0.8149 | 0.4906 | 0.2944 | 0.7169 | 0.6531 | 0.4173 | 0.3688 | 0.4061 |
Distilbert base multilingual cased | ODESIA | 0.4728 | 0.7222 | 0.4669 | 0.7507 | 0.4036 | 0.2222 | 0.6868 | 0.5851 | 0.3823 | 0.2874 | 0.2207 |
Dccuchile bert base spanish wwm cased | ODESIA | 0.5408 | 0.7146 | 0.5370 | 0.7916 | 0.4874 | 0.2931 | 0.7478 | 0.6326 | 0.4182 | 0.3738 | 0.4118 |
CenIA distillbert base spanish uncased | ODESIA | 0.4864 | 0.7203 | 0.5118 | 0.7708 | 0.4198 | 0.1782 | 0.6531 | 0.6128 | 0.4160 | 0.3324 | 0.2484 |
Bert base multilingual cased | ODESIA | 0.5073 | 0.7222 | 0.4693 | 0.7821 | 0.4231 | 0.2562 | 0.7592 | 0.6136 | 0.3917 | 0.3326 | 0.3225 |
Gemma-2B-IT | ixa_taldea | 0.5456 | 0.7548 | 0.5262 | 0.8109 | 0.5283 | 0.4303 | 0.6129 | 0.6257 | 0.4012 | 0.2920 | 0.4738 |
Hermes-3-Llama-3.1-8B_2 | ixa_taldea | 0.6069 | 0.8065 | 0.5736 | 0.8211 | 0.5677 | 0.4855 | 0.7042 | 0.6611 | 0.4378 | 0.3322 | 0.6791 |
Hermes-3-Llama-3.1-8B | ixa_taldea | 0.6017 | 0.8065 | 0.5736 | 0.8168 | 0.5379 | 0.4675 | 0.7056 | 0.6611 | 0.4364 | 0.3322 | 0.6791 |
XLM-RoBERTa-large-v3 | UMUTeam | 0.5462 | 0.7452 | 0.5540 | 0.8224 | 0.5425 | 0.4581 | 0.5967 | 0.5441 | 0.4384 | 0.3609 | 0.4000 |
XLM-RoBERTa-large-2 | UMUTeam | 0.5320 | 0.7452 | 0.5540 | 0.8224 | 0.5425 | 0.4581 | 0.5967 | 0.5441 | 0.4384 | 0.3609 | 0.2581 |
XLM-RoBERTa-large | UMUTeam | 0.4951 | 0.7452 | 0.5540 | 0.8224 | 0.5425 | 0.4581 | 0.5967 | 0.5441 | 0.3371 | 0.0925 | 0.2581 |
CHALLENGE RULES
This challenge aims to promote the development and evaluation of language models in Spanish using the evaluation platform and datasets provided by the ODESIA project (Espacio de Observación del Desarrollo del Español en la Inteligencia Artificial).
The challenge consists of solving 10 discriminative tasks in Spanish, that belong to the ODESIA Leaderboard and are evaluated on private data. These tasks, with private evaluation data, belong to the ODESIA-CORE section of the ODESIA Leaderboard. The ODESIA Leaderboard is an application that provides an evaluation infrastructure for pretrained language models in English and Spanish that allows a direct comparison between the performance of models in one and the other language. Additionally, the leaderboard has an ODESIA-EXTENDED section with 4 tasks with pre-existing public evaluation data, but these are not part of the challenge. Although for all tasks ODESIA provides bilingual data (Spanish and English), this challenge focuses only on the Spanish tasks (Spanish portion of ODESIA-CORE).
The team submitting the best system will receive a cash prize of 3.000 euros, donated by the company Llorente y Cuenca Madrid, SL (see details below).
The ODESIA-CORE benchmark consists of 10 discriminative tasks with public training datasets and private test datasets (not previously distributed by any means) created within the ODESIA initiative. The private nature of the test data guarantees the absence of contamination in the leaderboard results: no LLM should have seen the test set annotations in its pre-training phase. This is a summary of the tasks:
Name | Domain | Task | Abstract Task | Metric |
---|---|---|---|---|
DIANN 2023 | Biomedical | Disability detection | Sequence labeling | F1 Macro |
DIPROMATS 2023 | Politics | Propaganda identification | Binary Classification | ICM-Norm |
Propaganda characterization, coarse-grained | Multiclass Hierarchical Classification, Multilabel | ICM-Norm | ||
Propaganda characterization, fine-grained | Multiclass Hierarchical Classification, Multilabel | ICM-Norm | ||
EXIST 2022 | Social | Sexism detection | Binary Classification | Accuracy |
Sexism categorization | Multiclass Classification | F1 Macro | ||
EXIST 2023 | Social | Sexism detection | Binary classification | Soft-ICM-Norm |
Source intention categorization | Multiclass Hierarchical Classification | Soft-ICM-Norm | ||
Sexism categorization | Multiclass Hierarchical Classification, Multilabel | Soft-ICM-Norm | ||
SQUAD-SQAC 2024 | Scientific | Extractive Question-Answering | Sequence labeling | F1 |
The winning system will be the one that, at the end of the competition, obtains the best average score for Spanish version of the ODESIA-CORE tasks.
All types of Natural Language Processing (NLP) systems are accepted, provided they are applied consistently across all tasks using the same base model architecture or methodological approach. Thus, each participation must consist of a single system that addresses all tasks. Submissions where entirely different models or approaches are used for each task independently are not acceptable.
Here, a "single system" refers to a consistent methodological approach and/or architecture that is applied across all tasks, such as:
- Using the same discriminative base model architecture for all tasks.
- Utilizing the same generative model across all tasks. Participants can use generative large language models (LLMs) as their base models, applying them consistently across tasks (the same model is involved in the resolution of all tasks).
- In decoder-only models, consistent prompting strategies can be used, with task-specific prompts tailored as needed (see below for accepted task-specific adjustments).
- Employing ensembles of models consistently across tasks (the same models are involved in the resolution of all tasks). For example, using base and auxiliary models or a RAG approach is permitted, as long as they are the same across tasks (but with possibly varying strategies per task).
Allowed Task-Specific Adjustments:
- Preprocessing Steps: Different preprocessing steps (before training) per task are permitted.
- Fine-Tuning: Task-specific fine-tuning of the same base model is allowed.
- Hyperparameters and Prompts: Adjusting hyperparameters, prompts, or other parameters for each task is acceptable, if the same model(s) is/are used consistently across all tasks.
- External Data and Retrieval Strategies: Using different external data sources or retrieval strategies for different tasks is allowed.
Examples of Acceptable Systems (non-exhaustive; please consult us if unsure about your proposal):
- Fine-Tuned Encoder-Type LLMs:
- Using the same encoder-type large language model (or an ensemble) as a base, with task-specific fine-tuning applied for each challenge task. Training data from the participants’ package or other appropriate external sources can be used.
- Generative LLMs with Uniform Prompting Strategy:
- Utilizing the same generative large language model(s) combined with a zero-shot, one-shot, or few-shot prompting strategy.
- While the overall approach is consistent, prompts can be tailored to each task. For example, using GPT-3 for all tasks, with task-specific prompt, represents a single system applied consistently across tasks. Even if prompts or retrieval strategies differ per task, the underlying generative model remains the same.
- Retrieval-Augmented Generation:
- Employing one or more generative LLMs with a retrieval-augmented generation strategy, using external training datasets or sources. Different retrieval strategies or external data can be used for each task.
- Combinations of Methods:
- Any combination of the above methods, as long as the same base model(s) with a consistent methodological approach is/are applied across all tasks.
Examples of Unacceptable Systems:
- Using Different Base Models for Each Task Without a Shared Foundation:
- Example: A team submits a solution where they use a BERT model fine-tuned for Task A, a GPT-2 model for Task B, and an XLNet model for Task C, with no shared base model or overarching methodology.
- Reason: This approach involves entirely different models for each task without a common base, violating the requirement of using the same base model architecture or consistent methodological approach across all tasks.
- Employing Distinct Architectures and Methodologies Per Task:
- Example: For Task A, the participant uses a rule-based system; for Task B, they apply a statistical machine translation model; and for Task C, they use a neural network trained from scratch.
- Reason: Using fundamentally different architectures and methodologies for each task without any common elements does not constitute a single system.
- Independent Systems Developed Separately for Each Task:
- Example: The participant submits three separate codebases, each developed independently for Tasks A, B, and C, with no shared components, code, or models.
- Reason: This submission represents multiple independent systems rather than a unified system applied across all tasks.
- Combining Unrelated Pre-Trained Models Without Integration:
- Example: Using a pre-trained sentiment analysis model for Task A, a separate named entity recognition model for Task B, and a topic modeling algorithm for Task C, without any attempt to integrate them into a unified system.
- Reason: Simply grouping unrelated models together without a common base model or methodological approach does not meet the criteria of a single system.
- Applying Different Machine Learning Paradigms per Task:
- Example: Using supervised learning for Task A, unsupervised learning for Task B, and reinforcement learning for Task C, with no common framework or base model connecting these approaches.
- Reason: This approach lacks a consistent methodological foundation across tasks.
- Using Different Generative Models for Each Task Without a Shared Base Model:
- Example: A participant submits a solution where they use GPT-3 for Task A, PaLM for Task B, and LLaMA for Task C, with no shared base model or overarching methodological approach. Each task is addressed using a different generative AI model independently.
- Reason: This approach involves using entirely different generative models for each task without a common base model or consistent methodology. It violates the requirement of applying the same base model architecture or consistent methodological approach across all tasks.
Reproducibility and Verification:
- To ensure the originality and validity of solutions, the organizers may request participants to supply their implementation code and all materials necessary to reproduce results. Code should be provided as a link to a GitHub repository (possibly along with a Docker image to facilitate execution), upon request.
- Models or systems for which verification or reproduction materials are not provided when requested will not be eligible for the challenge prize and may be ultimately be removed from the leaderboard.
- Teams will have to pre-register for the challenge before they can participate.
- Each team will register a single account on the "ODESIA-Leaderboard" evaluation platform using the form provided for this purpose (link).
- The organizers will provide a username and password on the “ODESIA-Leaderboard” platform upon validation of the registration data.
- The results will be submitted through the ODESIA Leaderboard at https://leaderboard.odesia.uned.es/leaderboard/submit , where they will be automatically evaluated using the metrics corresponding to each task.
- For each submission, teams should format their prediction files following the specifications described in the README files of each dataset (included in the download package).
- In addition, the following fields must be completed on the prediction submission page:
- Team Name: Login details on the ODESIA platform of the team representative, which will be provided when registering for the challenge.
- Email: Contact email used for the challenge registration.
- Affiliation: Institution the participants belong to (if applicable).
- System name: To be formatted as “{team_name}-{submission_number}”, where team_name will be a permanent identifier for all submissions from the same participant, and where “submission_number” will be a number from 1 to 20 corresponding to each of the 20 submissions allowed per team during the contest.
- Model URL: (optional) URL of the model used (e.g. in Hugging Face), if applicable.
- System description: A 300 to 500-word description of the system used to generate the predictions.
- GitHub URL: Optionally, teams can add the link to the source code used to generate the results.
- Leaderboard version: "Challenge" must be selected.
- Submission languages: Check only "Spanish".
- ZIP File: The results of the system are sent as a compressed file with predictions formatted as specified above.
The ODESIA Leaderboard uses the PyEvALL evaluation library for classification tasks. PyEvALL is accessible from the Pip package manager and can be used during the development phase to evaluate the DIPROMATS 2023, EXIST 2022 and EXIST 2023 tasks. The F1 metric implemented to evaluate the tasks of the original SQUAD/SQAC dataset has been adapted to evaluate the tasks of the SQUAD-SQAC 2024 dataset. Its original implementation, for use in the development phase, can be found in SQUAD METRIC. The DIANN 2023 sequence labeling task is evaluated with the Macro F1 metric implemented in the HuggingFace Evaluate library, which has also been adapted for the ODESIA Leaderboard.
The only restrictions on participating teams are:
- All team members must be of legal age.
- No person may be a member of more than one team.
- A single prize of 3,000 euros (donated by the company “Llorente y Cuenca Madrid, SL”) will be awarded to the team that presents the system with the best global average performance in the ODESIA-CORE tasks in Spanish.
- To be eligible for the prize, the following conditions are established:
- Teams must make their results public on the ODESIA Leaderboard before the end date of the contest.
- The winning team must obtain an average score higher than that of the baseline models provided by the organization. Specifically, they must surpass the model that achieves the best average, which is XLM-Roberta-Large with a score of 0.5873.
- There must be a minimum of five teams submitting results; if this number is not met, the organization reserves the right to defer the deadline of the challenge.
- The winning team commits to present its solution at the Award ceremony (see section "Results Presentation and Prize Award Ceremony").
- Employees of UNED, LlyC Madrid S.L., Red.es, SEDIA, and any other entity related to the ODESIA project may participate in the challenge but will not be eligible for the final cash prize.
- Common sense rules of ethics and professional conduct must be respected. The organizers reserve the right to disqualify teams that have violated the rules.
- No limits are imposed on the costs associated with the implementation of the solutions, but the organizers may request information on these costs.
- The organizers reserve the right to update the rules in response to unforeseen circumstances, to better serve the competition's mission.
- The organizers reserve all rights regarding the final decision.
- The winning team, and a selection of other teams submitting innovative solutions, will be asked to submit a technical report in PDF format of at least 4 pages (excluding references) detailing their solution.
- The report will include a discussion of the strategies adopted by the team in the development of their proposal and the evaluation results.
- The report will include a breakdown of the costs of implementing the system and the use of datasets provided by the organization and by third parties, if applicable.
- If there is sufficient material, the option of publishing the technical reports in a special issue or in a joint journal article will be considered.
- The winning team will be invited to present their solution at an award ceremony during the ODESIA Final Project Workshop in Feb 2025. Acceptance of the award implies mandatory attendance (in person or online) at the session.
- All participants will receive certificates of participation at the end of the competition.
This challenge is organized in the framework of the ODESIA Project, a cooperation between the Spanish public university UNED and Red.es, a public Business Entity associated to the Ministry for Digital Processing and Civil Service, through the Secretary of State for Digitalization and Artificial Intelligence. The project is partially funded by the European Union (NextGenerationEU funds) through the "Plan de Recuperación, Transformación y Resiliencia", by the Ministry of Economic Affairs and Digital Transformation and by the UNED. It belongs to the activities of the "Plan de Tecnologías del Lenguaje de la Secretaría de Estado de Inteligencia Artificial y Digitalización" from Spain.
- Organizing Committee:
- Alejandro Benito-Santos (co-chair, UNED)
- Roser Morante (co-chair, UNED)
- Julio Gonzalo (UNED)
- Jorge Carrillo-de-Albornoz (UNED)
- Laura Plaza (UNED)
- Enrique Amigó (UNED)
- Víctor Fresno (UNED)
- Andrés Fernández (UNED)
- Adrián Ghajari (UNED)
- Guillermo Marco (UNED)
- Eva Sánchez (UNED)
- Miguel Lucas (LLyC)
- Advisory Board:
- TBA
For questions related to the challenge, please join our Discord server: #odesia-challenge-2024. You can also contact the challenge co-chairs, Alejandro Benito-Santos (al.benito@lsi.uned.es) and Roser Morante (r.morant@lsi.uned.es).
- Registration opens: September 26, 2024
- Registration closes: January 20, 2025*
- Submissions deadline: February 2, 2025*
- Official results announced: mid February 2025
- Award ceremony and presentation of results: end of February 2025