FAQ Leaderboard | Leaderboard

What is the ODESIA Leaderboard?

The ODESIA Leaderboard allows to evaluate language models with a diverse collection of Natural Language Processing tasks, which can be addressed by the language models through supervised training of the last layer of neurons (fine-tuning). Each task has an associated metric. The leaderboard reports both the results for each task and the overall results for each system participating in the leaderboard. The unique feature of this leaderboard is that it also allows quantitative comparison of the state of the art in English and Spanish because the datasets have a version for each language and are comparable.

How useful is the ODESIA Leaderboard for me?

There is no leaderboard for evaluating language models in Spanish. In general, each new pre-trained model is evaluated independently by its authors, and the evaluation is usually included in the technical report or scientific article that accompanies the publication of the model. Our main goal is to create a leaderboard that allows the comparative evaluation of English and Spanish language models, so that the performance gap between the pre-trained models of both languages can be estimated under equal conditions with respect to the quantity, quality and comparability of the training data.

What tasks are evaluated?

In the ODESIA Leaderboard, two groups of tasks are evaluated, tasks with private evaluation data (Core Tasks) and tasks with public evaluation data (Extended Tasks).

Version 1 of the Leaderboard includes 10 tasks with private evaluation data.

DIANN 2023 - Discapacity detection (sequence labeling, biomedical domain).
DIPROMATS 2023 - Propaganda identification (classification, politics domain).
DIPROMATS 2023 - Coarse propaganda characterization (classification , politics domain).
DIPROMATS 2023 - Fine propaganda characterization (classification , politics domain).
EXIST 2022 - Sexism detection (classification, social domain).
EXIST 2022 - Sexism categorisation (classification, social domain).
EXIST-2023 - Sexism identification (classification, LeWiDi, social domain).
EXIST-2023 - Source intention (classification, LeWiDi, social domain).
EXIST-2023 - Sexism categorization (classification, LeWiDi, social domain).
SQUAD-SQAC 2024 - Question-Answering (extractive) (sequence labeling, diverse scientific domains).

In addition, four tasks with public evaluation data are included.

MLDoc - Multilingual Document Classification (classification, news domain).
MULTICONER 2022 - Named entity recognition (sequence labeling, general domain) .
STS-2017 - Sentence similarity (regression, news, captions, forums ).
SQAC-SQUAD 2016 - Question-Answering (extractive) (sequence labeling, general domain).

More detailed information on each task can be found on the Tasks page.

How is the gap computed?

In order to make a qualitative comparison between the Spanish and English results, we first estimate the intrinsic difficulty of each dataset, understood as the average effectiveness of several learning algorithms that do not handle linguistic information. The differences in effectiveness in both languages are then calibrated to remove the intrinsic difficulty and obtain the difference in linguistic performance between one language and the other. More detailed information can be found in the Methodology page.

How are the baselines computed?

The baseline results are obtained by averaging the results obtained by several models that do not use linguistic information. This average is used as a reference to calibrate the effectiveness of the language models between English and Spanish. More detailed information is provided on the Methodology page.

How can I publish my results?

To participate in the Leaderboard it is necessary to register and submit the form that appears on the Participate page, sending the results in a compressed zip file containing a file of predictions for each task per language. Instructions can be found on the Participate page.

Can I publish results anonymously?

The results are associated with a user account that is displayed on the Leaderboard, but there are no restrictions on the user name to be used. Contact details are not published.

Is there a deadline?

Results are constantly updated, results can be sent at any moment and multiple times.

Under what license are the datasets distributed?

Each dataset has its own license, so it is convenient to look at the information of each one to get this answer. For the Core Tasks, the test partitions are not distributed, as an attempt is made to avoid model contamination.

How should the prediction files be formatted?

Prediction files should be submitted in a json format which is the same format required by the EvALL 2.0 evaluation platform. All predictions must have the fields "test_case", "id" and "value". The name of the files is composed by the task name, the task number and the language.

To make a submissionn for Spanish, the files should be named as follows:

DIANN_2023_T1_es.json
DIPROMATS_2023_T1_es.json
DIPROMATS_2023_T2_es.json
DIPROMATS_2023_T3_es.json
EXIST_2022_T1_es.json
EXIST_2022_T2_es.json
EXIST_2023_T1_es.json
EXIST_2023_T2_es.json
EXIST_2023_T3_es.json
SQUAD-SQAC_2024_T1_es.json
MLDOC_2018_es.json
MULTICONER_2022_es.json
SQAC_SQUAD_2016_es.json
STS_2017_es.json

Here you can download sample predictions for submissions in English, Spanish and both languages.

By what metrics are the tasks evaluated?

Each task has its own metric according to the problem to be solved, which is used to rank the systems. More details are provided on the Tasks page.

How are submitted results evaluated?

The submitted results are evaluated using the EvALL 2.0 tool, for more information go to http://evall.uned.es/