Methodology

The ODESIA Leaderboard provides an evaluation infrastructure for pre-trained language models in English and Spanish that allows a direct comparison between the performance of models both language, and thus to measure the English-Spanish effectiveness gap of state-of-the-art Natural Language Processing (NLP) systems. The infrastructure consists of a benchmark against which the models are evaluated and a leaderboard where the results are displayed.


Benchmark

A benchmark to comparatively evaluate language models usually consists of a diverse collection of Natural Language Processing tasks, which can be addressed by the language models through supervised training of the last layer of neurons (fine-tuning). The benchmark consists of datasets with private test partition and datasets with public test partition.  All datasets contain one set in English and one set in Spanish.

The datasets with private test partitions are the following ones:

  • DIPROMATS 2023. This dataset was created from scratch to be incorporated into the ODESIA Leaderboard in Version 1. It is a set of tweets issued by diplomats from four world powers (the European Union, Russia, China and the United States), annotated according to the propaganda techniques they use to convey a particular image of their countries or their competitors at the global level. There are three tasks associated with this dataset: propaganda identification, coarse-grained propaganda characterization (four techniques) and fine-grained characterization (15 techniques subsumed in the previous ones). This is a multi-class, multi-label classification problem. It is framed within the problems related to misinformation.
  • EXIST 2022. This dataset contains tweets annotated with information on sexism: a binary label indicating whether the tweet expresses sexism or not, and a multiclass label incating the type of sexism that is being conveyed. It is framed within the problem of toxicity in social networks.
  • DIANN 2023. The dataset contains abstracts of biomedical articles, where the mentions of disabilities are annotated. The dataset supports a sequence labeling task on entity recognition.
  • EXIST 2023. The dataset has been created in its entirety for Version 2 of the Leaderboard. It is composed of tweets labeled according to the type of sexism expressed or described in them. It is also a dataset developed following the "Learning with Disagreement'' (LeWiDi) paradigm, which makes it the first dataset for training and testing sexism detection systems built according to this paradigm. It consists of three partitions (training, development, evaluation) and annotations for three tasks: sexism detection, categorization and identification of the sexism sender. It is framed within the problem of toxicity in social networks.
  • SQUAD/SQAC 2024. This dataset consists of an evaluation partition created for Version 2 of the leaderboard. It contains popular science articles from CSIC for Spanish and Cambridge University for English annotated with questions and extractive answers. The task that this dataset allows to evaluate is text comprehension in the form of question-answering. The task consists of answering questions about a text in such a way that the answer is a fragment extracted directly from the text. It is a sequence labeling task.

To construct the datasets, an identical methodology has been applied in both languages, both to select the source texts and annotate them manually. Calibration mechanisms have been defined to compensate for possible differences in intrinsic difficulty between the datasets of the two languages. For all of them, moreover, the test set will be kept hidden indefinitely to avoid system overfitting effects and to avoid possible contamination of the systems in the pre-training phase.

The datasets with public test  partitions are listed below. For all these datasets we use the public training and test partitions.

  • Multilingual Document Classification Corpus (MLDoc) (Schwenk and Li, 2018) contains news items classified into four categories: corporate/industrial, economics, government/social, and markets.
  • MultiCONER 2022 (Malmasi et al., 2022), is a multilingual dataset for complex named entity recognition with six different categories.
  • STS-2017 (Cer et al., 2017) is a multilingual textual similarity dataset. The task consists of predicting the degree of similarity between pairs of sentences.
  • Spanish Question Answering Corpus (SQUAD/SQAC) (Gutiérrez-Fandiño et al., 2021) is an extractive Question Answering dataset for Spanish, in which, given a question and an associated paragraph, the task is to identify the smallest span containing the answer, and SQuAD v1.1 (Rajpurkar et al., 2016) is a similar dataset for English.

In the Tasks page more information is provided about the tasks related to the datasets.

Evaluation

This leaderboard uses the infrastructures of EvALL, an evaluation toolkit and online evaluation service. For each task a relevant evaluation metric is chosen, and both the results for each task and the aggregated result over all of them, which is usually some form of average, are reported.

In the Tasks page a description of the evaluation metrics per task is provided.

Computing the gap


Details regarding the gap calculation will be published in short.