Methodology | Leaderboard

The ODESIA Leaderboard provides an evaluation infrastructure for pretrained language models in English and Spanish that allows a direct comparison between the performance of models in one and the other language and, therefore, it allows to measure the effectiveness gap for English-Spanish. The infrastructure consists of a benchmark on which the models are evaluated, an evaluation platform, and a leaderboard where the results per task and language and the results of the effectiveness gap calculation are shown.

Datasets and Tasks

A benchmark for evaluating language models usually consists of a diverse collection of Natural Language Processing tasks that can be addressed by language models through supervised training (fine-tuning) or other approaches (zero-shot, few-shot, in context learning, etc.). The ODESIA Leaderboard benchmark consists of two groups of tasks: Core Tasks, which are evaluated on datasets with private test partitions created specifically for this leaderboard, and Extended Tasks, which are evaluated on public test partitions. Comparable datasets in English and Spanish are provided for all tasks.

The datasets with private test partitions, where it is guaranteed that there is no contamination are as follows:

DIPROMATS 2023. This dataset is a collection of tweets issued by diplomats from four world powers (the European Union, Russia, China and the United States) annotated according to the propaganda techniques used to convey a particular image of their countries or of their competitors at the global level. The dataset is framed within the problem of disinformation. There are three tasks associated with this dataset:
- Identification of tweets with propaganda content.
- Coarse-grained characterization of propaganda techniques (4 techniques).
- Fine-grained characterization of propaganda techniques (15 techniques subsumed in the previous ones).
EXIST 2022. This dataset contains tweets annotated with information about sexism: a binary tag that indicates whether the tweet expresses sexism or not, and a multi-class tag that indicates the type of sexism conveyed. The dataset is framed within the problem of toxicity in social networks. There are two tasks associated with this dataset:
- Sexism detection.
- Sexism categorization.
DIANN 2023. This dataset is a collection of biomedical abstracts, annotated with disability citations. A sequence labeling task is performed on this dataset:
- Named entity recognition of disabilities.
EXIST 2023. Like EXIST 2022, this is a collection of tweets labeled according to the type of sexism expressed or described in them. Unlike EXIST 2022 the dataset has been annotated following the "Learning with Disagreement'' (LeWiDi) paradigm. It contains annotations for three tasks covering binary classification, multiclass hierarchical classification and multilabel hierarchical classification. It is framed within the problem of toxicity in social networks. The related tasks are:
- Sexism detection in tweets.
- Sexism categorization.
- Source intention categorization.
SQUAD/SQAC 2024. This dataset is a private test set built with the same methodology used in SQUAD (English) and SQAC (Spanish). The models can be trained on the SQUAD/SQAC datasets. It contains popular science articles from CSIC for Spanish and Cambridge University for English, annotated with extractive questions and answers. It has one associated sequence labeling task:
- Text comprehension in extractive question-answering systems. The task consists of answering questions about a text in such a way that the answer is a fragment extracted directly from the text.

In order to create these datasets in English and Spanish in a comparable way, an identical methodology has been applied in both languages to select the source texts and annotate them, and calibration mechanisms have been defined to compensate for possible differences in the intrinsic difficulty of the datasets of both languages. The test set of these datasets will be kept private indefinitely, to avoid system overfitting effects and to avoid possible contamination of the language models in the pretraining phase.

The datasets with public partitions are listed below. For all these datasets we use public training and test data, and therefore they are susceptible to contamination issues.

Multilingual Document Classification Corpus (MLDoc) (Schwenk y Li, 2018) contains news classified in four categories: corporate/industrial, economics, government/social and markets.
MultiCONER 2022 (Malmasi et al., 2022) is a multilingual dataset for complex named entity recognition with six different categories.
STS-2017 (Cer et al., 2017) is a multilingual textual similarity dataset. The task is to predict the degree of similarity between a pair of sentences.
SQUAD/SQAC consists of a dataset in Spanish, Spanish Question Answering Corpus (SQUAD/SQAC) (Gutiérrez-Fandiño et al., 2021) and another in English, SQuAD v1.1 (Rajpurkar et al., 2016). The texts are general domain texts (mainly Wikipedia) annotated with questions and their extractive answers. The associated task is text comprehension in extractive question-answering systems.

Detailed information on the tasks can be found on the Tasks page.

Evaluation

The results of the evaluation are presented in three tables (link) for each set of tasks (Core Tasks and Extended Tasks):

A table showing the results obtained for the tasks in English by system and task.
A table showing the results obtained for the tasks in Spanish by system and task.
A table with the results of the English-Spanish gap calculation per task.

The ODESIA Leaderboard uses the PyEvALL evaluation library for classification tasks. PyEvALL is accessible from the Pip package manager and can be used during the development phase to evaluate the DIPROMATS 2023, EXIST 2022 and EXIST 2023 tasks. The F1 metric implemented to evaluate the tasks of the original SQUAD/SQAC dataset has been adapted to evaluate the tasks of the SQUAD-SQAC 2024 dataset. Its original implementation, for use in the development phase, can be found in SQUAD METRIC. The DIANN 2023 sequence labeling task is evaluated with the Macro F1 metric implemented in the HuggingFace Evaluate library, which has also been adapted for the ODESIA Leaderboard. To evaluate the DIANN 2023 and MultiCoNER 2022 sequence labeling tasks, the Macro F1 metric implemented in the HuggingFace Evaluate library is used, adapted for the ODESIA Leaderboard. Finally, the Pearson Correlation metric, as implemented in SciPy, has been adapted to evaluate the STS 2017 dataset task.

To obtain the results of the gap calculation, an indicator is used which compares by language and task (i) the results of baseline systems that do not use linguistic information and (ii) the best results registered on the leaderboard. This procedure is explained below.

Defining an indicator that measures the effectiveness gap between languages requires taking into account many variables, which in many cases also depend on the problem and the data available for evaluation. The first problem to solve is that not all effectiveness metrics have the same scaling properties. Most metrics, such as hit rate, are bounded between zero and one. Other metrics do not have an upper bound. Therefore, differences obtained for various problems using different metrics cannot be equated. That is, an interval in one metric may have a completely different relevance than the same interval in another metric. Therefore, it is necessary to establish a unit-interval or reference interval.

The second problem is that the effectiveness obtained may be sensitive to the intrinsic difficulty of the evaluation data. For example, on a smaller training data size, it is normal to obtain lower effectiveness values. But the fact that the size of the training data of a dataset is different in two languages does not mean that there is an intrinsic gap in the effectiveness of the systems in each language. To control this aspect, it is necessary to take as a reference a baseline system that meets certain characteristics. The most crucial one is that the baseline must not use any language technology. That is, it must be a system that is not pre-trained for any language, and that does not use linguistic processing tools. An example of a suitable baseline would be to use an SVM classifier that works on sets of "tokens". Therefore, the differences in effectiveness between languages of such a baseline will be determined solely by the difficulty of the evaluation dataset. With this information, the results produced by language models can be calibrated to measure the gap between languages.

With this purpose, we take as reference interval in each language the distance between the effectiveness of the baseline system, that we denote as b, and a reference point in the scale of the metric that we denote as r. That is, we will take as unit interval |b-r|.

The reference point r requires also a prior analysis. In cases where the metric is not upper bounded or in cases where the effectiveness of the systems is very low, this point should be the lower value of the scale of the metric. On the other hand, in cases where there is an upper bound and the effectiveness is high, the upper value on the scale of the metric should be taken as the reference point. For example, given a baseline system with effectiveness close to one in an upper bounded metric, e.g. 82% success rate, and taking as a reference point 100% success rate, the unit interval should be 18%.

Once this unit interval |b-r| has been defined, the effective contribution of a (non-baseline) system in a language will be calculated as the ratio between the effectiveness of the evaluated system and the baseline system with respect to the unit interval: |b-r|.

$$\Delta=\frac{s-b}{|b-r|}\cdot 100$$

The linguistic contribution in each language has the following properties: first, the contribution is null when the best system behaves the same as the baseline system without language technology:

$$s=b\Longrightarrow\Delta=0$$

Secondly, given a fixed effectiveness of the baseline system, the contribution is proportional to the difference in effectiveness between the evaluated system and the baseline:

$$b=k\Longrightarrow\Delta\propto s-b$$

Third, given a fixed difference between the system and the baseline system, the contribution will be proportional to the inverse of the unit interval:

$$s-b=k\Longrightarrow\Delta\propto \frac{1}{|b-r|}$$

This means that, taking as a reference the maximum score (r=1) as the effectiveness of the system and baseline system approach the maximum point, the contribution will be higher. For example, an improvement of 0.97 to 0.98 is more important than an improvement from 0.67 to 0.68, because it represents a higher error reduction. Inversely, when the effectiveness values are lower, taking (r=0) as reference points, an improvement of 0.1 to 0.2 is more significant than an improvement from 0.3 to 0.4: the first is a percentage improvement of 100% and the second of 33%.

Thus, the effectiveness gap indicator between English (EN) and Spanish (ES) is defined as follows:

Calculation of the effectiveness gap

It represents the difference between languages in percentage improvements in relation to a baseline system.

$$Ind(EN,ES)=\Delta_{EN}-\Delta_{ES}=\frac{s_{EN}-b_{EN}}{|b_{EN}-r|}-\frac{s_{ES}-b_{ES}}{|b_{ES}-r|}$$

where $ s_{EN} $, $ b_{EN} $ and $ r $ represent the system effectiveness, the baseline system effectiveness and the reference point in English. The notation for Spanish is analogous.

Baseline Systems

The baseline systems used as reference to calibrate the effectiveness of the language models in English and Spanish are the following:

DIPROMATS 2023 For task 1 we trained logistic regression models, xgboosts and Support Vector Machines (SVM). We took as baseline the mean of the results obtained. For tasks 2 and 3, and continuing with the idea of avoiding using semantic information in the models, we used a multi-label classifier, based on the KNN algorithm (K-Nearest Neighbors).
EXIST 2022. We vectorized the datasets without any preprocessing, to avoid using linguistic information. From the resulting sets, we trained logistic regression models, xgboosts and Support Vector Machines (SVM). We took as baseline the mean of the results obtained.
DIANN 2023. Conditional Random Fields (CRF) were used without any linguistic information.
EXIST 2023. This case is more complex because it is a dataset in which the ground-truth is not a label per case, but a probability for each label that reflects the degree of agreement between annotators. To establish a baseline for each of the EXIST-2023 tasks, a simple neural network was trained with a single hidden layer for text classification. The network was trained to 20 epochs at a learning rate of 0.001. The optimization algorithm chosen was Adam. The loss function used by the neural network was binary entropy for the three tasks. The neural network architecture consisted of three main components: two fully connected layers and a rectified linear unitary activation function (ReLu). The input texts supplied to the network were first converted to 10,000-dimensional vectors using the TF-IDF method. The last layer of the neural network produces outputs corresponding to the number of target layers for each specific task: 2 for task 1, 4 for task 2 and 6 for task 3. To obtain a robust baseline for each task, the results of 10 different runs for each task were averaged.
SQUAD/SQAC 2024. The baseline algorithm simply consists of matching, by means of cosine distance, each sentence of the text with the question, taking the most similar one as the candidate answer.
MLDoc. For this dataset we use the Logistic Regression, xgboosts and SVM algorithms. As in the previous baselines, we vectorize avoiding the use of linguistic information.
MultiCONER 2022. We used Conditional Random Fields (CRF). No linguistic information was used.
STS-2017. We vectorized the two sentences of each instance using sklearn's TfidfVectorizer function, and we calculated the cosine between the two representations as an approximation of their similarity.
SQUAD/SQAC. The baseline algorithm consists of matching, by cosine distance, each sentence of the context with the question, keeping the whole sentence as a candidate answer. Note that by taking the whole sentence as the answer (to avoid any kind of linguistic processing), the baseline score is very low.