Leaderboard ODESIA
Evaluation of language models in English and Spanish
Goals: to make a direct comparison between the effectiveness of language models in English and Spanish to measure the effectiveness gap.
Method: evaluation on the ODESIA Benchmark, a collection of Natural Language Processing tasks with comparable datasets in English and Spanish.
Goals
The ODESIA Leaderboard allows (I) to measure the effectiveness gap of Spanish language models with respect to English; (II) to comparatively evaluate Spanish language models. If you have developed a Spanish language model, submit your results!
Results
The average effectiveness gap between Spanish and English is 20%, , with a standard error of +-4%. It should be noted that the gap is more pronounced in the most difficult tasks (up to over 200% in the most intrinsically difficult task), and therefore the average value has a relative representativeness.
Tasks
Two sets of tasks are used: (I) ODESIA CORE, , bilingual tasks with private test data (this avoids contamination, that the models have seen the evaluation keys in the pre-training phase); and (II) ODESIA EXTENDED, which adds a set of standard and publicly available bilingual tasks.
Methodology
ODESIA Leaderboard uses a set of bilingual tasks to compare the state of the art in English and Spanish. On each task (I) the intrinsic difficulty is estimated by applying several non-linguistic algorithms and (II) the best results in each language are calibrated using that intrinsic difficulty.
Leaderboard
Odesia Core Tasks
# | System | Arithmetic mean | EXIST 2022: Sexism detection (ES) | EXIST 2022: Sexism categorisation (ES) | DIPROMATS 2023: Propaganda identification (ES) | DIPROMATS 2023: Coarse propaganda characterization (ES) | DIPROMATS 2023: Fine-grained propaganda characterization (ES) | DIANN 2023: Disability detection (ES) | EXIST-2023: Sexism identification (ES) | EXIST-2023: Source Intention (ES) | EXIST-2023: Sexism categorization (ES) | SQAC-SQUAD 2024: Question answering (ES) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | distilbert-base-multilingual-cased | 0.459 | 0.72 | 0.47 | 0.75 | 0.34 | 0.09 | 0.78 | 0.57 | 0.36 | 0.29 | 0.22 |
2 | distillbert-base-spanish-uncased | 0.473 | 0.72 | 0.51 | 0.77 | 0.34 | 0.07 | 0.75 | 0.60 | 0.39 | 0.33 | 0.25 |
3 | xlm-roberta-base | 0.515 | 0.74 | 0.50 | 0.79 | 0.47 | 0.10 | 0.84 | 0.62 | 0.40 | 0.32 | 0.37 |
4 | ixambert-base-cased | 0.485 | 0.71 | 0.49 | 0.77 | 0.32 | 0.06 | 0.83 | 0.60 | 0.37 | 0.34 | 0.36 |
5 | bert-base-multilingual-cased | 0.488 | 0.72 | 0.47 | 0.78 | 0.35 | 0.10 | 0.84 | 0.60 | 0.37 | 0.33 | 0.32 |
6 | bert-base-spanish-wwm-cased | 0.524 | 0.72 | 0.54 | 0.79 | 0.44 | 0.14 | 0.81 | 0.63 | 0.39 | 0.37 | 0.41 |
7 | PlanTL-GOB-ES-roberta-base-bne | 0.521 | 0.74 | 0.56 | 0.81 | 0.42 | 0.12 | 0.75 | 0.63 | 0.40 | 0.37 | 0.41 |
8 | bertin-roberta-base-spanish | 0.493 | 0.73 | 0.49 | 0.76 | 0.36 | 0.08 | 0.75 | 0.62 | 0.39 | 0.33 | 0.42 |
9 | PlanTL-GOB-ES-roberta-large-bne | 0.552 | 0.75 | 0.57 | 0.82 | 0.44 | 0.24 | 0.82 | 0.64 | 0.40 | 0.38 | 0.46 |
10 | xlm-roberta-large | 0.564 | 0.77 | 0.56 | 0.82 | 0.47 | 0.26 | 0.84 | 0.64 | 0.42 | 0.40 | 0.46 |
# | System | Arithmetic mean | EXIST 2022: Sexism detection (EN) | EXIST 2022: Sexism categorisation (EN) | DIANN 2023: Disability detection (EN) | DIPROMATS 2023: Propaganda identification (EN) | DIPROMATS 2023: Coarse propaganda characterization (EN) | DIPROMATS 2023: Fine-grained propaganda characterization (EN) | EXIST-2023: Sexism identification (ES) | EXIST-2023: Source Intention (ES) | EXIST-2023: Sexism categorization (ES) | EXIST-2023: Sexism categorization (EN) | EXIST-2023: Sexism identification (EN) | EXIST-2023: Source intention (EN) | SQAC-SQUAD 2024: Question answering (EN) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | bert-base-multilingual-cased | 0.485 | 0.76 | 0.50 | 0.73 | 0.80 | 0.48 | 0.18 | 0.60 | 0.37 | 0.33 | 0.34 | 0.60 | 0.32 | 0.30 |
2 | distilbert-base-multilingual-cased | 0.457 | 0.74 | 0.53 | 0.68 | 0.77 | 0.45 | 0.16 | 0.57 | 0.36 | 0.29 | 0.30 | 0.58 | 0.31 | 0.20 |
3 | distilbert-base-uncased | 0.382 | 0.77 | 0.55 | 0.66 | 0.78 | 0.47 | 0.14 | 0.37 | 0.62 | 0.34 | 0.27 | 0.00 | 0.00 | 0.00 |
4 | bert-base-cased | 0.395 | 0.76 | 0.53 | 0.72 | 0.81 | 0.50 | 0.21 | 0.37 | 0.61 | 0.32 | 0.30 | 0.00 | 0.00 | 0.00 |
5 | ixambert-base-cased | 0.488 | 0.75 | 0.53 | 0.73 | 0.78 | 0.49 | 0.14 | 0.60 | 0.37 | 0.34 | 0.36 | 0.61 | 0.32 | 0.32 |
6 | xlm-roberta-base | 0.501 | 0.76 | 0.53 | 0.76 | 0.80 | 0.54 | 0.16 | 0.62 | 0.40 | 0.32 | 0.35 | 0.62 | 0.32 | 0.33 |
7 | roberta-base | 0.408 | 0.78 | 0.53 | 0.75 | 0.81 | 0.52 | 0.19 | 0.38 | 0.63 | 0.33 | 0.38 | 0.00 | 0.00 | 0.00 |
8 | xlm-roberta-large | 0.547 | 0.79 | 0.56 | 0.78 | 0.81 | 0.52 | 0.39 | 0.64 | 0.42 | 0.40 | 0.39 | 0.63 | 0.36 | 0.42 |
9 | roberta-large | 0.452 | 0.81 | 0.58 | 0.79 | 0.82 | 0.55 | 0.47 | 0.40 | 0.64 | 0.35 | 0.46 | 0.00 | 0.00 | 0.00 |
10 | distillbert-base-spanish-uncased | 0.102 | 0.60 | 0.39 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
11 | PlanTL-GOB-ES-roberta-base-bne | 0.108 | 0.63 | 0.40 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
12 | bertin-roberta-base-spanish | 0.103 | 0.62 | 0.39 | 0.33 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
13 | bert-base-spanish-wwm-cased | 0.107 | 0.63 | 0.39 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
14 | PlanTL-GOB-ES-roberta-large-bne | 0.109 | 0.64 | 0.40 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Odesia Extended Tasks
# | System | Arithmetic mean | MLDOC 2018: Document classification (ES) | Multilingual Complex Named Entity Recognition 2022 (ES) | SQAC-SQUAD 2016: Question answering (ES) | Semantic Textual Similarity 2017 (ES) | DIANN 2018: Negation detection (ES) |
---|---|---|---|---|---|---|---|
1 | xlm-roberta-base | 0.772 | 0.95 | 0.66 | 0.67 | 0.73 | 0.85 |
2 | xlm-roberta-large | 0.832 | 0.96 | 0.71 | 0.77 | 0.80 | 0.92 |
3 | bert-base-multilingual-cased | 0.750 | 0.96 | 0.64 | 0.71 | 0.70 | 0.74 |
4 | distilbert-base-multilingual-cased | 0.724 | 0.94 | 0.61 | 0.55 | 0.69 | 0.83 |
5 | PlanTL-GOB-ES-roberta-base-bne | 0.792 | 0.96 | 0.64 | 0.74 | 0.75 | 0.87 |
6 | PlanTL-GOB-ES-roberta-large-bne | 0.730 | 0.96 | 0.63 | 0.77 | 0.76 | 0.53 |
7 | bertin-roberta-base-spanish | 0.772 | 0.96 | 0.62 | 0.73 | 0.67 | 0.88 |
8 | bert-base-spanish-wwm-cased | 0.810 | 0.96 | 0.63 | 0.71 | 0.79 | 0.96 |
9 | distillbert-base-spanish-uncased | 0.724 | 0.96 | 0.61 | 0.53 | 0.74 | 0.78 |
10 | ixambert-base-cased | 0.768 | 0.96 | 0.63 | 0.71 | 0.81 | 0.73 |
# | Sistema | Media aritmética | MLDOC 2018: Document classification (EN) | Multilingual Complex Named Entity Recognition 2022 (EN) | SQAC-SQUAD 2016: Question answering (EN) | Semantic Textual Similarity 2017 (EN) | DIANN 2018: Negation detection (EN) |
---|---|---|---|---|---|---|---|
1 | ixambert-base-cased | 0.804 | 0.98 | 0.65 | 0.80 | 0.82 | 0.77 |
2 | bert-base-cased | 0.784 | 0.97 | 0.68 | 0.78 | 0.82 | 0.67 |
3 | distilbert-base-uncased | 0.800 | 0.97 | 0.67 | 0.77 | 0.81 | 0.78 |
4 | roberta-large | 0.864 | 0.98 | 0.75 | 0.88 | 0.86 | 0.85 |
5 | roberta-base | 0.852 | 0.98 | 0.70 | 0.85 | 0.85 | 0.88 |
6 | distilbert-base-multilingual-cased | 0.774 | 0.97 | 0.63 | 0.75 | 0.76 | 0.76 |
7 | xlm-roberta-large | 0.868 | 0.98 | 0.74 | 0.86 | 0.84 | 0.92 |
8 | xlm-roberta-base | 0.808 | 0.98 | 0.69 | 0.80 | 0.80 | 0.77 |
9 | bert-base-multilingual-cased | 0.784 | 0.97 | 0.67 | 0.81 | 0.80 | 0.67 |
Check all the results on the Leaderboard
Gap Spanish-English
La brecha total entre el español y el inglés es del 21%
Odesia Core Tasks
Tasks | Best result Spanish | Best result English | |
---|---|---|---|
Media total | 0.60 | 0.60 | 14% |
EXIST 2022: Sexism detection (ES) | 0.77 | 0.81 | 17% |
EXIST 2022: Sexism categorisation (ES) | 0.57 | 0.58 | 10% |
DIPROMATS 2023: Propaganda identification (ES) | 0.82 | 0.82 | 11% |
DIPROMATS 2023: Coarse propaganda characterization (ES) | 0.47 | 0.55 | 48% |
DIPROMATS 2023: Fine-grained propaganda characterization (ES) | 0.26 | 0.47 | 299% |
DIANN 2023: Disability detection (ES) | 0.84 | 0.79 | 1% |
EXIST-2023: Sexism identification (ES) | 0.64 | 0.64 | 10% |
EXIST-2023: Source Intention (ES) | 0.42 | 0.36 | -4% |
EXIST-2023: Sexism categorization (ES) | 0.40 | 0.40 | 12% |
SQAC-SQUAD 2024: Question answering (ES) | 0.46 | 0.46 | 19% |
Odesia Extended Tasks
Tasks | Best result Spanish | Best result English | |
---|---|---|---|
Total mean | 0.84 | 0.88 | 35.2% |
MLDOC 2018: Document classification (ES) | 0.96 | 0.98 | 40% |
Multilingual Complex Named Entity Recognition 2022 (ES) | 0.71 | 0.75 | 5% |
SQAC-SQUAD 2016: Question answering (ES) | 0.77 | 0.88 | 25% |
Semantic Textual Similarity 2017 (ES) | 0.81 | 0.86 | 13% |
DIANN 2018: Negation detection (ES) | 0.96 | 0.92 | 93% |
Check all the results on the Leaderboard