Leaderboard ODESIA


Evaluation of language models in English and Spanish

NEW: ODESIA CHALLENGE -- Submit the best system for the ODESIA CORE tasks and win a cash prize of 3,000€

Deadline: February 2, 2025

Goals: to make a direct comparison between the effectiveness of language models in English and Spanish to measure the effectiveness gap.
Method: evaluation on the ODESIA Benchmark, a collection of Natural Language Processing tasks with comparable datasets in English and Spanish.

Goals

The ODESIA Leaderboard allows (I) to measure the effectiveness gap of Spanish language models with respect to English; (II) to comparatively evaluate Spanish language models. If you have developed a Spanish language model, submit your results!

For more details check here.

Results

The average effectiveness gap between Spanish and English is 20%, with a standard error of ±4%. It should be noted that the gap is more pronounced in the most difficult tasks (up to over 200% in the most intrinsically difficult task), and therefore the average value has a relative representativeness.

For more details check here.

Tasks

Two sets of tasks are used: (I) ODESIA CORE, bilingual tasks with private test data (this avoids contamination, that the models have seen the evaluation keys in the pre-training phase); and (II) ODESIA EXTENDED, which adds a set of standard and publicly available bilingual tasks.

For more details check here.

Methodology

ODESIA Leaderboard uses a set of bilingual tasks to compare the state of the art in English and Spanish. On each task (I) the intrinsic difficulty is estimated by applying several non-linguistic algorithms and (II) the best results in each language are calibrated using that intrinsic difficulty.

For more details check here.

Leaderboard

Odesia Core Tasks

System
Arithmetic mean
Ixa ehu ixambert base cased 0.4819 0.6743 0.4875 0.7666 0.3796 0.0543 0.7580 0.6117 0.3890 0.3412 0.3570
Bertin roberta base spanish 0.4984 0.7280 0.4941 0.7596 0.2532 0.2500 0.6877 0.6465 0.4146 0.3331 0.4172
Xlm roberta large 0.5873 0.7663 0.5593 0.8186 0.5343 0.4527 0.7855 0.6564 0.4414 0.3995 0.4589
Xlm roberta base 0.5264 0.7395 0.4997 0.7894 0.4504 0.2668 0.7819 0.6236 0.4245 0.3195 0.3691
PlanTL GOB ES roberta large bne 0.5626 0.7241 0.5668 0.8177 0.5173 0.3894 0.6757 0.6671 0.4237 0.3798 0.4640
PlanTL GOB ES roberta base bne 0.5453 0.7356 0.5554 0.8149 0.4906 0.2944 0.7169 0.6531 0.4173 0.3688 0.4061
Distilbert base multilingual cased 0.4728 0.7222 0.4669 0.7507 0.4036 0.2222 0.6868 0.5851 0.3823 0.2874 0.2207
Dccuchile bert base spanish wwm cased 0.5408 0.7146 0.5370 0.7916 0.4874 0.2931 0.7478 0.6326 0.4182 0.3738 0.4118
CenIA distillbert base spanish uncased 0.4864 0.7203 0.5118 0.7708 0.4198 0.1782 0.6531 0.6128 0.4160 0.3324 0.2484
Bert base multilingual cased 0.5073 0.7222 0.4693 0.7821 0.4231 0.2562 0.7592 0.6136 0.3917 0.3326 0.3225
System
Arithmetic mean
Ixa ehu ixambert base cased 0.5286 0.7563 0.5300 0.7450 0.7796 0.4430 0.4004 0.3556 0.5913 0.3622 0.3222
Xlm roberta large 0.5723 0.7953 0.5422 0.7740 0.7931 0.4867 0.5123 0.3866 0.6135 0.4029 0.4163
Xlm roberta base 0.5279 0.7661 0.5345 0.7438 0.7791 0.4329 0.3773 0.3487 0.5983 0.3735 0.3251
Roberta large 0.5961 0.8187 0.5846 0.7982 0.7984 0.5204 0.5526 0.4026 0.6262 0.3962 0.4626
Roberta base 0.5522 0.7875 0.5258 0.7612 0.7799 0.4811 0.4406 0.3774 0.6155 0.3779 0.3746
Distilbert base uncased 0.5120 0.7739 0.5486 0.6966 0.7687 0.4054 0.3035 0.3676 0.6044 0.3844 0.2670
Distilbert base multilingual cased 0.4828 0.7388 0.4792 0.6950 0.7471 0.3794 0.3592 0.3041 0.5683 0.3576 0.1994
Bert base cased 0.5329 0.7641 0.5344 0.7364 0.7763 0.4468 0.4271 0.3659 0.6083 0.3701 0.2996
Bert base multilingual cased 0.5171 0.7563 0.5022 0.7384 0.7709 0.4266 0.3884 0.3443 0.5876 0.3618 0.2948

Odesia Extended Tasks

System
Arithmetic mean
Ixa ehu ixambert base cased 0.7764 0.9579 0.5926 0.7429 0.8120
Bertin roberta base spanish 0.7484 0.9605 0.5215 0.7298 0.7818
Xlm roberta large 0.8156 0.9641 0.6801 0.7895 0.8287
Xlm roberta base 0.7646 0.9534 0.6201 0.6988 0.7861
PlanTL GOB ES roberta large bne 0.7922 0.9567 0.6069 0.7818 0.8232
PlanTL GOB ES roberta base bne 0.7823 0.9570 0.6041 0.7584 0.8096
Distilbert base multilingual cased 0.7088 0.9425 0.5580 0.5566 0.7781
Dccuchile bert base spanish wwm cased 0.7661 0.9564 0.5472 0.7276 0.8330
CenIA distillbert base spanish uncased 0.7182 0.9553 0.5894 0.5329 0.7951
Bert base multilingual cased 0.7613 0.9562 0.5992 0.6976 0.7920
distilbert-base-multilingual-cased 0.1375 0.5500 0.0000 0.0000 0.0000
Sistema
Media aritmética
Ixa ehu ixambert base cased 0.8047 0.9756 0.6075 0.8187 0.8170
Xlm roberta large 0.8457 0.9789 0.7007 0.8581 0.8450
Xlm roberta base 0.7984 0.9761 0.6080 0.7998 0.8097
Roberta large 0.8556 0.9832 0.7012 0.8724 0.8656
Roberta base 0.8345 0.9802 0.6577 0.8427 0.8572
Distilbert base uncased 0.8063 0.9726 0.6563 0.7602 0.8360
Distilbert base multilingual cased 0.7681 0.9693 0.5693 0.7467 0.7872
Bert base cased 0.8036 0.9749 0.5993 0.7968 0.8434
Bert base multilingual cased 0.8035 0.9716 0.6252 0.8059 0.8112

Check all the results on the Leaderboard

Gap Spanish-English

The overall gap between Spanish and English is 21%


Odesia Core Tasks


Tasks
Best result Spanish
Best result English
Media total
0.59
0.60
20%
EXIST 2022: Sexism detection 0.77 0.82 49%
EXIST 2022: Sexism categorisation 0.57 0.58 30%
DIPROMATS 2023: Propaganda identification 0.82 0.80 25%
DIPROMATS 2023: Coarse propaganda characterization 0.53 0.52 -1%
DIPROMATS 2023: Fine-grained propaganda characterization 0.45 0.55 23%
DIANN 2023: Disability detection 0.79 0.80 71%
EXIST-2023: Sexism identification (soft-soft) 0.67 0.63 -5%
EXIST-2023: Source Intention (soft-soft) 0.44 0.40 -4%
EXIST-2023: Sexism categorization (soft-soft) 0.40 0.40 8%
SQAC-SQUAD 2024: Question answering 0.46 0.46 2%

Odesia Extended Tasks


Tasks
Best result Spanish
Best result English
Total mean 0.82 0.86 23.5%
MLDOC 2018: Document classification 0.96 0.98 66%
Multilingual Complex Named Entity Recognition 2022 0.68 0.70 -6%
SQAC-SQUAD 2016: Question answering 0.79 0.87 26%
Semantic Textual Similarity 2017 0.83 0.87 8%

Check all the results on the Leaderboard

Participate

You can participate in several ways:

(1) Evaluating language models in Spanish or English.
(2) Evaluating multilingual models in Spanish and English.

If you want to evaluate your model for a single task, you can do so on the EvALL.

Register and participate by sending us your results.