ODESIA Leaderboard
Evaluación de modelos de lenguaje en inglés y español
Objetivos: establecer una comparación directa entre el rendimiento de modelos en inglés y español para medir la brecha de efectividad.
Método: evaluación sobre el Benchmark ODESIA, una colección de tareas de Procesamiento del Lenguaje Natural con conjuntos de datos comparables en inglés y español.
Objetivos
El Leaderboard ODESIA permite (I) medir la brecha de efectividad de los modelos de lenguaje en español respecto al inglés; (II) evaluar de forma comparada modelos de lenguaje en español. Si has desarrollado un modelo de lenguaje en español, ¡envía tus resultados!
Resultados
La brecha de efectividad promedio entre Español e Inglés es del 20%, , con un error estándar de +-6%. Hay que destacar que la brecha es más acusada en las tareas más difíciles (hasta superar el 200% en la tarea con mayor dificultad intrínseca), y por tanto el valor promedio tiene una representatividad relativa.
Tareas
Se utilizan dos conjuntos de tareas: (I) ODESIA CORE, , bilingual tasks with private test data (this avoids contamination, that the models have seen the evaluation keys in the pre-training phase); and (II) ODESIA EXTENDED,que añade un conjunto de cinco tareas bilingües estándar y disponibles de forma pública.
Metodología
ODESIA Leaderboard utiliza un conjunto de 14 tareas bilingües para comparar el estado del arte en inglés y español. Sobre cada tarea (I) se estima la dificultad intrínseca aplicando varios algoritmos no lingüísticos y (II) se calibran los mejores resultados en cada idioma usando esa dificultad intrínseca.
Leaderboard
Odesia Core Tasks
# | Sistema | Media aritmética | EXIST 2022: Sexism detection (ES) | EXIST 2022: Sexism categorisation (ES) | DIPROMATS 2023: Propaganda identification (ES) | DIPROMATS 2023: Coarse propaganda characterization (ES) | DIPROMATS 2023: Fine-grained propaganda characterization (ES) | DIANN 2023: Disability detection (ES) | EXIST-2023: Sexism identification (ES) | EXIST-2023: Source Intention (ES) | EXIST-2023: Sexism categorization (ES) | SQAC-SQUAD 2024: Question answering (ES) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | distilbert-base-multilingual-cased | 0.459 | 0.72 | 0.47 | 0.75 | 0.34 | 0.09 | 0.78 | 0.57 | 0.36 | 0.29 | 0.22 |
2 | distillbert-base-spanish-uncased | 0.473 | 0.72 | 0.51 | 0.77 | 0.34 | 0.07 | 0.75 | 0.60 | 0.39 | 0.33 | 0.25 |
3 | xlm-roberta-base | 0.515 | 0.74 | 0.50 | 0.79 | 0.47 | 0.10 | 0.84 | 0.62 | 0.40 | 0.32 | 0.37 |
4 | ixambert-base-cased | 0.485 | 0.71 | 0.49 | 0.77 | 0.32 | 0.06 | 0.83 | 0.60 | 0.37 | 0.34 | 0.36 |
5 | bert-base-multilingual-cased | 0.488 | 0.72 | 0.47 | 0.78 | 0.35 | 0.10 | 0.84 | 0.60 | 0.37 | 0.33 | 0.32 |
6 | bert-base-spanish-wwm-cased | 0.524 | 0.72 | 0.54 | 0.79 | 0.44 | 0.14 | 0.81 | 0.63 | 0.39 | 0.37 | 0.41 |
7 | PlanTL-GOB-ES-roberta-base-bne | 0.521 | 0.74 | 0.56 | 0.81 | 0.42 | 0.12 | 0.75 | 0.63 | 0.40 | 0.37 | 0.41 |
8 | bertin-roberta-base-spanish | 0.493 | 0.73 | 0.49 | 0.76 | 0.36 | 0.08 | 0.75 | 0.62 | 0.39 | 0.33 | 0.42 |
9 | PlanTL-GOB-ES-roberta-large-bne | 0.552 | 0.75 | 0.57 | 0.82 | 0.44 | 0.24 | 0.82 | 0.64 | 0.40 | 0.38 | 0.46 |
10 | xlm-roberta-large | 0.564 | 0.77 | 0.56 | 0.82 | 0.47 | 0.26 | 0.84 | 0.64 | 0.42 | 0.40 | 0.46 |
# | Sistema | Media aritmética | EXIST 2022: Sexism detection (EN) | EXIST 2022: Sexism categorisation (EN) | DIANN 2023: Disability detection (EN) | DIPROMATS 2023: Propaganda identification (EN) | DIPROMATS 2023: Coarse propaganda characterization (EN) | DIPROMATS 2023: Fine-grained propaganda characterization (EN) | EXIST-2023: Sexism categorization (EN) | EXIST-2023: Sexism identification (EN) | EXIST-2023: Source intention (EN) | SQAC-SQUAD 2024: Question answering (EN) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | bert-base-multilingual-cased | 0.501 | 0.76 | 0.50 | 0.73 | 0.80 | 0.48 | 0.18 | 0.34 | 0.60 | 0.32 | 0.30 |
2 | distilbert-base-multilingual-cased | 0.472 | 0.74 | 0.53 | 0.68 | 0.77 | 0.45 | 0.16 | 0.30 | 0.58 | 0.31 | 0.20 |
3 | distilbert-base-uncased | 0.497 | 0.77 | 0.55 | 0.66 | 0.78 | 0.47 | 0.14 | 0.37 | 0.62 | 0.34 | 0.27 |
4 | bert-base-cased | 0.513 | 0.76 | 0.53 | 0.72 | 0.81 | 0.50 | 0.21 | 0.37 | 0.61 | 0.32 | 0.30 |
5 | ixambert-base-cased | 0.503 | 0.75 | 0.53 | 0.73 | 0.78 | 0.49 | 0.14 | 0.36 | 0.61 | 0.32 | 0.32 |
6 | xlm-roberta-base | 0.517 | 0.76 | 0.53 | 0.76 | 0.80 | 0.54 | 0.16 | 0.35 | 0.62 | 0.32 | 0.33 |
7 | roberta-base | 0.530 | 0.78 | 0.53 | 0.75 | 0.81 | 0.52 | 0.19 | 0.38 | 0.63 | 0.33 | 0.38 |
8 | xlm-roberta-large | 0.565 | 0.79 | 0.56 | 0.78 | 0.81 | 0.52 | 0.39 | 0.39 | 0.63 | 0.36 | 0.42 |
9 | roberta-large | 0.587 | 0.81 | 0.58 | 0.79 | 0.82 | 0.55 | 0.47 | 0.40 | 0.64 | 0.35 | 0.46 |
Odesia Extended Tasks
# | Sistema | Media aritmética | MLDOC 2018: Document classification (ES) | Multilingual Complex Named Entity Recognition 2022 (ES) | SQAC-SQUAD 2016: Question answering (ES) | Semantic Textual Similarity 2017 (ES) | DIANN 2018: Negation detection (ES) |
---|---|---|---|---|---|---|---|
1 | xlm-roberta-base | 0.772 | 0.95 | 0.66 | 0.67 | 0.73 | 0.85 |
2 | xlm-roberta-large | 0.832 | 0.96 | 0.71 | 0.77 | 0.80 | 0.92 |
3 | bert-base-multilingual-cased | 0.750 | 0.96 | 0.64 | 0.71 | 0.70 | 0.74 |
4 | distilbert-base-multilingual-cased | 0.724 | 0.94 | 0.61 | 0.55 | 0.69 | 0.83 |
5 | PlanTL-GOB-ES-roberta-base-bne | 0.792 | 0.96 | 0.64 | 0.74 | 0.75 | 0.87 |
6 | PlanTL-GOB-ES-roberta-large-bne | 0.730 | 0.96 | 0.63 | 0.77 | 0.76 | 0.53 |
7 | bertin-roberta-base-spanish | 0.772 | 0.96 | 0.62 | 0.73 | 0.67 | 0.88 |
8 | bert-base-spanish-wwm-cased | 0.810 | 0.96 | 0.63 | 0.71 | 0.79 | 0.96 |
9 | distillbert-base-spanish-uncased | 0.724 | 0.96 | 0.61 | 0.53 | 0.74 | 0.78 |
10 | ixambert-base-cased | 0.768 | 0.96 | 0.63 | 0.71 | 0.81 | 0.73 |
# | Sistema | Media aritmética | MLDOC 2018: Document classification (EN) | Multilingual Complex Named Entity Recognition 2022 (EN) | SQAC-SQUAD 2016: Question answering (EN) | Semantic Textual Similarity 2017 (EN) | DIANN 2018: Negation detection (EN) |
---|---|---|---|---|---|---|---|
1 | ixambert-base-cased | 0.804 | 0.98 | 0.65 | 0.80 | 0.82 | 0.77 |
2 | bert-base-cased | 0.784 | 0.97 | 0.68 | 0.78 | 0.82 | 0.67 |
3 | distilbert-base-uncased | 0.800 | 0.97 | 0.67 | 0.77 | 0.81 | 0.78 |
4 | roberta-large | 0.864 | 0.98 | 0.75 | 0.88 | 0.86 | 0.85 |
5 | roberta-base | 0.852 | 0.98 | 0.70 | 0.85 | 0.85 | 0.88 |
6 | distilbert-base-multilingual-cased | 0.774 | 0.97 | 0.63 | 0.75 | 0.76 | 0.76 |
7 | xlm-roberta-large | 0.868 | 0.98 | 0.74 | 0.86 | 0.84 | 0.92 |
8 | xlm-roberta-base | 0.808 | 0.98 | 0.69 | 0.80 | 0.80 | 0.77 |
9 | bert-base-multilingual-cased | 0.784 | 0.97 | 0.67 | 0.81 | 0.80 | 0.67 |
Compruebe todos los resultados en el Leaderboard
Gap Español-Inglés
La brecha total entre el español y el inglés es del 21%
Odesia Core Tasks
Tareas | Mejor resultado en Español | Mejor resultado en Inglés | |
---|---|---|---|
Media total | 0.60 | 0.60 | 14% |
EXIST 2022: Sexism detection (ES) | 0.77 | 0.81 | 17% |
EXIST 2022: Sexism categorisation (ES) | 0.57 | 0.58 | 10% |
DIPROMATS 2023: Propaganda identification (ES) | 0.82 | 0.82 | 11% |
DIPROMATS 2023: Coarse propaganda characterization (ES) | 0.47 | 0.55 | 48% |
DIPROMATS 2023: Fine-grained propaganda characterization (ES) | 0.26 | 0.47 | 299% |
DIANN 2023: Disability detection (ES) | 0.84 | 0.79 | 1% |
EXIST-2023: Sexism identification (ES) | 0.64 | 0.64 | 10% |
EXIST-2023: Source Intention (ES) | 0.42 | 0.36 | -4% |
EXIST-2023: Sexism categorization (ES) | 0.40 | 0.40 | 12% |
SQAC-SQUAD 2024: Question answering (ES) | 0.46 | 0.46 | 19% |
Odesia Extended Tasks
Tareas | Mejor resultado en Español | Mejor resultado en Inglés | |
---|---|---|---|
Total mean | 0.84 | 0.88 | 35.2% |
MLDOC 2018: Document classification (ES) | 0.96 | 0.98 | 40% |
Multilingual Complex Named Entity Recognition 2022 (ES) | 0.71 | 0.75 | 5% |
SQAC-SQUAD 2016: Question answering (ES) | 0.77 | 0.88 | 25% |
Semantic Textual Similarity 2017 (ES) | 0.81 | 0.86 | 13% |
DIANN 2018: Negation detection (ES) | 0.96 | 0.92 | 93% |
Compruebe todos los resultados en el Leaderboard