Evaluation of Generative Models
UNED-ACCESO 2024
This bilingual dataset consists of 1003 multiple-choice questions of university entrance level exams for eleven subjects in Spanish and English. All questions are taken from official, nation-wide examinations in Spain, are originally formulated in Spanish, and have not ever been publicly released. English questions are a professional, manual translation of the Spanish questions. Unlike other similar resources such as MMLU, UNED-ACCESO 2024 grants evaluations with minimal contamination, specially in English: it is very unlikely that LLMs have seen the original Spanish questions and answers, and it is simply not possible that they have seen them in English. In order to compare results globally and between subjects, instead of accuracy, we use the Cohen’s Kappa coefficient because subjects differ in the number of multiple choice options from which to choose the correct answer. Being M the number of possible answer choices, Cohen’s Kappa is defined as:
\[ \text{Kappa} = \frac{\text{observed accuracy} - \text{expected accuracy}}{1 - \text{expected accuracy}} = \frac{\frac{\text{C}}{\text{N}} - \frac{1}{\text{M}}}{1 - \frac{1}{\text{M}}} \]
Being C the proportion of correct answers over the total number of responses generated N.
System |
Media Español
|
Media Inglés
|
---|---|---|
Claude-3-Opus | 0.81 | 0.79 |
GPT-4o | 0.77 | 0.78 |
GPT-4-Turbo | 0.78 | 0.76 |
Llama-3-70B-Instruct | 0.67 | 0.65 |
Gemma-2-27B-Instruct | 0.66 | 0.64 |
GPT-3.5-Turbo | 0.55 | 0.60 |
Mixtral-8x7B-Instruct | 0.57 | 0.56 |
Llama-3-8B-Instruct | 0.50 | 0.51 |
Mistral-7B-Instruct | 0.43 | 0.46 |
Gemma-7B-It | 0.38 | 0.41 |
Llama-2-7B-Chat | 0.25 | 0.32 |
Leniachat-Gemma-2B | 0.11 | 0.15 |
System
|
Arithmetic mean
|
BAM |
Biology |
Biochemistry |
Economics |
F. of Computing |
Spanish Language |
Literature |
Mathematics |
Math Applied to SS |
Advanced Mathematics |
Psychology |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Claude-3-Opus | 0.81 | 0.86 | 0.96 | 1.00 | 0.90 | 0.98 | 0.67 | 0.84 | 0.69 | 0.63 | 0.44 | 0.89 |
GPT-4-Turbo | 0.78 | 0.78 | 0.96 | 1.00 | 0.95 | 0.96 | 0.69 | 0.72 | 0.55 | 0.57 | 0.50 | 0.88 |
GPT-4o | 0.77 | 0.84 | 0.97 | 1.00 | 0.90 | 0.96 | 0.74 | 0.81 | 0.51 | 0.38 | 0.50 | 0.91 |
Llama-3-70B-Instruct | 0.67 | 0.83 | 0.89 | 0.95 | 0.79 | 0.94 | 0.39 | 0.66 | 0.38 | 0.46 | 0.25 | 0.82 |
Gemma-2-27B-Instruct | 0.66 | 0.76 | 0.92 | 1.00 | 0.79 | 0.94 | 0.50 | 0.53 | 0.34 | 0.33 | 0.38 | 0.80 |
Mixtral-8x7B-Instruct | 0.57 | 0.72 | 0.84 | 0.87 | 0.58 | 0.87 | 0.32 | 0.52 | 0.32 | 0.23 | 0.25 | 0.78 |
GPT-3.5-Turbo | 0.55 | 0.64 | 0.80 | 0.90 | 0.53 | 0.87 | 0.32 | 0.44 | 0.20 | 0.20 | 0.38 | 0.74 |
Llama-3-8B-Instruct | 0.50 | 0.57 | 0.71 | 0.82 | 0.56 | 0.79 | 0.26 | 0.37 | 0.22 | 0.30 | 0.25 | 0.67 |
Mistral-7B-Instruct | 0.43 | 0.52 | 0.67 | 0.72 | 0.42 | 0.77 | 0.25 | 0.40 | 0.05 | 0.27 | 0.06 | 0.62 |
Gemma-7B-It | 0.38 | 0.40 | 0.63 | 0.77 | 0.32 | 0.66 | 0.12 | 0.36 | 0.12 | 0.16 | 0.06 | 0.58 |
Llama-2-7B-Chat | 0.25 | 0.29 | 0.41 | 0.31 | 0.19 | 0.56 | 0.12 | 0.34 | 0.14 | 0.12 | -0.12 | 0.44 |
Leniachat-Gemma-2B | 0.11 | 0.19 | 0.21 | 0.03 | 0.06 | 0.15 | 0.05 | 0.22 | -0.03 | 0.20 | -0.12 | 0.24 |
System
|
Arithmetic mean
|
BAM |
Biology |
Biochemistry |
Economics |
F. of Computing |
Spanish Language |
Literature |
Mathematics |
Math Applied to SS |
Advanced Mathematics |
Psychology |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Claude-3-Opus | 0.79 | 0.81 | 0.95 | 1.00 | 0.90 | 0.96 | 0.67 | 0.78 | 0.63 | 0.66 | 0.44 | 0.84 |
GPT-4o | 0.78 | 0.79 | 0.96 | 1.00 | 0.92 | 0.96 | 0.70 | 0.82 | 0.55 | 0.52 | 0.50 | 0.86 |
GPT-4-Turbo | 0.76 | 0.78 | 0.97 | 1.00 | 0.92 | 0.94 | 0.67 | 0.74 | 0.55 | 0.51 | 0.50 | 0.83 |
Llama-3-70B-Instruct | 0.65 | 0.74 | 0.90 | 1.00 | 0.82 | 0.92 | 0.33 | 0.60 | 0.34 | 0.44 | 0.19 | 0.82 |
Gemma-2-27B-Instruct | 0.64 | 0.72 | 0.94 | 1.00 | 0.79 | 0.87 | 0.50 | 0.55 | 0.28 | 0.33 | 0.25 | 0.81 |
GPT-3.5-Turbo | 0.60 | 0.67 | 0.84 | 0.95 | 0.61 | 0.89 | 0.36 | 0.56 | 0.28 | 0.17 | 0.50 | 0.73 |
Mixtral-8x7B-Instruct | 0.56 | 0.71 | 0.81 | 0.92 | 0.61 | 0.87 | 0.32 | 0.52 | 0.22 | 0.33 | 0.13 | 0.73 |
Llama-3-8B-Instruct | 0.51 | 0.52 | 0.77 | 0.90 | 0.61 | 0.79 | 0.38 | 0.43 | 0.20 | 0.28 | 0.13 | 0.67 |
Mistral-7B-Instruct | 0.46 | 0.57 | 0.71 | 0.82 | 0.63 | 0.77 | 0.23 | 0.36 | 0.05 | 0.23 | 0.00 | 0.65 |
Gemma-7B-It | 0.41 | 0.41 | 0.67 | 0.85 | 0.56 | 0.75 | 0.18 | 0.22 | 0.20 | 0.14 | -0.06 | 0.61 |
Llama-2-7B-Chat | 0.32 | 0.43 | 0.62 | 0.39 | 0.27 | 0.62 | 0.15 | 0.30 | 0.12 | 0.15 | 0.00 | 0.48 |
Leniachat-Gemma-2B | 0.15 | 0.29 | 0.32 | 0.24 | 0.03 | 0.13 | 0.05 | 0.22 | 0.08 | 0.07 | -0.06 | 0.27 |