Generative Model Evaluation

UNED-ACCESO 2024

This bilingual dataset consists of 1003 multiple-choice questions of university entrance level exams for eleven subjects in Spanish and English. All questions are taken from official, nation-wide examinations in Spain, are originally formulated in Spanish, and have not ever been publicly released. English questions are a professional, manual translation of the Spanish questions. Unlike other similar resources such as MMLU, UNED-ACCESO 2024 grants evaluations with minimal contamination, specially in English: it is very unlikely that LLMs have seen the original Spanish questions and answers, and it is simply not possible that they have seen them in English. In order to compare results globally and between subjects, instead of accuracy, we use the Cohen’s Kappa coefficient because subjects differ in the number of multiple choice options from which to choose the correct answer. Being M the number of possible answer choices, Cohen’s Kappa is defined as:

\[ \text{Kappa} = \frac{\text{observed accuracy} - \text{expected accuracy}}{1 - \text{expected accuracy}} = \frac{\frac{\text{C}}{\text{N}} - \frac{1}{\text{M}}}{1 - \frac{1}{\text{M}}} \]

Being C the proportion of correct answers over the total number of responses generated N.

System	Media Español	Media Inglés
Claude-3-Opus	0.81	0.79
GPT-4o	0.77	0.78
GPT-4-Turbo	0.78	0.76
Llama-3-70B-Instruct	0.67	0.65
Gemma-2-27B-Instruct	0.66	0.64
GPT-3.5-Turbo	0.55	0.60
Mixtral-8x7B-Instruct	0.57	0.56
Llama-3-8B-Instruct	0.50	0.51
Mistral-7B-Instruct	0.43	0.46
Gemma-7B-It	0.38	0.41
Llama-2-7B-Chat	0.25	0.32
Leniachat-Gemma-2B	0.11	0.15

System	Arithmetic mean	BAM	Biology	Biochemistry	Economics	F. of Computing	Spanish Language	Literature	Mathematics	Math Applied to SS	Advanced Mathematics	Psychology
Claude-3-Opus	0.81	0.86	0.96	1.00	0.90	0.98	0.67	0.84	0.69	0.63	0.44	0.89
GPT-4-Turbo	0.78	0.78	0.96	1.00	0.95	0.96	0.69	0.72	0.55	0.57	0.50	0.88
GPT-4o	0.77	0.84	0.97	1.00	0.90	0.96	0.74	0.81	0.51	0.38	0.50	0.91
Llama-3-70B-Instruct	0.67	0.83	0.89	0.95	0.79	0.94	0.39	0.66	0.38	0.46	0.25	0.82
Gemma-2-27B-Instruct	0.66	0.76	0.92	1.00	0.79	0.94	0.50	0.53	0.34	0.33	0.38	0.80
Mixtral-8x7B-Instruct	0.57	0.72	0.84	0.87	0.58	0.87	0.32	0.52	0.32	0.23	0.25	0.78
GPT-3.5-Turbo	0.55	0.64	0.80	0.90	0.53	0.87	0.32	0.44	0.20	0.20	0.38	0.74
Llama-3-8B-Instruct	0.50	0.57	0.71	0.82	0.56	0.79	0.26	0.37	0.22	0.30	0.25	0.67
Mistral-7B-Instruct	0.43	0.52	0.67	0.72	0.42	0.77	0.25	0.40	0.05	0.27	0.06	0.62
Gemma-7B-It	0.38	0.40	0.63	0.77	0.32	0.66	0.12	0.36	0.12	0.16	0.06	0.58
Llama-2-7B-Chat	0.25	0.29	0.41	0.31	0.19	0.56	0.12	0.34	0.14	0.12	-0.12	0.44
Leniachat-Gemma-2B	0.11	0.19	0.21	0.03	0.06	0.15	0.05	0.22	-0.03	0.20	-0.12	0.24

System	Arithmetic mean	BAM	Biology	Biochemistry	Economics	F. of Computing	Spanish Language	Literature	Mathematics	Math Applied to SS	Advanced Mathematics	Psychology
Claude-3-Opus	0.79	0.81	0.95	1.00	0.90	0.96	0.67	0.78	0.63	0.66	0.44	0.84
GPT-4o	0.78	0.79	0.96	1.00	0.92	0.96	0.70	0.82	0.55	0.52	0.50	0.86
GPT-4-Turbo	0.76	0.78	0.97	1.00	0.92	0.94	0.67	0.74	0.55	0.51	0.50	0.83
Llama-3-70B-Instruct	0.65	0.74	0.90	1.00	0.82	0.92	0.33	0.60	0.34	0.44	0.19	0.82
Gemma-2-27B-Instruct	0.64	0.72	0.94	1.00	0.79	0.87	0.50	0.55	0.28	0.33	0.25	0.81
GPT-3.5-Turbo	0.60	0.67	0.84	0.95	0.61	0.89	0.36	0.56	0.28	0.17	0.50	0.73
Mixtral-8x7B-Instruct	0.56	0.71	0.81	0.92	0.61	0.87	0.32	0.52	0.22	0.33	0.13	0.73
Llama-3-8B-Instruct	0.51	0.52	0.77	0.90	0.61	0.79	0.38	0.43	0.20	0.28	0.13	0.67
Mistral-7B-Instruct	0.46	0.57	0.71	0.82	0.63	0.77	0.23	0.36	0.05	0.23	0.00	0.65
Gemma-7B-It	0.41	0.41	0.67	0.85	0.56	0.75	0.18	0.22	0.20	0.14	-0.06	0.61
Llama-2-7B-Chat	0.32	0.43	0.62	0.39	0.27	0.62	0.15	0.30	0.12	0.15	0.00	0.48
Leniachat-Gemma-2B	0.15	0.29	0.32	0.24	0.03	0.13	0.05	0.22	0.08	0.07	-0.06	0.27