Generative Model Evaluation

UNED-Grados

The UNED-Grados dataset contains 12,436 multiple-choice exam questions, each with four possible answers, extracted from 109 subjects across 22 university degrees in various fields of knowledge. This dataset is designed to evaluate language models' capabilities in a realistic academic setting, based on real exam questions from UNED. Unlike other artificial benchmarks, this dataset originates from a private repository and will not be publicly released, allowing the assessment of not only memorization but also reasoning across different disciplines.

Evaluation is conducted in a zero-shot setting, without prior training on the dataset, using accuracy as the primary metric to measure performance across different disciplines. Model temperature is set to 0 to obtain more deterministic responses.

The results show that Gemini-2.0 is the most accurate model across all areas and in overall performance (0.77), followed by Llama-3.3 (0.71), Phi-4 (0.67), and QwQ (0.64), which has the lowest performance. Accuracy is highest in Sciences (0.81 with Gemini-2.0) and Arts and Humanities, while it drops significantly in Social and Legal Sciences and Engineering and Architecture, with the latter being the weakest area. Overall, Gemini-2.0 excels in all categories, while QwQ demonstrates the lowest performance in each.

Modelo Artes y Humanidades Ciencias Ciencias de la Salud Ciencias Sociales y Jurídicas Ingeniería y Arquitectura
Gemini-2.0 0.778 0.807 0.784 0.760 0.719
Llama-3.3 70B 0.738 0.734 0.734 0.689 0.681
Phi-4 14B 0.680 0.729 0.712 0.661 0.645
QwQ 32B 0.639 0.681 0.655 0.625 0.631