The Spanish Question Answering Corpus (SQAC) is an extractive QA dataset with no unanswerable questions. It is created from texts extracted from the Spanish Wikipedia, encyclopedic articles, newswire articles from Wikinews, and the Spanish section of the AnCora corpus, which is a mix from different newswire and literature sources. It was created by commissioning the creation of 18,817 questions with the annotation of their answer spans from 6,247 textual contexts. The guidelines were adapted from SQuAD v1.1 (Rajpurkar et al., 2016), and the annotators were all native Spanish speakers with university studies in various fields related to linguistics. Following the XQuAD (Artetxe, Ruder, and Yogatama, 2019) structure, no additional answers were collected.
Language(s)
Spanish
Dataset description link
Year
2022
Domain
General
News
Text types
Encyclopedia entries
News
Annotations
question-answer
Data access
Public
Publication
Asier Gutiérrez Fandiño, Jordi Armengol-Estapé, Marc Pàmies, Joan Llop-Palao,Joaquín Silveira-Ocampo,Casimiro Pio Carrino, Carme Armentano-Oller, Carlos Rodriguez-Penagos, Aitor Gonzalez-Agirre, Marta Villegas (2016) Procesamiento del Lenguaje Natural, Revista nº 68, marzo de 2022, pp. 39-60.
NLP Topic
Number of units
8817