SQUAD/SQAC 2024 is an extension of the datasets SQUAD v1.1. (Stanford Question Answering Corpus) (Rajpurkar et al., 2016) for English and SQAC (Spanish Question Answering Corpus) (Gutiérrez-Fandiño et al., 2021) for Spanish. The dataset contains academic news from CSIC (Centro Superior de Investigaciones Científicas) for Spanish and Cambridge University for English, with questions and extractive answers. The news belong to different domains and are usually short, between 712 y 2760 tokens in English and 514 and 2818 tokens in Spanish. Each text has at least 10 questions with their answers. The text are addressed to the general public, so the language is not specialized. SQUAD/SQAC 2024 EN is the dataset in English.
Language(s)
English
Year
2024
Domain
Diverse
Text types
Scientific papers
Annotations
Question-extractive answer pairs
Format
json
NLP Topic
Number of units
110
Type of units
News
Tokens
1235638
Documents
110
Test set size
110