SQAC-SQUAD 2016: Question answering

A reading comprehension task on a dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. Systems must select the answer from all possible spans in the passage, thus needing to cope with a fairly large number of candidates.

Publication
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
Language
English
NLP topic
Abstract task
Dataset
Year
2016
Ranking metric
F1

Task results

System Precision Recall F1 Sort ascending CEM Accuracy MacroPrecision MacroRecall MacroF1 RMSE MicroPrecision MicroRecall MicroF1 MAE MAP UAS LAS MLAS BLEX Pearson correlation Spearman correlation MeasureC BERTScore EMR Exact Match F0.5 Hierarchical F ICM MeasureC Propensity F Reliability Sensitivity Sentiment Graph F1 WAC b2 erde30 sent weighted f1
Roberta large 0.8724 0.8724 0.8724 0.8724 0.87
Xlm roberta large 0.8581 0.8581 0.8581 0.8581 0.86
Roberta base 0.8427 0.8427 0.8427 0.8427 0.84
Ixa ehu ixambert base cased 0.8187 0.8187 0.8187 0.8187 0.82
Bert base multilingual cased 0.8059 0.8059 0.8059 0.8059 0.81
Xlm roberta base 0.7998 0.7998 0.7998 0.7998 0.80
Bert base cased 0.7968 0.7968 0.7968 0.7968 0.80
Distilbert base uncased 0.7602 0.7602 0.7602 0.7602 0.76
Distilbert base multilingual cased 0.7467 0.7467 0.7467 0.7467 0.75

If you have published a result better than those on the list, send a message to odesia-comunicacion@lsi.uned.es indicating the result and the DOI of the article, along with a copy of it if it is not published openly.