Paper, Tags: #nlp
Evaluations are misleading when the test set comes from the training dataset. We propose a new annotation paradigm for NLP that helps close systematic gaps in the test data. We recommend that after a dataset is constructed, authors manually perturb the test instances in small (preserving lexical/syntactict artifacts) but meaningful ways that change the gold label, creating contrast sets.
Our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test set, up to 25% in some cases.
Systematic gaps in the dataset allow simple decision rules to perform well on test data. This is evident when models achieve high test accuracy but fail on simple input perturbations, challenge examples and covariate and label shifts.
Perturbed test sets only need to be largue enough to draw substantiated conclusions about model behavior and thus do not require undue labor on the original dataset authors. 1000 instances. The test sets that we have created are available here.
We recommend the original dataset authors, the experts on the intended linguistic phenomena of their dataset, construct the contrast set.
How to create contrast sets for DROP, NLVR2 and UD Parsing in section 3. The task of creating contrast sets varies for each dataset. Dataset authors should use their best judgment to select which phenomena they are most interested in studying and craft their contrast sets to explicitly test those phenomena.
Our results sjpw tjat models have 'overfit' to artifacts that are present in existing datasets.