Implementation of The AAAI-21 Workshop on Scientific Document Understanding paper A Paragraph-level Multi-task Learning Model for Scientific Fact-Verification. There is a short video available for this work! This work is at top 2 of SciFact leaderboard as of March 30th, 2021.
We recommend you create an anaconda environment:
conda create --name scifact python=3.6 conda-build
Then, from the scifact
project root, run
conda develop .
which will add the scifact code to your PYTHONPATH
.
Then, install Python requirements:
pip install -r requirements.txt
If you encounter any installation problem regarding sent2vec, please check their repo.
The BioSentVec model is available here.
The SciFact claim files and corpus file are available at SciFact repo. The checkpoint of Paragraph-Joint model used for the paper (trained on training set) is available here. The checkpoint of Paragraph-Joint model used for leaderboard submission (trained on train+dev set) is available here.
python ComputeBioSentVecAbstractEmbedding.py --claim_file /path/to/claims.jsonl --corpus_file /path/to/corpus.jsonl --sentvec_path /path/to/sentvec_model
python SentVecAbstractRetriaval.py --claim_file /path/to/claims.jsonl --corpus_file /path/to/corpus.jsonl --k_retrieval 30 --claim_retrieved_file /output/path/of/retrieval_file.jsonl --scifact_abstract_retrieval_file /output/path/of/retrieval_file_scifact_format.jsonl
The retrieved abstracts are available here: train, dev, test.
You need to retrieve some negative samples for FEVER pre-training. We used the trieval code from here. Empirically, only retrieving 5 negative examples for each claim is enough, while retrieving more may be way too time-consuming. You need to convert the format of the output of the retrieval code to the input of SciFact.
For your convenience, the converted retrieved FEVER examples with k_retrieval=15
are available: train, dev.
The checkpoint of the Paragraph-Joint model only pretrained on the retrieved FEVER examples shared above is available here.
Run FEVER_joint_paragraph_dynamic.py
to pre-train the model on FEVER. Use --checkpoint
to specify the checkpoint path. Run scifact_joint_paragraph_dynamic.py
to fine-tune on SciFact dataset. Use --pre_trained_model
to load the pre-trained model. Please check the other options in the source file.
python scifact_joint_paragraph_dynamic_prediction.py --corpus_file /path/to/corpus.jsonl --test_file /path/to/retrieval_file.jsonl --dataset /path/to/scifact/claims_test.jsonl --batch_size 25 --k 30 --prediction /path/to/output.jsonl --evaluate --checkpoint /path/to/checkpoint
The file names should be self-explanatory. Most parameters are set with default values. The parameters should be straight forward.
File names with rationale
and stance
are those scripts for rationale selection and stance prediction models.
File names with FEVER
are scripts for training on FEVER dataset. Same for domain_adaptation
.
File names with prediction
are scripts for taking the pre-trained models and perform inference.
File names with kgat
means those models with KGAT as stance predictor.
You can use --pre_trained_model path/to/pre_trained.model
to load a model trained on FEVER dataset and fine-tune on SciFact.
@inproceedings{li2021paragraph,
title={A Paragraph-level Multi-task Learning Model for Scientific Fact-Verification.},
author={Li, Xiangci and Burns, Gully A and Peng, Nanyun},
booktitle={SDU@ AAAI},
year={2021}
}