Valla: A standardized benchmark for authorship attribution and verification.
Paper: On the State of the Art in Authorship Attribution and Authorship Verification
The name was chosen in memory of Lorenzo Valla, who in 1440, published De falso credita et ementita Constantini Donatione declamatio, which proved that the Donation of Constantine (where Constantine I gave the whole of the Western Roman Empire to the Roman Catholic Church) was a forgery, using word choice and other vernacular stylistic choices as evidence.
The requirements for this project were managed with conda. See the environment.yml
file for more information.
Note that environment_torched_adhom.yml
is an environment specifically for
working with authoridentification/methods/torched_adhominem.py
.
To use any of the datasets in this repository, first download the dataset according to the instructions at the top of the corresponding script. The project expects the data to be in a structure like:
datasets
└───Dataset1
│ └───Raw
│ │ │ Raw Files
│ │ │ ...
│ │
│ └───Processed
│ │ │ Processed Files will be saved here
│ │ │ ...
└───Dataset2
| ...
Once the data is placed and extracted into the corrsponding Raw directory, the scripts can process them. The datasets currently supported are:
- Amazon
- Blogs
- CCAT50
- CMCC
- Guardian
- Gutenberg
- IMDB
- PAN20
- PAN21
- TopicConfusion
This repository holds implementations of several popular Authorship Attribution and Verification methodologies,
including some based on BERT, Siamese Models, Multi-Headed Language models, BiLSTMs, compression models, Ngrams, and more.
After processing a raw dataset with the corresponding script, the dataset is ready for use with any of these models.
See each file in valla/methods
for more information on the available methods.
This project uses Weights & Biases both for logging and hyperparameter sweeps.
If you use this software, place cite our paper: On the State of the Art in Authorship Attribution and Authorship Verification
Feel free to contribute, and drop me a note at jacob.tyo@gmail.com
if you have any questions/comments/concerns.