We analyze the following benchmark datasets in this work: FB15k, FB15k-237, WN18, WN18RR, YAGO3-10, Wikidata5M and DBpedia50k
This repository contains all the work needed to characterize the most benchmarks used for link prediction:
- Analysis of Test Leakage and Sample Selection Bias
- Topological analysis (Visualization images, network metrics)
- Experiments for analysing bias in prediction results in FB15k, FB15k-237, WN18 and WN18RR
- Mappings for generating N-Triple files and to upload benchmark data to a knowledge graph
- SPARQL queries for analyzing bias patterns and other irregularities
All datasets - except Wikidata5M - can be found in the data
folder. Wikidata5M needs to be manually imported due its large size. It can be directly downloaded from here. The CSV files can be generated by placing the text files under data/Wikidata5M/original and running the script data/write_csv.py
.
Statistics regarding entities, relations and triples for each split can be found under analysis/data/split_statistics
.
- Relation distributions:
analysis/output/relation_distribution
- Network visualizations:
analysis/output/network_analysis/visualizations
- Other network characteristics (degree, components, pagerank, communities):
analysis/output/network_analysis/{partition}
In this work, seven bias patterns are defined concerning test leakage and sample selection bias.
Test leakage patterns:
- Near-duplicate relations
- Near-inverse relations
- Near-symmetric relations
For sample selection bias, we reused patterns defined in the work of Rossi et al.:
- Overrepresented tail answers (referred to as Type 1 Bias by Rossi et al.)
- Overrepresented head answers
- Default tail answers (referred to as Type 2 Bias by Rossi et al.)
- Default head answers
Using these patterns, we queried the number of bias-affected triples for each split in every dataset with SPARQL. The queries can be found in the folder sparql/affectedTriples
. The following plots provide a good overview of bias in each benchmark:
Following observations can be made:
- FB15k and WN18 suffer from test leakage due to near-inverse relations
- Test leakage can be found in YAGO3-10 through near-duplicate relations (over 63.3% of test triples are near-duplicates in the training set)
- WN18RR contains a large number of near-symmetric relations (almost 40%), higher than WN18
- FB15k-237 features a lot of test triples that have a default tail answer
To understand how these bias types affect prediction results we then tried to explain each correct prediction (based on H@k metric) by one or more of our bias types on a learned TransE model. If no explanation could be found the correct prediction is assigned to the bucket unknown. The jupyter notebook for bucketing predictions can be found under experiments/prediction_analysis.ipynb
.
Following observations can be made:
- Correct predictions in FB15k and WN18 can almost completely be explained with our defined bias patterns
- In WN18RR, the majority of correct predictions can be explained with the occurrence of near-symmetric relations
- We further notice that this pattern occurs at a ~100% higher rate than in the input data. This behavior reinforces the idea that the model is biased towards learning these patterns
First make sure to have correctly installed the library AmpliGraph with version 2.0.0.
To reproduce all of our experiments you simply need to start the reproducibility script using bash reproducibility/reproduce_experiments.sh
.
The script will automatically run the whole pipeline from calling SPARQL endpoints for the dataset analysis on an input level to learning embeddings and finally generating the shown plots.
The hyperparameters from here were reused for our experiments.
The Python version used is 3.8
- Install the RDFizer using
python3 -m pip install rdfizer
cd mappings
bash generate_triple_files.sh