This repository contains the code and resources for our proposed bias-aware negative sampling strategy. Our proposed strategy is capable of decreasing the level of gender bias in neural ranking models, while maintaining a comparable level of retrieval effectiveness,and does not require any changes to the architecture or loss function of SOTA neural rankers.
Table 1 shows top 3 documents of the re-ranked list of documents that are ranked by the BERT model for two fairness-sensitive queries. We can observe that the third document of the first query and the second document of the second query have inclination towards male gender and represent supervisor and governor as male-oriented positions. However, these biased documents have lower position in the re-ranked list of our proposed fairness-aware version of the model. We note that these biased documents are not considered as the relevance judgment documents of these queries. Therefore, ranking these documents in a lower position would not impact on the performance of the model.
Query: is a supervisor considered a manager? | Query: How important is a governor? |
---|---|
So a manager or supervisor who has control or direction of the employment of an employee, or is directly or indirectly responsible for an employee's employment, is considered an employer and subject to the rights and obligations of an employer under the AEPA. Ranking Position in Ours: 1 |
Understanding and Adjusting Your Governor. Generally [...]. With the engine at idle, move the governor lever with your finger to open the throttle and it should push the arm back toward idle if working properly. One way to do this test is with the governor spring removed. Ranking Position in Ours: 2 |
A manager or supervisor of agricultural employees may also be considered an employer for the purposes of the AEPA.The AEPA defines employer to mean the employer of an agricultural employee , and any other person who , acting on behalf of the employer , has control or direction of , or is directly or indirectly responsible for , the employment … Ranking Position in Ours: 2 |
Governor is important because he is the chief executive of the state. He is the little president that implements the law in the state and oversee the operations of all local government units within his area. The Governor is like the president of the state. He makes decisions for his state and makes opinions to the ppl of the state where he is president of the state that he controls.... It's important to a specific state. Not important for Congress. a governor is like a presidnet of the state. Ranking Position in Ours: 88 |
It becomes clear that the core of the role and responsibility of a supervisor lies in overlooking the activities of others to the satisfaction of laid standards in an organization. The position of a supervisor in a company is considered to be at the lowest rung of management. A supervisor in any department has more or less the same work experience as the other members in his team, but he is considered to be the leader of the group. The word manager comes from the word management, and a manager is a person who manages men. To manage is to control and to organize things, men, and events. Managers do just that. They ensure smooth running of the day to day functioning of a workplace, whether it is business, hospital, or a factory. Ranking Position in Ours: 22 |
The governor is the visible official who commands media attention. The governor, along with the lieutenant governor, is also a major legislative player. The governor sets forth a recommended legislative agenda, initiates the budget, and signs or vetoes legislative bills. The governor has several other important roles. These include the role of commander of both the Georgia State Patrol and the Georgia National Guard. Often overlooked is the role of intergovernmental middleman, a fulcrum of power and a center of political gravity. Ranking Position in Ours: 1 |
In order to investigate how our proposed negative sampling strategy affects retrieval effectiveness, we re-rank MS MARCO dev small dataset which is the common dataset for testing the performance of the model. As shown in Table 2, we can see that retrieval effectiveness remains statistically comparable to the base ranker and no statistically significant changes occur in terms of performance.
Table 2: Comparison between the performance (MRR@10) of the base ranker and the ranker trained based on our proposed negative sampling strategy on MS MARCO Dev Set.
Neural Ranker | Training Schema | Change | |
---|---|---|---|
Original | Ours | ||
BERT-base-uncased | 0.3688 | 0.3583 | -2.84% |
DistilRoBERTa-base | 0.3598 | 0.3475 | -3.42% |
ELECTRA-base | 0.3332 | 0.3351 | +0.57% |
We adopt two bias measurement metrics to calculate the level of biases within the retrieved list of documents: (1) The first metric measures the level of bias in a document based on the presence and frequency of gendered terms in a document, referred to as Boolean and TF ARaB metrics. (2) The Second metric is NFaiRR which calculates the level of fairness within the retrieved list of documents by calculating each document’s neutrality score. We note that less ARaB and higher NFaiRR metric values are desirable. In order to investigate the impact of our proposed negative sampling strategy on reducing gender biases, we re-rank the queries inside two sets of neutral queries (QS1 and QS2) that consists of 1765 and 215 queries, respectively. We report the retrieval effectiveness as well as the level of fairness and bias (NFaiRR, ARaB) at cut-off 10 for each of the query sets in Table 3.
Table 3: Retrieval effectiveness and the level of fairness and bias across three neural ranking models trained on query sets QS1 and QS2 at cut-off 10.
Query Set | Neural Ranker | Training Schema |
MRR@10 | NFaiRR | ARaB | ||||
---|---|---|---|---|---|---|---|---|---|
Value | Improvement | TF | Reduction | Boolean | Reduction | ||||
QS1 | BERT (base uncased) |
Original (Run) |
0.3494 | 0.7764 | - | 0.1281 | - | 0.0956 | - |
Ours (Run) |
0.3266 | 0.8673 | 11.71% | 0.0967 | 24.51% | 0.0864 | 9.62% | ||
DistilRoBERTa (base) |
Original (Run) |
0.3382 | 0.7805 | - | 0.1178 | - | 0.0914 | - | |
Ours (Run) |
0.3152 | 0.8806 | 12.83% | 0.0856 | 27.33% | 0.0813 | 11.05% | ||
ELECTRA (base) |
Original (Run) |
0.3265 | 0.7808 | - | 0.1273 | - | 0.0961 | - | |
Ours (Run) |
0.3018 | 0.8767 | 12.28% | 0.0949 | 25.45% | 0.0855 | 11.03% | ||
QS2 | BERT (base uncased) |
Original (Run) |
0.2229 | 0.8779 | - | 0.0275 | - | 0.0157 | - |
Ours (Run) |
0.2265 | 0.9549 | 8.77% | 0.0250 | 9.09% | 0.0156 | 0.64% | ||
DistilRoBERTa (base) |
Original (Run) |
0.2198 | 0.8799 | - | 0.0338 | - | 0.0262 | - | |
Ours (Run) |
0.2135 | 0.9581 | 8.89% | 0.0221 | 34.62% | 0.0190 | 27.48% | ||
ELECTRA (base) |
Original (Run) |
0.2296 | 0.8857 | - | 0.0492 | - | 0.0353 | - | |
Ours (Run) |
0.2081 | 0.9572 | 8.07% | 0.0279 | 43.29% | 0.0254 | 28.05% |
We also compare our proposed strategy with ADVBERT which is the state-of-the-art model that leverages an adversarial training process in order to remove gender-related information. Since the authors shared their trained models based on BERT-Tiny and BERT-Mini and only for QS2, we compare our work with AdvBert in Table 4 based on these two shared models and on the QS2 query set in terms of retreival effectiveness, the level of fairness, and the level of bias.
Neural Ranker | Training Schema |
MRR@10 | NFaiRR | ARaB | ||||
---|---|---|---|---|---|---|---|---|
Value | Improvement | TF | Reduction | Boolean | Reduction | |||
BERT-Tiny | Original (Run) |
0.1750 | 0.8688 | - | 0.0356 | - | 0.0296 | - |
ADVBERT (Run) |
0.1361 | 0.9257 | 6.55% | 0.0245 | 31.18% | 0.0236 | 20.27% | |
Ours (Run) |
0.1497 | 0.9752 | 12.25% | 0.0099 | 72.19% | 0.0115 | 61.15% | |
BERT-Mini | Original (Run) |
0.2053 | 0.8742 | - | 0.0300 | - | 0.0251 | - |
ADVBERT (Run) |
0.1515 | 0.9410 | 7.64% | 0.0081 | 73.00% | 0.0032 | 87.26% | |
Ours (Run) |
0.2000 | 0.9683 | 10.76% | 0.0145 | 51.67% | 0.0113 | 54.98% |
We note that due to space limitation of github, we have uploaded trained neural ranking models, original training dataset, fairness-aware training dataset, and trec run files of MS MARCO dev small here.
In order to train any other neural ranking model with our proposed negative sampling strategy you can follow the process below:
- You need to first download the fairness-aware training dataset from here which consists of 20 negative sample documents for each query. Among these 20 negative sample documents, 60% of the documents (equivalant to 12 documents) are the ones that had the highest level of bias among the top-1000 retreived documents for that query by BM25.
- Followed by that, using the cross-encoder architecture, that passes two sequence (query and candidate document) to the transformer model, you can train your desired model. In order to do so, you can use the cross encoder architecture of Sentence Transformer Library and set the training parameters of the model as well as the training dataset which would be the fairness-aware training dataset.
- Once your model is trained, you can rerank any list of documents for a given query using the
reranker.py
script that takes the trained model latest checkpoint and rerank the list as follows:
python reranker.py\
-checkpoint path to the checkpoint of the latest model \
-queries path to the queries .tsv file \
-run path to the run .trec file \
-res path to the result folder
- In order to calculate the MRR of re-ranked run file of the model you can use the
MRR_calculator.py
script as follows:
python MRR_calculator.py \
-qrels path to the qrels file \
-run path to the re-ranked run file obtained from the previous step \
-result path to result
- Finally, to calculate the level of bias using ARaB metric and NFaiRR metrics for the re-ranked run files you can use the provided scripts that are taken from Navid Rekabsaz repositories (ARaB and NFaiRR) and just have some minor changes so as to be applied for our work. In order to caluclate ARaB you frist need to run
documents_calculate_bias.py
script to calculate the level of bias within each document. Subsequently, you need to useruns_calculate_bias.py
andmodel_calculate_bias.py
scripts for calculating the TF ARab and TF Boolean metrics. Finally, to calculate the level of Fairness for the re-reanked runs, you need to first runcalc_documents_neutrality.py
to calucalte the level of fairness within each document and then runmetrics_fairness.py
to calculate NFaiRR metric over the run file.