Skip to content

Latest commit

 

History

History
331 lines (324 loc) · 24.7 KB

README.md

File metadata and controls

331 lines (324 loc) · 24.7 KB

A Light-weight Strategy for Restraining Gender Biases in Neural Rankers

This repository contains the code and resources for our proposed bias-aware negative sampling strategy. Our proposed strategy is capable of decreasing the level of gender bias in neural ranking models, while maintaining a comparable level of retrieval effectiveness,and does not require any changes to the architecture or loss function of SOTA neural rankers.

Example of Bias in IR

Table 1 shows top 3 documents of the re-ranked list of documents that are ranked by the BERT model for two fairness-sensitive queries. We can observe that the third document of the first query and the second document of the second query have inclination towards male gender and represent supervisor and governor as male-oriented positions. However, these biased documents have lower position in the re-ranked list of our proposed fairness-aware version of the model. We note that these biased documents are not considered as the relevance judgment documents of these queries. Therefore, ranking these documents in a lower position would not impact on the performance of the model.

Table 1: Top 3 re-ranked documents by the original BERT model.

Query: is a supervisor considered a manager? Query: How important is a governor?
So a manager or supervisor who has control or direction of the employment of an employee, or is directly or indirectly responsible for an employee's employment, is considered an employer and subject to the rights and obligations of an employer under the AEPA.
Ranking Position in Ours: 1
Understanding and Adjusting Your Governor. Generally [...]. With the engine at idle, move the governor lever with your finger to open the throttle and it should push the arm back toward idle if working properly. One way to do this test is with the governor spring removed.
Ranking Position in Ours: 2
A manager or supervisor of agricultural employees may also be considered an employer for the purposes of the AEPA.The AEPA defines employer to mean the employer of an agricultural employee , and any other person who , acting on behalf of the employer , has control or direction of , or is directly or indirectly responsible for , the employment …
Ranking Position in Ours: 2
Governor is important because he is the chief executive of the state. He is the little president that implements the law in the state and oversee the operations of all local government units within his area. The Governor is like the president of the state. He makes decisions for his state and makes opinions to the ppl of the state where he is president of the state that he controls.... It's important to a specific state. Not important for Congress. a governor is like a presidnet of the state.
Ranking Position in Ours: 88
It becomes clear that the core of the role and responsibility of a supervisor lies in overlooking the activities of others to the satisfaction of laid standards in an organization. The position of a supervisor in a company is considered to be at the lowest rung of management. A supervisor in any department has more or less the same work experience as the other members in his team, but he is considered to be the leader of the group. The word manager comes from the word management, and a manager is a person who manages men. To manage is to control and to organize things, men, and events. Managers do just that. They ensure smooth running of the day to day functioning of a workplace, whether it is business, hospital, or a factory.
Ranking Position in Ours: 22
The governor is the visible official who commands media attention. The governor, along with the lieutenant governor, is also a major legislative player. The governor sets forth a recommended legislative agenda, initiates the budget, and signs or vetoes legislative bills. The governor has several other important roles. These include the role of commander of both the Georgia State Patrol and the Georgia National Guard. Often overlooked is the role of intergovernmental middleman, a fulcrum of power and a center of political gravity.
Ranking Position in Ours: 1

Retrieval Effectiveness

In order to investigate how our proposed negative sampling strategy affects retrieval effectiveness, we re-rank MS MARCO dev small dataset which is the common dataset for testing the performance of the model. As shown in Table 2, we can see that retrieval effectiveness remains statistically comparable to the base ranker and no statistically significant changes occur in terms of performance.

Table 2: Comparison between the performance (MRR@10) of the base ranker and the ranker trained based on our proposed negative sampling strategy on MS MARCO Dev Set.

Neural Ranker Training Schema Change
Original Ours
BERT-base-uncased 0.3688 0.3583 -2.84%
DistilRoBERTa-base 0.3598 0.3475 -3.42%
ELECTRA-base 0.3332 0.3351 +0.57%

Bias Measurement

We adopt two bias measurement metrics to calculate the level of biases within the retrieved list of documents: (1) The first metric measures the level of bias in a document based on the presence and frequency of gendered terms in a document, referred to as Boolean and TF ARaB metrics. (2) The Second metric is NFaiRR which calculates the level of fairness within the retrieved list of documents by calculating each document’s neutrality score. We note that less ARaB and higher NFaiRR metric values are desirable. In order to investigate the impact of our proposed negative sampling strategy on reducing gender biases, we re-rank the queries inside two sets of neutral queries (QS1 and QS2) that consists of 1765 and 215 queries, respectively. We report the retrieval effectiveness as well as the level of fairness and bias (NFaiRR, ARaB) at cut-off 10 for each of the query sets in Table 3.

Table 3: Retrieval effectiveness and the level of fairness and bias across three neural ranking models trained on query sets QS1 and QS2 at cut-off 10.

Query Set Neural Ranker Training
Schema
MRR@10 NFaiRR ARaB
Value Improvement TF Reduction Boolean Reduction
QS1 BERT
(base uncased)
Original
(Run)
0.3494 0.7764 - 0.1281 - 0.0956 -
Ours
(Run)
0.3266 0.8673 11.71% 0.0967 24.51% 0.0864 9.62%
DistilRoBERTa
(base)
Original
(Run)
0.3382 0.7805 - 0.1178 - 0.0914 -
Ours
(Run)
0.3152 0.8806 12.83% 0.0856 27.33% 0.0813 11.05%
ELECTRA
(base)
Original
(Run)
0.3265 0.7808 - 0.1273 - 0.0961 -
Ours
(Run)
0.3018 0.8767 12.28% 0.0949 25.45% 0.0855 11.03%
QS2 BERT
(base uncased)
Original
(Run)
0.2229 0.8779 - 0.0275 - 0.0157 -
Ours
(Run)
0.2265 0.9549 8.77% 0.0250 9.09% 0.0156 0.64%
DistilRoBERTa
(base)
Original
(Run)
0.2198 0.8799 - 0.0338 - 0.0262 -
Ours
(Run)
0.2135 0.9581 8.89% 0.0221 34.62% 0.0190 27.48%
ELECTRA
(base)
Original
(Run)
0.2296 0.8857 - 0.0492 - 0.0353 -
Ours
(Run)
0.2081 0.9572 8.07% 0.0279 43.29% 0.0254 28.05%

Comparative Analysis

We also compare our proposed strategy with ADVBERT which is the state-of-the-art model that leverages an adversarial training process in order to remove gender-related information. Since the authors shared their trained models based on BERT-Tiny and BERT-Mini and only for QS2, we compare our work with AdvBert in Table 4 based on these two shared models and on the QS2 query set in terms of retreival effectiveness, the level of fairness, and the level of bias.

Table 4: Comparing AdvBert training strategy and our approach at cut-off 10.

Neural Ranker Training
Schema
MRR@10 NFaiRR ARaB
Value Improvement TF Reduction Boolean Reduction
BERT-Tiny Original
(Run)
0.1750 0.8688 - 0.0356 - 0.0296 -
ADVBERT
(Run)
0.1361 0.9257 6.55% 0.0245 31.18% 0.0236 20.27%
Ours
(Run)
0.1497 0.9752 12.25% 0.0099 72.19% 0.0115 61.15%
BERT-Mini Original
(Run)
0.2053 0.8742 - 0.0300 - 0.0251 -
ADVBERT
(Run)
0.1515 0.9410 7.64% 0.0081 73.00% 0.0032 87.26%
Ours
(Run)
0.2000 0.9683 10.76% 0.0145 51.67% 0.0113 54.98%

Resources

We note that due to space limitation of github, we have uploaded trained neural ranking models, original training dataset, fairness-aware training dataset, and trec run files of MS MARCO dev small here.

Usage

In order to train any other neural ranking model with our proposed negative sampling strategy you can follow the process below:
  1. You need to first download the fairness-aware training dataset from here which consists of 20 negative sample documents for each query. Among these 20 negative sample documents, 60% of the documents (equivalant to 12 documents) are the ones that had the highest level of bias among the top-1000 retreived documents for that query by BM25.
  2. Followed by that, using the cross-encoder architecture, that passes two sequence (query and candidate document) to the transformer model, you can train your desired model. In order to do so, you can use the cross encoder architecture of Sentence Transformer Library and set the training parameters of the model as well as the training dataset which would be the fairness-aware training dataset.
  3. Once your model is trained, you can rerank any list of documents for a given query using the reranker.py script that takes the trained model latest checkpoint and rerank the list as follows:
python reranker.py\
     -checkpoint path to the checkpoint of the latest model \
     -queries path to the queries .tsv file \
     -run path to the run .trec file \
     -res path to the result folder
  1. In order to calculate the MRR of re-ranked run file of the model you can use the MRR_calculator.py script as follows:
python MRR_calculator.py \
 -qrels  path to the qrels file \
 -run path to the re-ranked run file obtained from the previous step \
 -result path to result
  1. Finally, to calculate the level of bias using ARaB metric and NFaiRR metrics for the re-ranked run files you can use the provided scripts that are taken from Navid Rekabsaz repositories (ARaB and NFaiRR) and just have some minor changes so as to be applied for our work. In order to caluclate ARaB you frist need to run documents_calculate_bias.py script to calculate the level of bias within each document. Subsequently, you need to use runs_calculate_bias.py and model_calculate_bias.py scripts for calculating the TF ARab and TF Boolean metrics. Finally, to calculate the level of Fairness for the re-reanked runs, you need to first run calc_documents_neutrality.py to calucalte the level of fairness within each document and then run metrics_fairness.py to calculate NFaiRR metric over the run file.