This repository contains the code and resources for detecting the gender of queries (Female, Male, Neutral) along with psychological characteristics of their relevance judgement documents.
In this work, we proposed a Query Gender classifier. As the first step and in order to be able to label queries based on their gender at scale, we employed the gender-annotated dataset released by Navid Rekabsaz and Markus Schedl to train relevant classifiers. This dataset consists of 742 female, 1,202 male and 1,765 neutral queries. We trained various types of classifiers on this dataset and in order to evaluate the performance of the classifiers, we adopt a 5-fold cross-validation strategy.
Category | Classifier | Accuracy | F1-Score | ||
---|---|---|---|---|---|
Dynamic Embeddings |
BERT (base uncased) | 0.856 | 0.816 | 0.872 | 0.862 |
DistilBERT (base uncased) | 0.847 | 0.815 | 0.861 | 0.853 | |
RoBERTa | 0.810 | 0.733 | 0.820 | 0.836 | |
DistilBERT (base cased) | 0.800 | 0.730 | 0.823 | 0.833 | |
BERT (base cased) | 0.797 | 0.710 | 0.805 | 0.827 | |
XLNet (base cased) | 0.795 | 0.710 | 0.805 | 0.826 | |
Static Embeddings | Word2Vec | 0.757 | 0.626 | 0.756 | 0.809 |
fastText | 0.750 | 0.615 | 0.759 | 0.792 |
In the table above the performance of each of the developed classifiers is reported. As shown the uncased fine-tuned BERT model shows the best performance for query gender identification. Finally, for the purpose of measuring bias in relevance judgements, we used our best-performed model to identify the gender of queries in MS MARCO Dev set that had at least one related human-judged relevance judgement document - equivalent to 51,827 queries. Note that, the queries of gender-annotated dataset were removed from this dataset to avoid unintended leakage.
The following table illustrates a few queries labeled using our fine-tuned BERT classifier. Here is all the the 1405 female, 1405 male, and 1405 neutral labeled queries.
QID | Query | Predicted Gender |
---|---|---|
80095 | can you take naproxen during pregnancy | Female |
14757 | aimee osbourne net worth | Female |
189154 | foods that can prevent prostate cancer | Male |
11251 | adam devine net worth | Male |
40234 | average percentage of accepted scholarships | Neutral |
- Training -
code/train.py
: The code for fine-tuning BERT on queries_gender_annotated dataset or any other dataset. - Predicting -
codes/predict.py
: In any case that you do not want to train the model, you can download our fine-tuned model and usepredict.py
for predicting the gender of queries.
Our approach for quantifying bias is based on measuring different psychological characteristics of the relevance judgement documents associated with each query. To investigate this, we employ Linguistic Inquiry and Word Count (LIWC) text analytics toolkit to compute the degree to which different psychological characteristics are observed in relevance judgement documents. These Psychological characteristics related to the queries of each group can be found in results/psychological analysis
folder.