In June 2024, a friend and I developed a superior alternative to BM-25 using a clustering approach with a k-NN algorithm. The effectiveness of this method in improving MAP was confirmed through both a paired t-test and the Wilcoxon log-rank test.
This project was part of the Information Retrieval exam. Our approach was inspired by the paper by Lee, K. S., Croft, W. B., & Allan, J. (2008) titled A deterministic resampling method using overlapping document clusters for pseudo-relevance feedback. We can summarize it in seven steps:
- Initial retrieval on the whole collection.
- Clustering.
- Identification of the "dominant document" based on the clustering.
- Aggregation of documents in the clusters of the "dominant document."
- Retrieval on the aggregated clusters.
- Query expansion based on the first result (pseudo-RF).
- Second retrieval on the whole collection.
We then applied this approach on an experimental collection to optimize its hyperparameters and test its effectiveness.
We have uploaded the following files:
- bulk.py and bulk_without_stopwords.py: Used to index the TREC collection ROBUST 2004, both with and without stopwords.
- cluster_pseudo_RF.py: Contains the code for the pseudo RF described above.
- search.py: Used to get the results of the queries on ROBUST 2004 and to perform a grid search to optimize some parameters.
- Quartuccio_Varotto.pdf: Contains the presentation my friend and I used to present our project.