Hi there! 👋 In this repository, we develop a Multilingual Information Retrieval model that support 15 different languages, and it will be used on Yelp search engine after further online experiments.
Figure 1: Yelp's search interface
The current search engine of Yelp is based on NrtSearch. However, inverted index-based lexical matching on Lucene-based search engine such as NrtSearch falls short in several key aspects:
- Lack of understanding of hypernyms, synonyms, and antonyms. For example, "sneaker" might match the intent of the query "running shoes", but may not be retrieved.
- Fragility of morphological variants (e.g. woman vs. women)
- Sensitivity to spelling errors
- Inability to support multilingual search
Although Yelp rewrites the query by query expansion and spelling correction before sending it to search engine, the capacity of this method is still limited. Therefore, we intend to add a neural-network-based model trained with large amount of text to complement the lexical search engine in ad-hoc multilingual retrieval.
Both mBERT and XLM have shown great sucess when fine-tuned on downstream tasks. However, pre-training objectives tailored for ad-hoc information retrieval task have not been well explored. In this repository, we use three pretraining objective specially designed for multilingual retrieval tasks:
- Query Language Modeling Task (QLM)
Mask some query tokens and ask the model to predict the masked tokens based on query contexts and full relevant document.
- Relevance Ranking Task (RR)
Given a query and several documents, the model is asked to rank these documents based on levels of relevance.
- Representative wOrds Prediction (ROP)
Pretrain the model to predict the pairwise preference between two generated pseudo queries from document. Please refer to the detail of PROP in here
In order to be both efficient and effective, we use ColBERT as backbone. ColBERT relies on fine-grained contextual late interaction to enable scalable BERT-based search over large text collections in tens of milliseconds.
Figure 2: ColBERT's late interaction structure
We use multiligual Wiki as pretraining dataset, and our approach is conceptually similar to the Inverse Cloze Task (ICT), where one sentence is sampled from a Wiki paragraph as query, and the rest of the paragraph is treated as document. We also use triples.train.small.tar.gz
from MSMARCO passage ranking dataset as training corpus, and use an in-house translation model to translate it into 15 languages.
The ColXLM-15 model includes these languages: en-fr-es-de-it-pt-nl-sv-pl-ru-ar-zh-ja-ko-hi, represented by ISO 639-2 Code
We continue pretraining our retrieval-oriented language models from the public mBERT checkpoint. Therefore, our multilingual LM is implicitly pretrained with four objectives (MLM, QLM, RR, ROP). We first train with QLM in random order of languange pairs, then train with RR and ROP in random order of language pairs in each iteration. Each epoch contains 200K iteration per language pair for each objective, with batchsize 32.
In order to train the model, you need to run train.sh
:
CUDA_VISIBLE_DEVICES="0" \
python -m \
colXLM.train --doc_maxlen 180 --mask-punctuation --bsize 32 --accum 1 --mlm_probability 0.1 \
--triples /path/to/train.tsv \
--prop /path/to/prop/msmarco_info \
--langs "en,fr,es,de,it,pt,nl,sv,pl,ru,ar,zh,ja,ko,hi" \
--root /path/to/ColXLM --experiment WIKI-psg --similarity l2 --run wiki.psg.l2 --maxsteps 2000000
In this step, we use the model trained in the pretraining phase to embed every document, and then store the embedding on disk.
CUDA_VISIBLE_DEVICES="0" \
python -m colXLM.index_document \
--checkpoint_path /path/to/checkpoints/colbert.dnn \
--index_path /path/to/indexes \
--doc_path /path/to/documents.tsv
We typically recommend that you use ColXLM for end-to-end retrieval, where it directly finds its top-k passages from the full collection. For this, you need FAISS indexing.
For end-to-end retrieval, you should index the document representations into FAISS.
CUDA_VISIBLE_DEVICES="0" \
python -m colXLM.index_faiss \
--dim 128 --index_path /path/to/indexes --faiss_name 'faiss_l2'
run retrieve.sh
to retrieve relevant documents from full collection.
CUDA_VISIBLE_DEVICES="7" \
python -m colXLM.retrieve_faiss \
--checkpoint_path /path/to/checkpoints/colbert.dnn \
--query_doc_path /path/to/queries.tsv \
--submit_path /path/to/submit.tsv \
--index_path /path/to/indexes \
--gold_path /path/to/top1000.tsv \
--faiss_name faiss_l2 \
--batchsize 128 \
--k 1000
Zihan Wang: zw2782@columbia.edu
Columbia Database Group: https://cudbg.github.io/