[Feature Request]: Update BM25 Retriever to Fully Support bm25s Non-ASCII and UTF-8 Options #17461

Restodecoca · 2025-01-08T21:06:02Z

Feature Description

The current implementation of BM25Retriever in LlamaIndex relies on an older version of bm25s (0.2.3 or 0.2.4), which does not support non-ASCII tokenization or UTF-8 encoding. These features are now available in bm25s starting from version 0.2.6. Updating BM25Retriever to leverage these capabilities would significantly improve its handling of multilingual and non-ASCII corpora.

Specifically, this update would:

Allow the user to enable non-ASCII tokenization (e.g., non_ascii=True) for better support of languages such as Chinese, Japanese, or Arabic.
Expose a configurable text encoding (e.g., encoding="utf-8") to avoid encoding-related issues like UnicodeDecodeError.

Comment

The current limitations of BM25Retriever affect users working with non-ASCII or multilingual datasets. For example:

Chinese users often encounter difficulties with tokenization and must resort to external tools like jieba to preprocess their text before passing it to BM25Retriever.
Without UTF-8 support, encoding errors (UnicodeDecodeError) can occur when the text contains special characters or is written in a language other than English.

These issues stem from the older version of bm25s used in BM25Retriever, which lacks support for these critical features.

Reason

LlamaIndex's BM25Retriever implementation is tied to a pre-0.2.6 version of bm25s that:

Does not include non-ASCII tokenization, limiting its effectiveness for languages like Chinese, which require specialized tokenization to split characters correctly.
Defaults to system encoding (e.g., cp1252 on Windows) when reading or writing files, often causing issues when dealing with special characters or multilingual text.

These limitations force users to:

Preprocess their data manually with external tools (e.g., jieba for Chinese).
Patch the underlying bm25s methods to support UTF-8 or non-ASCII handling.

While these workarounds are effective, they add unnecessary complexity and are not user-friendly.

Value of Feature

This feature would bring the following benefits:

Seamless Multilingual Support: By exposing options like non_ascii=True and encoding="utf-8", BM25Retriever would natively support languages that require special tokenization, such as Chinese or Japanese.
Improved Usability: Users would no longer need to rely on external tokenizers like jieba or monkey-patch the library to handle encoding issues.
Compatibility with bm25s: Leveraging the latest features in bm25s ensures better performance and alignment with the state-of-the-art BM25 implementation.

Example Usage

Here’s an example of how the updated could look:

bm25_retriever = BM25Retriever.from_defaults(
    docstore=docstore,
    similarity_top_k=10,
    language="zh",  # Specify language for stopword removal and tokenization
    non_ascii=True,  # Enable non-ASCII tokenization
    encoding="utf-8",  # Specify encoding to handle multilingual text
    verbose=True
)

This approach would maintain backward compatibility while offering new options for multilingual datasets.

The text was updated successfully, but these errors were encountered:

Restodecoca added enhancement New feature or request triage Issue needs to be triaged/prioritized labels Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request]: Update BM25 Retriever to Fully Support bm25s Non-ASCII and UTF-8 Options #17461

[Feature Request]: Update BM25 Retriever to Fully Support bm25s Non-ASCII and UTF-8 Options #17461

Restodecoca commented Jan 8, 2025

[Feature Request]: Update BM25 Retriever to Fully Support bm25s Non-ASCII and UTF-8 Options #17461

[Feature Request]: Update BM25 Retriever to Fully Support bm25s Non-ASCII and UTF-8 Options #17461

Comments

Restodecoca commented Jan 8, 2025

Feature Description

Reason

Value of Feature

Example Usage