Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Update BM25 Retriever to Fully Support bm25s Non-ASCII and UTF-8 Options #17461

Open
Restodecoca opened this issue Jan 8, 2025 · 0 comments
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized

Comments

@Restodecoca
Copy link

Feature Description

The current implementation of BM25Retriever in LlamaIndex relies on an older version of bm25s (0.2.3 or 0.2.4), which does not support non-ASCII tokenization or UTF-8 encoding. These features are now available in bm25s starting from version 0.2.6. Updating BM25Retriever to leverage these capabilities would significantly improve its handling of multilingual and non-ASCII corpora.

Specifically, this update would:

  1. Allow the user to enable non-ASCII tokenization (e.g., non_ascii=True) for better support of languages such as Chinese, Japanese, or Arabic.
  2. Expose a configurable text encoding (e.g., encoding="utf-8") to avoid encoding-related issues like UnicodeDecodeError.

Comment

The current limitations of BM25Retriever affect users working with non-ASCII or multilingual datasets. For example:

  • Chinese users often encounter difficulties with tokenization and must resort to external tools like jieba to preprocess their text before passing it to BM25Retriever.
  • Without UTF-8 support, encoding errors (UnicodeDecodeError) can occur when the text contains special characters or is written in a language other than English.

These issues stem from the older version of bm25s used in BM25Retriever, which lacks support for these critical features.

Reason

LlamaIndex's BM25Retriever implementation is tied to a pre-0.2.6 version of bm25s that:

  1. Does not include non-ASCII tokenization, limiting its effectiveness for languages like Chinese, which require specialized tokenization to split characters correctly.
  2. Defaults to system encoding (e.g., cp1252 on Windows) when reading or writing files, often causing issues when dealing with special characters or multilingual text.

These limitations force users to:

  • Preprocess their data manually with external tools (e.g., jieba for Chinese).
  • Patch the underlying bm25s methods to support UTF-8 or non-ASCII handling.

While these workarounds are effective, they add unnecessary complexity and are not user-friendly.

Value of Feature

This feature would bring the following benefits:

  • Seamless Multilingual Support: By exposing options like non_ascii=True and encoding="utf-8", BM25Retriever would natively support languages that require special tokenization, such as Chinese or Japanese.
  • Improved Usability: Users would no longer need to rely on external tokenizers like jieba or monkey-patch the library to handle encoding issues.
  • Compatibility with bm25s: Leveraging the latest features in bm25s ensures better performance and alignment with the state-of-the-art BM25 implementation.

Example Usage

Here’s an example of how the updated could look:

bm25_retriever = BM25Retriever.from_defaults(
    docstore=docstore,
    similarity_top_k=10,
    language="zh",  # Specify language for stopword removal and tokenization
    non_ascii=True,  # Enable non-ASCII tokenization
    encoding="utf-8",  # Specify encoding to handle multilingual text
    verbose=True
)

This approach would maintain backward compatibility while offering new options for multilingual datasets.

@Restodecoca Restodecoca added enhancement New feature or request triage Issue needs to be triaged/prioritized labels Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

1 participant