You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation of BM25Retriever in LlamaIndex relies on an older version of bm25s (0.2.3 or 0.2.4), which does not support non-ASCII tokenization or UTF-8 encoding. These features are now available in bm25s starting from version 0.2.6. Updating BM25Retriever to leverage these capabilities would significantly improve its handling of multilingual and non-ASCII corpora.
Specifically, this update would:
Allow the user to enable non-ASCII tokenization (e.g., non_ascii=True) for better support of languages such as Chinese, Japanese, or Arabic.
Expose a configurable text encoding (e.g., encoding="utf-8") to avoid encoding-related issues like UnicodeDecodeError.
Comment
The current limitations of BM25Retriever affect users working with non-ASCII or multilingual datasets. For example:
Chinese users often encounter difficulties with tokenization and must resort to external tools like jieba to preprocess their text before passing it to BM25Retriever.
Without UTF-8 support, encoding errors (UnicodeDecodeError) can occur when the text contains special characters or is written in a language other than English.
These issues stem from the older version of bm25s used in BM25Retriever, which lacks support for these critical features.
Reason
LlamaIndex's BM25Retriever implementation is tied to a pre-0.2.6 version of bm25s that:
Does not include non-ASCII tokenization, limiting its effectiveness for languages like Chinese, which require specialized tokenization to split characters correctly.
Defaults to system encoding (e.g., cp1252 on Windows) when reading or writing files, often causing issues when dealing with special characters or multilingual text.
These limitations force users to:
Preprocess their data manually with external tools (e.g., jieba for Chinese).
Patch the underlying bm25s methods to support UTF-8 or non-ASCII handling.
While these workarounds are effective, they add unnecessary complexity and are not user-friendly.
Value of Feature
This feature would bring the following benefits:
Seamless Multilingual Support: By exposing options like non_ascii=True and encoding="utf-8", BM25Retriever would natively support languages that require special tokenization, such as Chinese or Japanese.
Improved Usability: Users would no longer need to rely on external tokenizers like jieba or monkey-patch the library to handle encoding issues.
Compatibility with bm25s: Leveraging the latest features in bm25s ensures better performance and alignment with the state-of-the-art BM25 implementation.
Example Usage
Here’s an example of how the updated could look:
bm25_retriever=BM25Retriever.from_defaults(
docstore=docstore,
similarity_top_k=10,
language="zh", # Specify language for stopword removal and tokenizationnon_ascii=True, # Enable non-ASCII tokenizationencoding="utf-8", # Specify encoding to handle multilingual textverbose=True
)
This approach would maintain backward compatibility while offering new options for multilingual datasets.
The text was updated successfully, but these errors were encountered:
Feature Description
The current implementation of
BM25Retriever
in LlamaIndex relies on an older version ofbm25s
(0.2.3 or 0.2.4), which does not support non-ASCII tokenization or UTF-8 encoding. These features are now available inbm25s
starting from version0.2.6
. UpdatingBM25Retriever
to leverage these capabilities would significantly improve its handling of multilingual and non-ASCII corpora.Specifically, this update would:
non_ascii=True
) for better support of languages such as Chinese, Japanese, or Arabic.encoding="utf-8"
) to avoid encoding-related issues likeUnicodeDecodeError
.Comment
The current limitations of
BM25Retriever
affect users working with non-ASCII or multilingual datasets. For example:BM25Retriever
.UnicodeDecodeError
) can occur when the text contains special characters or is written in a language other than English.These issues stem from the older version of
bm25s
used inBM25Retriever
, which lacks support for these critical features.Reason
LlamaIndex's
BM25Retriever
implementation is tied to a pre-0.2.6 version ofbm25s
that:cp1252
on Windows) when reading or writing files, often causing issues when dealing with special characters or multilingual text.These limitations force users to:
bm25s
methods to support UTF-8 or non-ASCII handling.While these workarounds are effective, they add unnecessary complexity and are not user-friendly.
Value of Feature
This feature would bring the following benefits:
non_ascii=True
andencoding="utf-8"
,BM25Retriever
would natively support languages that require special tokenization, such as Chinese or Japanese.bm25s
ensures better performance and alignment with the state-of-the-art BM25 implementation.Example Usage
Here’s an example of how the updated could look:
This approach would maintain backward compatibility while offering new options for multilingual datasets.
The text was updated successfully, but these errors were encountered: