We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem: Tokenizer cannot handle special tokens properly when it is initialized by BertSudachipyTokenizer(...).
BertSudachipyTokenizer(...)
Reproduction:
from sudachitra import BertSudachipyTokenizer # assume there are vocab.txt and tokenizer_config.json in model_path directory model_path = "/path/to/model/" tok1 = BertSudachipyTokenizer.from_pretrained(model_path) tok2 = BertSudachipyTokenizer( vocab_file=model_path / "vocab.txt", do_nfkc=True, word_form_type="normalized_and_surface", ) print(tok1.tokenize("吾輩は[MASK]である")) # -> ['我が', '##輩', 'は', '[MASK]', 'で', '或る'] print(tok2.tokenize("吾輩は[MASK]である")) # -> ['我が', '##輩', 'は', '[', 'マスク', ']', 'で', 'ある']
Cause: unique_no_split_tokens is not set for tok2.
unique_no_split_tokens
print(tok1.unique_no_split_tokens) # -> ['[CLS]', '[MASK]', '[PAD]', '[SEP]', '[UNK]'] print(tok2.unique_no_split_tokens) # -> []
For tok1, it was set by calling sanitize_special_tokens (https://github.com/huggingface/transformers/blob/0fe17f375a4f0fdd9aea260d0645ccfd4896e958/src/transformers/tokenization_utils_base.py#L1984).
sanitize_special_tokens
How to solve:
__init__
BertSudachipyTokenizer
from_pretrained
I want to discuss which approach is suitable.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Problem:
Tokenizer cannot handle special tokens properly when it is initialized by
BertSudachipyTokenizer(...)
.Reproduction:
Cause:
unique_no_split_tokens
is not set for tok2.For tok1, it was set by calling
sanitize_special_tokens
(https://github.com/huggingface/transformers/blob/0fe17f375a4f0fdd9aea260d0645ccfd4896e958/src/transformers/tokenization_utils_base.py#L1984).How to solve:
sanitize_special_tokens
from__init__
ofBertSudachipyTokenizer
.BertSudachipyTokenizer
usingfrom_pretrained
method.I want to discuss which approach is suitable.
The text was updated successfully, but these errors were encountered: