Tokenizer initializations behave differently #43

mh-northlander · 2022-02-09T03:04:01Z

Problem:
Tokenizer cannot handle special tokens properly when it is initialized by BertSudachipyTokenizer(...).

Reproduction:

from sudachitra import BertSudachipyTokenizer

# assume there are vocab.txt and tokenizer_config.json in model_path directory
model_path = "/path/to/model/"

tok1 = BertSudachipyTokenizer.from_pretrained(model_path)
tok2 = BertSudachipyTokenizer(
  vocab_file=model_path / "vocab.txt",
  do_nfkc=True,
  word_form_type="normalized_and_surface",
)

print(tok1.tokenize("吾輩は[MASK]である"))  # -> ['我が', '##輩', 'は', '[MASK]', 'で', '或る']
print(tok2.tokenize("吾輩は[MASK]である"))  # -> ['我が', '##輩', 'は', '［', 'マスク', '］', 'で', 'ある']

Cause:
unique_no_split_tokens is not set for tok2.

print(tok1.unique_no_split_tokens)  # -> ['[CLS]', '[MASK]', '[PAD]', '[SEP]', '[UNK]']
print(tok2.unique_no_split_tokens)  # -> []

For tok1, it was set by calling sanitize_special_tokens (https://github.com/huggingface/transformers/blob/0fe17f375a4f0fdd9aea260d0645ccfd4896e958/src/transformers/tokenization_utils_base.py#L1984).

How to solve:

Call sanitize_special_tokens from __init__ of BertSudachipyTokenizer.
Let user create BertSudachipyTokenizer using from_pretrained method.

I want to discuss which approach is suitable.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer initializations behave differently #43

Tokenizer initializations behave differently #43

mh-northlander commented Feb 9, 2022

Tokenizer initializations behave differently #43

Tokenizer initializations behave differently #43

Comments

mh-northlander commented Feb 9, 2022