Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizer initializations behave differently #43

Open
mh-northlander opened this issue Feb 9, 2022 · 0 comments
Open

Tokenizer initializations behave differently #43

mh-northlander opened this issue Feb 9, 2022 · 0 comments

Comments

@mh-northlander
Copy link
Collaborator

Problem:
Tokenizer cannot handle special tokens properly when it is initialized by BertSudachipyTokenizer(...).

Reproduction:

from sudachitra import BertSudachipyTokenizer

# assume there are vocab.txt and tokenizer_config.json in model_path directory
model_path = "/path/to/model/"

tok1 = BertSudachipyTokenizer.from_pretrained(model_path)
tok2 = BertSudachipyTokenizer(
  vocab_file=model_path / "vocab.txt",
  do_nfkc=True,
  word_form_type="normalized_and_surface",
)

print(tok1.tokenize("吾輩は[MASK]である"))  # -> ['我が', '##輩', 'は', '[MASK]', 'で', '或る']
print(tok2.tokenize("吾輩は[MASK]である"))  # -> ['我が', '##輩', 'は', '[', 'マスク', ']', 'で', 'ある']

Cause:
unique_no_split_tokens is not set for tok2.

print(tok1.unique_no_split_tokens)  # -> ['[CLS]', '[MASK]', '[PAD]', '[SEP]', '[UNK]']
print(tok2.unique_no_split_tokens)  # -> []

For tok1, it was set by calling sanitize_special_tokens (https://github.com/huggingface/transformers/blob/0fe17f375a4f0fdd9aea260d0645ccfd4896e958/src/transformers/tokenization_utils_base.py#L1984).

How to solve:

  1. Call sanitize_special_tokens from __init__ of BertSudachipyTokenizer.
  2. Let user create BertSudachipyTokenizer using from_pretrained method.

I want to discuss which approach is suitable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant