Cleaner Tokenization Handling in tidybert::tidy_bert_output() #18

jonthegeek · 2022-11-07T15:55:45Z

The basic_usage.Rmd vignette still has to do a manual tokenize step. That shouldn’t be necessary.

Right now we only auto-tokenize with {luz} fit/predict (and the callback telling it to do that). We need a clean way to tokenize when we use pretrained berts more directly, like we do here.

If that worked, we could create an (untokenized) dataset, then use it in the model, at which point it would be updated to match the model (or we call a helper or whatever). And then tidy_bert_output() could accept a dataset_bert_pretrained for its second argument.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleaner Tokenization Handling in tidybert::tidy_bert_output() #18

Cleaner Tokenization Handling in tidybert::tidy_bert_output() #18

jonthegeek commented Nov 7, 2022

Cleaner Tokenization Handling in tidybert::tidy_bert_output() #18

Cleaner Tokenization Handling in tidybert::tidy_bert_output() #18

Comments

jonthegeek commented Nov 7, 2022