2.5.0-rc0
Pre-release
Pre-release
Release 2.5.0-rc0
Major Features and Improvements
- Add a subwords tokenizer tutorial to text/examples.
- Add a function to generate a BERT vocab from a tf.data.Dataset.
- Add detokenize methods for
BertTokenizer
andWordpieceTokenizer
. - Let SentencePieceTokenizer optionally return the nbest tokenizations instead of sampling from them.
- Enable NFD and NFKD in NormalizeWithOffset op
- Adding an i18n-friendly BasicTokenizer that can preserve accents
- Create guide for tokenizers.
Breaking Changes
Bug Fixes and Other Changes
- Other:
- For Windows, always include ICU data files since they need to be built in statically.
- Patches TF to fix windows builds to not look for a python3 executable.
- Rename documentation file WordShape.md to WordShape_cls.md. The problem is on MacOS (and maybe Windows) this filename collides with wordshape.md, because the filesystem does not differentiate cases for the files. This is purely a QOL change for anybody checking out the library on a non-Linux platform. Fix #361.
- Convert input to tensor to allow for numpy inputs to state based sentence breaker.
- Add classifiers to py packages and fix header image.
- fix bad rendering for add_eos add_bos description in SentencepieceTokenizer.md
- Fix for the model server test. Make sure our test tensors have the expected
- Update regression test for break_sentences_with_offsets.
- Add a
shape
attribute to theToDense
Keras layer. - Add support for [batch, 1] shaped inputs in StateBasedSentenceBreaker
- Fix for the model server test. The result of the tokenize() method of
- Refactor saved_model.py to make it easier to comment out blocks of related code to identify problems. Also moved out the vocab for Wordpiece due to a tf bug.
- Update documentation for SplitMergeFromLogitsTokenizer
- Add regression test for Find Source Offsets
- Fix
unselectable_ids
shape check in ItemSelector. - changing two tests, to debug failure on Kokoro Windows build.
- Switch out architecture image in tf.Text documentation.
- Fix regression test for state_based_sentence_breaker_v2
- Update run_build with enable_runfiles flag.
- Update the version of bazel_skylib to match TF's and fix a possible visibility issue.
- Simplify tf-text WORKSPACE, by relying on tf_workspace().
- Update transformer.ipynb to use a saved
text.BertTokenizer
- typos
- Update mobile targets to use :mobile rather than separate :android & :ios targets.
- Make tools part of the
tensorflow_text
pip package. - Import tools from the tf-text package, instead of cloning the git repo.
- Minor cleanups to make some code compile on the android build system.
- Fix pip install command in readme
- Fix
tools
pip package inclusion. - Clear outputs
- A tensorfow.org compatible docs generator for tf-text.
- Formatting fixes for tensorflow.org
- Sample random tokens correctly during MLM.
- Internal repo change
- Treat Sentencepiece ops as stateful in tf.data pipelines.
- Reduce the critical section range. Because the options are
- Replacing use of TFT's deprecated dataset_schema.from_feature_spec with its replacement schema_utils.schema_from_feature_spec.
- Updating guide with new template
Thanks to our Contributors
This release contains contributions from many people at Google, as well as:
Rens, Samuel Marks, thuang513