Release 2.5.0-rc0

Major Features and Improvements

Add a subwords tokenizer tutorial to text/examples.
Add a function to generate a BERT vocab from a tf.data.Dataset.
Add detokenize methods for BertTokenizer and WordpieceTokenizer.
Let SentencePieceTokenizer optionally return the nbest tokenizations instead of sampling from them.
Enable NFD and NFKD in NormalizeWithOffset op
Adding an i18n-friendly BasicTokenizer that can preserve accents
Create guide for tokenizers.

Breaking Changes

Bug Fixes and Other Changes

Other:
- For Windows, always include ICU data files since they need to be built in statically.
- Patches TF to fix windows builds to not look for a python3 executable.
- Rename documentation file WordShape.md to WordShape_cls.md. The problem is on MacOS (and maybe Windows) this filename collides with wordshape.md, because the filesystem does not differentiate cases for the files. This is purely a QOL change for anybody checking out the library on a non-Linux platform. Fix #361.
- Convert input to tensor to allow for numpy inputs to state based sentence breaker.
- Add classifiers to py packages and fix header image.
- fix bad rendering for add_eos add_bos description in SentencepieceTokenizer.md
- Fix for the model server test. Make sure our test tensors have the expected
- Update regression test for break_sentences_with_offsets.
- Add a shape attribute to the ToDense Keras layer.
- Add support for [batch, 1] shaped inputs in StateBasedSentenceBreaker
- Fix for the model server test. The result of the tokenize() method of
- Refactor saved_model.py to make it easier to comment out blocks of related code to identify problems. Also moved out the vocab for Wordpiece due to a tf bug.
- Update documentation for SplitMergeFromLogitsTokenizer
- Add regression test for Find Source Offsets
- Fix unselectable_ids shape check in ItemSelector.
- changing two tests, to debug failure on Kokoro Windows build.
- Switch out architecture image in tf.Text documentation.
- Fix regression test for state_based_sentence_breaker_v2
- Update run_build with enable_runfiles flag.
- Update the version of bazel_skylib to match TF's and fix a possible visibility issue.
- Simplify tf-text WORKSPACE, by relying on tf_workspace().
- Update transformer.ipynb to use a saved text.BertTokenizer
- typos
- Update mobile targets to use :mobile rather than separate :android & :ios targets.
- Make tools part of the tensorflow_text pip package.
- Import tools from the tf-text package, instead of cloning the git repo.
- Minor cleanups to make some code compile on the android build system.
- Fix pip install command in readme
- Fix tools pip package inclusion.
- Clear outputs
- A tensorfow.org compatible docs generator for tf-text.
- Formatting fixes for tensorflow.org
- Sample random tokens correctly during MLM.
- Internal repo change
- Treat Sentencepiece ops as stateful in tf.data pipelines.
- Reduce the critical section range. Because the options are
- Replacing use of TFT's deprecated dataset_schema.from_feature_spec with its replacement schema_utils.schema_from_feature_spec.
- Updating guide with new template

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Rens, Samuel Marks, thuang513

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.5.0-rc0

Release 2.5.0-rc0

Major Features and Improvements

Breaking Changes

Bug Fixes and Other Changes

Thanks to our Contributors