Skip to content

2.5.0-rc0

Pre-release
Pre-release
Compare
Choose a tag to compare
@gregbillock gregbillock released this 06 Apr 23:41
· 49 commits to 2.5 since this release

Release 2.5.0-rc0

Major Features and Improvements

  • Add a subwords tokenizer tutorial to text/examples.
  • Add a function to generate a BERT vocab from a tf.data.Dataset.
  • Add detokenize methods for BertTokenizer and WordpieceTokenizer.
  • Let SentencePieceTokenizer optionally return the nbest tokenizations instead of sampling from them.
  • Enable NFD and NFKD in NormalizeWithOffset op
  • Adding an i18n-friendly BasicTokenizer that can preserve accents
  • Create guide for tokenizers.

Breaking Changes

Bug Fixes and Other Changes

  • Other:
    • For Windows, always include ICU data files since they need to be built in statically.
    • Patches TF to fix windows builds to not look for a python3 executable.
    • Rename documentation file WordShape.md to WordShape_cls.md. The problem is on MacOS (and maybe Windows) this filename collides with wordshape.md, because the filesystem does not differentiate cases for the files. This is purely a QOL change for anybody checking out the library on a non-Linux platform. Fix #361.
    • Convert input to tensor to allow for numpy inputs to state based sentence breaker.
    • Add classifiers to py packages and fix header image.
    • fix bad rendering for add_eos add_bos description in SentencepieceTokenizer.md
    • Fix for the model server test. Make sure our test tensors have the expected
    • Update regression test for break_sentences_with_offsets.
    • Add a shape attribute to the ToDense Keras layer.
    • Add support for [batch, 1] shaped inputs in StateBasedSentenceBreaker
    • Fix for the model server test. The result of the tokenize() method of
    • Refactor saved_model.py to make it easier to comment out blocks of related code to identify problems. Also moved out the vocab for Wordpiece due to a tf bug.
    • Update documentation for SplitMergeFromLogitsTokenizer
    • Add regression test for Find Source Offsets
    • Fix unselectable_ids shape check in ItemSelector.
    • changing two tests, to debug failure on Kokoro Windows build.
    • Switch out architecture image in tf.Text documentation.
    • Fix regression test for state_based_sentence_breaker_v2
    • Update run_build with enable_runfiles flag.
    • Update the version of bazel_skylib to match TF's and fix a possible visibility issue.
    • Simplify tf-text WORKSPACE, by relying on tf_workspace().
    • Update transformer.ipynb to use a saved text.BertTokenizer
    • typos
    • Update mobile targets to use :mobile rather than separate :android & :ios targets.
    • Make tools part of the tensorflow_text pip package.
    • Import tools from the tf-text package, instead of cloning the git repo.
    • Minor cleanups to make some code compile on the android build system.
    • Fix pip install command in readme
    • Fix tools pip package inclusion.
    • Clear outputs
    • A tensorfow.org compatible docs generator for tf-text.
    • Formatting fixes for tensorflow.org
    • Sample random tokens correctly during MLM.
    • Internal repo change
    • Treat Sentencepiece ops as stateful in tf.data pipelines.
    • Reduce the critical section range. Because the options are
    • Replacing use of TFT's deprecated dataset_schema.from_feature_spec with its replacement schema_utils.schema_from_feature_spec.
    • Updating guide with new template

Thanks to our Contributors

This release contains contributions from many people at Google, as well as:

Rens, Samuel Marks, thuang513