Skip to content

0.6.1

Compare
Choose a tag to compare
@eiennohito eiennohito released this 08 Dec 08:45
· 180 commits to develop since this release
e13bf75

Highlights

  • Added Fuzzing (see sudachi-fuzz subdirectory), Sudachi.rs seems to be pretty robust towards arbitrary inputs (no crashes and panics)
    • Issues like #182 should never occur more
  • ~5% analysis speed improvement over 0.6.0
  • Added support for Unicode combining symbols, now Sudachi.rs/py should be much better with emoji (🎅🏾) and more complex Unicode (İstanbul)

Rust

  • Added partial dictionary read functionality, it is now possible to skip reading certain fields if they are not needed
  • Improved startup times, especially for debug builds

Python

  • Morpheme.part_of_speech method now returns Tuple of POS components instead of a list.
  • Partial Dictionary Read
  • HuggingFace PreTokenizer support
    • We provide a built-in HuggingFace-compatible pre-tokenizer
    • API: Dictionary.pre_tokenizer()
    • It is multithreading-compatible and supports customization
  • Memory allocation reuse
    • It is possible to reduce re-allocation overhead by using out parameters which accept MorphemeLists
    • Supported API: Tokenizer.tokenize(), Morpheme.split()
    • It is now a recommended way to use both those APIs
  • PosMatcher
    • New API for checking if a morpheme has a POS tag from a set
    • Strongly prefer using it instead of string comparison of POS components
  • Performance
    • Greatly decreased cost of accessing POS components
  • len(Morpheme) now returns the length of the morpheme in Unicode codepoints. Use it instead of len(m.surface())
  • Morpheme.split() has new add_single parameter, which can be used to check whether the split has produced anything
    • E.g. with if m.split(SplitMode.A, out=res, add_single=False): handle_splits(res)
    • add_single=True, returning the list with the current morpheme is the current behavior
  • Morpheme/MorphemeList now have readable __repr__ and __str__