Release 0.6.1 · WorksApplications/sudachi.rs

Highlights

Added Fuzzing (see sudachi-fuzz subdirectory), Sudachi.rs seems to be pretty robust towards arbitrary inputs (no crashes and panics)
- Issues like #182 should never occur more
~5% analysis speed improvement over 0.6.0
Added support for Unicode combining symbols, now Sudachi.rs/py should be much better with emoji (🎅🏾) and more complex Unicode (İstanbul)

Added partial dictionary read functionality, it is now possible to skip reading certain fields if they are not needed
Improved startup times, especially for debug builds

Morpheme.part_of_speech method now returns Tuple of POS components instead of a list.
Partial Dictionary Read
- It is possible to ask for a subset of morpheme fields instead of all fields
- Supported API: Dictionary.create(), Dictionary.pre_tokenizer()
HuggingFace PreTokenizer support
- We provide a built-in HuggingFace-compatible pre-tokenizer
- API: Dictionary.pre_tokenizer()
- It is multithreading-compatible and supports customization
Memory allocation reuse
- It is possible to reduce re-allocation overhead by using out parameters which accept MorphemeLists
- Supported API: Tokenizer.tokenize(), Morpheme.split()
- It is now a recommended way to use both those APIs
PosMatcher
- New API for checking if a morpheme has a POS tag from a set
- Strongly prefer using it instead of string comparison of POS components
Performance
- Greatly decreased cost of accessing POS components
len(Morpheme) now returns the length of the morpheme in Unicode codepoints. Use it instead of len(m.surface())
Morpheme.split() has new add_single parameter, which can be used to check whether the split has produced anything
- E.g. with if m.split(SplitMode.A, out=res, add_single=False): handle_splits(res)
- add_single=True, returning the list with the current morpheme is the current behavior
Morpheme/MorphemeList now have readable __repr__ and __str__
- #187