Releases
v0.6.1
Highlights
Added Fuzzing (see sudachi-fuzz
subdirectory), Sudachi.rs seems to be pretty robust towards arbitrary inputs (no crashes and panics)
Issues like #182 should never occur more
~5% analysis speed improvement over 0.6.0
Added support for Unicode combining symbols, now Sudachi.rs/py should be much better with emoji (🎅🏾) and more complex Unicode (İstanbul)
Rust
Added partial dictionary read functionality, it is now possible to skip reading certain fields if they are not needed
Improved startup times, especially for debug builds
Python
Morpheme.part_of_speech
method now returns Tuple of POS components instead of a list.
Partial Dictionary Read
HuggingFace PreTokenizer support
We provide a built-in HuggingFace-compatible pre-tokenizer
API: Dictionary.pre_tokenizer()
It is multithreading-compatible and supports customization
Memory allocation reuse
It is possible to reduce re-allocation overhead by using out
parameters which accept MorphemeList
s
Supported API: Tokenizer.tokenize()
, Morpheme.split()
It is now a recommended way to use both those APIs
PosMatcher
New API for checking if a morpheme has a POS tag from a set
Strongly prefer using it instead of string comparison of POS components
Performance
Greatly decreased cost of accessing POS components
len(Morpheme)
now returns the length of the morpheme in Unicode codepoints. Use it instead of len(m.surface())
Morpheme.split()
has new add_single
parameter, which can be used to check whether the split has produced anything
E.g. with if m.split(SplitMode.A, out=res, add_single=False): handle_splits(res)
add_single=True
, returning the list with the current morpheme is the current behavior
Morpheme
/MorphemeList
now have readable __repr__
and __str__
You can’t perform that action at this time.