- total 36497 chinese news, which is collected from the Internet
Different from English, there are no space between Chinese words. This project aims to implement Chinese word segmentation without dictionary.
- develop Chinese word segmentation algorithm based on entropy
- find new Chinese words (which is not in corpus)
- see WordInfo.calculateAggregation()
- see WordInfo.calculateEntropy()
- In this project, I set FREQ_MIN = 5, FREQ_MAX = 10. The word between FREQ_MIN and FREQ_MAX can be new words candidates
- Change INPUT_FILE in wordSegment.py
- python 2