A Japanese text normalization library for Ruby follows the conventions of neologd/mecab-ipadic-neologd, with some performance optimizations, without external dependencies. It is designed to preprocess Japanese text before applying NLP techniques.
The specific rules are documented here: https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja
require "neologdish-normalizer"
Neologdish::Normalizer.normalize("南アルプスの 天然水- Sparking* Lemon+ レモン一絞り")
# => 南アルプスの天然水-Sparking*Lemon+レモン一絞り
The performance comparison between the official Ruby example (https://github.com/neologd/mecab-ipadic-neologd/wiki/Regexp.ja#ruby-written-by-kimoto-and-overlast) and this library is as follows:
user system total real
original normalizer: 4.200670 0.032004 4.232674 ( 4.274573)
this library: 1.158801 0.005238 1.164039 ( 1.170226)
The benchmark script is here: ./scripts/benchmark.rb
Install the gem and add to the application's Gemfile by executing:
bundle add 'neologdish-normalizer'
If bundler is not being used to manage dependencies, install the gem by executing:
gem install 'neologdish-normalizer'
After checking out the repo, run bin/setup
to install dependencies. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and the created tag, and push the .gem
file to rubygems.org.
Bug reports and pull requests are welcome on GitHub at https://github.com/moznion/neologdish-normalizer.