-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Space deduplication #79
Comments
The first issue is whether Python's import re
spaces = (
'\t', # tab
'\n', # newline
'\v', # vertical tab
'\f', # form feed
'\r', # carriage return
' ', # space
'\u0085', # next line
'\u00a0', # non-breaking space
'\u1680', # ogham space mark
'\u2000', # en quad
'\u2001', # em quad
'\u2002', # en space
'\u2003', # em space
'\u2004', # three-per-em space
'\u2005', # four-per-em space
'\u2006', # six-per-em space
'\u2007', # figure space
'\u2008', # punctuation space
'\u2009', # thin space
'\u200a', # hair space
'\u2028', # line separator
'\u2029', # paragraph separator
'\u202f', # narrow no-break space
'\u205f', # medium mathematical space
'\u3000', # ideographic space
)
for sp in spaces:
s = f'a{sp}b'
assert ' '.join(s.split()) == re.sub(r'\s+', ' ', s) I get no The next question is whether >>> import re
>>> s = ' a b '
>>> ' '.join(s.split())
'a b'
>>> re.sub(r'\s+', ' ', s)
' a b ' As long as you are currently always doing The final question (that I came up with) is whether doing so has any performance gains. I think in general a simple string operation is faster than invoking the regex engine, and $ python3 -m timeit -uusec -s 'import re; r = re.compile(r"\s+"); s = "a " * 20' '" ".join(s.split())'
500000 loops, best of 5: 0.521 usec per loop
$ python3 -m timeit -uusec -s 'import re; r = re.compile(r"\s+"); s = "a " * 20' 'r.sub(" ", s)'
100000 loops, best of 5: 3.73 usec per loop So assuming you are currently removing leading or trailing spaces, deduplicating spaces with |
Thanks @goodmami for the analysis!! Hmmm, seems like considering things like em-space and ideographic space might be problematic because I think it affects other regexes in Moses. Have to take a closer look at other regexes that manipulates/uses these "alternative" spaces in the moses' list of regexes. |
Yes, all I have shown was that the basic string operations are suitable to replace Python's |
Moses default behavior is to use a regex to deduplicate spaces, https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L507
But in Python, this can be don't by doing
" ".join(text.split())
Would it be appropriate to change the space de-duplication behavior in sacremoses to use
" ".join(text.split())
?The text was updated successfully, but these errors were encountered: