Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optional normalization of dot codes and brace codes #59

Open
lucboruta opened this issue Sep 30, 2017 · 3 comments
Open

Optional normalization of dot codes and brace codes #59

lucboruta opened this issue Sep 30, 2017 · 3 comments

Comments

@lucboruta
Copy link

Brace codes

I noticed the use of "brace codes" in XML data, e.g. {acute over (e)} instead of é. They are used to encode non-ASCII characters, mostly diacritics and mathematical symbols, and they are visible in both PatFT/AppFT (e.g. here) and Google Patents (e.g. here and there).

There's a significant amount of variation in the way the codes are spelled out (insertions, deletions, or substitutions, maybe OCR artifacts?), e.g. {circumflex over (e)} occurring as {circumflexiover (e)} or {circumflexioveri(e)}, or {square root over (n)} occurring as {squaruaroot over (n)}. I've had good results using Damerau-Levenshtein distance to match codes to their canonical form.

Dot codes

Green Book documents use "dot codes", e.g. .alpha. to encode α. @bgfeldm developed gov.uspto.patent.doc.greenbook.DotCodes, but the class isn't referenced in the latest version of the parsers/readers.

Integration

I've forked the repo in https://github.com/thunken/PatentPublicData to implement these changes myself, making sure the changes are backward compatible (i.e. normalization is disabled by default). Dot code normalization is already done thanks to Brian's class, and I've started moving our code for brace codes to this codebase. I'll open a PR shortly.

@bgfeldm
Copy link
Contributor

bgfeldm commented Oct 24, 2017

I just want to note that in the legal space of Patents sometimes it is best to keep things just as they are written. Patent examiners and lawyers only really trust the original application image, which has to remain pixel to pixel perfect even scaling is discouraged, and the text version needs to be as close as possible.

Dot codes

Dot Codes conversion has a small potential for changing text which are not dot codes. And Dot Codes largely represent mathematical symbols which currently pose little improvement for current search systems. We could disable conversion of any dot codes which have a likelihood of confusion, such as ".En.".

Brace Codes

I quickly created a class to replace Brace Codes with their Unicode equivalents. The largest problem which exist, and also the main reason to use Brace Codes, is when a character is not represented in unicode. Brace codes within mathematical equations are the hardest to process since they are often used for constants/variables and many do not appear in unicode. The characters covered in unicode are largely representative of those characters used within human languages and not fully the vast character options used within mathematics. Though, it may be useful to only perform brace code conversion to names of people and companies as well as titles of non-patent literature. The only question remains since they are highly likely to already have a unicode equivalent, then why is the unicode character not used by the applicant when filed in fields such as the inventor name. But since it occurs frequently enough in the data, conversion is useful.

@bgfeldm
Copy link
Contributor

bgfeldm commented Oct 24, 2017

I came across the following documentation which talks about how to use accents with mathematics and unicode. https://unicode.org/reports/tr25/

@lucboruta
Copy link
Author

I just want to note that in the legal space of Patents sometimes it is best to keep things just as they are written. Patent examiners and lawyers only really trust the original application image, which has to remain pixel to pixel perfect even scaling is discouraged, and the text version needs to be as close as possible.

I agree with this point, but dot codes and brace codes are non-standard encodings, and they make it hard to cross-reference USPTO data with other sources.

So while patent examiners and lawyers rely on the original application images, discovery systems and other data mining applications need the data to be normalized, at least for named entities (persons, organizations, locations). Persistent identifiers would solve many such problems, but we need workarounds for existing data.

I noticed that you have started pushing code that converts brace codes into Unicode, so I will hold off on submitting a PR. I pasted the class I had written into https://gist.github.com/lucboruta/9336cfd4e2f2cfe7d5391aae9e74382d, including the list of diacritics and symbols that I had found in the XML files. I hope you will find it useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants