Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransformerCli slows, then blows up after processing ~1000 number of records (with 300MB heap). #88

Closed
zanerock opened this issue Apr 9, 2019 · 3 comments

Comments

@zanerock
Copy link

zanerock commented Apr 9, 2019

While processing the weekly dump of granted patent records, TransformerCli first slows down and then hangs. This has all the hallmarks of a memory issue, but I have not dug into debugging to confirm.

By "slow down", I mean that when watching the output logs to stdout, it processes fewer and fewer records before pausing, eventually processing ~1-4 records before hanging indefinitely.

FIxing this issue or #87 (which would allow processing of the data in chunks) seems to be necessary for processing large data sets.

Executed with:

PROJECTPATH=$( cd $(dirname $0)/.. ; pwd -P )
CLASSPATH="${PROJECTPATH}/lib/*:${PROJECTPATH}/lib/dependency-jars/*"
JAVA="java -cp ${CLASSPATH} -Dlog4j.configuration=file:${PROJECTPATH}/conf/log4j.properties"
${JAVA} gov.uspto.patent.TransformerCli --input "$FILE" --stdout

Where FILE was tested with ipg190326.zip (which would consistently hang at record 991), ipg190402.zip (hanging at record 1041), and ipg190409.zip (hanging at record 1017). Each of those with a heap size of 300MB (as I remember it).

@zanerock
Copy link
Author

zanerock commented Apr 9, 2019

The behavior would be consistent with an increasing frequency of garbage collection, though an OutOfMemory exception is never thrown (at least, before I just kill the process). If the maintainers could weigh in on which they believe would be easier to address, I may be able to work on this issue or #87 in order to enable processing of arbitrarily large data sets.

@zanerock zanerock changed the title TransformerCli blows up after processing "too many" number of records. TransformerCli slows, then blows up after processing ~1000 number of records (with 300MB heap). Apr 9, 2019
@bgfeldm
Copy link
Contributor

bgfeldm commented Apr 9, 2019

There are some huge patents, some have lots of huge tables, which take more than the java's default memory threshold to process. And with the way the bulk files are written, in a sequential fashion, the patents get larger and more complex around the same area of each bulk file.

Patents with about 100MB in text, are handled within the gov.uspto.patent.PatentReader which either skips them or drops the large fields and continues. A 100MB patent once read into a DOM will be 3+ times the size. The description field is read into a DOM twice, which requires lot of memory on these large patents.

I have two suggestions:

  1. Try setting a large max memory with -Xmx2g
  2. Try using the newer transformer which supports skip at gov.uspto.bulkdata.cli.Transformer

@zanerock
Copy link
Author

That makes sense, I'll give that a try.

@bgfeldm bgfeldm closed this as completed Dec 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants