Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue importing large dataset #34

Open
roelj opened this issue Feb 5, 2019 · 2 comments
Open

Issue importing large dataset #34

roelj opened this issue Feb 5, 2019 · 2 comments

Comments

@roelj
Copy link

roelj commented Feb 5, 2019

First of all, thanks for the dcc-portal and all of the tooling around it.

We're trying to import a dataset of ~2600 donors, ~70.000.000 mutations, in ~30 primary sites.
Importing all donors in one go fails at the JOIN step with the following:

com.esotericsoftware.kryo.KryoException: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
        at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
        at com.esotericsoftware.kryo.io.Input.require(Input.java:155)
        at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
        at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109)
        at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
        at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
        at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
        at org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:169)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:477)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:498)
        at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.org$apache$spark$util$collection$ExternalAppendOnlyMap$ExternalIterator$$readNextHashCode(ExternalAppendOnlyMap.scala:299)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$next$1.apply(ExternalAppendOnlyMap.scala:372)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$next$1.apply(ExternalAppendOnlyMap.scala:370)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:370)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:265)
        at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
        at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:98)
        at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
        at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:465)
        at org.xerial.snappy.Snappy.uncompress(Snappy.java:504)
        at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:422)
        at org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:182)
        at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
        at com.esotericsoftware.kryo.io.Input.fill(Input.java:140)
        ... 28 common frames omitted

I believe this is due to not having enough memory to hold the uncompressed data in memory.

Importing per primary site works, except that in the portal UI the links between genes and mutations are lost for everything but the last imported tissue type. So the Oncogrid tool cannot plot the matrix because it doesn't have information on the genes. Is this a known problem?

I wonder how the data in https://dcc.icgc.org/ was imported, and what hardware/cloud requirements would be needed to do so. Is it possible to run dcc-release per primary site?

Thank you for your time.

@junjun-zhang
Copy link

Dear @roelj, thanks for getting in touch with us and giving feedback. The dataset (~2600 donors, ~70.000.000 mutations, in ~30 primary sites) you are trying to load should be possible to go through the ETL given that it's processed in a Spark cluster with sufficient resources. @andricDu should be able to provide more details on the Spark cluster we use to run ETL for the ICGC data portal.

In terms of processing/loading per primary site, for sure it will reduce the required resources due to smaller data size. However, as you already noticed the resulting Elasticsearch indexes are not able to support same functionality as the full dataset would be able to. Not only OncoGrid will be affected, many other parts of the portal will be affected as well. For example, you will not be able to get top mutated genes across donors from all primary sites. When building gene-centric view of the index, it requires all donors from all primary sites to present.

@andricDu
Copy link
Contributor

andricDu commented Mar 7, 2019

Hi @roelj

Could you describe the spark cluster you are running this on in terms of memory and cores per compute node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants