Issue importing large dataset #34

roelj · 2019-02-05T10:12:39Z

First of all, thanks for the dcc-portal and all of the tooling around it.

We're trying to import a dataset of ~2600 donors, ~70.000.000 mutations, in ~30 primary sites.
Importing all donors in one go fails at the JOIN step with the following:

com.esotericsoftware.kryo.KryoException: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
        at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
        at com.esotericsoftware.kryo.io.Input.require(Input.java:155)
        at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
        at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109)
        at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
        at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
        at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
        at org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:169)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:477)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:498)
        at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.org$apache$spark$util$collection$ExternalAppendOnlyMap$ExternalIterator$$readNextHashCode(ExternalAppendOnlyMap.scala:299)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$next$1.apply(ExternalAppendOnlyMap.scala:372)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$next$1.apply(ExternalAppendOnlyMap.scala:370)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:370)
        at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:265)
        at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
        at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
        at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
        at org.apache.spark.scheduler.Task.run(Task.scala:89)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
        at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:98)
        at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
        at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:465)
        at org.xerial.snappy.Snappy.uncompress(Snappy.java:504)
        at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:422)
        at org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:182)
        at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
        at com.esotericsoftware.kryo.io.Input.fill(Input.java:140)
        ... 28 common frames omitted

I believe this is due to not having enough memory to hold the uncompressed data in memory.

Importing per primary site works, except that in the portal UI the links between genes and mutations are lost for everything but the last imported tissue type. So the Oncogrid tool cannot plot the matrix because it doesn't have information on the genes. Is this a known problem?

I wonder how the data in https://dcc.icgc.org/ was imported, and what hardware/cloud requirements would be needed to do so. Is it possible to run dcc-release per primary site?

Thank you for your time.

The text was updated successfully, but these errors were encountered:

junjun-zhang · 2019-02-19T17:33:38Z

Dear @roelj, thanks for getting in touch with us and giving feedback. The dataset (~2600 donors, ~70.000.000 mutations, in ~30 primary sites) you are trying to load should be possible to go through the ETL given that it's processed in a Spark cluster with sufficient resources. @andricDu should be able to provide more details on the Spark cluster we use to run ETL for the ICGC data portal.

In terms of processing/loading per primary site, for sure it will reduce the required resources due to smaller data size. However, as you already noticed the resulting Elasticsearch indexes are not able to support same functionality as the full dataset would be able to. Not only OncoGrid will be affected, many other parts of the portal will be affected as well. For example, you will not be able to get top mutated genes across donors from all primary sites. When building gene-centric view of the index, it requires all donors from all primary sites to present.

andricDu · 2019-03-07T19:40:06Z

Hi @roelj

Could you describe the spark cluster you are running this on in terms of memory and cores per compute node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue importing large dataset #34

Issue importing large dataset #34

roelj commented Feb 5, 2019

junjun-zhang commented Feb 19, 2019

andricDu commented Mar 7, 2019

Issue importing large dataset #34

Issue importing large dataset #34

Comments

roelj commented Feb 5, 2019

junjun-zhang commented Feb 19, 2019

andricDu commented Mar 7, 2019