You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, thanks for the dcc-portal and all of the tooling around it.
We're trying to import a dataset of ~2600 donors, ~70.000.000 mutations, in ~30 primary sites.
Importing all donors in one go fails at the JOIN step with the following:
com.esotericsoftware.kryo.KryoException: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
at com.esotericsoftware.kryo.io.Input.require(Input.java:155)
at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
at org.apache.spark.serializer.DeserializationStream.readKey(Serializer.scala:169)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.readNextItem(ExternalAppendOnlyMap.scala:477)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.hasNext(ExternalAppendOnlyMap.scala:498)
at scala.collection.Iterator$$anon$1.hasNext(Iterator.scala:847)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.org$apache$spark$util$collection$ExternalAppendOnlyMap$ExternalIterator$$readNextHashCode(ExternalAppendOnlyMap.scala:299)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$next$1.apply(ExternalAppendOnlyMap.scala:372)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$next$1.apply(ExternalAppendOnlyMap.scala:370)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:370)
at org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.next(ExternalAppendOnlyMap.scala:265)
at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:191)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: FAILED_TO_UNCOMPRESS(5)
at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:98)
at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:465)
at org.xerial.snappy.Snappy.uncompress(Snappy.java:504)
at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:422)
at org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:182)
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:163)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:140)
... 28 common frames omitted
I believe this is due to not having enough memory to hold the uncompressed data in memory.
Importing per primary site works, except that in the portal UI the links between genes and mutations are lost for everything but the last imported tissue type. So the Oncogrid tool cannot plot the matrix because it doesn't have information on the genes. Is this a known problem?
I wonder how the data in https://dcc.icgc.org/ was imported, and what hardware/cloud requirements would be needed to do so. Is it possible to run dcc-release per primary site?
Thank you for your time.
The text was updated successfully, but these errors were encountered:
Dear @roelj, thanks for getting in touch with us and giving feedback. The dataset (~2600 donors, ~70.000.000 mutations, in ~30 primary sites) you are trying to load should be possible to go through the ETL given that it's processed in a Spark cluster with sufficient resources. @andricDu should be able to provide more details on the Spark cluster we use to run ETL for the ICGC data portal.
In terms of processing/loading per primary site, for sure it will reduce the required resources due to smaller data size. However, as you already noticed the resulting Elasticsearch indexes are not able to support same functionality as the full dataset would be able to. Not only OncoGrid will be affected, many other parts of the portal will be affected as well. For example, you will not be able to get top mutated genes across donors from all primary sites. When building gene-centricview of the index, it requires all donors from all primary sites to present.
First of all, thanks for the
dcc-portal
and all of the tooling around it.We're trying to import a dataset of ~2600 donors, ~70.000.000 mutations, in ~30 primary sites.
Importing all donors in one go fails at the
JOIN
step with the following:I believe this is due to not having enough memory to hold the uncompressed data in memory.
Importing per primary site works, except that in the portal UI the links between genes and mutations are lost for everything but the last imported tissue type. So the Oncogrid tool cannot plot the matrix because it doesn't have information on the genes. Is this a known problem?
I wonder how the data in https://dcc.icgc.org/ was imported, and what hardware/cloud requirements would be needed to do so. Is it possible to run
dcc-release
per primary site?Thank you for your time.
The text was updated successfully, but these errors were encountered: