Memory error reading evidence data with spark


I have some trouble loading the evidence dataset using spark in R. It seems to be a memory issue, as the first line of the error is:

evd ā† spark_read_parquet(sc, path = evidencePath)
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in stage 1.0 failed 1 times, most recent failure: Lost task 11.0 in stage 1.0 (TID 12) (localhost executor driver): java.lang.OutOfMemoryError: Java heap space

I am not familiar with spark, so also not completely sure how it works. I was not able to find the solution by googling or searching here on the OpenTargets community.

Thank you in advance!

Hi @Jansen and welcome to our community!

Java heap memory errors are fairly common when working with large datasets. We usually solve them by providing more resources to the Spark Context.

Here you can take a look at the function that we commonly use to initialize Spark, although it is written in Python, Iā€™d assume that the R API works similarly.

Another alternative is not to load the whole evidence set and work instead with the subset of interest. For example, if you have a more specific query on the ClinVar evidence, you could just load the evidence strings where the datasourceId field contain eva. Evidence coming from the literature is a very significant portion of the dataset and perhaps it is not necessary in your analysis.

I hope that helps!

1 Like

Great thank you, it is working now!

1 Like