Memory error reading evidence data with spark

Jansen · 10 August 2022 08:33

Hi!

I have some trouble loading the evidence dataset using spark in R. It seems to be a memory issue, as the first line of the error is:

evd ← spark_read_parquet(sc, path = evidencePath)
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 11 in stage 1.0 failed 1 times, most recent failure: Lost task 11.0 in stage 1.0 (TID 12) (localhost executor driver): java.lang.OutOfMemoryError: Java heap space

I am not familiar with spark, so also not completely sure how it works. I was not able to find the solution by googling or searching here on the OpenTargets community.

Thank you in advance!

irene · 10 August 2022 09:01

Hi @Jansen and welcome to our community!

Java heap memory errors are fairly common when working with large datasets. We usually solve them by providing more resources to the Spark Context.

Here you can take a look at the function that we commonly use to initialize Spark, although it is written in Python, I’d assume that the R API works similarly.

Another alternative is not to load the whole evidence set and work instead with the subset of interest. For example, if you have a more specific query on the ClinVar evidence, you could just load the evidence strings where the datasourceId field contain eva. Evidence coming from the literature is a very significant portion of the dataset and perhaps it is not necessary in your analysis.

I hope that helps!
Best,
Irene

Jansen · 15 August 2022 09:23

Great thank you, it is working now!

Topic		Replies	Views
Missing .parquet files? Data downloads	1	277	5 October 2023
Getting "cannot resolve '`statisticalTestTail`'" error when processing evidence files for release 23.02 Technical Support ot-platform	2	284	10 May 2023
Batch-query variant-centric evidence for a list of targets (R) Data downloads	2	1089	15 August 2022
Cannot reproduce python code of 'Accessing and querying datasets' Data downloads	7	595	3 December 2021
Unable to parse JSON files Data downloads	3	4464	1 May 2021

Memory error reading evidence data with spark

Related topics