Possible Parquet data format problem with table "molecule" when trying to load with HAIL

thondeboer · 1 November 2021 17:48

I loaded the PARQUET datafiles for release 21.04 of open targets and tried to load the molecule table with spark and HAIL, but got this error

FatalError: MatchError: MapType(StringType,ArrayType(StringType,true),true) (of class org.apache.spark.sql.types.MapType)

Java stack trace:
scala.MatchError: MapType(StringType,ArrayType(StringType,true),true) (of class org.apache.spark.sql.types.MapType)
	at is.hail.expr.SparkAnnotationImpex$.importType(AnnotationImpex.scala:29)
	at is.hail.expr.SparkAnnotationImpex$.$anonfun$importType$1(AnnotationImpex.scala:39)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
	at is.hail.expr.SparkAnnotationImpex$.importType(AnnotationImpex.scala:39)
	at is.hail.backend.spark.SparkBackend.$anonfun$pyFromDF$1(SparkBackend.scala:462)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.pyFromDF(SparkBackend.scala:460)
	at sun.reflect.GeneratedMethodAccessor107.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



Hail version: 0.2.77-684f32d73643
Error summary: MatchError: MapType(StringType,ArrayType(StringType,true),true) (of class org.apache.spark.sql.types.MapType)

Other tables load fine, so is there possibly a data format error in molecule table?

This is my code

rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/21.04/output/etl/parquet/* OpenTargets

Then in PYTHON Notebook:

sdf = spark.read.parquet(f'OpenTargets/molecule/')
from hail import Table
ht = Table.from_spark(sdf)

I am guessing there is some type in the data that is not supported by Hail, but not quite sure how to avoid that…

The SPARK loading step was fine, it is the conversion of SPARK to HAIL that is the problem (So should I post this on the HAIL site)

ahercules · 2 November 2021 13:02

Hello @thondeboer!

Welcome to the Open Targets Community!

I have reviewed the molecule dataset from our 21.04 release along with the error message included in your post.

Side note: thank you for including it as it made it much easier to understand the issue and find solutions.

The error message is displayed because Hail is unable to match the map Spark SQL data type used in the crossReferences field in our molecule dataset. Please see below for the full molecule dataset schema, including fields, types, and nullable status:

root
 |-- id: string (nullable = true)
 |-- canonicalSmiles: string (nullable = true)
 |-- inchiKey: string (nullable = true)
 |-- drugType: string (nullable = true)
 |-- blackBoxWarning: boolean (nullable = true)
 |-- name: string (nullable = true)
 |-- yearOfFirstApproval: long (nullable = true)
 |-- maximumClinicalTrialPhase: long (nullable = true)
 |-- parentId: string (nullable = true)
 |-- hasBeenWithdrawn: boolean (nullable = true)
 |-- isApproved: boolean (nullable = true)
 |-- withdrawnNotice: struct (nullable = true)
 |    |-- countries: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- classes: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- year: long (nullable = true)
 |-- tradeNames: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- synonyms: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- crossReferences: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- childChemblIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- linkedTargets: struct (nullable = true)
 |    |-- rows: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- count: integer (nullable = true)
 |-- linkedDiseases: struct (nullable = true)
 |    |-- rows: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- count: integer (nullable = true)
 |-- description: string (nullable = true)

As noted in the Hail Table from_spark() documentation, Hail converts the following Spark SQL data types into Hail types:

BooleanType => :py:data:`.tbool`
IntegerType => :py:data:`.tint32`
LongType => :py:data:`.tint64`
FloatType => :py:data:`.tfloat32`
DoubleType => :py:data:`.tfloat64`
StringType => :py:data:`.tstr`
BinaryType => :class:`.TBinary`
ArrayType => :class:`.tarray`
StructType => :class:`.tstruct`

Unfortunately, the map Spark SQL data type is not in that list and Hail only supports the types listed above.

To resolve your issue, I have two recommendations:

If you do not require the crossReferences field, you could drop that column from the Spark data frame before loading it into Hail.
Alternatively, you could also use Spark and create a User Defined Function to transform the map column to a column of structs and then import into Hail. For more information on how to do this, please see this StackOverflow article:

I hope this helps and has answered your question. Feel free to post any follow-up questions below.

Thank you!

~ Andrew

Topic		Replies	Views
Unable to parse JSON files Data downloads	3	4490	1 May 2021
Difference between parquet files and website/API Data issue data , data-updates	3	160	7 February 2024
File formats and our preference for Parquet Data Access datadownloads , ot-platform , ftp	0	61	4 April 2025
L2G JSON download for Open Targets Genetics Data Access datadownloads , genetics-portal	4	278	7 December 2023
Cannot reproduce python code of 'Accessing and querying datasets' Data downloads	7	596	3 December 2021

Possible Parquet data format problem with table "molecule" when trying to load with HAIL

Related topics