Possible Parquet data format problem with table "molecule" when trying to load with HAIL

I loaded the PARQUET datafiles for release 21.04 of open targets and tried to load the molecule table with spark and HAIL, but got this error

FatalError: MatchError: MapType(StringType,ArrayType(StringType,true),true) (of class org.apache.spark.sql.types.MapType)

Java stack trace:
scala.MatchError: MapType(StringType,ArrayType(StringType,true),true) (of class org.apache.spark.sql.types.MapType)
	at is.hail.expr.SparkAnnotationImpex$.importType(AnnotationImpex.scala:29)
	at is.hail.expr.SparkAnnotationImpex$.$anonfun$importType$1(AnnotationImpex.scala:39)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
	at is.hail.expr.SparkAnnotationImpex$.importType(AnnotationImpex.scala:39)
	at is.hail.backend.spark.SparkBackend.$anonfun$pyFromDF$1(SparkBackend.scala:462)
	at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
	at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
	at is.hail.backend.spark.SparkBackend.pyFromDF(SparkBackend.scala:460)
	at sun.reflect.GeneratedMethodAccessor107.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

Hail version: 0.2.77-684f32d73643
Error summary: MatchError: MapType(StringType,ArrayType(StringType,true),true) (of class org.apache.spark.sql.types.MapType)

Other tables load fine, so is there possibly a data format error in molecule table?

This is my code

rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/21.04/output/etl/parquet/* OpenTargets

Then in PYTHON Notebook:

sdf = spark.read.parquet(f'OpenTargets/molecule/')
from hail import Table
ht = Table.from_spark(sdf)

I am guessing there is some type in the data that is not supported by Hail, but not quite sure how to avoid that…

The SPARK loading step was fine, it is the conversion of SPARK to HAIL that is the problem (So should I post this on the HAIL site)

Hello @thondeboer! :wave:

Welcome to the Open Targets Community! :tada:

I have reviewed the molecule dataset from our 21.04 release along with the error message included in your post.

Side note: thank you for including it as it made it much easier to understand the issue and find solutions. :male_detective:

The error message is displayed because Hail is unable to match the map Spark SQL data type used in the crossReferences field in our molecule dataset. Please see below for the full molecule dataset schema, including fields, types, and nullable status:

 |-- id: string (nullable = true)
 |-- canonicalSmiles: string (nullable = true)
 |-- inchiKey: string (nullable = true)
 |-- drugType: string (nullable = true)
 |-- blackBoxWarning: boolean (nullable = true)
 |-- name: string (nullable = true)
 |-- yearOfFirstApproval: long (nullable = true)
 |-- maximumClinicalTrialPhase: long (nullable = true)
 |-- parentId: string (nullable = true)
 |-- hasBeenWithdrawn: boolean (nullable = true)
 |-- isApproved: boolean (nullable = true)
 |-- withdrawnNotice: struct (nullable = true)
 |    |-- countries: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- classes: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- year: long (nullable = true)
 |-- tradeNames: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- synonyms: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- crossReferences: map (nullable = true)
 |    |-- key: string
 |    |-- value: array (valueContainsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- childChemblIds: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- linkedTargets: struct (nullable = true)
 |    |-- rows: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- count: integer (nullable = true)
 |-- linkedDiseases: struct (nullable = true)
 |    |-- rows: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- count: integer (nullable = true)
 |-- description: string (nullable = true)

As noted in the Hail Table from_spark() documentation, Hail converts the following Spark SQL data types into Hail types:

BooleanType => :py:data:`.tbool`
IntegerType => :py:data:`.tint32`
LongType => :py:data:`.tint64`
FloatType => :py:data:`.tfloat32`
DoubleType => :py:data:`.tfloat64`
StringType => :py:data:`.tstr`
BinaryType => :class:`.TBinary`
ArrayType => :class:`.tarray`
StructType => :class:`.tstruct`

Unfortunately, the map Spark SQL data type is not in that list and Hail only supports the types listed above.

To resolve your issue, I have two recommendations:

  1. If you do not require the crossReferences field, you could drop that column from the Spark data frame before loading it into Hail.

  2. Alternatively, you could also use Spark and create a User Defined Function to transform the map column to a column of structs and then import into Hail. For more information on how to do this, please see this StackOverflow article:

I hope this helps and has answered your question. Feel free to post any follow-up questions below.

Thank you! :slight_smile:

~ Andrew

1 Like