I loaded the PARQUET datafiles for release 21.04 of open targets and tried to load the molecule table with spark and HAIL, but got this error
FatalError: MatchError: MapType(StringType,ArrayType(StringType,true),true) (of class org.apache.spark.sql.types.MapType)
Java stack trace:
scala.MatchError: MapType(StringType,ArrayType(StringType,true),true) (of class org.apache.spark.sql.types.MapType)
at is.hail.expr.SparkAnnotationImpex$.importType(AnnotationImpex.scala:29)
at is.hail.expr.SparkAnnotationImpex$.$anonfun$importType$1(AnnotationImpex.scala:39)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
at is.hail.expr.SparkAnnotationImpex$.importType(AnnotationImpex.scala:39)
at is.hail.backend.spark.SparkBackend.$anonfun$pyFromDF$1(SparkBackend.scala:462)
at is.hail.utils.ExecutionTimer$.time(ExecutionTimer.scala:52)
at is.hail.utils.ExecutionTimer$.logTime(ExecutionTimer.scala:59)
at is.hail.backend.spark.SparkBackend.pyFromDF(SparkBackend.scala:460)
at sun.reflect.GeneratedMethodAccessor107.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Hail version: 0.2.77-684f32d73643
Error summary: MatchError: MapType(StringType,ArrayType(StringType,true),true) (of class org.apache.spark.sql.types.MapType)
Other tables load fine, so is there possibly a data format error in molecule table?
This is my code
rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/21.04/output/etl/parquet/* OpenTargets
Then in PYTHON Notebook:
sdf = spark.read.parquet(f'OpenTargets/molecule/')
from hail import Table
ht = Table.from_spark(sdf)
I am guessing there is some type in the data that is not supported by Hail, but not quite sure how to avoid that…
The SPARK loading step was fine, it is the conversion of SPARK to HAIL that is the problem (So should I post this on the HAIL site)
Hello @thondeboer!
Welcome to the Open Targets Community!
I have reviewed the molecule
dataset from our 21.04 release along with the error message included in your post.
Side note: thank you for including it as it made it much easier to understand the issue and find solutions.
The error message is displayed because Hail is unable to match the map
Spark SQL data type used in the crossReferences
field in our molecule
dataset. Please see below for the full molecule
dataset schema, including fields, types, and nullable
status:
root
|-- id: string (nullable = true)
|-- canonicalSmiles: string (nullable = true)
|-- inchiKey: string (nullable = true)
|-- drugType: string (nullable = true)
|-- blackBoxWarning: boolean (nullable = true)
|-- name: string (nullable = true)
|-- yearOfFirstApproval: long (nullable = true)
|-- maximumClinicalTrialPhase: long (nullable = true)
|-- parentId: string (nullable = true)
|-- hasBeenWithdrawn: boolean (nullable = true)
|-- isApproved: boolean (nullable = true)
|-- withdrawnNotice: struct (nullable = true)
| |-- countries: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- classes: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- year: long (nullable = true)
|-- tradeNames: array (nullable = true)
| |-- element: string (containsNull = true)
|-- synonyms: array (nullable = true)
| |-- element: string (containsNull = true)
|-- crossReferences: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: string (containsNull = true)
|-- childChemblIds: array (nullable = true)
| |-- element: string (containsNull = true)
|-- linkedTargets: struct (nullable = true)
| |-- rows: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- count: integer (nullable = true)
|-- linkedDiseases: struct (nullable = true)
| |-- rows: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- count: integer (nullable = true)
|-- description: string (nullable = true)
As noted in the Hail Table from_spark() documentation, Hail converts the following Spark SQL data types into Hail types:
BooleanType => :py:data:`.tbool`
IntegerType => :py:data:`.tint32`
LongType => :py:data:`.tint64`
FloatType => :py:data:`.tfloat32`
DoubleType => :py:data:`.tfloat64`
StringType => :py:data:`.tstr`
BinaryType => :class:`.TBinary`
ArrayType => :class:`.tarray`
StructType => :class:`.tstruct`
Unfortunately, the map
Spark SQL data type is not in that list and Hail only supports the types listed above.
To resolve your issue, I have two recommendations:
-
If you do not require the crossReferences
field, you could drop that column from the Spark data frame before loading it into Hail.
-
Alternatively, you could also use Spark and create a User Defined Function to transform the map
column to a column of structs
and then import into Hail. For more information on how to do this, please see this StackOverflow article:
I hope this helps and has answered your question. Feel free to post any follow-up questions below.
Thank you!
~ Andrew
2 Likes