I think the proteinIds column in all the parquet files in the following location are empty:
I use the Python’s fastparquet and pandas packages to parse the parquet files.
Could you please double check?
Tiejun Cheng
I think the proteinIds column in all the parquet files in the following location are empty:
I use the Python’s fastparquet and pandas packages to parse the parquet files.
Could you please double check?
Tiejun Cheng
Dear Tiejun Cheng,
Thanks for reaching out to us with your question.
The schema of target
dataset contains a set of nested columns (this include the proteinIds
column). In the case of proteinIds
column the schema looks like
root
|-- proteinIds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- source: string (nullable = true)
The schema is a nested array of structs, which is not supported by fastparquet
- have a look at the package documentation. The fastparquet
read supports only List fields that have primitive element types, not structs. This is why fastparquet
report None
instead of actual values.
In case of using pandas please use the pyarrow
engine.
In [36]: import pandas as pd
In [37]: pd.read_parquet('https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.03/output/target/part-00000-4c6c5f36-b7cf-4ea1-804c-33e0c7b43fad-c000.snappy.parquet', engine='pyarrow').proteinIds
Out[37]:
0 [{'id': 'A6NM43', 'source': 'uniprot_obsolete'}]
1 [{'id': 'Q13395', 'source': 'uniprot_swissprot...
2 [{'id': 'P11277', 'source': 'uniprot_swissprot...
3 [{'id': 'Q86US8', 'source': 'uniprot_swissprot...
4 [{'id': 'O94910', 'source': 'uniprot_swissprot...
...
398 None
399 None
400 None
401 None
402 None
Name: proteinIds, Length: 403, dtype: object
Thanks a lot. That’s extremely helpful.