I think the proteinIds column in all the parquet files in the following location are empty:
I use the Python’s fastparquet and pandas packages to parse the parquet files.
Could you please double check?
Tiejun Cheng
I think the proteinIds column in all the parquet files in the following location are empty:
I use the Python’s fastparquet and pandas packages to parse the parquet files.
Could you please double check?
Tiejun Cheng
Dear Tiejun Cheng,
Thanks for reaching out to us with your question.
The schema of target dataset contains a set of nested columns (this include the proteinIds column). In the case of proteinIds column the schema looks like
root
|-- proteinIds: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- source: string (nullable = true)
The schema is a nested array of structs, which is not supported by fastparquet - have a look at the package documentation. The fastparquet read supports only List fields that have primitive element types, not structs. This is why fastparquet report None instead of actual values.
In case of using pandas please use the pyarrow engine.
In [36]: import pandas as pd
In [37]: pd.read_parquet('https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.03/output/target/part-00000-4c6c5f36-b7cf-4ea1-804c-33e0c7b43fad-c000.snappy.parquet', engine='pyarrow').proteinIds
Out[37]:
0 [{'id': 'A6NM43', 'source': 'uniprot_obsolete'}]
1 [{'id': 'Q13395', 'source': 'uniprot_swissprot...
2 [{'id': 'P11277', 'source': 'uniprot_swissprot...
3 [{'id': 'Q86US8', 'source': 'uniprot_swissprot...
4 [{'id': 'O94910', 'source': 'uniprot_swissprot...
...
398 None
399 None
400 None
401 None
402 None
Name: proteinIds, Length: 403, dtype: object
Thanks a lot. That’s extremely helpful.