proteinIds column all empty in parquet file?

I think the proteinIds column in all the parquet files in the following location are empty:

I use the Python’s fastparquet and pandas packages to parse the parquet files.

Could you please double check?

Tiejun Cheng

Dear Tiejun Cheng,

Thanks for reaching out to us with your question.

The schema of target dataset contains a set of nested columns (this include the proteinIds column). In the case of proteinIds column the schema looks like

root
 |-- proteinIds: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- source: string (nullable = true)

The schema is a nested array of structs, which is not supported by fastparquet - have a look at the package documentation. The fastparquet read supports only List fields that have primitive element types, not structs. This is why fastparquet report None instead of actual values.

In case of using pandas please use the pyarrow engine.


In [36]: import pandas as pd

In [37]: pd.read_parquet('https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.03/output/target/part-00000-4c6c5f36-b7cf-4ea1-804c-33e0c7b43fad-c000.snappy.parquet', engine='pyarrow').proteinIds
Out[37]: 
0       [{'id': 'A6NM43', 'source': 'uniprot_obsolete'}]
1      [{'id': 'Q13395', 'source': 'uniprot_swissprot...
2      [{'id': 'P11277', 'source': 'uniprot_swissprot...
3      [{'id': 'Q86US8', 'source': 'uniprot_swissprot...
4      [{'id': 'O94910', 'source': 'uniprot_swissprot...
                             ...                        
398                                                 None
399                                                 None
400                                                 None
401                                                 None
402                                                 None
Name: proteinIds, Length: 403, dtype: object

Thanks a lot. That’s extremely helpful.

1 Like