proteinIds column all empty in parquet file?

need47 · 28 March 2025 21:27

I think the proteinIds column in all the parquet files in the following location are empty:

I use the Python’s fastparquet and pandas packages to parse the parquet files.

Could you please double check?

Tiejun Cheng

project-defiant · 29 March 2025 09:27

Dear Tiejun Cheng,

Thanks for reaching out to us with your question.

The schema of target dataset contains a set of nested columns (this include the proteinIds column). In the case of proteinIds column the schema looks like

root
 |-- proteinIds: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- source: string (nullable = true)

The schema is a nested array of structs, which is not supported by fastparquet - have a look at the package documentation. The fastparquet read supports only List fields that have primitive element types, not structs. This is why fastparquet report None instead of actual values.

In case of using pandas please use the pyarrow engine.


In [36]: import pandas as pd

In [37]: pd.read_parquet('https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.03/output/target/part-00000-4c6c5f36-b7cf-4ea1-804c-33e0c7b43fad-c000.snappy.parquet', engine='pyarrow').proteinIds
Out[37]: 
0       [{'id': 'A6NM43', 'source': 'uniprot_obsolete'}]
1      [{'id': 'Q13395', 'source': 'uniprot_swissprot...
2      [{'id': 'P11277', 'source': 'uniprot_swissprot...
3      [{'id': 'Q86US8', 'source': 'uniprot_swissprot...
4      [{'id': 'O94910', 'source': 'uniprot_swissprot...
                             ...                        
398                                                 None
399                                                 None
400                                                 None
401                                                 None
402                                                 None
Name: proteinIds, Length: 403, dtype: object

need47 · 29 March 2025 11:04

Thanks a lot. That’s extremely helpful.

Topic		Replies	Views
Possible Parquet data format problem with table "molecule" when trying to load with HAIL Technical Support	1	382	2 November 2021
Cannot reproduce python code of 'Accessing and querying datasets' Data downloads	7	597	3 December 2021
Difference between parquet files and website/API Data issue data , data-updates	3	162	7 February 2024
File formats and our preference for Parquet Data Access datadownloads , ot-platform , ftp	0	61	4 April 2025
L2G JSON download for Open Targets Genetics Data Access datadownloads , genetics-portal	4	279	7 December 2023

proteinIds column all empty in parquet file?

Related topics