File formats and our preference for Parquet

James_Hayhurst · 4 April 2025 10:14

As of the 25.03 release, the Open Targets platform data outputs are only available in Parquet format and we’ve stopped duplicating data in newline-delimited JSON. Parquet offers significant benefits over JSON such as having the schema in the file metadata (and therefore better data typing). It is also much smaller on disk when compressed, is fast to access and is supported by a number data frame libraries. One drawback we’ve identified is the lack of human-readability but we think most users of these data are unlikely to require this functionality, nevertheless we have tried to accommodate this use case (see below).

If you are interested in switching over from JSON to Parquet, the change should be simple because most popular data frame libraries support Parquet and your pipeline will likely be faster at reading the data. Here are some example Parquet file readers from a few of the popular data frame libraries:

r: read_parquet function - RDocumentation
spark: Parquet Files - Spark 3.5.1 Documentation
polars: Parquet - Polars user guide
pandas: https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html (use the pyarrow engine)

We currently have a small number of hive-partitioned data sets in evidence, but most of these readers should be able to handle that correctly. Typically, the readers are built on the Apache Arrow library, which itself has APIs in many languages should you need to interface with the Parquet in way that is unsupported by the data frame library.

If you don’t wish to read data into data frames and instead want to read the data as human-readable JSON (newline delimited), there are open source tools for this. Two such examples are parquet2json (rust), or our own in-house tool p2j (python), both of which utilise Apache Arrow. There are many tools that exist for interrogating, reading and previewing Parquet files.

We’re keeping our documentation up-to-date to help with these changes.

Thanks for reading and please get in touch if you have any questions.

Topic		Replies	Views
L2G JSON download for Open Targets Genetics Data Access datadownloads , genetics-portal	4	279	7 December 2023
Unable to parse JSON files Data downloads	3	4505	1 May 2021
Can we have data as flat files instead of JSON? Genetics feature requests	4	389	8 May 2022
proteinIds column all empty in parquet file? Bug reports datadownloads	2	22	29 March 2025
Difference between parquet files and website/API Data issue data , data-updates	3	163	7 February 2024

File formats and our preference for Parquet

Related topics