File formats and our preference for Parquet

As of the 25.03 release, the Open Targets platform data outputs are only available in Parquet format and we’ve stopped duplicating data in newline-delimited JSON. Parquet offers significant benefits over JSON such as having the schema in the file metadata (and therefore better data typing). It is also much smaller on disk when compressed, is fast to access and is supported by a number data frame libraries. One drawback we’ve identified is the lack of human-readability but we think most users of these data are unlikely to require this functionality, nevertheless we have tried to accommodate this use case (see below).

If you are interested in switching over from JSON to Parquet, the change should be simple because most popular data frame libraries support Parquet and your pipeline will likely be faster at reading the data. Here are some example Parquet file readers from a few of the popular data frame libraries:

We currently have a small number of hive-partitioned data sets in evidence, but most of these readers should be able to handle that correctly. Typically, the readers are built on the Apache Arrow library, which itself has APIs in many languages should you need to interface with the Parquet in way that is unsupported by the data frame library.

If you don’t wish to read data into data frames and instead want to read the data as human-readable JSON (newline delimited), there are open source tools for this. Two such examples are parquet2json (rust), or our own in-house tool p2j (python), both of which utilise Apache Arrow. There are many tools that exist for interrogating, reading and previewing Parquet files.

We’re keeping our documentation up-to-date to help with these changes.

Thanks for reading and please get in touch if you have any questions.

5 Likes