I am unable to find Open Targets Genetics data in the JSON format. I am particularly interested in l2g and lut/study-index. I can only see parquet files here: Index of /pub/databases/opentargets/genetics/latest
In Open Targets Genetics, we only export our datasets in Parquet format, rather than JSON. This is due to Parquet’s efficiency in storing columnar data, which is really relevant for handling large datasets like ours.
Although we don’t offer JSON files directly, the data in Parquet files is equally valid and comprehensive, and you can always convert Parquet to JSON. Let me know if I can help with your specific use case.
Thanks @irene for your reply. That makes sense. My use case is that I want to load this data into a Postgres database, and loading JSON data is easier and fast.
I can load parquet files in python using pyarrow, and from there using pandas, I can either load the data in the database or convert it into JSON. However, loading parquet files in python is extremely slow and I am wondering if there is a more efficient way.
Do you have any suggestions or ideas for my use case?
if you’re using Python, you should be able to load parquet files with Pandas directly by specifying the directory. Pandas uses Pyarrow as the backend to interact with Parquet and I didn’t find it slow. Bare in mind that you’ll be loading data into memory, however considering that you’re working with the L2G/study index datasets shouldn’t raise any problems.
Thanks for the suggestion I was not aware of the pandas function to read parquet files. I will give that a try and that should fix my problem. Many thanks for your help