Hi! Where can I get the ICD 9/10 codes for the known diseases, in the data downloads?
(I see the ICD codes in the website, but I can’t seem to find where they are in the different data downloads. I want to link this downstream to the UK biobank to get prevalence of the diseases, but I need a medical codes first).
Hi dofer, to get the codes you’re looking for:
- Download the
diseases
dataset: Index of /pub/databases/opentargets/platform/22.11/output/etl/json/diseases - Extract the values for from the fields
dbXRefs
using the tool of your choice. Usingjq
for instance a command likejq '{id: .id, icd: .dbXRefs[]} | select(.icd | startswith("ICD"))' <files>
should give you what you’re after.
1 Like
Great, thanks very much!
(Now to get it working with the parquets/pandas).
This should get you most of the way there:
# python 3.10.6
# pandas 1.4.4
import pandas as pd
path = 'some path to files'
ref_filter = r"^ICD[9,10].+"
raw_df: pd.DataFrame = pd.read_parquet(path, columns=['id', 'dbXRefs'])
id_and_xref_df: pd.DataFrame = raw_df.explode('dbXRefs').astype('string')
id_and_icd_df: pd.DataFrame = id_and_xref_df[id_and_xref_df["dbXRefs"].str.match(
ref_filter)]
The FTP also needs to be updated to: http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/22.11/output/etl/parquet/diseases/
1 Like
Oh, thanks! That just saved me even more time