Older versions of OpenTargets data

A question regarding archived versions of OpenTargets data. The oldest version of OpenTargets data that I can see is from July 2019:
https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/
I wondered if there were older versions of OpenTargets data available elsewhere or on request, and if so, what is the oldest available collated data?

Many thanks,

Sanjay

Hi Sanjay, Under the ftp address you have pasted, there is a complete collection of OpenTargets Platform data releases. The earliest releases from 16.04. However, keep in mind that over the years the data model changed significantly, which can make a systemic comparison relatively complicated.

Best,
Daniel

@sanjayb100 you might want to look at how @eczech and colleagues performed a temporal analysis on the Open Targets data.

you might want to look at how @eczech and colleagues performed a temporal analysis on the Open Targets data.

It wasn’t in that paper, but we have built some consolidated datasets across OT versions going all the way back to 16.04 (i.e. from 2016). We weren’t merging schemas for raw evidence across years extensively, so we were able to eschew most of the problems @dsuveges mentioned. We created a simple, merged view like this: ot_version, gene_id, diesase_id, datasource_id, score. Here is an example of the schemas/data we were merging:

! curl http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/16.04/16.04_association_data.json.gz \
| gzip -dc | head -n 1000 > /tmp/16.04_association_data.sample.json
(
    spark.read.json("/tmp/16.04_association_data.sample.json")
    .select(
        F.col("target.id").alias("gene_id"),
        F.col("disease.id").alias("disease_id"),
        F.col("association_score.datasources.*")
    )
    .transform(lambda df: (
        df.select(
            "gene_id", "disease_id", 
            F.array(*[
                F.struct(F.lit(c).alias("datasource_id"), F.col(c).alias("score"))
                for c in df.columns if c not in {"gene_id", "disease_id"}
            ]).alias("scores")
        )
    ))
    .select("*", F.explode("scores").alias("score"))
    .select("gene_id", "disease_id", "score.datasource_id", "score.score")
    .printSchema()
)
root
 |-- gene_id: string (nullable = true)
 |-- disease_id: string (nullable = true)
 |-- datasource_id: string (nullable = false)
 |-- score: double (nullable = true)

Here are a few other things to watch out for in doing this:

  1. Older datasets have an is_direct flag in the data to delineate between direct associations and those attained through EFO ancestors. Where that is true, you get the equivalent of the more recent “direct” datasets at Open Targets Downloads. Where that is either true or false, you get the equivalent of the “indirect” datasets in the more recent downloads. That might not be obvious at first, so it’s worth knowing if you’re trying to work across versions.
  2. Some of the folder structures change over time. Starting with version 18.12, data starts appearing in output directories so you’ll need a switch for directory layout like that.
  3. I believe older datasets included associations with a score of 0 while newer ones don’t, so you’ll likely want to filter out 0 scores or somehow deal with that inconsistency.
1 Like

Thanks all! I hadn’t seen that preprint, I’ll be reading this in detail! Thanks also for the pointers on some of the practical challenges in this approach