Can I create my own graphql database for use in the Platform UI from the data download?

I’ll give a broad overview here, but depending on your specific requirements you might need further assistance.

An overview

The GraphQL interface itself does not contain any data (a stateless service), it is an interface which allows us to query two databases which host the data for OTP; in our documentation we refer to it as the API. The two databases (one Elasticsearch and the other Clickhouse) need to be built using the outputs of the OTP ETL. The web application queries the API to provide a GUI for the OTP.

The databases are prepared using code from this repository. Within that repository, the directory terraform_create_images hosts the code which we use to create the necessary GCP VMs which we use to run OTP.

The four components are deployed using this repository. This again uses Terraform to deploy to GCP.

We deploy our infrastructure on Google Cloud Platform, and as such much of the deployment code is tied quite tightly to that service. Each of the individual components is deployed either as a VM or a Docker image.

A rough guide to releasing

Typically our release process has the following steps:

  1. Collect all the necessary ETL inputs using the Platform Input Support repository. You don’t need to do this yourself, the outputs of this step are available from either Google Cloud (gs://open-targets-data-releases/22.09/input) or the EBI FTP (http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/22.09/input/).
  2. Using the outputs from step 1, use the ETL to create the datasets. If you are adding in additional data, this is probably going to be where you need to do it. You can examine our ETL outputs from either Google Cloud (gs://open-targets-data-releases/22.09/output) or the EBI FTP (http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/22.09/output/).
  3. Using the outputs from step 2, load the Clickhouse and Elasticsearch databases with the code in platform-output-support. You could potentially add in additional data at this point. There is a risk in doing so. The datasets interact through ID terms (a target might reference a drug, which references a disease). For example, the query:
query target_to_drug_to_disease {
  target(ensemblId: "ENSG00000157764") {
    knownDrugs {
      rows {
        drugId
        disease {
          id
        }
      }
    }
  }
}

Because our inputs are produced in step 2 we can validate that each referenced drug referenced by the target, actually exists in the data. The same is true for each disease referenced by the drug. If you add in unrecognised entries the API will not be able to return valid responses.

  1. Deploy the created images using the terraform-google-opentargets-platform repository. You’ll likely have to update this quite significantly if you’re not deploying to GCP.
3 Likes