Is there some docs or code to help me create my own GraphQL database from the data download? We would like to extend the information in the database and create our own version of the Platform for internal use, showing that new data.
I’ll give a broad overview here, but depending on your specific requirements you might need further assistance.
An overview
The GraphQL interface itself does not contain any data (a stateless service), it is an interface which allows us to query two databases which host the data for OTP; in our documentation we refer to it as the API. The two databases (one Elasticsearch and the other Clickhouse) need to be built using the outputs of the OTP ETL. The web application queries the API to provide a GUI for the OTP.
The databases are prepared using code from this repository. Within that repository, the directory terraform_create_images
hosts the code which we use to create the necessary GCP VMs which we use to run OTP.
The four components are deployed using this repository. This again uses Terraform to deploy to GCP.
We deploy our infrastructure on Google Cloud Platform, and as such much of the deployment code is tied quite tightly to that service. Each of the individual components is deployed either as a VM or a Docker image.
A rough guide to releasing
Typically our release process has the following steps:
- Collect all the necessary ETL inputs using the Platform Input Support repository. You don’t need to do this yourself, the outputs of this step are available from either Google Cloud (
gs://open-targets-data-releases/22.09/input
) or the EBI FTP (http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/22.09/input/
). - Using the outputs from step 1, use the ETL to create the datasets. If you are adding in additional data, this is probably going to be where you need to do it. You can examine our ETL outputs from either Google Cloud (
gs://open-targets-data-releases/22.09/output
) or the EBI FTP (http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/22.09/output/
). - Using the outputs from step 2, load the Clickhouse and Elasticsearch databases with the code in platform-output-support. You could potentially add in additional data at this point. There is a risk in doing so. The datasets interact through ID terms (a
target
might reference adrug
, which references adisease
). For example, the query:
query target_to_drug_to_disease {
target(ensemblId: "ENSG00000157764") {
knownDrugs {
rows {
drugId
disease {
id
}
}
}
}
}
Because our inputs are produced in step 2 we can validate that each referenced drug
referenced by the target
, actually exists in the data. The same is true for each disease
referenced by the drug
. If you add in unrecognised entries the API will not be able to return valid responses.
- Deploy the created images using the terraform-google-opentargets-platform repository. You’ll likely have to update this quite significantly if you’re not deploying to GCP.
Thanks Jarod, this is indeed what I needed! I am getting the input files directly from the bucket and will add my evidence from there, so it gets validated through the pipeline as well etc. I would use the platform-etl-support scala code to add my data into the process and then take it all the way to the clickhouse/ES servers and use the platform-api to create the graphql API that will then be used by the UI…Quite the process, but I am getting there!
Thon