Re-writing the drug index

We recently restructured the Open Targets Platform’s drug index, making it more flexible and easier for stakeholders to expand with their own data.

The Platform retrieves raw inputs from ChEMBL using an Elasticsearch instance. Our pipeline then winnows the approximately 2 million compounds available down to almost 12 000 drugs.

We define a drug to be any molecule that meets at least one of the following criteria:

  • There is at least 1 known indication (disease);
  • There is at least 1 known mechanism of action (targets); or
  • The ChEMBL ID can be mapped to a DrugBankID.

The new data structure is broken down into three broad tranches: molecules, mechanisms of action, and indications. They can be combined as necessary using a ChEMBL ID as a linking field.

For users who run their own instances of the ETL and Platform, there is now also the possibility to add additional data using external files. The new structure simplifies this process since a smaller number of fields must be supplied to add new data. For detailed instructions regarding required fields and formats to do this, please refer to the ETL pipeline configuration’s readme.

This post is based on our recent blog update, where you can find a more detailed explanation of the reasons and process behind the drug index restructuring.