Adding FDA data to Open Targets with Scala and Spark

Last year, the Open Targets team integrated the United States’ Food and Drug Administration’s (FDA) Adverse Event Report System (FAERS) into the Open Targets data ecosystem.

FAERS is a database of adverse events and medication error reports submitted to the FDA, and the publicly available API provides nearly 12 million records of adverse drug reactions. Medical professionals and consumers can voluntarily file reports of adverse medical events which are suspected to be associated with a drug. Since reporting is voluntary, the FDA has a number of disclaimers, including that the existence of a report does not establish causation, and that the information in the reports has not been verified.

The Platform team prepared a pipeline using Apache Spark and Scala, making it possible to analyse and extract insights from this information in minutes using Google Cloud’s Dataproc service. We limited the results to reports which did not result in patient death and were reported by a medical professional. We also ignored specific events related to treatment, technology or human action, rather than the action of the drug itself. This left us with approximately 55 000 unique drug-reaction pairs, covering 465 biological compounds.

We implemented a likelihood ratio test to account for how often the event and drug appear in the data set, and to test the relevance of each adverse effect associated with the drug. This leaves us with a useful guide as to which adverse events are strongly associated with specific drugs, which enables researchers to identify potential new linkages between specific targets and drugs and their effects.

This post is based on an Open Targets Blog post, where you can find further information on the process of integrating this data into the Platform, including details on why Spark was a particularly useful technology in this case, and how you can run your own analyses.