Creating Pandas Dataframe from API POST using Python Requests Function

How do I transform the API response from JSON into a pandas dataframe? I am currently looking for drug targets within Alzheimers, here is the post requests script:

#gene_id = "ENSG00000169083"
disease_ids = ["MONDO_0004975", "EFO_1001870"]
for disease_id in disease_ids:
    # Build query string to get general information about AD and knownDrugs and their names and ID's.
    query_string = """query getInfoOnAlzheimers ($efoId: String!) {
      disease (efoId: $efoId){
        id
        name
        dbXRefs
        knownDrugs{
          uniqueDrugs
          uniqueTargets
          rows{
            drug{
              id
              name
              tradeNames
              }
              #name
            }
          }

            }
          }
    """
    # Set variables object of arguments to be passed to endpoint
    variables = {"efoId": disease_id}

    # Set base URL of GraphQL API endpoint
    base_url = "https://api.platform.opentargets.org/api/v4/graphql"

    # Perform POST request and check status code of response
    r = requests.post(base_url, json = {"query": query_string, "variables": variables})#, json={"query": query_string, "variables": variables})
    print(r.status_code)
    #print(r.text)
    # Transform API response from JSON into Python dictionary and print in console
    api_response = json.loads(r.text)
    print(api_response)

Hi, Welcome to the OpenTargets community! The problem is that pandas cannot really handle nested data structures, rather likes flat tabular data with simple columns and rows. You can either go for pyspark, or you can parse the returned json, select relevant fields, format them into a flatter format, then convert it to pandas. When you say:

data = r.json()
pd.DataFrame(data['data']['disease'])

you would get a pandas dataframe, but you’ll see, it is very complicated to further process. Instead I would recommend to parse the data = r.json() .

Also you don’t need to iterate over diseases to retrieve information from the API. the diseases graphql endpoint allows querying multiple diseases at once. Example:

import requests
import pandas as pd

query = '''
    query getInfoOnAlzheimers($efoIds: [String!]!) {
      diseases(efoIds: $efoIds) {
        id
        name
        dbXRefs
        knownDrugs {
          uniqueDrugs
          uniqueTargets
          rows {
            drug {
              id
              name
              tradeNames
            }
          }
        }
      }
    }
'''

disease_ids = ['MONDO_0004975', 'EFO_1001870']

# Set base URL of GraphQL API endpoint
base_url = "https://api.platform.opentargets.org/api/v4/graphql"

# Perform POST request and check status code of response
r = requests.post(base_url, json = {"query": query, "variables": {"efoIds": disease_ids}})

assert r.status_code == 200, f'Expected status code is 200, got {r.status_code} instead'

# Extract data as json:
data = r.json()

# This variable will collect disease and drug data in a nicer format suitable to convert to pandas:
parsed_disease_data = []

# Looping over all diseases in the returned object:
for disease in data['data']['diseases']:
    disease_data = {
        'diseaseId': disease['id'],
        'diseaseName': disease['name'],
        'diseaseDbXRefs': disease['dbXRefs'],
    }

    # Further steps depends on the existence of drug object:
    if disease['knownDrugs']:
        disease_data.update({
            'uniqueDrugs': disease['knownDrugs']['uniqueDrugs'],
            'uniqueTargets': disease['knownDrugs']['uniqueTargets']
        })

        # Iterate over all drugs:
        drugs = []
        for drug in disease['knownDrugs']['rows']:
            drugs.append({
                'drugId': drug['drug'].get('id', None),
                'drugName': drug['drug'].get('name', None),
                'tradenames': drug['drug'].get('trandeNames', None)
            })

        # Combining disease and drug data:
        disease_data.update({'drugs': drugs})

    # Collecting data:
    parsed_disease_data.append(disease_data)

df = (
    # Converting to pandas dataframe:
    pd.DataFrame(parsed_disease_data)
    # Exploding drug column to multiple rows:
    .explode('drugs')
    # Resetting index:
    .reset_index(drop=True)
)

Please let us know if there’s anything unclear.

3 Likes

This was incredibly helpful. Thank you so much.

I needed to store the drug information into its own local data frame in order to write it to a file. So i could then use the PubChem Identifier Exchange Service to convert ChEMBL Ids to CIDs:

## Creating a file to store ChEMBLIds into
f = open('chembls.txt','w')

# Looping over all diseases in the returned object:
for disease in data['data']['diseases']:
    disease_data = {
        'diseaseId': disease['id'],
        'diseaseName': disease['name'],
        'diseaseDbXRefs': disease['dbXRefs'],
    }

    ChEMBLIds = []

    # Further steps depends on the existence of drug object:
    if disease['knownDrugs']:
        disease_data.update({
            'uniqueDrugs': disease['knownDrugs']['uniqueDrugs'],
            'uniqueTargets': disease['knownDrugs']['uniqueTargets']
        })

        # Iterate over all drugs:
        drugs = []
        for drug in disease['knownDrugs']['rows']:
            drugs.append({
                'drugId': drug['drug'].get('id', None),
                'drugName': drug['drug'].get('name', None),
                'tradenames': drug['drug'].get('tradeNames', None)
            })
        
        ### Creates a local dataframe containing only drug information
        df = pd.DataFrame(drugs)
        ChEMBLIds = df['drugId'].tolist() # stores ChEMBLIds to a list
        
        ### Writes the list of drug information to a file
        file_data = "\n" + disease_data['diseaseName'] + ":" + disease_data['diseaseId']+ "\n%s\n" % ChEMBLIds
        print(file_data)
        f.write(file_data)

If you only need the drug id list saved into a file, I would do something along these lines:

(
    df
    # Dropping rows with missing drug identifier:
    .loc[lambda df: df.drugId.notna()]
    # Select column:
    .drugId
    # Remove duplicates:
    .drop_duplicates()
    # Save list without index and header:
    .to_csv('drugids.txt', index=False, header=None)
)

However, I would advise you to take a look at the documentation of the graphQL API (see DOCS) for all the fields you can retrieve from the drug index we have, there might be something you you need afterwards.

Is it possible to retrieve relevant efo Ids based on a Disease or phenotype name search?
In other means, instead of using the database to find efoIds based on Disease name and then plugging the efoId into a list inside my python script, is it possible to just retrieve the top efoIds through some requests retrieval function?

Hi Sawagema, So the search page and the search functionality is powered by the same GraphQL API that provides data for the widgets on the evidence and target pages etc. So you actually can script against it. A such script would look like this:

import pandas as pd
import requests

graphql_api_url = 'https://api.platform.opentargets.org/api/v4/graphql'

query = '''
  query search4disease($query: String!, $entities: [String!]!){
    search(
      queryString: $query
      entityNames: $entities
    ){
      hits {
        id
        entity
        name
        description
      }
    }
  }
'''

variables = {
  "query":"alzheim",
  "entities": ["disease"]
}

# Submit query:
r = requests.post(graphql_api_url, json={'query': query, 'variables': variables})
data = r.json()

df = pd.DataFrame(data['data']['search']['hits'])
df.head()

I strongly advise looking at the documentation on this endpoint here. I hope this helped.