Creating Pandas Dataframe from API POST using Python Requests Function

sawagema · 23 October 2022 04:21

How do I transform the API response from JSON into a pandas dataframe? I am currently looking for drug targets within Alzheimers, here is the post requests script:

#gene_id = "ENSG00000169083"
disease_ids = ["MONDO_0004975", "EFO_1001870"]
for disease_id in disease_ids:
    # Build query string to get general information about AD and knownDrugs and their names and ID's.
    query_string = """query getInfoOnAlzheimers ($efoId: String!) {
      disease (efoId: $efoId){
        id
        name
        dbXRefs
        knownDrugs{
          uniqueDrugs
          uniqueTargets
          rows{
            drug{
              id
              name
              tradeNames
              }
              #name
            }
          }

            }
          }
    """
    # Set variables object of arguments to be passed to endpoint
    variables = {"efoId": disease_id}

    # Set base URL of GraphQL API endpoint
    base_url = "https://api.platform.opentargets.org/api/v4/graphql"

    # Perform POST request and check status code of response
    r = requests.post(base_url, json = {"query": query_string, "variables": variables})#, json={"query": query_string, "variables": variables})
    print(r.status_code)
    #print(r.text)
    # Transform API response from JSON into Python dictionary and print in console
    api_response = json.loads(r.text)
    print(api_response)

dsuveges · 25 October 2022 02:19

Hi, Welcome to the OpenTargets community! The problem is that pandas cannot really handle nested data structures, rather likes flat tabular data with simple columns and rows. You can either go for pyspark, or you can parse the returned json, select relevant fields, format them into a flatter format, then convert it to pandas. When you say:

data = r.json()
pd.DataFrame(data['data']['disease'])

you would get a pandas dataframe, but you’ll see, it is very complicated to further process. Instead I would recommend to parse the data = r.json() .

Also you don’t need to iterate over diseases to retrieve information from the API. the diseases graphql endpoint allows querying multiple diseases at once. Example:

import requests
import pandas as pd

query = '''
    query getInfoOnAlzheimers($efoIds: [String!]!) {
      diseases(efoIds: $efoIds) {
        id
        name
        dbXRefs
        knownDrugs {
          uniqueDrugs
          uniqueTargets
          rows {
            drug {
              id
              name
              tradeNames
            }
          }
        }
      }
    }
'''

disease_ids = ['MONDO_0004975', 'EFO_1001870']

# Set base URL of GraphQL API endpoint
base_url = "https://api.platform.opentargets.org/api/v4/graphql"

# Perform POST request and check status code of response
r = requests.post(base_url, json = {"query": query, "variables": {"efoIds": disease_ids}})

assert r.status_code == 200, f'Expected status code is 200, got {r.status_code} instead'

# Extract data as json:
data = r.json()

# This variable will collect disease and drug data in a nicer format suitable to convert to pandas:
parsed_disease_data = []

# Looping over all diseases in the returned object:
for disease in data['data']['diseases']:
    disease_data = {
        'diseaseId': disease['id'],
        'diseaseName': disease['name'],
        'diseaseDbXRefs': disease['dbXRefs'],
    }

    # Further steps depends on the existence of drug object:
    if disease['knownDrugs']:
        disease_data.update({
            'uniqueDrugs': disease['knownDrugs']['uniqueDrugs'],
            'uniqueTargets': disease['knownDrugs']['uniqueTargets']
        })

        # Iterate over all drugs:
        drugs = []
        for drug in disease['knownDrugs']['rows']:
            drugs.append({
                'drugId': drug['drug'].get('id', None),
                'drugName': drug['drug'].get('name', None),
                'tradenames': drug['drug'].get('trandeNames', None)
            })

        # Combining disease and drug data:
        disease_data.update({'drugs': drugs})

    # Collecting data:
    parsed_disease_data.append(disease_data)

df = (
    # Converting to pandas dataframe:
    pd.DataFrame(parsed_disease_data)
    # Exploding drug column to multiple rows:
    .explode('drugs')
    # Resetting index:
    .reset_index(drop=True)
)

Please let us know if there’s anything unclear.

sawagema · 29 October 2022 02:45

This was incredibly helpful. Thank you so much.

sawagema · 29 October 2022 02:50

I needed to store the drug information into its own local data frame in order to write it to a file. So i could then use the PubChem Identifier Exchange Service to convert ChEMBL Ids to CIDs:

## Creating a file to store ChEMBLIds into
f = open('chembls.txt','w')

# Looping over all diseases in the returned object:
for disease in data['data']['diseases']:
    disease_data = {
        'diseaseId': disease['id'],
        'diseaseName': disease['name'],
        'diseaseDbXRefs': disease['dbXRefs'],
    }

    ChEMBLIds = []

    # Further steps depends on the existence of drug object:
    if disease['knownDrugs']:
        disease_data.update({
            'uniqueDrugs': disease['knownDrugs']['uniqueDrugs'],
            'uniqueTargets': disease['knownDrugs']['uniqueTargets']
        })

        # Iterate over all drugs:
        drugs = []
        for drug in disease['knownDrugs']['rows']:
            drugs.append({
                'drugId': drug['drug'].get('id', None),
                'drugName': drug['drug'].get('name', None),
                'tradenames': drug['drug'].get('tradeNames', None)
            })
        
        ### Creates a local dataframe containing only drug information
        df = pd.DataFrame(drugs)
        ChEMBLIds = df['drugId'].tolist() # stores ChEMBLIds to a list
        
        ### Writes the list of drug information to a file
        file_data = "\n" + disease_data['diseaseName'] + ":" + disease_data['diseaseId']+ "\n%s\n" % ChEMBLIds
        print(file_data)
        f.write(file_data)

dsuveges · 31 October 2022 10:04

If you only need the drug id list saved into a file, I would do something along these lines:

(
    df
    # Dropping rows with missing drug identifier:
    .loc[lambda df: df.drugId.notna()]
    # Select column:
    .drugId
    # Remove duplicates:
    .drop_duplicates()
    # Save list without index and header:
    .to_csv('drugids.txt', index=False, header=None)
)

However, I would advise you to take a look at the documentation of the graphQL API (see DOCS) for all the fields you can retrieve from the drug index we have, there might be something you you need afterwards.

sawagema · 11 November 2022 18:39

Is it possible to retrieve relevant efo Ids based on a Disease or phenotype name search?
In other means, instead of using the database to find efoIds based on Disease name and then plugging the efoId into a list inside my python script, is it possible to just retrieve the top efoIds through some requests retrieval function?

dsuveges · 11 November 2022 23:16

Hi Sawagema, So the search page and the search functionality is powered by the same GraphQL API that provides data for the widgets on the evidence and target pages etc. So you actually can script against it. A such script would look like this:

import pandas as pd
import requests

graphql_api_url = 'https://api.platform.opentargets.org/api/v4/graphql'

query = '''
  query search4disease($query: String!, $entities: [String!]!){
    search(
      queryString: $query
      entityNames: $entities
    ){
      hits {
        id
        entity
        name
        description
      }
    }
  }
'''

variables = {
  "query":"alzheim",
  "entities": ["disease"]
}

# Submit query:
r = requests.post(graphql_api_url, json={'query': query, 'variables': variables})
data = r.json()

df = pd.DataFrame(data['data']['search']['hits'])
df.head()

I strongly advise looking at the documentation on this endpoint here. I hope this helped.

Topic		Replies	Views
Batch download disease/association scores GraphQL API ot-platform	2	469	28 June 2022
API request for associations with gene OT Genetics Infrastructure genetics-portal	4	493	26 October 2022
Query all snps associated with a disease GraphQL API	1	387	20 March 2023
Issue getting data with the GraphQL API using Python GraphQL API	4	526	8 March 2022
How to select all diseases using the Open Targets Platform GraphQl API? GraphQL API	1	612	9 August 2021

Creating Pandas Dataframe from API POST using Python Requests Function

Related topics