Publication years for Proteins

Hello there community,

I am required to retrieve publication dates for 10 proteins for which I am doing the following.

query_string = """
      query targetAnnotation($ensemblId: String!,$cursor:String!) {
        target(ensemblId: $ensemblId) {
          approvedSymbol
          literatureOcurrences(startYear:1989,endYear:2025,cursor:$cursor){
              cursor
              rows{
                  publicationDate
            }
          }
        }
      }
"""

I am passing the cursor value in a recursive function as this query yields just 25 dates at a time. I keep doing this until the cursor returns None. This I found out by reading couple of previous discussions.

The code works all fine but it is extremely slow, as you can see my time period is 1989-2025. The code ran for about 45 minutes and crashed. I was appending each return to a list. Btw, it was still on the 1st protein by 45 minutes.

I also read that large queries like this should be done in OT’s BigQuery, however in the ‘targets’ schema ‘literatureOcurrences’ is not available while rest of the others from GraphQL API are there.

Can someone help me with what I want to achieve? I basically need a dataframe with Proteins as column names and publication dates as row values (preferably). Other structures are also fine.

The Europe PMC or PubMed provide similar results on the fly but the results are different when you search with UniProt id or HGNC symbol or synonym. I guess OT search does a better job in this aspect.

Thanks in advance,
R

Hi, yes, indeed this graphql query can be painfully slow due to the large number of publications. Unfortunately the relevant dataset is not exposed via BigQuery, you can access the dataset from ftp (only available in parquet format): ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/24.09/output/etl/parquet/literature/literatureIndex.

This is a flat table with the following schema:

root
 |-- pmid: string (nullable = true)
 |-- pmcid: string (nullable = true)
 |-- date: date (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- keywordId: string (nullable = true)
 |-- relevance: double (nullable = true)
 |-- keywordType: string (nullable = true)

Where you can filter the dataset for the relevant keyworkdId of your interest and extract the dates. The dataset is not painfully large (<2GB) and contains 140M rows.

The Europe PMC or PubMed provide similar results on the fly but the results are different when you search with UniProt id or HGNC symbol or synonym.

This is one of our speciality that we normalise the diverse universe of disease/gene-protein and molecule labels to a standard reference. Unfortunately this is not yet available at EuroPMC.

Please let us know if you have further questions.

1 Like

Thanks a lot. I will try it out and get back to you soon.