Get V2G genes/scores from variant rsid?

Hi,
I’m new to the platform (and to both GraphQL and SQL). I have a list of genetics variants as rsIDs, for which I would to retreive the table of “assigned genes” that is shown on the website when I query each variant individually. I understand that to do this, I will first need to map the rsIDs to the open targets variants IDs (e.g. 1_154453788_C_T).

I’m having a hard to working out how to do this from your documentation - perhaps I’m looking in the wrong place?! - and so would appreciate some advice on how I can do this or to be pointed in the direction of some tutorials and schema. From what I understand the best way to achieve this would be via the bigquery instance, but what the relevant tables and column names are seems like a mystery at the moment.

Thanks in advance.

Hello @new_user,
for your query you would need to use 2 of our datasets:

  • The variant index dataset. To map your rsIDs to our variant notation.
  • The variant to gene scored dataset. To annotate your variants with the associated genes and their scores.

As always, if you have a large number of variants you want to annotate, I’d encourage you to use our datasets dump, and not the API, as it is easier to operate with. To get the variant information you can join your rsIDs with out variant index using the rs_id and chr_id, position, ref_allele, alt_allele (we build our variant ID by concatenating these 4 columns). And to get the V2G scores, you can use our scored dataset to join the harmonised variant IDs with the columns gene_id and overall_score.

As explained in the documentation, you should be able to follow this approach either by downloading the data from the FTP or by using BigQuery.

I hope you find this helpful!

Thanks Irene, this is very useful. More generally, where is best to look in order to determine which variables are stored in which datasets? I have read the docs but perhaps have missed where a schema is conveniently available (i.e. outside of the being stored within the individual datasets).

Hi. I am doing the same thing and when I try to wget the variant index, I get a return that downloads the first few then says “Skipping directory ‘variant-index’.” and does not download the necessary data. Have you seen this before?

Hi @Melissa_B, I haven’t come across with this problem. Is it solved? Are you getting the data from the FTP?
I usually do something like wget --recursive --no-parent --no-host-directories --cut-dirs 8 ftp://ftp.ebi.ac.uk/pub/databases/opentargets/genetics/latest//lut/variant-index/

Thank you for the suggestion. We are working on an optimised version of our pipelines where, among other significant changes, documentation of our datasets and the logic to generate them is a key component. Stay tuned!

1 Like

Hi, I have some problems while reproducing tables for variant scores for multiple SNPs as shown on your website:

I tried to download v2g (Index of /pub/databases/opentargets/genetics/latest/v2g) and managed to read the data, but it seems that no scores in those files. Here is what I got in the following columns; they are not the complete information shown above.

[1] “chr_id” “position” “ref_allele” “alt_allele”
[5] “gene_id” “feature” “type_id” “source_id”
[9] “fpred_labels” “fpred_scores” “fpred_max_label” “fpred_max_score”
[13] “qtl_beta” “qtl_se” “qtl_pval” “qtl_score”
[17] “interval_score” “qtl_score_q” “interval_score_q” “d”
[21] “distance_score” “distance_score_q”

Then I downloaded v2g_scored via

wget -P xxxxxx / --recursive -e robots=off --no-parent --no-host-directories --cut-dirs 8 Index of /pub/databases/opentargets/genetics/latest/v2g_scored -A “*parquet”

even though I specified "parquet", I got 11587 files, and it seems all of them have "parquet". Actually, on your website, there are only 1200 files: part-00000 to part-00199, I copied all those files and tried to read them with sparkly::spark_read_parquet() and I got problems.

Error:
! org.apache.spark.SparkException: Job 1 cancelled because SparkContext was shut down

Run `sparklyr::spark_last_error()` to see the full Spark error (multiple lines)
To use the previous style of error message set `options("sparklyr.simple.errors" = TRUE)`

I used quite a large memory and tried to tune some parameters, but I could not solve the problem.

conf <- spark_config()
confsparklyr.shell.executor-memory' <- "32g"
confsparklyr.shell.driver-memory' <- "40g"
conf$spark.executor.cores <- 32
conf$spark.executor.memory <- "64G"
conf$spark.yarn.am.cores  <- 32
conf$spark.yarn.am.memory <- "64G"
conf$spark.executor.instances <- 16
conf$spark.dynamicAllocation.enabled <- "false"
conf$maximizeResourceAllocation <- "true"
conf$spark.default.parallelism <- 64
conf$spark.local.dir <- "/mnt/data_beteigeuse4/"
conf$spark.driver.maxResultSize <- "8g"

My concrete questions are:

  • is v2g_scored the correct files?
  • Is downloading wrong because I got more files than what is displayed on the FTP server?
  • is the error caused by the wrong files of settings?
  • any other solution/suggestion?

Thank you very much for your help! I was stuck here for quite a while.

Dear llg,

Yes v2g_scored is the file you are after. It is unclear where your 11587 files came from, but I managed to download all the required files using the command pasted above:

wget --recursive --no-parent --no-host-directories --cut-dirs 8 ftp://ftp.ebi.ac.uk/pub/databases/opentargets/genetics/latest/v2g_scored/

Which gives me the files required.

It is also unclear why your spark crashed from your error message. Indeed running out of memory is the most common issue, v2g_scored as a whole is around 20gb with around 1 billion rows. I recommend re-downloading the file using the command and trying to open it again.

Best wishes,
Xiangyu

2 Likes