Understanding the data available online

Dear community,

I am trying to decipher the information on the data (specifically v2g_scored) available online and extract valuable information. While working on that, I have a few questions:

1. I can infer that source_list and source_score_list are interconnected, however, I wanted to ask what do the numbers in the source_score_list represent exactly?

overall_score                source_list  source_score_list  

0 0.046479 [canonical_tss] [0.7]
1 0.086519 [vep, canonical_tss] [0.1, 1.0]

2. Each genetic locus might appear several times, by capturing different pieces of information. I have found that one with position “958339”, ref_allele = G, alt_allele = A that has double information in the “source_id”. In particular, when source_id = canonical_tss, the information is different. Why does this discrepancy occur? Please check the cells highlighted in blue.

3.

Assuming the following portion of the data

Column 1 Column 2 Column 3 Column 4 E F G H I
position ref_allele alt_allele d type_id source_id source_list source_score_list feature
958339 G A 315525.0 distance canonical_tss [jung2019, javierre2016,canonical_tss] [0.0,0.6,0.4] unspecified
958339 G A 8143.0 distance canonical_tss [eqtl, canonical_tss] [0.9, 1.0] unspecified

Could you help me by confirming whether the following JSON format captures every information correctly?

[
{
“distance”: 315525.0 ,
“source_scores”: {
“javierre2016”: 0.0,
“jung2019”: 0.6,
“canonical_tss”: 0.4
}
]

Thank you very much in advance!

Aglaia

Hi Aglaia,

Is this the full dataset? Are these rows for the same gene or different genes?

Best wishes,
Xiangyu