Dear community,
I am trying to decipher the information on the data (specifically v2g_scored) available online and extract valuable information. While working on that, I have a few questions:
1. I can infer that source_list and source_score_list are interconnected, however, I wanted to ask what do the numbers in the source_score_list represent exactly?
overall_score source_list source_score_list
0 0.046479 [canonical_tss] [0.7]
1 0.086519 [vep, canonical_tss] [0.1, 1.0]
2. Each genetic locus might appear several times, by capturing different pieces of information. I have found that one with position “958339”, ref_allele = G, alt_allele = A that has double information in the “source_id”. In particular, when source_id = canonical_tss, the information is different. Why does this discrepancy occur? Please check the cells highlighted in blue.
3.
Assuming the following portion of the data
Column 1 | Column 2 | Column 3 | Column 4 | E | F | G | H | I |
---|---|---|---|---|---|---|---|---|
position | ref_allele | alt_allele | d | type_id | source_id | source_list | source_score_list | feature |
958339 | G | A | 315525.0 | distance | canonical_tss | [jung2019, javierre2016,canonical_tss] | [0.0,0.6,0.4] | unspecified |
958339 | G | A | 8143.0 | distance | canonical_tss | [eqtl, canonical_tss] | [0.9, 1.0] | unspecified |
Could you help me by confirming whether the following JSON format captures every information correctly?
[
{
“distance”: 315525.0 ,
“source_scores”: {
“javierre2016”: 0.0,
“jung2019”: 0.6,
“canonical_tss”: 0.4
}
]
Thank you very much in advance!
Aglaia