Defined schema for v2g and variant-index data

Afternoon,

I’m working on the v2g and variant-index data. I’m just wondering if the schemas for these datasets are available for me to query programmatically. I’ve found a link here that links to a GitHub file for the v2g data, but I can’t find anything for the variant-index data.

The schema is easily retrievably for the platform data.

There will ultimately be loading and processing using both Python and R and I don’t want to have to rely on the data types being inferred implicitly or having to manually hard-code the schema myself. Hard-coding is doable but if the data were to change between release versions, I’d have to update my code.

Any help would be much appreciated!

Thanks,
Kier

Hi @K_J_Finnegan,

You are right, the variant index is not documented anywhere. Thanks for reporting this gap in our documentation. We’ll review to make sure it is up to date and there’s no missing bit. I cannot give a definitive timeline though.

In the meantime, this is the schema:


root
 |-- chr_id: string (nullable = true)
 |-- position: integer (nullable = true)
 |-- ref_allele: string (nullable = true)
 |-- alt_allele: string (nullable = true)
 |-- chr_id_b37: string (nullable = true)
 |-- position_b37: integer (nullable = true)
 |-- rs_id: string (nullable = true)
 |-- most_severe_consequence: string (nullable = true)
 |-- cadd: struct (nullable = true)
 |    |-- raw: double (nullable = true)
 |    |-- phred: double (nullable = true)
 |-- af: struct (nullable = true)
 |    |-- gnomad_afr: double (nullable = true)
 |    |-- gnomad_amr: double (nullable = true)
 |    |-- gnomad_asj: double (nullable = true)
 |    |-- gnomad_eas: double (nullable = true)
 |    |-- gnomad_fin: double (nullable = true)
 |    |-- gnomad_nfe: double (nullable = true)
 |    |-- gnomad_nfe_est: double (nullable = true)
 |    |-- gnomad_nfe_nwe: double (nullable = true)
 |    |-- gnomad_nfe_onf: double (nullable = true)
 |    |-- gnomad_nfe_seu: double (nullable = true)
 |    |-- gnomad_oth: double (nullable = true)
 |-- gene_id_any_distance: long (nullable = true)
 |-- gene_id_any: string (nullable = true)
 |-- gene_id_prot_coding_distance: long (nullable = true)
 |-- gene_id_prot_coding: string (nullable = true)

We are in the process of a larger scale update in the data release process including moving the variant index to GnomAD3. So, there might be movement in this space as well. However by the time we get there, the documentation will be updated for sure.

Best,
Daniel

1 Like

Hi Daniel,

That’s really helpful!

Thanks a lot,
Kier