Missing .parquet files?

Dear OpenTargets Team,

Is it expected that some .parquet files are missing in e.g. evidences progeny (Index of /pub/databases/opentargets/platform/latest/output/etl/parquet/evidence/sourceId=progeny )? part-00005 for example is missing here? Is this by design?
Any advice would be much appreciated. Thank you.

Hi,

I can assure the datasets are good. For very small datasets, where the number of expected partitions are in the range of the evidence count, it is expected to see fewer partitions. This is the case for

  • progeny: 168 partitions, 378 evidence
  • sysbio: 171 partitions, 389 evidence

How these partitions are actually created and how the available data is spread across them, is controlled by spark and we have little overview on the details. Please let us know if you encounter with anything unexpected.

Best,
Daniel

2 Likes