Skip to content

Latest commit

 

History

History
41 lines (34 loc) · 2.08 KB

ARCHITECTURE.md

File metadata and controls

41 lines (34 loc) · 2.08 KB

ENCODE Ingest Architecture

Schema Design

The schema used for this dataset was almost entirely inspired by the Terra Interoperability Model (TIM). JSON definitions of the resulting tables can be found under schema/. The schema is purposefully very different from the source ENCODE schema: it is sample-centric (instead of experiment-centric), merges & splits ENCODE entities as needed, and does not attempt to capture all of the information published in the source data. Data model

Pipeline Architecture

The ingest pipeline is orchestrated through Argo, with most data processing logic delegated to Dataflow and BigQuery. The high level flow looks like: Architecture diagram

NOTE: Only metadata ingest is implemented. The system still needs to be extended to ingest data files from ENCODE's S3 archives.

Extracting Metadata

Metadata extraction is the first step of ENCODE ingest. The extraction component is written as a Dataflow pipeline, which begins by querying the metadata of all Biosample entities from human donors. From there, the pipeline traverses known foreign-key fields to scrape the rest of the metadata we care about: Extraction graph

Some important things to note:

  1. The initial query relies on there being relatively few Biosample entities in ENCODE's system, as they all must be returned in a single response payload. Performance may degrade as ENCODE continues to grow.
  2. The links we follow between different entity types were chosen based on non-scientific experimentation and performance-testing. If the number of entities in a type grows significantly, performance migth degrade / the pipeline might crash.

Transforming Metadata

Metadata transformation is also implemented as a Dataflow pipeline. Instead of one large DAG, this step can be viewed as a collection of tiny processing trees that just happen to run at the same time: Transformation trees

Note that some raw entities are used to contribute to multiple output tables.