Name		Name	Last commit message	Last commit date
parent directory ..
build-pipeline-docker-images		build-pipeline-docker-images
celligner		celligner
cn_gene		cn_gene
context_explorer		context_explorer
data_page		data_page
examples/dsub-exec-profile		examples/dsub-exec-profile
predictability		predictability
scripts		scripts
_run_common.conseq		_run_common.conseq
_run_external.conseq		_run_external.conseq
cell_lines.conseq		cell_lines.conseq
context-enrichment.conseq		context-enrichment.conseq
correlation.conseq		correlation.conseq
dose_replicate_reformat.conseq		dose_replicate_reformat.conseq
dstat_wrapper.py		dstat_wrapper.py
exec.conseq		exec.conseq
jenkins-run-nonquarterly.sh		jenkins-run-nonquarterly.sh
jenkins-run-pipeline.sh		jenkins-run-pipeline.sh
make_compound_summary_table.conseq		make_compound_summary_table.conseq
nonquarterly-processed.conseq		nonquarterly-processed.conseq
oncokb_import.conseq		oncokb_import.conseq
pref_essential_genes.conseq		pref_essential_genes.conseq
preprocess_raw_biom_matrix.conseq		preprocess_raw_biom_matrix.conseq
preprocess_taiga_ids.py		preprocess_taiga_ids.py
proteomics.conseq		proteomics.conseq
publish.conseq		publish.conseq
readme.md		readme.md
reformat_deps.conseq		reformat_deps.conseq
reformat_repurposing_data.conseq		reformat_repurposing_data.conseq
rules-to-skip		rules-to-skip
run_dev.conseq		run_dev.conseq
run_dqa.conseq		run_dqa.conseq
run_external.conseq		run_external.conseq
run_iqa.conseq		run_iqa.conseq
run_test.conseq		run_test.conseq
run_xqa.conseq		run_xqa.conseq
sanger_proteomics.conseq		sanger_proteomics.conseq
sparkles-config		sparkles-config
sparkles-config-n1-highmem-4		sparkles-config-n1-highmem-4
summarize_gene_deps.conseq		summarize_gene_deps.conseq
tda.conseq		tda.conseq
tda_table_generator.conseq		tda_table_generator.conseq
validation.conseq		validation.conseq
xrefs-common.conseq		xrefs-common.conseq
xrefs-external.template		xrefs-external.template
xrefs-nonquarterly-unprocessed.conseq		xrefs-nonquarterly-unprocessed.conseq
xrefs-public.template		xrefs-public.template

readme.md

Preprocessing Pipeline Overview

We have a preprocessing pipeline runs via conseq. It takes in taiga datasets specified as xrefs, cleans, transforms, and unifies these datasets, and publishes the output to an GCS bucket. Conseq is available here https://github.com/broadinstitute/conseq.

See the conseq repo for instructions using conseq.

This process is now run in a production setting via jenkins jobs. If you want to make changes, its often easiest to download the latest artifacts from a given environment as your starting point.

Common conventions

xrefs

All inputs to the pipeline should be registered into taiga and accessed by taiga datafile ID. These IDs should be specified properties on artifacts (our convention, is typically to use dataset_id).

Taiga IDs used by all environments should go into xrefs_common.conseq. Those data which are only released to the DMC or internally should go into xrefs-shared-internal-dmc.conseq. Those which change every release should go into xrefs-ENVIRONMENT.template

Executing rules on the cloud

If you have a task which requires a large amount of memory or CPU, it's best to push it to the cloud. If it's an array job (ie: you want hundreds of jobs to run in parallel) you should have your rule run sparkles to submit the job. Always submit the job with a name that contains a hash of the inputs so that we can gracefully continue if the process is interrupted. (See the predictive pipeline for examples)

If you have individual tasks which should run in the cloud, you can mark then as using the dsub executor and specify the memory required and the image to use. For example:

rule process_celligner_inputs:
    executor: dsub {
       "docker_image": "us.gcr.io/broad-achilles/celligner@sha256:6442129dfc136d0d603e8fbd5b1d469a0bf91cc63286132e45975101edbaffa8",
       "min_ram": "50",
       "boot_disk_size": "70",
       "helper_path": "/opt/conseq/bin/conseq-helper" }
    inputs:
       ...

Also, note always specify the image SHA so that we can track which version of the image was used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline

pipeline

readme.md

Preprocessing Pipeline Overview

Common conventions

xrefs

Executing rules on the cloud

Files

pipeline

Directory actions

More options

Directory actions

More options

Latest commit

History

pipeline

Folders and files

parent directory

readme.md

Preprocessing Pipeline Overview

Common conventions

xrefs

Executing rules on the cloud