Skip to content

ANiknejad/bgee_pipeline

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pipeline of Bgee release 14.1

Bgee pipeline documentation: content

General information

  1. Introduction
  2. Directory structure
  3. Pipeline steps

Developer guidelines

  1. Documenting the pipeline
  2. Writing Makefiles
  3. Versioning input/output files

General information

Introduction

Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data).

Bgee is based exclusively on curated "normal", healthy, expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression.

Bgee produces ranked calls of presence/absence of expression, and of differential over-/under-expression, integrated along with information of gene orthology, and of homology between organs. This allows comparisons of expression patterns between species.

Bgee pipeline overview

Directory structure:

  • pipeline/: contains the Makefiles and scripts used to generate data. The Bgee pipeline is a mixture of Perl scripts, R scripts, Java code. They are all called through the use of makefiles. This directory is organized by sub-directories, corresponding to steps in the pipeline, that each contains a Makefile to be run. Common parameters are defined in the files pipeline/Makefile.common (for non-sensitive variables) and pipeline/Makefile.Config (for sensitive variables, e.g., passwords).

  • source_files/: contains the source data used for the pipeline, for instance, files from annotators in TSV format, downloaded ontologies to use, downloaded files from MOD, etc.

  • generated_files/: contains the files and outputs generated by the pipeline.

  • java/: contains files related to the Bgee Java project. Some Makefiles use them through command line arguments.

Pipeline steps

Each step of the pipeline is represented by a sub-directory in the pipeline/ directory. See pipeline/README.md for description of the steps, and configuration.

Developer guidelines

  1. Documenting the pipeline
  2. Writing Makefiles
  3. Versioning input/output files

Documenting the pipeline

The pipeline is documented directly in the relevant directories of the pipeline, through README.md files. See pipeline/ directory as a starting point.

Recommended sections:

  • Aim of the step, and previous steps required.
  • Details: explanations about the step
  • Data generation: how to run the step
  • Data verification: how to verify that the step was correctly run
  • Error handling: what to do when confronted to typical errors
  • Other notable Makefile targets: individual Makefile targets that could be useful

Writing Makefiles

Mandatory variables and import

Two variables that are used in pipeline/Makefile.common must be defined by each Makefile:

  • PIPELINEROOT: defines the path to the root folder of the pipeline, relative to the Makefile (i.e., to target the directory pipeline/).
  • DIR_NAME: defines the name of the directory where lives the Makefile (e.g., species/). This allows to use folders of same name in source_files/ and generated_files/.

These variables must be defined before importing pipeline/Makefile.common. They allow to define useful common variables from pipeline/Makefile.common, e.g.: VERIFICATIONFILE, INPUT_DIR, OUTPUT_DIR.

Example start of a Bgee Makefile in pipeline/species/:

PIPELINEROOT := ../
DIR_NAME := species/
include $(PIPELINEROOT)Makefile.common

Step verification file

The all target of the Makefile must always be to generate a step verification file, whose path is accessible through the variable $(VERIFICATIONFILE). The all target should either be the first target defined in the Makefile, or be assigned as the default goal (i.e., .DEFAULT_GOAL := all).

This verification file must contain information allowing to easily assess whether all the targets of the Makefile were successfully run. For instance, to generate this file, a Makefile could launch a query to the database to verify correct insertion of data.

Example Makefile targets:

 all: $(VERIFICATIONFILE)
 
 ...
 
 $(VERIFICATIONFILE): dependency1 dependency2 ...
     @$(MYSQL) -e "SELECT * FROM taxon where bgeeSpeciesLCA = TRUE order by taxonLeftBound" > [email protected]
     @$(MV) [email protected] $@

Input/output folders

In order to more efficiently save input files and files generated by a pipeline run, they are kept in specific folders, not mixed with pipeline scripts, in source_files/ and generated_files/. The directory structure in these folders should be the same as in the pipeline/ directory.

No Makefile should read directly from or write directly into the pipeline/ folder.

Common variables

When a file name, or URL, etc, is used in several Makefiles, it should be assigned to a variable in pipeline/Makefile.common. Notable variables:

  • SOURCE_FILES_DIR: corresponds to $(PIPELINEROOT)../source_files/
  • INPUT_DIR: corresponds to $(SOURCE_FILES_DIR)$(DIR_NAME)
  • GENERATED_FILES_DIR: corresponds to $(PIPELINEROOT)../generated_files/
  • OUTPUT_DIR: corresponds to $(GENERATED_FILES_DIR)$(DIR_NAME)
  • VERIFICATIONFILE: corresponds to $(OUTPUT_DIR)step_verification_$(RELEASE).txt
  • TMPDIR
  • useful commands:
    • $(JAVA)
    • $(MYSQL)
    • $(MYSQLNODBNAME)
    • $(CURL): and see $(APPEND_CURL_COMMAND); allows to download a file only if the remote file is newer than the local file, and to a save it to a tmp file until the transfer is successfully completed.

Secured variables

If a variable contains sensitive information (e.g., a password), it should be defined in pipeline/Makefile.Config. The actual values of these variables should not be versioned! (simpler than to encrypt the file)

Versioning input/output files

Source and generated files should be versioned using git, if not too large. This versioning is not performed automatically by the Makefiles. It is the responsibility of the person in charge to version the relevant files when a step is completed.

About

Source code of the Bgee pipeline

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Perl 42.2%
  • HTML 22.2%
  • Makefile 12.6%
  • Jupyter Notebook 12.4%
  • R 5.1%
  • TSQL 4.5%
  • Other 1.0%