Name		Name	Last commit message	Last commit date
parent directory ..
Affymetrix		Affymetrix
Differential_expression		Differential_expression
ESTs		ESTs
In_situ		In_situ
RNA_Seq		RNA_Seq
SPARQL-RDF		SPARQL-RDF
annotations		annotations
collaboration		collaboration
db_creation		db_creation
dblite_creation		dblite_creation
download_files		download_files
genes		genes
init		init
post_processing		post_processing
scRNA_Seq		scRNA_Seq
species		species
uberon		uberon
web		web
Makefile.Config.example		Makefile.Config.example
Makefile.common		Makefile.common
Makefile.taxon_info		Makefile.taxon_info
README.md		README.md
Utils.pm		Utils.pm
Utils_test.pl.t		Utils_test.pl.t

README.md

Bgee pipeline

General information:

Introduction
List of pipeline steps
- Pipeline and database initialization
- Taxa, genomes, ontologies (e.g., Uberon)
- Raw expression data analyses and insertion: RNA-Seq, Affymetrix, in situ hybridization, EST analyses.
- Bgee post-processing steps

Shortcut note: for the RNA-Seq analysis pipeline, see RNA_Seq/.

Developer guidelines

Keeping track of data source versions
Pipeline configuration
Running and re-running pipeline steps

General information

Introduction

Through all the documentation, RELEASE will denote the current Bgee version (e.g., if the current release number is 14, bgee_vRELEASE means bgee_v14).

Each step in the Bgee pipeline is represented by a specific folder, containing a Makefile, and related scripts. Variables common to several steps are defined in the file pipeline/Makefile.common. Sensitive variables are stored in the file pipeline/Makefile.Config.

Each Makefile ultimately generates an output file, called step_verification_RELEASE.txt, in the corresponding output folder. This file is generated for the Makefile to determine whether a step should be re-run, and for developers to control that the step was correctly executed. These files are committed to git, so that results can be compared between releases. They are not meant to be the output of the Makefiles, but, rather, small files to be added to git, and to served as control of the procedures.

List of pipeline steps

Pipeline and database initialization

Pipeline initialization: see init/.
Database creation: see db_creation/.

Taxa, genomes, ontologies

Species and taxon information: see species/.
Genomes and gene-related information: see genes/.
Anatomical ontology (Uberon) and developmental stage ontologies: see uberon/.

Raw data analyses and insertion

RNA-Seq data analyses: see RNA_Seq/.
Affymetrix data analyses: see Affymetrix/.
In situ hybridization data analyses: see In_situ/.
EST data analyses: see ESTs/.
Differential expression analyses: see Differential_expression/.

Bgee post-processing steps

Annotation sanity checks: see post_processing/.
Propagation/reconciliation of present/absent expression calls: see post_processing/.
Computations of expression rank scores: see post_processing/.
Generation of files containing data available for download: see download_files/.
Generation of XRefs to Uniprot: see download_files/.
Insertion of information about versions of the data sources used: see db_creation/Makefile, target update_data_sources.sql.

Developer guidelines

Keeping track of data source versions
Pipeline configuration
Running and re-running pipeline steps

Keeping track of data source versions

At each step of the pipeline, you will need to update the file db_creation/update_data_sources.sql, that keeps track of the version of the data sources used for the current release. This file will be used at the end of the pipeline run, to insert this information into the database. The reason why this information is not managed by the Makefiles, is that the ways to obtain this information are too disparate between data sources (sometimes you have to look at the home page of the website, sometimes to look at a specific file, sometimes you cannot use the modification date of the file, but need to look for a release date inside the file, etc.).

Configuration

Before running the pipeline on a specific machine, you need to perform some configurations:

in Makefile.Config: edit this file with correct values of logins and passwords. The correct values should not be versioned! (easier than to encrypt the file)
in Makefile.common, edit the following variables as needed:
- RELEASE: version of Bgee for which the pipeline is being run
- ENSRELEASE: version used of Ensembl
- TMP DIR: where to store (potentially large) TMP files
- Servers and ports configuration:
  - DBHOST and DBPORT for MySQL database
  - ANNOTATORHOST denoting the server storing Affymetrix raw data, and Ensembl local version
  - DATAHOST an additional backup machine
  - PIPEHOST, name of the machine on which the pipeline is run

Running and re-running pipeline steps

To re-run the last operation performed by a pipeline step, remove its step_verification_RELEASE.txt file. To re-run the step all from scratch, use the command make clean. In that case, data inserted in the database are not cleaned automatically, for safety, you would need to remove inserted data yourself. This documentation often explains how to do it. The command clean only takes care of the generated files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline

pipeline

README.md

Bgee pipeline

General information

Introduction

List of pipeline steps

Pipeline and database initialization

Taxa, genomes, ontologies

Raw data analyses and insertion

Bgee post-processing steps

Developer guidelines

Keeping track of data source versions

Configuration

Running and re-running pipeline steps

Files

pipeline

Directory actions

More options

Directory actions

More options

Latest commit

History

pipeline

Folders and files

parent directory

README.md

Bgee pipeline

General information

Introduction

List of pipeline steps

Pipeline and database initialization

Taxa, genomes, ontologies

Raw data analyses and insertion

Bgee post-processing steps

Developer guidelines

Keeping track of data source versions

Configuration

Running and re-running pipeline steps