GitHub - Gardner-BinfLab/EnTrI

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 155 Commits
bin		bin
data		data
documents		documents
figures		figures
results		results
.gitignore		.gitignore
README		README
core-essential.xlsx		core-essential.xlsx
Repository files navigation

============ CLUSTER PROTEINS
0. Annotate Enterobacter using prokka
	prokka --outdir ~/Desktop/prokka --locustag ENC --genus Enterobacter --species cloacae --kingdom Bacteria --gram neg ~/EnTrI/data/fasta-genome/chromosome/ENC.fa
	Convert the gbf file to embl using seqret:
	seqret
	Read and write (return) sequences
	Input (gapped) sequence(s): ~/Desktop/prokka/ENC_02162017.gbf 
	output sequence(s) [fp929040.fasta]: embl::ENC_02162017.embl
1. Convert gff + fasta files to embl using seqret tool from EMBOSS package (If the fasta and gff files contain chromosome and plasmid sequences, separate them)
	seqret -sequence CS17p1.fna -feature -fformat gff -fopenfile CS17p1.gff -osformat embl -auto
2. If the CDS IDs are not standard (like in CS17p1-4 in our case):
	sed -i -e 's/CS17p1|SC|contig000001://g' CS17p1.embl
3. Remove locus tag "ERS227112_02466:gene" from "Klebsiella_pneumoniae_RH201207_v0.3.embl" embl file. This gene has two locus tags: "ERS227112_02466" and "ERS227112_02466:gene", which probably is a mistake in annotation, and the ":gene" at the end of this tag causes jackhmmer name this gene "gene" which is impossible to track.
4. python correct-annotations.py and manually correct the added genes.
5. Convert embl files to protein fasta files
	cd EnTrI/bin
	mkdir ../data/fasta-protein
	mkdir ../data/fasta-protein/chromosome
	for filename in ../data/embl/chromosome/*.embl; do embl2peptides.py $filename ../data/fasta-protein/chromosome/"$(basename $filename | cut -d. -f1).fasta"; done
6. Add SL3261 to the sequences by copying SL1344 and removing aroA, ycaL and cmk genes from it (Lars' NAR paper) and renaming sequences.
7. Make a fasta file with all sequences in it.
	cd EnTrI/data/fasta-protein/chromosome
	cat * > seqdb.fasta
8. Run homclust to cluster homologous proteins
	cd EnTrI/bin
	homclust.py ../data/fasta-protein/chromosome/Citrobacter_rodentium_ICC168_FN543502_v1.fasta ../data/fasta-protein/chromosome/seqdb.fasta ../results/homclust -ns 14 -s 5
9. Convert embl files to DNA fasta files
	cd EnTrI/bin
	mkdir ../data/fasta-dna
	mkdir ../data/fasta-dna/chromosome
	for filename in ../data/embl/chromosome/*.embl; do embl2dna.py $filename ../data/fasta-dna/chromosome/"$(basename $filename | cut -d. -f1).fasta"; done
10. Add SL3261 to the sequences by copying SL1344 and removing aroA, ycaL and cmk genes from it (Lars' NAR paper) and renaming sequences.
11. Make a fasta file with all sequences in it.
	cd EnTrI/data/fasta-dna/chromosome
	cat * > seqdb.fasta

============ RUN PHYLOSIFT & RAXML
1. cd EnTrI/results/phylosift
2. find ../../data/fasta-genome/chromosome/ -maxdepth 1 -name "*.fa" -exec ~/program-bank/phylosift_v1.0.1/phylosift search --isolate --besthit {} \;
3. find ../../data/fasta-genome/chromosome/ -maxdepth 1 -name "*.fa" -exec ~/program-bank/phylosift_v1.0.1/phylosift align --isolate --besthit {} \;
4.a. for f in PS_temp/*/alignDir/concat.codon.updated.1.fasta; do cat "$f" >> codon_alignment.fa; done
4.b. for f in PS_temp/*/alignDir/concat.updated.1.fasta; do cat "$f" >> protein_alignment.fa; done
5. Run RaXML
	a. raxmlHPC -s protein_alignment.fa -n phylosift-aa.raxmlbootstrap -m PROTGAMMALG4M -p 1234 -f a -x 1234 -# 100
	b. raxmlHPC -s codon_alignment.fa -n phylosift-n.raxmlbootstrap -m GTRCAT -p 1234 -f a -x 1234 -# 100

============ RUN HIERANOID
1. Copy all the protein fasta sequences to program-bank/hieranoid/data/sequences and replace all '<>s with _prime().
2. Copy the species tree to program-bank/hieranoid/data/tree, root it, and add node names to it.
3. Edit program-bank/hieranoid/Configurations/Configuration.pm
4. cd program-bank/hieranoid/
5. perl hieranoid.pl
6. Copy n14.OGTree.txt to results/hieranoid and rename it to hieranoid-result.txt
7. python edit-hieranoid.py
8. python edit-hieranoid-with-inparalogs.py

============ FIND ESSENTIAL GENES
1. cd EnTrI/bin
2. Rscript calculate-insertion-index.R
3. Rscript calculate-insertion-index_not-trimmed.R

============ BENCHMARK DIFFERENT METHODS
1. Download the list of E.coli K-12 essential genes from ecogenes and save it as ecogene-k12.txt in results directory.
2. python find-k12-counterparts.py and put the results in results/ecogenecounterparts
3. Make monte-carlo directory in results and run Rscript montecarlo.R
4. Rscript benchmark-essentiality-calls.R

============ STUDY BIASES
1. Make logos for top 100 most frequent insertion sites:
	mkdir ../results/logos
	python3 make-logos-top-100.py
2. Generate a logo for the sequences resulted from the last step.
	cd EnTrI/results/logos
	weblogo -F pdf -A dna -f 100logos.txt -o ../../figures/100logo-prob.pdf -s large -U probability
	Get nucleotide compositions from count-nucleotides.py
	weblogo -F pdf -A dna -f 100logos.txt -o ../../figures/100logo-bits.pdf -s large --composition "{'A':23, 'C':27, 'G':27, 'T':23}"
3. python3 insertion-position-bias.py
4. Rscript insertion-position-bias.R
5. python3 check-biases.py
6. python3 check-biases_not-trimmed.py
7. Rscript check-biases-ii.R
8. Rscript compare-dbscan-gamma.R
9. Rscript compare-bias-predictions.R

============ EVOLUTION OF ESSENTIALITY
1. python3 define-core-accessory-hieranoid.py
	python3 define-core-accessory-hieranoid-fasta.py (edit to change thresholds to 80,90, and 100)
	python3 define-core-accessory-hieranoid-fasta-2.py --> used for generating table for important genes and testing wether they are core essential or not.
2. Make phylogenetic trees for the subsets of species (in code) and place them in bin/speciestrees, and bin/speciestrees-no-k12
# 3. python3 define-core-accessory-hieranoid-fdollop.py
# 4. python3 define-core-accessory-hieranoid-ancestralii.py
5. python3 define-core-accessory-hieranoid-fitch.py
	cd EnTrI/results/define-core-accessory-hieranoid-fitch-cores/
	mkdir alignments
	for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done
	mkdir profiles
	cd alignments
	for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done
	cd ../profiles/
	cat * > corefam.hmm
	hmmpress corefam.hmm
	Go to homology folder on cluster
	qsub embl2peptide.sh
	qsub run_hmmsearch.sh
	qsub run_con-ess.sh
	scp fas31@abacus:/data/people/fas31/homology/homologs.count /home/fatemeh/EnTrI/results/homologs.count
6. Rscript upsetr-hieranoid.R
7. Annotate the phylogenetic tree
8. Correct tree-data.R and run it
9. Generate a table for core essential and ancestrally essential genes
	python3 table-core-genes.py

============ STUDY CLUSTERS
1. Plot cluster size distribution (This is not a general script and can only be used for this case)
	cd EnTrI/bin
	Rscript orfan-single-multiple.R
2. Run merge-clust-plot to add insertion indices to genes in clusters
	cd EnTrI/bin
	merge-clust-plot.py ../results/homclust/EFam-clusters/ ../results/biases/dbscan/ ../results/ecogene-k12.txt ../results/merge-clust-plot
3. Plot the insertion index distribution in different clusters (This is not a general script and can only be used for this case)
	cd EnTrI/bin  
	Rscript clust2plot.R
4. Count multicopy genes that are essential when single copy:
	Rscript multicopy_ess-when-sing-copy.R

============ ENRICHMENT
1. Word enrichment in beneficial losses
	cd EnTrI/bin
	python study-beneficial-losses.py
	Rscript study-beneficial-losses.R
# 2. Word enrichment in low GC genes:
	cd EnTrI/bin
	python study-low-gc.py
	Rscript study-low-gc.R
# 3. Codon enrichment in 5' end of essential genes:
	cd EnTrI/bin
	python essential5prime.py
	Rscript essential5prime.R
4. KEGG analysis
	cd EnTrI/bin
	Rscript buildpathwayDB.R -o KEGG_organism_code -d path_to_save_database
	Set pathanalysis.R parameters: essentiality class, pdf file name, plot main.
	Rscript pathanalysis.R
	Rscript pathanalysis-coregenes.R -> change between core genes and ancestral genes.
# 5. Difference between ancestralii, fdollop, and intersection
	cd EnTrI/bin
	python3 core-definition-method-difference.py
	Rscript core-definition-method-difference.R

============ STUDY INTERESTING GENES
1. python interesting_genes.py
2. python mark-duplications.py run for both universally-conserved_always-essential.tsv and sometimes-essential.tsv
3. Rscript interestinggenes_heatmap.R
# 4. change the first bit of interestinggenes_heatmap (up to write.table) to add pathways to genes.

# ============ ARE CONSERVED GENES ESSENTIAL?
# 1. Download bacterial and archaeal core genes from:
# 	https://figshare.com/articles/Systematically_identify_phylogenetic_markers_at_different_taxonomic_levels_for_bacteria_and_archaea/722713	
# 	paper:
# 	http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0077033
# 2. cd results/core-bacteria_and_archaea-genes
# 	for filename in *.hmm; do hmmpress $filename; done
# 3. for filename in *.hmm; do hmmscan --domtblout $filename.domtblout $filename ~/EnTrI/data/fasta-protein/chromosome/U00096.fasta; done
# 4. make a file containing all genes.

============ Ideal insertion density?
Run density-SL1344.R

============ ARE CONSERVED GENES MORE LIKELY TO BE ESSENTIAL?
1. Make HMM profiles for core genes and core essential genes
	cd EnTrI/results/define-core-accessory-hieranoid-fitch-cores
	mkdir alignments && mkdir profiles
	for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done
	cd alignments
	for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done
	cd ../profiles
	cat * > corefam.hmm
	hmmpress corefam.hmm
	cd ..
	mkdir hhmake
	for filename in alignments/*; do hhmake -i $filename -a hhmake/hhmake.out -M 50; done
	Do the same for EnTrI/results/define-core-accessory-hieranoid-fitch-core-essentials
2. Write cluster name, gene name, and pathways in the same file for ancestrally essential genes.
	python which-genes.py
3. Follow what we did on cluster: /data/people/fas31/homology/readme.txt
4. 
	mkdir EnTrI/results/walking-hypergeometric-test
	Make essentiality.txt
	cd EnTrI/bin
	Rscript gsea.R -d essentiality.txt -l conservation.out -o ~/EnTrI/results/walking-hypergeometric-test
5. Do the same for endosymbionts and DEG.
============ COMPARISON OF ANCESTRALLY ESSENTIAL GENES AND ENDOSYMBIONTS
1. Download Buchnera, Sodalis, Wigglesworthia, Baumannia Cicadellinicola, Blochmania fasta and embl files.
2. for filename in ~/EnTrI/results/endosymbionts/embl/*.txt; do embl2peptides.py $filename ~/EnTrI/results/endosymbionts/fasta-protein/"$(basename $filename | cut -d. -f1).fasta"; done
3. find /home/fatemeh/EnTrI/results/endosymbionts/fasta-genome/ -maxdepth 1 -name "*.fasta" -exec ~/program-bank/phylosift_v1.0.1/phylosift search --isolate --besthit {} \;
4. find /home/fatemeh/EnTrI/results/endosymbionts/fasta-genome/ -maxdepth 1 -name "*.fasta" -exec ~/program-bank/phylosift_v1.0.1/phylosift align --isolate --besthit {} \;
5. for f in PS_temp/*/alignDir/concat.updated.1.fasta; do cat "$f" >> protein_alignment.fa; done
6. raxmlHPC -s protein_alignment.fa -n phylosift-aa.raxmlbootstrap -m PROTGAMMALG4M -p 1234 -f a -x 1234 -# 100
7. Get essential genes from DEG and divide the file into different species and then run Hieranoid after rooting and editing the species tree. Follow the steps for running hieranoid and editing output from above.
8. python3 define-core-accessory-hieranoid-endosymbionts.py
9. 
	cd EnTrI/results/endosymbionts/define-core-accessory-hieranoid/core-essential-genomes/
	mkdir alignments && mkdir profiles
	for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done
	cd alignments
	for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done
	cd ../profiles
	cat * > corefam.hmm
	cd ..
	mkdir hhmake
	for filename in alignments/*; do hhmake -i $filename -o hhmake/"endo-$(basename $filename | cut -d. -f1).hmm" -M 50; done
	
============ COMPARISON OF ANCESTRALLY ESSENTIAL GENES AND DEG
1. Download DEG embl files and genomes (-enterobacteriaceae) and concat chromosomes.
2. find /home/fatemeh/EnTrI/results/deg/fasta-genome/ -maxdepth 1 -name "*.fasta" -exec ~/program-bank/phylosift_v1.0.1/phylosift search --isolate --besthit {} \;
3. find /home/fatemeh/EnTrI/results/deg/fasta-genome/ -maxdepth 1 -name "*.fasta" -exec ~/program-bank/phylosift_v1.0.1/phylosift align --isolate --besthit {} \;
4. for f in PS_temp/*/alignDir/concat.updated.1.fasta; do cat "$f" >> protein_alignment.fa; done
5. raxmlHPC -s protein_alignment.fa -n phylosift-aa.raxmlbootstrap -m PROTGAMMALG4M -p 1234 -f a -x 1234 -# 100
6. Get essential genes from DEG and divide the file into different species and then run Hieranoid after rooting (for rooting separate gram negatives from gram positives in DEG) and editing the species tree. Follow the steps for running hieranoid and editing output from above.
7. python define-core-accessory-hieranoid-fitch-deg.py
	python3 define-core-accessory-hieranoid-deg.py
8. 
	cd EnTrI/results/deg/define-core-accessory-hieranoid/core-essential-genomes/
	mkdir alignments && mkdir profiles
	for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done
	cd alignments
	for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done
	cd ../profiles
	cat * > corefam.hmm
	cd ..
	mkdir hhmake
	for filename in alignments/*; do hhmake -i $filename -o hhmake/"deg-$(basename $filename | cut -d. -f1).hmm" -M 50; done
9. python venn-entero-deg-endo.py 1-3
10. python pathways-for-clusts.py
11. python venn-entero-deg-endo-4.py
12. python entero-deg-endo-heatmap-1.py
13. annotate the output with eggnog mapper
14. python entero-deg-endo-heatmap-2.py
15. Rscript entero-deg-endo-heatmap-3.R (use partition-seqdb.py in results/deg/fasta-proteins beforehand)
16. python entero-deg-endo-heatmap-4.py , 5, 6
17. Manually correct NAs from these files:
	heatmap.tsv, U00096.fasta.mapper.annotations, final-clusters-plus-single-groups.fasta.emapper.annotations, final-clusters-plus-single-groups, and /home/fatemeh/EnTrI/results/define-core-accessory-hieranoid/all/core-essential-genomes

============ COMPARISON OF ACCESSORY ESSENTIAL GENES
accessory-essential-enterobacteriaceae_1..5

============ WHY DO WE HAVE INSERTIONS IN 5' END?
1. python 5prime-start-codon.py

============ COMPARE CORE ESSENTIAL AND ANCESTRALLY ESSENTIAL
1. ancestral-not-core.py
2. ancestral-not-core.R

============ RUN DCGO
1. hmmscan --cut_ga --domtblout ~/EnTrI/results/dcgo/seqdb_Pfam-A.domtblout -o /dev/null ~/data/pfam/Pfam-A.hmm ~/EnTrI/data/fasta-protein/chromosome/seqdb.fasta
2. dcgo.py and dcgo.R

============ ARE BENEFICIAL LOSSES COPIED IN PLASMIDS?
1. for filename in ../data/embl/new-plasmid/*.embl; do embl2peptides.py $filename ../data/fasta-plasmid/"$(basename $filename | cut -d. -f1).fasta"; done
2. change locus tags so that they are similar to chromosomal genomes, concat genes of plasmids for the same chromosome, and rename files.
3. cat ~/EnTrI/results/homclust/EFam-HMMs*.hmm > ~/EnTrI/results/beneficialloss-plasmid/all.hmm
4. hmmpress all.hmm
5. hmmscan -o /dev/null --domtblout ~/EnTrI/results/beneficialloss-plasmid/hmmscan.domtblout -E 1e-10 ~/EnTrI/results/beneficialloss-plasmid/all.hmm ~/EnTrI/data/fasta-plasmid/seqdb.fasta
6. python beneficialloss-plasmid.py

============ Generate giant table
#1. python giant-tab-1-del.py
#2.
	cd results/giant-tab/clusters/
	mkdir alignments && mkdir profiles
	for filename in clust*; do mafft --text --quiet $filename > alignments/$filename.msa; done
	cd alignments
	for filename in *.msa; do hmmbuild -o /dev/null ../profiles/$filename $filename; done
	cd ../profiles
	cat * > corefam.hmm
	cd ..
	mkdir hhmake
	for filename in alignments/*; do hhmake -i $filename -o hhmake/"entcore-$(basename $filename | cut -d. -f1).hmm" -M 50; done

#1. python giant-tab-1.py
1. cd EnTrI/results/all-hieranoid/
	find fasta-genome/ -maxdepth 1 -name "*.fa*" -exec ~/program-bank/phylosift_v1.0.1/phylosift search --isolate --besthit {} \;
	find fasta-genome/ -maxdepth 1 -name "*.fa*" -exec ~/program-bank/phylosift_v1.0.1/phylosift align --isolate --besthit {} \;
	for f in PS_temp/*/alignDir/concat.updated.1.fasta; do cat "$f" >> protein_alignment.fa; done
	raxmlHPC -s protein_alignment.fa -n phylosift-aa.raxmlbootstrap -m PROTGAMMALG4M -p 1234 -f a -x 1234 -# 100
2. Copy all the protein fasta sequences to program-bank/hieranoid/data/sequences and replace all '<>s with _prime().
3. Copy the species tree to program-bank/hieranoid/data/tree, root it, and add node names to it.
4. Edit program-bank/hieranoid/Configurations/Configuration.pm
5. cd program-bank/hieranoid/
6. perl hieranoid.pl
7. for all protein sequences run cat * > seqdb.fasta to have all sequences together
8. edit the addresses and run python3 edit-hieranoid.py
9. run annotate-clust.py and run eggnog-mapper on the output (cluster-representatives.txt) to find gene names for clusters.
10. run giant-tab-1.py, giant-tab-2.py and giant-tab-3.R