#Utilities for SAP (Statistical Assignment Package) These are scripts that can be used with Statistical Assignment Package. The scripts are geared toward running SAP jobs in parallel on a computing cluster. A large fasta file can be split into individual sequences and each sequence run separately on the cluster.
##split_fasta.pl Slightly modified code from the Harvard Scriptome that splits a fasta file into smaller sequentially numbered fasta files.
Default is 10 sequences per file, but this can be adjusted with the --length
option
./split_fasta.pl input.fasta --length 5
If input.fasta has 100 sequences, it creates files named 1.fasta though 20.fasta each with five sequences each.
##SAP-batch.job An SGE array job designed for the hydra cluster that executes a series of sap jobs. Requies fasta files sequentially numbered as generated from split_fasta.pl
The job file should be modified before running:
- Change the queue and memory options if necessary.
- Change the line with "#$ -t 1-100" to match the number of fasta files you have (e.g.
#$ -t 1-20
or even#$ 10-20
to start in the middle). - Change "[email protected]" to your email address in the "sap" line.
- Change the sap options leaving
$SGE_TASK_ID.fasta &>$SGE_TASK_ID.log
at the end.$SGE_TASK_ID
refers to the current repetition.
##find-failed.sh
Use this after your jobs have completed. This script will find runs that have not completed. It looks for the file index.html
in each sequentially numbered directory. If index.html
is missing, the fasta file from that directory is added to a file failed.fasta for resubmission with the split-and-run.pl script.
Check directories 1 through 10 for index.html
./find-failed.sh 10
##SAP-parse-all-files.py This finds all SAP output files in the given directory, extracts the assignments made and outputs a csv file summarizing the results. It can take an existing csv (e.g. otutable) file that has matching ID numbers to the SAP output and combine this csv table with the new sap data.
This script parses the file "classic.html" which is generated by SAP version ≥1.9.3. I will soon be modifying this script to use the newer output format which produced a csv which will greatly simplify this script so html won't need to be parsed.
Note, this script requires the Python package BeautifulSoup which can be installed with easy_install beautifulsoup4
or pip install beautifulsoup4
on many systems. On the Hydra cluster, run the command module load bioinformatics/anaconda/2.2.0
usage: SAP-parse-all-files.py [-h] [-out OUT] [-otutable OTUTABLE]
[-l LEVEL] [-p] [-v]
DIRECTORY
Extracts taxonomic rankings from SAP html output
positional arguments:
DIRECTORY Directory that contains one or more SAP "classic.html"
output files. The "classic.html" files can be nested
in other files.
optional arguments:
-h, --help show this help message and exit
-out OUT Name of output file of taxonomy (default "SAP-
out.csv").
-otutable OTUTABLE Name of optional otu table in csv format. If given the
OTU data will be added to the taxonomy
-l LEVEL, --level LEVEL
The cutoff assignment level (default: 80) (possible
values: 80, 90, 95).
-p, --prob OMIT outputting the probability level for each ranking
-v, --verbose Increased verbosity while parsing the html files.