GitHub - mkweskin/sap-parse: Scripts for working with the Statistical Assignment Package

#Utilities for SAP (Statistical Assignment Package) These are scripts that can be used with Statistical Assignment Package. The scripts are geared toward running SAP jobs in parallel on a computing cluster. A large fasta file can be split into individual sequences and each sequence run separately on the cluster.

##split_fasta.pl Slightly modified code from the Harvard Scriptome that splits a fasta file into smaller sequentially numbered fasta files.

Default is 10 sequences per file, but this can be adjusted with the --length option

./split_fasta.pl input.fasta --length 5

If input.fasta has 100 sequences, it creates files named 1.fasta though 20.fasta each with five sequences each.

##SAP-batch.job An SGE array job designed for the hydra cluster that executes a series of sap jobs. Requies fasta files sequentially numbered as generated from split_fasta.pl

The job file should be modified before running:

Change the queue and memory options if necessary.
Change the line with "#$ -t 1-100" to match the number of fasta files you have (e.g. #$ -t 1-20 or even #$ 10-20 to start in the middle).
Change "[email protected]" to your email address in the "sap" line.
Change the sap options leaving $SGE_TASK_ID.fasta &>$SGE_TASK_ID.log at the end. $SGE_TASK_ID refers to the current repetition.

##find-failed.sh Use this after your jobs have completed. This script will find runs that have not completed. It looks for the file index.html in each sequentially numbered directory. If index.html is missing, the fasta file from that directory is added to a file failed.fasta for resubmission with the split-and-run.pl script.

Check directories 1 through 10 for index.html

./find-failed.sh 10

##SAP-parse-all-files.py This finds all SAP output files in the given directory, extracts the assignments made and outputs a csv file summarizing the results. It can take an existing csv (e.g. otutable) file that has matching ID numbers to the SAP output and combine this csv table with the new sap data.

This script parses the file "classic.html" which is generated by SAP version ≥1.9.3. I will soon be modifying this script to use the newer output format which produced a csv which will greatly simplify this script so html won't need to be parsed.

Note, this script requires the Python package BeautifulSoup which can be installed with easy_install beautifulsoup4 or pip install beautifulsoup4 on many systems. On the Hydra cluster, run the command module load bioinformatics/anaconda/2.2.0

usage: SAP-parse-all-files.py [-h] [-out OUT] [-otutable OTUTABLE]
                                        [-l LEVEL] [-p] [-v]
                                        DIRECTORY

Extracts taxonomic rankings from SAP html output

positional arguments:
  DIRECTORY                Directory that contains one or more SAP "classic.html"
                        output files. The "classic.html" files can be nested
                        in other files.

optional arguments:
  -h, --help            show this help message and exit
  -out OUT              Name of output file of taxonomy (default "SAP-
                        out.csv").
  -otutable OTUTABLE    Name of optional otu table in csv format. If given the
                        OTU data will be added to the taxonomy
  -l LEVEL, --level LEVEL
                        The cutoff assignment level (default: 80) (possible
                        values: 80, 90, 95).
  -p, --prob            OMIT outputting the probability level for each ranking
  -v, --verbose         Increased verbosity while parsing the html files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SAP-batch.job		SAP-batch.job
SAP-parse-all-files.py		SAP-parse-all-files.py
find-failed.sh		find-failed.sh
split_fasta.pl		split_fasta.pl

License

mkweskin/sap-parse

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages