Skip to content

Scripts for working with the Statistical Assignment Package

License

Notifications You must be signed in to change notification settings

mkweskin/sap-parse

Repository files navigation

#Utilities for SAP (Statistical Assignment Package) These are scripts that can be used with Statistical Assignment Package. The scripts are geared toward running SAP jobs in parallel on a computing cluster. A large fasta file can be split into individual sequences and each sequence run separately on the cluster.

##split_fasta.pl Slightly modified code from the Harvard Scriptome that splits a fasta file into smaller sequentially numbered fasta files.

Default is 10 sequences per file, but this can be adjusted with the --length option

./split_fasta.pl input.fasta --length 5

If input.fasta has 100 sequences, it creates files named 1.fasta though 20.fasta each with five sequences each.

##SAP-batch.job An SGE array job designed for the hydra cluster that executes a series of sap jobs. Requies fasta files sequentially numbered as generated from split_fasta.pl

The job file should be modified before running:

  • Change the queue and memory options if necessary.
  • Change the line with "#$ -t 1-100" to match the number of fasta files you have (e.g. #$ -t 1-20 or even #$ 10-20 to start in the middle).
  • Change "[email protected]" to your email address in the "sap" line.
  • Change the sap options leaving $SGE_TASK_ID.fasta &>$SGE_TASK_ID.log at the end. $SGE_TASK_ID refers to the current repetition.

##find-failed.sh Use this after your jobs have completed. This script will find runs that have not completed. It looks for the file index.html in each sequentially numbered directory. If index.html is missing, the fasta file from that directory is added to a file failed.fasta for resubmission with the split-and-run.pl script.

Check directories 1 through 10 for index.html

./find-failed.sh 10

##SAP-parse-all-files.py This finds all SAP output files in the given directory, extracts the assignments made and outputs a csv file summarizing the results. It can take an existing csv (e.g. otutable) file that has matching ID numbers to the SAP output and combine this csv table with the new sap data.

This script parses the file "classic.html" which is generated by SAP version ≥1.9.3. I will soon be modifying this script to use the newer output format which produced a csv which will greatly simplify this script so html won't need to be parsed.

Note, this script requires the Python package BeautifulSoup which can be installed with easy_install beautifulsoup4 or pip install beautifulsoup4 on many systems. On the Hydra cluster, run the command module load bioinformatics/anaconda/2.2.0

usage: SAP-parse-all-files.py [-h] [-out OUT] [-otutable OTUTABLE]
                                        [-l LEVEL] [-p] [-v]
                                        DIRECTORY

Extracts taxonomic rankings from SAP html output

positional arguments:
  DIRECTORY                Directory that contains one or more SAP "classic.html"
                        output files. The "classic.html" files can be nested
                        in other files.

optional arguments:
  -h, --help            show this help message and exit
  -out OUT              Name of output file of taxonomy (default "SAP-
                        out.csv").
  -otutable OTUTABLE    Name of optional otu table in csv format. If given the
                        OTU data will be added to the taxonomy
  -l LEVEL, --level LEVEL
                        The cutoff assignment level (default: 80) (possible
                        values: 80, 90, 95).
  -p, --prob            OMIT outputting the probability level for each ranking
  -v, --verbose         Increased verbosity while parsing the html files.

About

Scripts for working with the Statistical Assignment Package

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published