Skip to content

HybPiper version 2.3.0

Latest
Compare
Choose a tag to compare
@chrisjackson-pellicle chrisjackson-pellicle released this 11 Sep 01:32
  • Add option --compress_sample_folder to command hybpiper assemble. Tarball and compress the sample folder after assembly has completed i.e. <sample_name>.tar.gz.

    • This is useful when running HybPiper on HPC clusters with file number limits.
    • If both an uncompressed and compressed folder exist for a sample, a warning is shown and HybPiper exits.
    • All HybPiper subcommands (stats, recovery_heatmap, retrieve_sequences, paralog_retriever, filter_by_length) work with either compressed or uncompressed sample files/folders, or a combination of both.
    • If a <sample_name>.tar.gz file already exists for a sample, it will be extracted and used for the current run of hybpiper assemble, and the <sample_name>.tar.gz file will be deleted.
  • When using BWA for read mapping, the command samtools flagstat is now run during the hybpiper assemble step, rather than during hybpiper stats, and the results are written to a <sample_name>_bam_flagstat.tsv \ <sample_name>_unpaired_bam_flagstat.tsv file(s).

    • If the <sample_name>_bam_flagstat.tsv \ <sample_name>_unpaired_bam_flagstat.tsv file(s) are not present in a sample directory (i.e. the sample was assembled with HybPiper version <2.3.0), samtools flagstat will be run during hybpiper stats. If the sample is a *.tar.gz file, the *.bam file(s) will first be extracted to disk to a temporary directory called temp_bam_files, within your current working directory. This temporary directory will be deleted after samtools flagstat has been run.
  • Add option --not_protein_coding to hybpiper assemble. When this option is provided, sequences matching your target file references will be extracted from SPAdes contigs using BLASTn, rather than Exonerate. This should improve recovery when using a target file with non-protein-coding sequences. Note that this feature is new and might have bugs - please report any issues.

    • Only nucleotide *.FNA sequences will be produced (i.e. no amino-acid sequences).
    • Intronerate will not be run; intron and supercontig sequences will not be produced.
    • If BLASTx or DIAMOND is selected for read mapping (i.e. protein vs translated-nucleotide searches), a warning will be displayed and read mapping will switch to BWA.
  • Add the following options to control BLASTn searches of SPAdes contigs when option --not_protein_coding is used:

    • --extract_contigs_blast_task. Task to use for blastn searches (blastn, blastn-short, megablast, dc-megablast). Default is blastn.
    • --extract_contigs_blast_evalue. Expectation value (E) threshold for saving hits. Default is 10.
    • --extract_contigs_blast_word_size. Word size for wordfinder algorithm (length of best perfect match).
    • --extract_contigs_blast_gapopen. Cost to open a gap.
    • --extract_contigs_blast_gapextend. Cost to extend a gap.
    • --extract_contigs_blast_penalty. Penalty for a nucleotide mismatch.
    • --extract_contigs_blast_reward. Reward for a nucleotide match.
    • --extract_contigs_blast_perc_identity. Percent identity.
    • --extract_contigs_blast_max_target_seqs. Maximum number of aligned sequences to keep (value of 5 or more is recommended). Default is 500.
  • The final step of the hybpiper assemble pipeline has been renamed from exonerate_contigs to extract_contigs (as either Exonerate or BLASTn can now be used).

  • Reorganised grouping of help options when running hybpiper assemble --help to improve clarity.

  • Changed option --timeout_assemble for hybpiper assemble to --timeout_assemble_reads to match the step name.

  • Changed option --timeout_exonerate_contigs for hybpiper assemble to --timeout_extract_contigs to match the step name.

  • Changed option --exonerate_hit_sliding_window_size for hybpiper assemble to --trim_hit_sliding_window_size. This option now applies to either Exonerate hits (and is measured in amino-acids) or BLASTn (measured in nucleotides). Defaults are 5 amino-acids (Exonerate; changed from previous default of 3) or 15 nucleotides (BLASTn).

  • Changed option --exonerate_hit_sliding_window_thresh for hybpiper assemble to --trim_hit_sliding_window_thresh. This option now applies to either Exonerate hits (and is measured via amino-acid similarity) or BLASTn (measured via nucleotide similarity). Defaults are 75 for amino-acids (Exonerate; changed from previous default of 55) or 65 for nucleotides (BLASTn).

  • Fixed a bug in fix_targetfile.py - MAFFT is now called via subprocess rather than Bio.Align.Applications.MafftCommandline when checking for best match translations (see issue#156).

  • Added a more informative error message if running hybpiper retrieve_sequences or hybpiper paralog_retriever from HybPiper version >=2.2.0 on sample folders from HybPiper version >2.2.0. This error occurs because the sample folders do not contain a <prefix>_chimera_check_performed.txt file (see issue#155).

  • When extracting coding sequences from SPAdes contigs using Exonerate, changed the initial Exonerate run to not use the option --refine full (see Exonerate docs), unless the option --exonerate_refine_full is provided to hybpiper assemble. Although the Exonerate option --refine full should improve output alignments, in some cases it can result in spurious alignment regions (e.g. an intron/non-coding region being included as an "exon" alignment) that can get incorporated in to the HybPiper output sequence.