Skip to content

Known issues

Guanliang MENG edited this page Jul 30, 2023 · 34 revisions

0. Additional to-do list

1. Circularity problem and its solution

How to check:

Have a look at the file *.mitoAssemble.K*.overlap_information, it may have something like this:

>C3252271 overlap between 5' and 3' are 52bp
TGAACGGAATAGTTGGTAATTAGTTTAATCAAAACAAATGATTTCGACTCA

and check the file *.mitoAssemble.K*.mitogenome.fa:

$ head -1  *.mitoAssemble.K*.mitogenome.fa
>C3252271 topology=circular

Because this overlapping region is quite long and not simple repeats (say AAAAAAAA), (in most cases) we can safely say that we have got a circular mitochondrial genome (there are methods to verify this, see XXX).

To do:

  • Use another Circularity check script (I remember I have written one?) which can be aware of the simple repeats

2. Simple repeats found in the overlapping region

Say we have this file *.mitoAssemble.K51.overlap_information:

>C3892882 overlap between 5' and 3' are 9bp
TTTTTTTT

and we have this file *.mitoAssemble.K51.mitogenome.fa:

>C3892882 topology=circular

In this case, it is dubious that the assembled mt genome is circular. In this case, you can just treat the sequence as linear, or try to assemble the mt genome with different kmers (larger or smaller), which might be able to overcome the simple repeat problem. This depends on the goal of your study, e.g., if you are going to use only the PCGs for subsequent analysis and you have got all PCGs already, then a circular (complete) mt genome does not help. The changing breakpoint method mentioned 3. Breakpoint and incomplete genes should also help.

3. Breakpoint and incomplete genes

The summary.txt file:

#Seq_id        Length(bp)     Circularity    Closely_related_species
C3252271       16597          yes             Onychomys leucogaster

#Seq_id        Start  End    Length(bp) Direction  Type   Gene_name  Gene_prodcut                      Total_freq_occurred
--------------------------------------------------------------------------------------------------------------------------
C3252271       14     82     69         +          tRNA   trnR(ucg)  tRNA-Arg                          1
C3252271       84     381    298        +          CDS    ND4L       NADH dehydrogenase subunit 4L     1
.
.
.
C3252271       14568  14772  205        +          CDS    ATP8       ATP synthase F0 subunit 8         1
C3252271       14729  15410  682        +          CDS    ATP6       ATP synthase F0 subunit 6         1
C3252271       16193  16262  70         +          tRNA   trnG(ucc)  tRNA-Gly                          1
C3252271       16262  >16598 337        +          CDS    ND3        NADH dehydrogenase subunit 3      1

Here, the ND3 cannot find its stop codon, but because this sequence is actually circular already, the stop codon of ND3 is at the 5' end of this sequence.

In this case, you can simply manually change the breakpoint of the mitochondrial genome. But be careful, you should find a breakpoint where there is no gene (especially overlapping regions of different genes), for example, here sites between 14772 and 14729, or the site 16262 are not good positions. Instead, Any positions between 15410 and 16193 can be chosen.

After this, you can re-annotate the sequence using the mitoz annotate command, by also providing the fastq files (--fq1 and/or --fq2 options), MitoZ will calculate the sequencing depth of all sites along the mitogenome, which in turn provides more evidence to show if the mitogenome is really complete or not (check the abundance track on the circos.svg and cirvos.png files) --- if the sequencing depth (abundance) around the original breakpoint is normally high like other sites of the mitogenome, not sudden dropping or increasing a lot (which indicates repeats), then the mitogenome is complete.

4. core error

During testing, when I submitted the job to SGE for running, sometimes it generated a core dump file (e.g. core.81923) in the annotation step, I do not know the exact reason yet. But when I locally re-ran the command, the problem resolved. Complicated Conda environments can also induce problems, e.g. I once used a conda installed by another person to create the mitozEnv environment, and I always got missing PCGs in the annotation step.

5. No Circos images generated

Do not know why sometimes even circos --modules shows every required Perl module is installed, MitoZ still fails to run Circos (thus no circos.svg and circos.png files), which results in something like this in the stderr output:

2022-06-01 17:18:53,822 - mitoz.utility.utility - INFO -
cp -r /home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.png /home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.svg /home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.depth.txt /home/gmeng/test/sing/mt_annotation/ttt.ttt.megahit.mitogenome.fa.result
cp: der Aufruf von stat für „/home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.png“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden
cp: der Aufruf von stat für „/home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.svg“ ist nicht möglich: Datei oder Verzeichnis nicht gefunden

2022-06-01 17:18:53,842 - mitoz.utility.utility - ERROR -
Error occured when running command:
cp -r /home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.png /home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.svg /home/gmeng/test/sing/mt_annotation/tmp_ttt_ttt.megahit.mitogenome.fa_mitoscaf.fa/mt_visualization/circos.depth.txt /home/gmeng/test/sing/mt_annotation/ttt.ttt.megahit.mitogenome.fa.result

Solution:

For each specimen, find the mt_annotation directory:

$ source activate mitozEnv   # or use "mamba" or "conda" instead of "source" the command here.

$ ls  mt_annotation/tmp_*_mitoscaf.fa/mt_visualization/ -d | while read f ; do cd $f ; circos ; cd ../../../ ; done

# to list the resulting files:
$ ls  mt_annotation/tmp_*_mitoscaf.fa/mt_visualization/circos.{svg,png}

If you have multiple specimens within the same directory, for example, /my/project/SampleID_1, /my/project/SampleID_2,

$ cd /my/project/ # go to the project directory containing the assembly directory of each specimen

$ source activate mitozEnv   # or use "mamba" or "conda" instead of "source" the command here.

$ ls  */mt_annotation/tmp_*_mitoscaf.fa/mt_visualization/ -d | while read f ; do cd $f ; circos ; cd ../../../../ ; done
# the '*' here will match your sample IDs

# to list the resulting files:
$ ls  */mt_annotation/tmp_*_mitoscaf.fa/mt_visualization/circos.{svg,png}

# to copy the SVG/PNG files to the resulting directories
# Warning: The below command assumes there is NO '_' in your sample ID
$ ls  */mt_annotation/tmp_*_mitoscaf.fa/mt_visualization/circos.{png,svg} | perl -a -F'/' -ne 'chomp;  $F[2]=~s/tmp\_//; $F[2]=~s/\_mitoscaf\.fa//; my $sample=(split(/\_/,$F[2]))[0]; $F[2]=~s/$sample\_$sample\.//; my $result_dir="$sample/$sample.result/$sample.$sample.$F[2].result"; `cp $_ $result_dir`; '

updates:.

  1. The Singularity version (MitoZ version 3.3) seems good to me, indicating the above problem is probably because my environmental variables on the cluster are somehow complicated (and I do not know the exact reason), if you have the same problem, please try to install mitozEnv into a clean environment or try the Singularity version.

  2. empty depth file

If the XXX.result/XXX.XXX.megahit.mitogenome.fa.result/circos.depth.txt file is empty, it is normal for Circos not to run properly.

6. The option --assembler spades fails

To specify a specific assembler, use the --assembler option.

Warning: --assembler spades only accepts paired-end data, which means that you need to provide both --fq1 and --fq2!

7. can not find taxid for XXX, maybe it's a misspelling.

This can happen if:

  1. typo for the value of the --requiring_taxa option.
  2. Your etetoolkit database is broken. Please re-install it by checking https://github.com/linzhi2013/MitoZ/wiki/Installation#3-the-etetoolkit-database.

8. Megahit gets very long sequences.

While Megahit can generate circular mitogenomes, in rare cases, however, it could generate very long "mitogenomes", for example:

$ cat summary.txt
#Seq_id        Length(bp)     Circularity    Closely_related_species
k141_58720     44502          no             

#Seq_id        Start  End    Length(bp) Direction  Type   Gene_name  Gene_prodcut                      Total_freq_occurred
--------------------------------------------------------------------------------------------------------------------------
k141_58720     22299  22368  70         +          tRNA   trnI(gau)  tRNA-Ile                          2
k141_58720     22367  22439  73         -          tRNA   trnQ(uug)  tRNA-Gln                          2
k141_58720     22442  22510  69         +          tRNA   trnM(cau)  tRNA-Met                          2
k141_58720     22511  23520  1010       +          CDS    ND2        NADH dehydrogenase subunit 2      2
k141_58720     23523  23589  67         +          tRNA   trnW(uca)  tRNA-Trp                          2
k141_58720     23591  23659  69         -          tRNA   trnA(ugc)  tRNA-Ala                          2
k141_58720     23662  23731  70         -          tRNA   trnN(guu)  tRNA-Asn                          2
k141_58720     23763  23830  68         -          tRNA   trnC(gca)  tRNA-Cys                          1
k141_58720     23831  23897  67         -          tRNA   trnY(gua)  tRNA-Tyr                          1
k141_58720     23899  25443  1545       +          CDS    COX1       cytochrome c oxidase subunit I    2
k141_58720     25441  25509  69         -          tRNA   trnS(uga)  tRNA-Ser                          3
k141_58720     25513  25580  68         +          tRNA   trnD(guc)  tRNA-Asp                          2
k141_58720     25582  26265  684        +          CDS    COX2       cytochrome c oxidase subunit II   2
k141_58720     26269  26332  64         +          tRNA   trnK(uuu)  tRNA-Lys                          2
k141_58720     26333  26536  204        +          CDS    ATP8       ATP synthase F0 subunit 8         2
k141_58720     26494  27174  681        +          CDS    ATP6       ATP synthase F0 subunit 6         2
k141_58720     27174  27958  785        +          CDS    COX3       cytochrome c oxidase subunit III  2
k141_58720     27958  28025  68         +          tRNA   trnG(ucc)  tRNA-Gly                          2
k141_58720     28031  28372  342        +          CDS    ND3        NADH dehydrogenase subunit 3      1
k141_58720     28374  28440  67         +          tRNA   trnR(ucg)  tRNA-Arg                          1
k141_58720     28443  28739  297        +          CDS    ND4L       NADH dehydrogenase subunit 4L     1
k141_58720     28733  30115  1383       +          CDS    ND4        NADH dehydrogenase subunit 4      1
k141_58720     30111  30177  67         +          tRNA   trnH(gug)  tRNA-His                          1
k141_58720     30178  30236  59         +          tRNA   trnS(gcu)  tRNA-Ser                          3
k141_58720     30236  30305  70         +          tRNA   trnL(uag)  tRNA-Leu                          2
k141_58720     30306  32117  1812       +          CDS    ND5        NADH dehydrogenase subunit 5      1
k141_58720     32114  32635  522        -          CDS    ND6        NADH dehydrogenase subunit 6      1
k141_58720     32636  32704  69         -          tRNA   trnE(uuc)  tRNA-Glu                          1
k141_58720     32710  33852  1143       +          CDS    CYTB       cytochrome b                      1
k141_58720     33855  33921  67         +          tRNA   trnT(ugu)  tRNA-Thr                          1
k141_58720     33922  33989  68         -          tRNA   trnP(ugg)  tRNA-Pro                          1
k141_58720     34957  35022  66         +          tRNA   trnF(gaa)  tRNA-Phe                          1
k141_58720     35025  35972  948        +          rRNA   s-rRNA     12S ribosomal RNA                 1
k141_58720     35973  36042  70         +          tRNA   trnV(uac)  tRNA-Val                          1
k141_58720     36041  37605  1565       +          rRNA   l-rRNA     16S ribosomal RNA                 1
k141_58720     37607  37681  75         +          tRNA   trnL(uaa)  tRNA-Leu                          2
k141_58720     37682  38638  957        +          CDS    ND1        NADH dehydrogenase subunit 1      1
k141_58720     38637  38704  68         +          tRNA   trnI(gau)  tRNA-Ile                          2
k141_58720     38702  38773  72         -          tRNA   trnQ(uug)  tRNA-Gln                          2
k141_58720     38775  38843  69         +          tRNA   trnM(cau)  tRNA-Met                          2
k141_58720     38844  39881  1038       +          CDS    ND2        NADH dehydrogenase subunit 2      2
k141_58720     39880  39946  67         +          tRNA   trnW(uca)  tRNA-Trp                          2
k141_58720     39948  40016  69         -          tRNA   trnA(ugc)  tRNA-Ala                          2
k141_58720     40019  40088  70         -          tRNA   trnN(guu)  tRNA-Asn                          2
k141_58720     40153  41685  1533       +          CDS    COX1       cytochrome c oxidase subunit I    2
k141_58720     41681  41749  69         -          tRNA   trnS(uga)  tRNA-Ser                          3
k141_58720     41753  41820  68         +          tRNA   trnD(guc)  tRNA-Asp                          2
k141_58720     41822  42505  684        +          CDS    COX2       cytochrome c oxidase subunit II   2
k141_58720     42509  42572  64         +          tRNA   trnK(uuu)  tRNA-Lys                          2
k141_58720     42573  42776  204        +          CDS    ATP8       ATP synthase F0 subunit 8         2
k141_58720     42734  43414  681        +          CDS    ATP6       ATP synthase F0 subunit 6         2
k141_58720     43414  44187  774        +          CDS    COX3       cytochrome c oxidase subunit III  2
k141_58720     44195  44262  68         +          tRNA   trnG(ucc)  tRNA-Gly                          2


Protein coding genes totally found: 19
tRNA genes totally found:           32
rRNA genes totally found:            2
--------------------------------------
Genes totally found:                53

And then I checked the duplicate genes, for example, the two COX1 genes and the two ATP8 genes, and found out they are different!

For the correct COX1, it has a sequencing depth of > 40,000 X along the whole gene. But for the wrong COX1, it has sequencing depths from 30X to 200X along the whole gene.

When I blast both COX1 to the NCBI NT database, they both mapped to the target genus.

Thus, be cautious when using the Megahit program for mitogenome assembly! We need to fine-tune the assembly parameters for it! For now, to determine if Megahit's mitogenome is reliable, (1) check the mitogenome length, too long and not circular is definitely problematic; (2) check the sequencing depth along the mitogenome! Be cautious with very high or low regions; (3) Compare the gene sequences to NCBI NT database (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome), check if it is your target clade; (4) Compare the results to the results generated by mitoAssemble which are more reliable. (5) adapting the used kmers (see below).

possible solution

By adapting the used kmers.

One of the possible reasons for the ultra-long mitogenome (>40Kbp) is that, when using small kmers, the assembly graph is much more complex, and thus some sequences are "wrongly" connected. When larger kmers are used, the assembly graph is simplified, and these connections can be avoided.

By default, MitoZ uses --kmers_megahit 21 29 39 59 79 99 119 141 if you use Megahit for assembly (--assembler megahit). When I set --kmers_megahit 39 59 79 99 (because my maximum read length is only 100 bp) or --kmers_megahit 59 79 99 119 141, I both got a correct mitogenome (all genes are 100% similar to the ones assembled by --assembler mitoassemble, see below summary). The sequencing depth along the whole mitogenome now is 1600X to 7199X.

You can also use even larger kmers, depending on your maximum read length. I may set larger default kmers in the further versions of MitoZ, and when the assembly is not good, the users can try smaller kmers by themselves.

#Seq_id        Length(bp)     Circularity    Closely_related_species
k141_114419    16359          yes            


#Seq_id        Start  End    Length(bp) Direction  Type   Gene_name  Gene_prodcut                      Total_freq_occurred
--------------------------------------------------------------------------------------------------------------------------
k141_114419    67     141    75         +          tRNA   trnL(uaa)  tRNA-Leu                          2
k141_114419    142    1098   957        +          CDS    ND1        NADH dehydrogenase subunit 1      1
k141_114419    1097   1164   68         +          tRNA   trnI(gau)  tRNA-Ile                          1
k141_114419    1162   1233   72         -          tRNA   trnQ(uug)  tRNA-Gln                          1
k141_114419    1235   1303   69         +          tRNA   trnM(cau)  tRNA-Met                          1
k141_114419    1304   2341   1038       +          CDS    ND2        NADH dehydrogenase subunit 2      1
k141_114419    2340   2406   67         +          tRNA   trnW(uca)  tRNA-Trp                          1
k141_114419    2408   2476   69         -          tRNA   trnA(ugc)  tRNA-Ala                          1
k141_114419    2479   2548   70         -          tRNA   trnN(guu)  tRNA-Asn                          1
k141_114419    2583   2649   67         -          tRNA   trnC(gca)  tRNA-Cys                          1
k141_114419    2650   2716   67         -          tRNA   trnY(gua)  tRNA-Tyr                          1
k141_114419    2718   4262   1545       +          CDS    COX1       cytochrome c oxidase subunit I    1
k141_114419    4260   4328   69         -          tRNA   trnS(uga)  tRNA-Ser                          2
k141_114419    4332   4399   68         +          tRNA   trnD(guc)  tRNA-Asp                          1
k141_114419    4401   5084   684        +          CDS    COX2       cytochrome c oxidase subunit II   1
k141_114419    5088   5151   64         +          tRNA   trnK(uuu)  tRNA-Lys                          1
k141_114419    5152   5355   204        +          CDS    ATP8       ATP synthase F0 subunit 8         1
k141_114419    5313   5993   681        +          CDS    ATP6       ATP synthase F0 subunit 6         1
k141_114419    5993   6777   785        +          CDS    COX3       cytochrome c oxidase subunit III  1
k141_114419    6777   6844   68         +          tRNA   trnG(ucc)  tRNA-Gly                          1
k141_114419    6850   7191   342        +          CDS    ND3        NADH dehydrogenase subunit 3      1
k141_114419    7193   7259   67         +          tRNA   trnR(ucg)  tRNA-Arg                          1
k141_114419    7262   7558   297        +          CDS    ND4L       NADH dehydrogenase subunit 4L     1
k141_114419    7552   8934   1383       +          CDS    ND4        NADH dehydrogenase subunit 4      1
k141_114419    8930   8996   67         +          tRNA   trnH(gug)  tRNA-His                          1
k141_114419    8997   9055   59         +          tRNA   trnS(gcu)  tRNA-Ser                          2
k141_114419    9055   9124   70         +          tRNA   trnL(uag)  tRNA-Leu                          2
k141_114419    9125   10936  1812       +          CDS    ND5        NADH dehydrogenase subunit 5      1
k141_114419    10933  11454  522        -          CDS    ND6        NADH dehydrogenase subunit 6      1
k141_114419    11455  11523  69         -          tRNA   trnE(uuc)  tRNA-Glu                          1
k141_114419    11529  12671  1143       +          CDS    CYTB       cytochrome b                      1
k141_114419    12674  12740  67         +          tRNA   trnT(ugu)  tRNA-Thr                          1
k141_114419    12741  12808  68         -          tRNA   trnP(ugg)  tRNA-Pro                          1
k141_114419    13776  13841  66         +          tRNA   trnF(gaa)  tRNA-Phe                          1
k141_114419    13844  14791  948        +          rRNA   s-rRNA     12S ribosomal RNA                 1
k141_114419    14792  14861  70         +          tRNA   trnV(uac)  tRNA-Val                          1
k141_114419    14860  16359  1500       +          rRNA   l-rRNA     16S ribosomal RNA                 1


Protein coding genes totally found: 13
tRNA genes totally found:           22
rRNA genes totally found:            2
--------------------------------------
Genes totally found:                37

9. Megahit failed and got PermissionError: [Errno 1] Operation not permitted

See https://github.com/linzhi2013/MitoZ/issues/176

10. ValueError: can not find taxid for Nemertea-ribbon-worms, maybe it's a misspelling.

Have a look at the --clade {Chordata,Arthropoda,Echinodermata,Annelida-segmented-worms,Bryozoa,Mollusca,Nematoda,Nemertea-ribbon-worms,Porifera-sponges} option.

If you use Annelida-segmented-worms or Nemertea-ribbon-worms or Porifera-sponges, you may get errors like this:

image

This is a small bug because MitoZ will try to find the taxonomy lineage for Nemertea-ribbon-worms (instead of Nemertea).

Current solution:

Find the /app/anaconda/lib/python3.9/site-packages/mitoz/tools/taxonomy_ranks/get_taxonomy_rank_with_ete3_with_super_and_sub.py, and modify part of its content (around line number 78):

    def __init__(self, user_taxa):
        NCBITaxa.__init__(self)
        self.lineages = dict()
        self.user_taxa = user_taxa
        
        self.taxa_searched = user_taxa
        if '-' in user_taxa: # add this line
            self.taxa_searched = user_taxa.split('-')[0] # add this line

If you are using the Docker or Singularity image, you can copy this file to your host, modify the file, and then, bind it back to the image (so the modified file overwrites the one in the image). For example,

singularity exec --bind $PWD --bind /host/path/modified_get_taxonomy_rank_with_ete3_with_super_and_sub.py:/app/anaconda/lib/python3.9/site-packages/mitoz/tools/taxonomy_ranks/get_taxonomy_rank_with_ete3_with_super_and_sub.py  ~/singularity/images/MitoZ_v3.4.sif 

I will fix the problem in the next release.

11. resuming a run

MitoZ option: --resume_assembly

  1. For megahit, it works when a run is accidentally stopped, e.g. the server was down. It will NOT work if you continue the run with some parameters changed.

  2. Changing the kmers after a running was stopped does not work with Megahit so far. see https://github.com/voutcn/megahit/issues/110

  3. When there is not a previous stopped run and you use the --resume_assembly option from the very beginning: It will not work either. Because this way, megahit only cares about megahit --continue --out-dir ./megahit_out options and omits the others, and it assumes all related parameters and commands have been written to ./megahit_out before using the --continue option ( --resume_assembly for MitoZ).

12. Fixing unpaired reads

Sometimes, for unknown reasons, your read 1 and read 2 fastq files may not be properly paired, e.g. the order of read 1 is not the same as read 2. This will lead to some errors and complaints in MitoZ. At least, I know Megahit and BWA do not like them.

Here are the tools to fix the problem:

  1. https://bioinf.shenwei.me/seqkit/usage/#pair (already installed in the mitoEnv environment, you can use the command directly; however, it can consume a lot of RAM!)

  2. https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/repair-guide/ (the repair.sh script). For example,

    repair.sh in=<fq1> in2=<fq2> out=<paired.R1.fq.gz> out2=<paired.R2.fq.gz>  outs=<singleton output> -Xmx10g
    
  3. https://github.com/linsalrob/fastq-pair

In the next release of MitoZ, it will automatically detect such a problem and fix it on the fly.

12. Inaccurate gene ranges

The accuracy of gene ranges highly depends on our database. Therefore, if you find the CDS ranges of species can not be accurately determined by MitoZ, you may need to modify the database by yourself (see https://github.com/linzhi2013/MitoZ/wiki/Extending-MitoZ-s-database):

Manually correct the annotation results of MitoZ or NCBI's annotaion, and then add them to the MitoZ's database. See https://github.com/linzhi2013/MitoZ/wiki/Extending-MitoZ-s-database. You might need to remove other NCBI's sequences in the database, i.e., creating your own custom database.

Clone this wiki locally