Skip to content

Extending MitoZ s database

Guanliang MENG edited this page Nov 5, 2023 · 5 revisions

For annotation, most of the time, MitoZ's default database works well, if not (usually due to the protein sequences in MitoZ's default database being too distant from your samples), then you might want to build a custom annotation database for MitoZ, here is how to do it.

1. find the path where MitoZ is installed

execute:

$ conda env list
# conda environments:
#
base                  *  /home/guanliang/soft/miniconda3
mitozEnv                 /home/guanliang/soft/miniconda3/envs/mitozEnv

The exact path for me is: /home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz.

The path for MitoZ's database:

$ ll /home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles
total 16K
-rw-rw-r-- 2 guanliang    0 May 12 06:47 __init__.py
drwxrwxr-x 2 guanliang 4.0K May 24 16:06 CDS_HMM
drwxrwxr-x 2 guanliang 4.0K May 24 16:06 rRNA_CM
drwxrwxr-x 2 guanliang 4.0K May 24 16:06 __pycache__
drwxrwxr-x 2 guanliang 4.0K May 24 17:36 MT_database

To list all the database file for PCG annotation:

$ ls /home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/*_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Animal_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Annelida-segmented-worms_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Arthropoda_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Bryozoa_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Chaetognatha-arrow-worms_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Chordata_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Cnidaria_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Echinodermata_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Mollusca_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Nematoda_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Nemertea-ribbon-worms_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Platyhelminthes-flatworms_CDS_protein.fa
/home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/MT_database/Porifera-sponges_CDS_protein.fa

2. Add protein sequences into MitoZ's PCG annotation database

I would suggest that you NOT touch the /home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles/ if you do not know what you are doing. Instead, copy this directory to a new place and then edit the files within this new place.

$ mkdir ~/mitoz_custom_db
$ cp -a  /home/guanliang/soft/miniconda3/envs/mitozEnv/lib/python3.7/site-packages/mitoz/profiles ~/mitoz_custom_db

$ ls -lhrt ~/mitoz_custom_db/profiles/
total 16K
-rw-rw-r-- 1 guanliang guanliang    0 May 12 06:47 __init__.py
drwxrwxr-x 2 guanliang guanliang 4.0K May 24 16:06 CDS_HMM
drwxrwxr-x 2 guanliang guanliang 4.0K May 24 16:06 rRNA_CM
drwxrwxr-x 2 guanliang guanliang 4.0K May 24 16:06 __pycache__
drwxrwxr-x 2 guanliang guanliang 4.0K May 24 17:36 MT_database

After you update the files within the ~/mitoz_custom_db/profiles/, when you run MitoZ, you should use the --profiles_dir ~/mitoz_custom_db/profiles option to tell MitoZ that you want to use this custom database:

$ mitoz annotate --thread_number 8 --fastafiles YOUR_mito_genome.fasta --profiles_dir ~/mitoz_custom_db/profiles  --genetic_code 5 --clade Arthropoda

What if you got errors with the --profiles_dir option? For example,

FileNotFoundError: [Errno 2] No such file or directory: '03_anno_Option_1_test.fasta_mitoscaf.fa.solar.genewise.gff.cds.position.cds

Make sure the value of your --profiles_dir option is correct, right under the path there should be CDS_HMM, rRNA_CM, and MT_database directories.

And make sure your target clade has the three files in these directories:

CDS_HMM/Arthropoda_CDS.hmm		
CDS_HMM/Arthropoda_CDS_length_list
MT_database/Arthropoda_CDS_protein.fa

You can create them by yourself. The "Artrhopoda" here is the clade name.

  • Please provide an absolute path to the --profiles_dir option!

Which file is to be edited?

Have a look at https://github.com/linzhi2013/MitoZ/issues/146.

If your samples belong to arthropods, then you should add the new protein sequences into this file:

~/mitoz_custom_db/profiles/MT_database/Arthropoda_CDS_protein.fa

For example, add the following sequences to this file:

>gi_NC_KX091860_ND1_Cerapanorpa_obtusa_319_aa
MMMIDFIMPLIGSLLLIICVLVGVAFLTLLERKVLGYIQIRKGPNKVGFMGIPQPFCDAIKLFTKEQTYP
ILSNYVSYYFSPIFSLFLSLTVWLVMPYFTNLYTFNLGLMFFLCCTSLGVYTVMIAGWSSNSNYALLGGL
RAVAQTISYEVSLALILLSFVFLIGNYSLMSFFYYQNYVWFIIITFPLALSWFASCLAETNRTPFDFAEG
ESELVSGFNVEYSSGGFALIFLAEYASILFMSMLFSVIFLGCDLMSFMFFIKLTFLSFLFIWVRGTLPRF
RYDKLMYLAWKSFLPLALNYLIFFLGLKVMLIYLY

The header line format must follow this style.

3. But what protein sequences are to be used?

The rule is, to find some mitogenomes more closely related to your samples.

For example, you can use some mitogenomes on NCBI that belongs to the same genus, or family as your sample. If you do not know what clades your samples belong to, you can blast your mitogenome sequences to NCBI's NT database (https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome), and use the top-hit species.

Normally, adding one species closely related to your sample into MitoZ's database is good enough.

  1. Download the protein sequences of some more closely related species:
image
  1. Add these new protein sequences into the annotation database file: ~/mitoz_custom_db/profiles/MT_database/Arthropoda_CDS_protein.fa
image

The header line format must follow this style.

>gi_NC_XXX_YYY_Cerapanorpa_obtusa_319_aa
  • replace the XXX with the Genbank accession number of the protein sequence, and gi_NC_ must be kept for any case. For example, you must use KX091860 instead of KX091860.1, which means that the dot (.) is not allowed here.
  • replace the YYY with the corresponding standard PCG names: ATP6, ATP8, COX1, COX2, COX3, CYTB, ND1, ND2, ND3, ND4, ND4L, ND5, ND6.
  • replace Cerapanorpa_obtusa with the new genus and species name. For unknown species, GenusName_sp. is also fine.
  • replace 319 with the length of the protein sequences.

Here shown is the ND1 gene only. You can do the same thing for the other PCGs. But you do NOT have to add all 13 PCGs. For example, the ATP8 gene is usually very divergent, and thus difficult to be annotated by MitoZ, in this case, you can simply add a new ATP8 protein sequence to your custom MitoZ's database.

Finally, if your samples belong to another clade, say Chordata, then you should edit the ~/mitoz_db/profiles/MT_database/Chordata_CDS_protein.fa instead.

See also: https://github.com/linzhi2013/MitoZ/issues/181

Clone this wiki locally