Skip to content

Latest commit

 

History

History
198 lines (114 loc) · 8.87 KB

Concatenation & Model Selection.md

File metadata and controls

198 lines (114 loc) · 8.87 KB

Concatenation & Model Selection


intro:

Currently, two divergent systematic methods are commonly applied for inferring species trees: the supermatrix approach (concatenation) and the coalescent approach (gene trees are calculated and then reconciled in a species tree). You will find a couple of interesting papers on the topic at the end of this tutorial, but here we will focus on the supermatrix approach.

Before starting you should have two aligned & filtered MSA, or if you missed the previous lesson you can grab the ones I generated: 12S_xinsi_op7_aligned.gb.fasta and ND2_p_aligned.n.gb.fasta.


concatenation:

After having aligned and filtere our genes, we will concatenate them using Phyutility: the software has a wide array of functions and can be considere a swiss-knife for phylogeneticists. As you always should, when using a new software, take a look at its manual.

After adjusting the path to the java executable of phyutility, try this string in a folder which contains just the final alignments of your gene (otherwise the wildcard will cause all the fasta files to be concatenated):

java -jar /Applications/bio/phyutility/phyutility.jar -concat -in *.fasta -out concatenation.nxs

Then take a look at the output file concatenation.nxs. It's a nexus file which

  • stores sequence information
  • from which we will extract the block which codes informations for gene boundaries.

We can reformat the nexus line which stores the gene boundaries information,

[12S_xinsi_op7_aligned.gb.fasta_gene1 1-745 ND2_p_aligned.n.gb.fasta_gene2 746-1768 ]

into something similar to:

DNA, 12S = 1-745
DNA, ND2 = 746-1768

and save it to a file named gene.prt using an editor as nano. I strongly discourage manual editing of files, but for us it's a nice way to understand the structure of different formats. Btw if you are curious you can also get familiar with the nexus format. This is a step in phylogenetic pipelines which is often automated, as manual editing of partition files for hundred/thousands of genes is not possible. Later you can explore this in-house script which I made for the purpose, just remember to adjust the path to phyutility in it. Moreover future versions of IQ-TREE can accept a folder of alignments as input, completely removing the need for the user to concatenate and edit partitions.

Nonetheless this step is necessary for you to understand its underlying logic! For example, let's edit the partition file to take into account the different codon position (they evolve under different constrains due to the gen code degeneracy).

Let's use nano to transform our gene.prt file from:

DNA, 12S = 1-745
DNA, ND2 = 746-1768

into:

DNA, 12S = 1-746
DNA, ND2st = 746-1768\3
DNA, ND2nd = 747-1768\3
DNA, ND2rd = 748-1768\3

and save it as gene_and_codon.prt. As you can notice the \3 notation informs the program to consider each three positions, and the start of the partition needs to be adjusted as well.

At the end of this part we should have


model of evolution & partitioning scheme selection:

To carry out the model selection we will use ModelFinder (Kalyaanamoorthy et al., 2017). As for most of the steps which are carried out to build a tree, there is a staggering choice of different tools and is sometimes difficult to decide which to use. The more widespread tool for model selection is PartitionFinder2 but we are using ModelFinder as implemented in IQ-TREE for two key reason:

  • usability
  • speed

Also I think it's a good idea to keep constantly using new and shiny tools. Let's try the string:

iqtree -s ND2_p_aligned.n.gb.fasta -m MF

We can read the best model from the standard output or open the relative file by zcat ND2_p_aligned.n.gb.fasta.model.

The MF word stands for ModelFinder, which tells IQ-TREE to perform ModelFinder: this tool computes the log-likelihoods of an initial parsimony tree for many different models and the Akaike information criterion (AIC), corrected Akaike information criterion (AICc), and the Bayesian information criterion (BIC). Then ModelFinder chooses the model that minimizes the BIC score (you can also change to AIC or AICc by adding the option -AIC or -AICc, respectively).

The -m flag can also specify a model name to use during the analyses, which can be a priori specified by the user (here's a list of models implemented in ModelFinder if you feel you will nail it better than ModelFinder).

iqtree -s ND2.fasta -m 'TN{2.0,3.0}+FQ+G8{0.5}+I{0.15}'

As you see several additional parameters are possible in order to:

  • describe base frequencies, such as +F , +FQ, +FO
  • model rate heterogeneity across sites: +I, +G and +R

What we've seen until now is the process through which we select the "best" model of evolution for our sequence data, according to a metric of choice. In a concatenation framework we should carry out the process on the whole concatenation instead of single alignements, but without loosing the information of the single genes boundaries. Let's try:

iqtree -s concatenation.nxs -sp gene_and_codon.prt -m MF

and the we can take a look at the file gene_and_codon.prt.best_scheme.nex.

The previous analysis will result in separate models for each partion. Nonetheless, there are several reasons for which we wanto to merge partitions which can be described by similar models of evolution, possibly including a better estimation of model parameters.

To carry out simultaneously model of evolution & partitioning scheme selection let's use:

iqtree -s concatenation.nxs -spp gene_and_codon.prt -m MF+MERGE  -redo

We have overwritten the previous analysis using the -redo flag, as the merging of partition will almost certainly result in a better model for our dataset.

Let's take again a look at the file gene_and_codon.prt.best_scheme.nex, which should look like this:

begin sets;
  charset 12S = 1-745;
  charset ND2st = 746-1768\3;
  charset ND2nd = 747-1768\3;
  charset ND2rd = 748-1768\3;
  charpartition mymodels =
    GTR+F+G4: 12S,
    TN+F+I+G4: ND2st,
    HKY+F+G4: ND2nd,
    TIM2+F+R2: ND2rd;
end;

If we give such a .prt.best_scheme.nex file to iqtree (either previously generated by ModelFinder or by other means) it will skip the model & partitioning scheme selection and carry out just the parameter optimisation.

Moreover, IQ-TREE allows different branchlengths between partitions with the flags:

  • -q: all partitions share the same set of branch lengths (unrealistic).
  • -spp: allows each partition to have its own evolution rate (recommended for typical analysis).
  • -sp: has its own set of branch lengths to account for heterotachy (very parameter-rich).

In real scenarios one should perform all three analyses and compare e.g. the BIC scores to determine the best-fit partition model. Even if different branchlengths strategies can have a real impact on phylogenies, we will skip this part and chose the more reasonable assumption of a separate evolutionary rate of each partitions.


further reading:

Some great basic and advanced tutorials are available from the authors themselves on ModelFinder and IQ-TREE.

Here is a very nice tutorial on complex models in IQ-TREE.

Here you'll find few words on the composition test carried out by IQ-TREE.

Interesting paper on how concatenation/coalescence impacts mammalian phylogeny.

Interesting paper on how systematic errors as heterotachy impact plants phylogeny

Some music which has been clearly inspired by phylogenetic methods.