BioloMICS logo
×
BioloMICS menu

N - Algorithms

 
Six different algorithms can be used to align sequence data.
 
The first is called the optimistic algorithm (opt) and computes a similarity between the first and the last bases that are in common between both sequences under comparison.
Starting and ending non-overlapping regions are not accounted, but intermediate overlapping regions are accounted as mismatches.
 
The pessimistic algorithm (pes) accounts all portions of both sequences including the ones that are not in common.
So mutations, deletions and insertions are accounted as mismatches.
 
The third algorithm is the super optimistic (SupOpt) one and will only be using the overlapping region(s) between the two sequences under comparison and will, like the pessimistic algorithm, consider the mutations, deletions and short insertions as mismatches.
 
The next three algorithms are called opt2dir, pes2dir and supopt2dir because they align the original sequence as well as its reverse complement in an optimistic, a pessimistic and super optimistic way respectively.
Then, only the best matching sequence is accounted for the similarity comparisons.
 
When several sequences are available in the same field of a single record and have to be compared with one or several of another record, then only the best matching pair of sequences is accounted for the computation of the similarity coefficient.
A threshold value can also be set in order to keep alignments with a minimum amount of base pairs in common.
 
 
Let’s use a very simple example to demystify these algorithms. A source and a reference DNA sequence are created as follows:
 
Source DNA:
        10        20        30        40        50        60
gcttggagtcaccgcagacgttaacgggaaccgacgttgtcaccggggacaccctcctcttcc
 
 
 
Reference DNA:
ttctttcttggagtcaccgcagacgttaccacggcggacttcgcattatatagcgcatagcgcgcaggcgagagagctct
        10        20        30        40        50        60        70        80   
tcatattatatcgatctcgatcatgccttgacggaaaccgacgttgtcaccggggacacctcagg
        90       100       110       120       130       140    
 
 
 
The alignment by hand of these two sequences gives the following result:
     1       10        20                          30        40        50        60
     gcttggagtcaccgcagacgtta  ac             g  gg aaccgacgttgtcaccggggacaccctcctcttcc
      ::::::::::::::::::::::  ::             :  :: ::::::::::::::::::::::::: ::     
ttctttcttggagtcaccgcagacgttaccacggcg...ttccttgacggaaaccgacgttgtcaccggggacacc tcagg       
1       10        20        30             110       120       130       140         
 
 
 
There is 54 identical nucleotides. The similarity is this 54 value divided by a denominator that can be computed different ways.
 
The best alignment returned by BioloMICS is:
     1       10        20                          30        40        50        60
     gcttggagtcaccgcagacgtta  acgg g               aaccgacgttgtcaccggggacaccctcctcttcc
      ::::::::::::::::::::::  :::: :               ::::::::::::::::::::::::: ::     
ttctttcttggagtcaccgcagacgttaccacggcg...ttccttgacggaaaccgacgttgtcaccggggacacc tcagg      
1       10        20        30             110       120       130       140         
 
Note that this alignment is slightly different from the previous one, showing that there isn’t one unique alignment solution.
 
 
Super-Optimistic
     1       10        20                          30        40        50        60
     gcttggagtcaccgcagacgtta  acgg g               aaccgacgttgtcaccggggacaccctcctcttcc
      ::::::::::::::::::::::  :::: :               ::::::::::::::::::::::::: ::     
ttctttcttggagtcaccgcagacgttaccacggcg...ttccttgacggaaaccgacgttgtcaccggggacacc tcagg     
      <------- Super-optimistic --->       +       <--------- denominator ---->
 
In this case, important insertions are ignored. The denominator is the sum of the size of the similar segments. In the case above, there is two similar blocs. Bloc 1 is running in source from nucleotide 2 to 28 with 3 insertions, so its size is 28 – 2 + 1 + 3 = 30 and bloc 2 is covering nucleotides 29 to 55 with one insertion, so its size = 55 – 29 + 1 + 1 = 28. The denominator is 30 + 28 = 58 and the similarity is 54 / 58 = 93.10 %.
 
 
Pessimistic
In this case, the denominator is the global right most location of a nucleotide minus the global leftmost position of a nucleotide, as shown in figure below.
     gcttggagtcaccgcagacgtta  acgg g               aaccgacgttgtcaccggggacaccctcctcttcc
      ::::::::::::::::::::::  :::: :               ::::::::::::::::::::::::: ::     
ttctttcttggagtcaccgcagacgttaccacggcg...ttccttgacggaaaccgacgttgtcaccggggacacc tcagg
<------------------------- Pessimistic denominator ---------------------------------->
 
In the example above, the denominator is 150 and the similarity:
Sim = 54 / 150 = 36.00 %
 
 
Optimistic
In this case, the denominators is the global right most location of a nucleotide in the sequence ending the first minus the global leftmost position of a nucleotide in the sequence starting the last, as shown in figure below.
 
     gcttggagtcaccgcagacgtta  acgg g               aaccgacgttgtcaccggggacaccctcctcttcc
      ::::::::::::::::::::::  :::: :               ::::::::::::::::::::::::: ::     
ttctttcttggagtcaccgcagacgttaccacggcg...ttccttgacggaaaccgacgttgtcaccggggacacc tcagg 
     <-------------------- Optimistic denominator ------------------------------->
 
In the example above, the denominator is 141 and the similarity: Sim = 54 / 141 = 38.30 %