Today’s, and tomorrow’s even greater, availability of sequence data provides unprecedented opportunities to understand evolutionary relations; however, the increasing wealth of data also poses challenges to computational methods for data analysis and, thereby also, to algorithm design. The number of sequences involved in an analysis may be very large, since a large number of species or a large gene family may be considered – some species harbor hundreds of members of some gene families. The ultimate goal in species tree construction is to obtain the tree of life, to the extent it exists. As the size of the phylogenies being reconstructed grows, the demand for long sequences increases, which often is satisfied by concatenation of genes. In other applications many trees have to be reconstructed. For instance, comparative genomics studies often involves constructing gene trees for all gene families in a number of genomes, i.e., often thousands of families. Also bootstrapping, which commonly is used to obtain significance values for a single family, requires many reconstructions.
There are two major approaches to phylogenetic tree reconstruction; distance methods and character based methods (such as likelihood and parsimony). Although this paper is concerned with tree reconstruction through distance methods, character based methods are many times the preferred approach. However, since most distance methods are several orders of magnitude faster than character based methods, distance methods are preferred when it comes to extensive evolutionary analysis.
In general, distance methods first estimate pairwise evolutionary distances of the species, represent them using a distance matrix, and, then, compute the tree, or attempt to compute the tree, that realizes the given distance matrix as well as possible. The Neighbor Joining algorithm (NJ) [1] is a distance method, running in time Θ(n3), where n is the number of sequences, which due to its accuracy and speed has been embraced by the phylogenetic Community. However, prior to running the distance method the distance matrix has to be computed from the sequences. Although the distance matrix in theory can be computed in time O(l·n1.376), where l is the sequence length, see [2], the only practical algorithms require time Ω(l·n2). Therefore, the overall running time of tree reconstruction with NJ is Ω(l·n2 + n3). Since the sequence length often is larger than the number of taxa, Computing the distance matrix is the bottleneck. In this paper, we remove this bottleneck by presenting an algorithm which in practice has running time comparable to N.J.
There are various Markov models of sequence evolution describing how sites evolve, typically, independently and identically from the root down toward the leaves. For example, Kimura’s two-parameter model (K2P) [3] distinguishes between two types of events: transitions which are changes within the purine (A and G) and pyrimidine (C and T) groups and transversions which are changes between the groups. The aim in distance estimation is to find the most likely (ML) estimate of the actual number of mutational events given the number of observed events. For the Kimura two-parameter model, this is done by counting the number of transitions and transversions between each pair of sequences and thereafter optimizing the likelihood function, using a closed formula, to attain the corrected distances. The straightforward way of counting the number of observed events between two sequences of length l takes Ω(l) time. Therefore, the straightforward algorithm of Computing all pairwise distances between n sequences has running time Θ(l·n2).
In this paper, a novel divide and conquer algorithm for Computing the number of observed events is presented. The algorithm computes the same function and has the same asymptotic running time as the straightforward algorithm. In practice, however, it is significantly faster than Phylip [4] and Paup [5]. We show on both simulated and real biological data how our algorithm speeds up the reconstruction using NJ from a matter of many minutes and even hours to a matter of seconds. It is important to note that, since the computation of the distance matrix is a prerequisite for all distance methods, our algorithm provides an increase in speed for all distance methods, i.e., not only NJ.
In addition, we present two new methods for handling, so called, ambiguity Symbols, i.e., bases of uncertain identity. Such symbols occur naturally in the form of single nucleotide polymorphisms (SNPs) and also as a result of failure to resolve bases during sequencing. Ambiguity symbols complicates matters in distance estimation and it is not obvious how to include them in the likelihood computations. We show on simulated data that our new methods for ambiguity symbols are significantly more accurate than earlier methods. This is, to the best of our knowledge, the first work which evaluates the accuracy of different ambiguity approaches.