Advertisers
Free Chat Rooms   UK Chat Rooms   Chat Community   Chat   
Free Chat Rooms   Punk Rock T-Shirts   Free Chat   Live Chat   Concert Bands T Shirts   Chat Rooms   Fitness News   Band T Shirts   
Free Web Directory | Directory Submission Service | Buy Text Links | Theaters and Showtimes | News Archive |
Suggest a Site | Check Status

Artifactual Phylogenies Caused By Correlated Distribution of Substitution Rates Among Sites and Line

Current Headlines

Artifactual Phylogenies Caused By Correlated Distribution of Substitution Rates Among Sites and Line

Mar 28, 03:39 AM

Current Headlines: By Ruano-Rubio, Valentin; Fares, Mario A

Abstract.-

Despite the advances in understanding molecular evolution, current phylogenetic methods barely take account of a fraction of the complexity of evolution. We are chiefly constrained by our incomplete knowledge of molecular evolutionary processes and the limits of computational power. These limitations lead to the establishment of either biologically simplistic models that rarely account for a fraction of the complexity involved or overfitting models that add little resolution to the problem. Such oversimplified models may lead us to assign high confidence to an incorrect tree (inconsistency). Rate-across-site (RAS) models are commonly used evolutionary models in phylogenetic studies. These account for heterogeneity in the evolutionary rates among sites but do not account for changing within-site rates across lineages (heterotachy). If heterotachy is common, using RAS models may lead to systematic errors in tree inference. In this work we show possible misleading effects in tree inference when the assumption of constant within-site rates across lineages is violated using maximum likelihood. Using a simulation study, we explore the ways in which gamma stationary models can lead to wrong topology or to deceptive bootstrap support values when the within-site rates change across lineages. More precisely, we show that different degrees of heterotachy mislead phylogenetic inference when the model assumed is stationary. Finally, we propose a geometry-based approach to visualize and to test for the possible existence of bias due to heterotachy. [Covarions; heterotachy; maximum likelihood; phylogenetic bias; rate-across-site; systematic error.]

(ProQuest-CSA LLC: ... denotes formulae omitted.)

There are several factors that may hamper the inference of reliable phylogenies, including insufficient data and oversimplified models. At best, these model deficiencies may lead to random noise that does not result in erroneous tree topologies. However, correlated parameters unaccounted for in the tree inference model can result in systematic biases towards highly supported erroneous topologies, branch lengths, and other parameter values.

Typically, probabilistic models proposed for the molecular evolution at a site are based on a discrete Markov process where each state represents a different character value (a nucleotide or amino acid) and transitions between states represent a site change fixation. Early methods assumed an identical and independent evolution process for all sites. Uzzell and Corbin (1971) accounted for the evolutionary heterogeneity among sites using rate-across- site (RAS) models. In these RAS models, site-specific relative rates are tied under a parametric sampling distribution, usually a gamma Γ (α, β) distribution. Despite their usefulness to approximate empirical data, RAS models are simplistic, as they do not take into account changes in the functional, environmental, and selective constraints acting on amino acid sites of a protein over time.

Covarion-like models were among the first to account for variations in evolutionary constraints among lineages. Fitch and Markowitz (1970) coined the term covarion, but covarion-like models were not given much attention until recently (e.g., Tuffley and Steel, 1998; Lopez et al., 1999; Galtier, 2001; Penny et al., 2001; Huelsenbeck, 2002; Kolaczkowski and Thornton, 2004; Philippe et al., 2005; Ane et al., 2005). Under these models, two separate Markov processes describe molecular evolution. The first is the typical transition between character states (amino acids for protein, nucleotides for DNA), whereas the second allows sites to switch between on-off states. Sites in the off state will remain constant, whereas those in the on state accept changes following the former process. This covarion model can explain an overall RAS-like heterogeneity, as every site relative rate would depend on the time interval it has spent in the on state; the longer the time, the higher the rate. It also justifies the presence of extensive heterotachy, or changing rates, for a particular site across lineages that cannot be well explained by RAS models (Lockhart et al., 1998; Lopez et al., 2002; Philippe et al., 2005; Ane et al., 2005).

Extensions to this initial covarion model may be proposed to convey more accurately the complexity present in real data with a limited set of parameters. For example, Galtier (2001) considered a discrete gamma [Γ(α)] model that adds a fixed instantaneous rate of change, rate-of-rates (υ), between rate categories for each lineage on every site. This extended covarion model does not allow, however, for rate-of-rates change between lineages that may be responsible of prsducing phylogenetic biases.

Currently, most phylogenetic studies use RAS models on simplicity and software availability grounds. Simultaneously, some methods have been proposed to determine significant within-site changes in evolutionary rates across lineages (Gu, 1999, 2001; Knudsen and Miyamoto, 2001; Mooers and Holmes, 2000; Susko et al., 2002). Other authors have carried out work on topologically misleading effects with special focus on shared invariant sites (Lockhart et al., 1998; Susko et al., 2004), on highly divergent or even random sites (Brinkmann and Philippe, 1999; Pisani, 2004; Susko et al., 2005), or without restriction to any of those two extremes (Chang, 1996; Huelsenbeck, 1998; Inagaki et al., 2004; Kolaczkowski and Thorn ton, 2004).

Deviations from the RAS assumptions may lead to systematic errors. Chang (1996) showed that taxa tend to cluster together in a phylogeny when they display similarities in the distribution of rates among sites. Susko et al. (2004) reported that distance correction to account for heterogeneity in distance-based methods might become a one-way trip into long-branch attraction artifacts. Their results pinpoint a tendency towards long-branch attraction or repulsion, even using maximum likelihood (ML) methods, when the model and parameter values induce systematic misestimation of genetic distances. A covarion-like evolving data set, as those described by Tuffley and Steel (1998) or Galtier (2001), does not necessarily lead to an incorrect phylogeny. In fact, true sister groups may well present a greater correlation in their relative site rates and hence tend to be clustered together in the phylogeny. Heterogeneity in the rate-of-rates parameter values among lineages may however lead to mistaken phylogenetic groupings, as we will suggest in this study.

Here, we first investigate possible sources of bias from assuming RAS models in applying the maximum likelihood criterion on data that evolved following a covarion-like model of evolution. More precisely, we focus on evolutionary scenarios where lineages differ in the assignation of rate category per site. Then we propose a tool to visualize and test the presence of such biases.

METHODS

Simulation Model

To reproduce the bias caused by changes in evolutionary constraints, we used a similar approach as that used in previous work (Chang, 1996; Kolaczkowski and Thornton, 2004; Spencer et al, 2005; Philippe et al., 2005). However, we introduced important differences regarding the phylogeny used in simulations: the tree comprised a quartet of subtrees rather than a quartet of taxa, we considered a set of coefficients that account for the conservation of the relative site rate categories at the deepest divergence events (internal nodes) of the tree; we also accounted for possible differences in the shape parameter of the gamma RAS distribution of substitutions among subtrees. In this study we did not consider the effect of bipartitions of the data matrix based on different sets of branch lengths (see Fig. 2) as done in other studies; however, heterotachy is present due to random independent changes in relative rate category per site or difference in rate distributions across lineages.

We focused on unrooted phylogenies presenting four well-resolved monophyletic groups (Fig. 1). Each group is composed of a number of taxa and evolves under a standard RAS model. The shape of each subtree is controlled by three parameters (Fig. 1c): the total height of the tree, h, the number of simultaneous divergence events along every lineage, d (thus the total number of leaves is 2d), and the relative branch length ratio, r, that modulates the temporal spacing between divergence events. The motivation behind considering subtrees rather than the typical unrooted four-taxa phylogeny is twofold: first, to provide enough data to infer back the rate for a site on each subtree and therefore its relative rate on that lineage, and second, to record the influence of different taxa sampling contexts on the bias strength.

We modeled heterogeneity among sites using a discrete approximation to Γ (α) rate distribution with the number of rate categories C = 16 on all branches. Initially we considered an RAS model with each site belonging to the same rate category throughout the tree. However, we included the possibility of changing the rate categories per site instantaneously on the two most internal nodes or divergence events, \which led to the four well-supported subtrees. On both nodes we defined three transition probability matrices P^sub nα^, P^sub nβ^, and P^sub nγ^, one per each incident branch. Each matrix element p^sub ij^ indicates the probability for a site with category i on the node to change to category j that will apply to the incident branch and beyond. As we constrained all branch rates to follow a discrete gamma distribution, all matrices must yield equally probable stationary category frequency vectors F = {C^sup -1^}^sub C^. We further constrained the family of valid matrices in order to be able to parameterize the change in the distribution of rates across the tree with a single tuning value, θ, which we used for the site rate category conservation coefficient throughout the text (Fig. 1a, b). Its value indicates the proportion of sites whose rate category is unchanged at the divergence event, whereas (1 - θ) indicates the percentage of sites that is susceptible to change to another rate category. In the latter case, the new category will be selected at random from all C possible categories including its current value. Thus the resulting transition matrix is:

... (1)

where:

... (2)

Under this restriction, reversion and transitivity are trivial because P^sub xy^ is symmetric:

... (3)

where z is reachable from x through node y. We based our simulations on a JTT (Jones et al., 1992) Markov process. Nonetheless, qualitative conclusions are extensible to other evolutionary models.

Simulation Cases

Due to computational limitations we only investigated covarion bias effects on three different parametric scenarios: (a) presence of changes in rate categories that are equally frequent in nonsister subtrees, (b) presence of changes in rate categories in the inner branch connecting phylogenetic pairs of subtrees; and (c) existence of different site rate sampling distributions. For all the simulations conducted here, we produce amino acid data sets following the model explained above based on a JTT substitution matrix. Each parameter values combination was repeated 50 to 100 times to reduce stochasticity. Unless otherwise indicated, alignments were 1000 amino acids long. We utilized the program CODEML from the PAML package version 3.14 (Yang, 1997) for evaluating the phylogenetic support based on RELL bootstrap proportions (Kishino and Hasegawa, 1989). For each data set we compared the support obtained by the three unrooted possible trees that were constrained so that all four subtrees are monophyletic and their topologies were known. Parameter values were specific for each simulation batch. All diagrams plot mean RELL support index among repetitions. Notice that this is not the same as considering the percentage of cases where that tree topology is the one that maximizes the likelihood, which is typically the statistic reported in this kind of studies. We used the R statistical package to produce all diagrams and specifically the loc-fit module to fit surfaces for the three-dimensional plots, and geneplotter for the triangular plots in Figure 10.

Geometrical Visualization of Bias

Intuitively, sites that may introduce bias towards a particular hypothesis (topology) are those for which the two non-sister lineages have similar rates and different from the other two lineages (Fig. 2). Therefore, assuming that we have enough data to accurately estimate rates per site per lineage, we can classify sites according to the hypothesis they favor. We expect that a proportion of sites support a particular topology due to stochastic events alone. However, their overall effect must cancel out and result in an unbiased sample.

There are three possible topologies for a fully resolved four- subtree-based unrooted phylogeny. A simple way to represent graphically these hypotheses is using the equidistant vertices of a regular triangle (Fig. 3). Others have used triangle geometry previously in a phylogenetic context although with a different purpose (Strimmer and von Haeseler, 1997; Kim, 2000). Within such a layout, impartial sites (those not favoring a particular hypothesis) will fall close to the middle point, the barycenter, whereas partial sites will accumulate towards peripheral areas close to a vertex, depending on which hypothesis or hypotheses they do favor.

In order to compare rates per site between different subtrees, we used empirical Bayesian estimates as a proxy (Mayrose et al., 2004). We calculated these for each subtree independently. At this point we assume a Poisson model (Juke and Cantor, 1969) rather than the actual JTT rate-matrix because using an empirical matrix for low evolving sites would yield rates highly dependent on site composition. This dependence tends to cluster groups of taxa that share a greater number of constant or nearly constant sites due to the fact they are slowevolving lineages in cases that combine subtrees with pronounced height differences. Here it is not as critical to obtain accurate estimators as to produce a reasonable order that is not sensitive to the conserved or convergent site composition. For each subtree we ranked sites by their rate estimator, resolving ties by randomization. Thus, R^sub x^(i), denotes the rank for site i on the subtree x taking values between O and 1 in steps of 1 /(data matrix length).

Plotting Individual Sites

Under the null hypothesis (RAS model) all subtrees should consistently have equal relative rates at each site across lineages; that is, a fast-evolving site on a lineage has to be proportionally fast evolving in all other lineages. In other words, if sites on certain group are ranked based on their individual rates on that lineage, independent ranking performed on other parts of the phylogeny should yield a similar order (positively correlated with the former) depending on the variance of the rate estimator used. Nonetheless, in non-RAS processes, as the covarion process described above, within-site rate changes break to certain degree this site- by-site correspondence between relative site rates among lineages and their ranking. Lineages with more similar assignment of rate categories along sites should yield higher correlated independent site rankings.

Based on previous observations, lineages with more similar evolutionary processes should tend to attract each other. In this context, we may say that a site "prefers" a hypothesis [(x, y), v, w] if for that site, ranks on lineages x, y and v, w are closer within pairs and different between pairs. A way to reflect this in quantitative terms is that a site will prefer a particular hypothesis if |R^sub x^(i) - R^sub y^(i)| and R^sub v^(i) - R^sub w^(i) are small (close to zero) and the average AVG(|R^sub x^(i) - R^sub v^(i)|, |R^sub x^(i)- R^sub w^(i)|, |R^sub y^(i) - R^sub w^(i)|, |R^sub y^(i) - R^sub v^(i)|) is big (close to one).

Instead of determining individual support indices for each site and hypothesis, we calculated coefficients that give to each hypothesis a weight proportional to the similarity of ranking of their two subtree pairs. Under this approach we yield three weight values, with only two degrees of freedom.

Let us consider the three hypotheses:

H^sub 0^ = [(a^sub 1^, a^sub 2^),b^sub 1^,b^sub 2^]

H^sub 1^ =[(a^sub 1^, b^sub 1^), a^sub 2^, b^sub 2^]

H^sub 2^ = [(a^sub 1^, b^sub 2^), a^sub 2^, b^sub 1^] (4)

For each site i we calculated the value F that measures the amount of relative rate change between two subtrees (x and y) as:

F(x, y, i) = 1 -|R^sub x^(i) - R^sub y^(i) (5)

Small differences in rates (rate conservation or convergence) will yield values close to 1, whereas extreme differences should yield F value closer to 0. F values are directly comparable only if they are estimated from subtrees with equivalent shapes (topology and branch lengths) as distributions of rate estimators used to rank sites depend on these subtree shapes. Therefore, in order to normalize ranking differences between all subtrees, we performed parametric simulations (using 1000 replicates) for each rate category based on RAS model assumptions. In these simulations, we used the parameter estimators obtained by ML analysis performed in PAML. Then we substituted the F values at each site by the complement of their corresponding cumulative probability F ' assuming that its rate category is the one with maximum posterior probability. This will yield uniformly distributed values from 0 to 1.

F'(x, y, i) = Prob(F*^sub xy^ > F (x, y, i)) (6)

Then we measure the total amount of change on a site i for each hypothesis with a second function, G, defined as the product of the indices for its two individual phylogenetic subtree pairs:

G^sub H^sub 0^^(i) = F'(a^sub 1^, a^sub 2^, i). F'(b^sub 1^, b^sub 2^, i)

G^sub H^sub 1^^(i) = F'(a^sub 1^, b^sub 1^, i). F'(a^sub 2^, b^sub 2^, i)

G^sub H^sub 2^^(i) = F' (a^sub 1^, b^sub 2^, i) . F'(a^sub 2^, b^sub 1^, i) (7)

The resulting hypothesis indices also vary from 1 (total conservation of the substitution rate categories per site within pairs) to O (extreme change of the substitution rate categories per site at least within one pair). Then, we defined the weight, ω, on site i associated to a particular hypothesis j as:

... (8)

Finally, we used these weights for combining arithmetically the three triangle vertices (hypotheses) in order to plot a single point that represents the site within the triangle. The bidimensional location (p^sub i^) of that point can be calculated as:

... (9)

Here h^sub j^ stands for the two-dimensional coordinates of the vertex that represents hypothesis H^sub j^. Notice that we can implement many alternative G or F functions generating different weight coefficients and plots with different statistical properties. Nonetheless, any alternative must be designed as to allow impartial sites to concentrate around the center of the triangle, whereas partial sites must cluster towards the corners (one hypothe\sis wins) or the edges (two draw and one loses).

Bias Test and Measurement

We can evaluate the assumptions in the RAS models by testing the goodness-of-fit of the expected to the observed distribution densities of the points in the triangle. Even though this may spot nonstandard samples, it would not automatically indicate which hypothesis benefits from that deviation. Besides, some samples may still be balanced and yet have significantly different plot densities. Alternatively, we could take a look at the mass center of the plot, that is, the arithmetic mean of all site points.

In the ideal case of equivalent subtrees with the same shape (topology and branch lengths), and assuming a genuine RAS evolution model, the mass center must be placed on the triangle barycenter equidistant from each tree hypothesis (topology) regardless of the presence or lack of any phylogenetic signal. Asymmetric heavy tails must pull this mean towards the winning alternative hypothesis (or further from the losing one). In consequence, one could use this to infer whether the sample is biased with respect to one or more hypotheses. Provided that the distribution of the mean plot point position approaches a bivariate normal distribution, we can readily obtain a simple formula to calculate a confidence area for the sample mass center as developed in the Appendix:

... (10)

where D designates the Euclidean distance from the sample mean to the barycenter, ? is the standard deviation of the sample's projection on any axis of choice with origin in the barycenter, and L is the sample size (data matrix length).

Nonetheless, distinct branch lengths and different number of taxa may affect the distribution of dots on the plot and change the mass center. This is due to the fact that the G values for each hypothesis, as described in Methods, are not independent from each other because they combine shared subtree estimators. In practice, however, the test starts to become liberal just with a great amount of data (10,000 sites) and severe tree shape height asymmetry (e.g., when trees present subtrees over 8 times taller, or with 4 times greater number of taxa, than others). A general solution to this issue is to further simulate the expected site plot and mass center, given the ML estimators of the full phylogeny, and to use the distance to this alternative center. Even though Equation (10) is still a good approximation under such circumstances, sample mean comparison tests, such as the Wilcoxon test, are probably more appropriate.

Another aspect to bear in mind is that since we sort rank ties by randomization, each execution can potentially return a different P- value. In order to stabilize the outcome, the test must be repeated a number of times and an averaged P-value must be selected. However, under this procedure the test becomes somewhat conservative. In this work we will proceed without repeats, as results for single data sets are not critical in the simulations.

As an alternative to the null hypothesis presented above, data sets can show different degrees of relative site rate correlation among lineages that may cause systematic distortions in the resulting topology and branch estimators. Interpretation of the results of such a test is independent of the amount of actual phylogenetic signal (inner branch length). Therefore, results can only point to the possibility of having inferred an artifactual phylogeny by ML under a RAS model.

The simulation and test programs were implemented in Java programming language utilizing functionality from the Java standard library, PAL library (Drummond and Strimmer, 2001), and Jakarta commons. We delegated on rates4site software (Mayrose et al., 2004) to obtain empirical Bayesian site rate estimates per each subtree.

RESULTS AND DISCUSSION

We investigated the effect of inducing different levels of site rate correlation between subtrees using the simulation model based on four monophyletic groups of taxa a^sub 1^, a^sub 2^, b^sub 1^, and b^sub 2^ as described in Methods and Figure 1. We intuitively expected that lineages with a more similar evolutionary process would attract each other due to a greater concordance on the rates across sites, or alternatively that lineages with more divergent evolutionary processes would repel accordingly. Higher pairwise site rate category conservation coefficients 9s (product of individual θ along connecting nodes) should cluster lineages with increasing confidence as more data becomes available.

The Bad

In the first simulation, we fixed θ^sub ab^ (Fig. 1b) on the inner branch connecting the a and b groups to 1 so that there is no intrinsic rate category shuffle along this branch. θ^sub aa1^ = θ^sub bb1^ = θ^sub 1^ and θ^sub aa2^ = θ^sub bb2^ = θ^sub 2^ were assigned arbitrary values between 0 and 1 in steps of 0.1 (Fig. 4a). Initially, for each of these cases we fixed all subtrees total heights, h, to a common average of one change per site, resulting in a symmetric and molecular clock- compatible phylogeny. The shape parameter value, a, was also set to 1 (i.e., a moderate to high degree of substitution rate heterogeneity among sites). Subtrees were composed of four taxa; the result of two rounds of duplication or speciation events on all lineages (d = 2). Each branch was twice as short as the one that leads to the parent node (r = 2). The inner branch was set to 0.04 changes per site. This last value was chosen so that our simulated data sets are forced towards the inconsistency parametric edge. Consequently, relatively small changes in parameter values can make the results wander from yielding the right topology to the wrong one with increasing probability as the size of the data set increases. Under these conditions, the bootstrap support for the right topology following an RAS model of evolution was 80% out of 100 replicates.

A possible biological interpretation of such an arrangement is that smaller θs indicate a greater dissimilarity between evolutionary constraints on different lineages. For instance, paralogs within a gene family with distinct functions would show a similar degree of disparity in within-site evolutionary rates. If θ^sub 1^ = θ^sub 2^, there is the same degree of covariance between all subtrees. Consequently, we expect to obtain a roughly random type of error with no systematic bias. However, if two lineages have a higher degree of rate category conservation; for example, when the two paralogous groups of sequences are functionally redundant, there will be a higher correlation of site rates and a stronger attraction under non-covarion models between these two lineages. The opposite can be also true for the two lineages with lower site rate category conservation. Thus we would expect to see wrong topology resolution as the difference between θ^sub 1^ and θ^sub 2^ increases. In fact, our simulations, based on a symmetrical phylogeny where all subtrees have the same height, show that such an increase leads from reasonable support for the true topology towards false support for the hypothesis in which lineages with greater site rate conservation appear together (Fig. 4b).

Further, we took a particular combination of θ values (θ^sub 1^ - 1, θ^sub 2^ = 0.2) and explored the effect of changing values for key characteristics of the input phylogeny at different data sets sizes (Fig. 5). As expected, the increase of the amount of phylogenetic signal (inner branch length) neutralizes systematic error effects. On the other hand, the increase of subtree height does reinforce systematic error because of the loss of phylogenetic signal in the innermost branch. Covarion bias relies on the existence of heterogeneity of rates among sites because no site rate categories would be considered otherwise. Accordingly, a lower shape parameter value (more severe heterogeneity) favors resolution for the wrong topology, whereas a higher value does just the opposite. Besides, proper taxon sampling may reduce the effect of bias. In fact, adding more taxa reduces the bias strength. Short deep and longtip branches seem to be less problematic than long deep and short-tip branches are. Thus, we can conclude that the characteristics of the evolutionary process at the subtrees have a remarkable effect on phylogeny recovery because long deep branches seem to have a greater impact, something that is in agreement with previous studies (Kim, 1996,1998; Bremer et al., 1999).

Finally, within-site rate heterogeneity among lineages can also alter the outcome significantly. Among all alternative combinations studied, it is only the ones that elongate or shorten lineages together with the same θ values that have a pronounced effect in comparison with the results obtained with the symmetric settings (Fig. 6). In fact, bias increases when lineages with high θ values are accelerated, whereas it decreases when lineages with low θ values accumulate a greater number of changes. We also studied the effect that different numbers of taxa among subtrees has on producing such phylogeny biases. Results here are qualitatively parallel to the ones obtained by tree height asymmetry, with accelerated lineages and more conserved lineages having equivalent effects to poorly sampled and well-sampled lineages, respectively. This, together with the results presented in Figure 5, suggests that long deep branches (caused by rate acceleration, poor sampling or high tree branch length ratio values) may have a major role in catalyzing phylogenetic bias.

The Good

For the second simulation scenario, we allowed θ^sub ab^ to vary from 0 to 1 in steps of 0.1, while we fixed the θ for each external branch to the same value θ^sub 12^ taking the same set of values as formerly (Fig. 7a). The tree shape is the same as for "the bad" case, although here we set up the internal branch to0.01 to hamper the recovery of the correct topology (48% bootstrap out of 100 repeats). Intuitively, simplistic assumptions may actually assist in obtaining the right topology, as closely related taxa should more often share evolutionary traits not accounted for by the model. Accordingly, decreasing θ^sub ab^ (no rate category conservation across the inner branch) and increasing subtree θ^sub s^ (conservation between sister groups) yields high support for the true topology (Fig. 7b) despite the limited phylogenetic signal available.

Although this may seem to be preferable for obvious reasons, we may well want to have an objective measure of the evidence towards each hypothesis. This same reasoning was used to debunk the arguments concerning the superiority of parsimony over likelihood in the socalled Farris zone (Siddall et al., 1998; Swofford et al., 2001), which is part of the long-standing literature battle regarding long-branch attraction effects and the performance of maximum likelihood and parsimony (e.g., Sullivan and Swofford, 2001; Kolaczkowski and Thornton, 2004; Gadagkar and Kumar, 2005; Spencer et al., 2005; Gaucher and Miyamoto, 2005; Brinkmann et al., 2005).

The Ugly

In the third simulated scenario, we explored the effect of distinct shape parameter values under different degrees of site rate category conservation. Two pairs of phylogenetic groups evolved under a gamma RAS model, with the first pair of subtrees (a^sub 1^, b^sub 1^) using a shape parameter, α^sub 1^, and the other two subtrees using a different shape parameter, α^sub 2^. For the inner branch, we used one of the pair values, α^sub ab^ = α^sub 1^ (Fig. 8b) and an interpolated version of both values, α^sub ab^ = exp[ln(α^sub 1^) + ln(α^sub 2^)]/2 (Fig. 8c-f). This means that even if a site belongs to the same rate category throughout the phylogeny, its relative rate will change among lineages depending on the difference in site rate distributions. Once more, we intuitively expected to reproduce systematic error that would join lineages with convergent rate among site distributions α^sub 1^ ≠ α^sub 2^. Topology and branch lengths parameters are set up as in the good case including weak phylogenetic signal, so that we can see support oscillations in both directions (pro and against the wrong rate distribution convergent topology). We also simulated different shape parameters under different conservation levels between 1 and 0 in steps of 0.25 but maintained them equal among all subtrees (Fig. 8a) so that:

... (11)

The outcome is quite surprising. When θs are high (close to 1), for the most part, different shape parameter values drag together those subtree pairs with equal rate distribution shapes (Fig. 8b). Nonetheless, greater difference in shape parameters does not necessarily produce greater attraction between convergent subtrees. Moreover, there are zones where support falls below the level achieved when there are no differences between rate distributions (α^sub 1^ = α^sub 2^). Consequently, repulsion takes place between lineages with the same shape parameter. In this particular simulation batch, this phenomenon occurred when medium/high heterogeneity (around α = 1) and high heterogeneity (when α approaches zero) are combined. In fact, these sinuous support surface changes as the relative subtree heights vary to break the initial clock like overall structure (Fig. 9) or when the inner branch shape parameter is set up differently (Fig. 8b versus 8c).

This may be due to a delicate equilibrium between counterpoised attraction and repulsion effects between all lineages because a single shape parameter is used to maximize the global likelihood, despite that each combination of two subtrees would have a different marginally best-fitting pairwise shape parameter. Susko et al. (2004) concluded that the bias strength and the inconsistency zone would depend on the relative branch lengths.

Nonetheless, one may question the biochemical relevance of such a particular scenario; changes on the overall distribution of site rates under the RAS model or its tuning parameters must be accompanied by arbitrary changes in site rate categories, resulting in decreasing θs. We simulated the same shape parameter subspace at different overall rate category conservation coefficients (Fig. 8c-f). The outcome revealed that repulsion or attraction between subtrees caused by convergence in the sampling distribution of rates between lineages strongly depends on the conservation of site relative rates. This puts forward the conclusion that site rate category dependency between lineages is required for attraction or repulsion over possible influences of relative rate probability distribution differences or similarities. Therefore, different RAS distributions cannot be used alone as an argument to postulate possible systematic bias in a data set or to indicate what lineages should attract.

Visualization and Test

Using the visualization method described in Methods, we plotted data sets simulated under different model conditions including the first and second parameter subspaces previously investigated. A regular RAS process (θ = 1 for all branches) generates a balanced density surface (Fig. 10a) around the center of the triangle, although segments from the center to the vertices do accumulate a slightly greater amount of points. Covarion unbiased data sets (θ values are 0 for all the subtree-leading branches and θ = 1 for the innermost branch) produce a visibly different point distribution with similar amount of points cumulating towards each hypothesis vertex (Fig. 10b). In contrast, covarion-biased data sets generated by correlating the conservation coefficients between pairs of subtrees yield unbalanced plots (Fig. 10c, d and f). Plot differences between biased and unbiased alignments are evident to the naked eye. The overall plot distribution may change if subtrees have clearly distinct shapes (number of taxa, branch length, and topology) as illustrated in Figures 10e and f. Nonetheless, even in these circumstances, the averaged mean point position remains close to the center of the triangle in genuine RAS data sets.

These features match the resulting real or artifactual tree support obtained by ML tree reconstruction. Accordingly, application of the proposed test on these simulated data sets allows the detection of deviations with increasing sensitivity as the bias strength grows (Fig. 11). Sensitivity here is measured as the percentage of positives with a significance level of 0.05 in 1000 amino acid long alignments.

Regarding "the ugly" case, the subtree pairwise rate difference ranking described above also detects parametric areas where systematic error occurs (Fig. 11d). Nonetheless, it seems not to be able to discern between attraction and repulsion, supporting the hypothesis that joins subtrees with similar site rate distributions. The fact that the parametric area of attraction and repulsion changes as different modeling elements are altered suggests that this will not be easily resolvable with an ad hoc approach. Here, the use of the right or fairly well approximated model of evolution is required. This issue will require further investigation; for the moment, we recommend to perform simulations when there is an important difference between shape distributions in different regions of the phylogeny.

REMARKS

It is important to find solutions to the limitations of current phylogenetic methods and evolutionary models. Incorrect phylogenetic resolution may invalidate any result obtained by downstream analyses. A first step towards this aim is to characterize peculiarities of evolutionary processes that may cause bias and provide assessing tools to localize their occurrence in real data sets.

In this work, we illustrated possible misleading effects of some violations of RAS model assumptions regarding evolutionary rates among sites and lineages. Here we extend the observation done by previous work on this aspect of molecular evolution (Chang, 1996; Kolaczkowski and Thornton, 2004), with a more general and relaxed heterogeneous model of evolution. Bias is caused by different degrees of site rate drift among lineages rather than specially tailored conflictive rate convergence. This phenomenon may also be reproducible when assuming a process accounting for continuous changes in site rates categories (Tuffley and Steel, 1998; Galtier, 2001) that allows for different rate-of-rates among lineages. We investigated different overlapping effects of some model parameters on the intensity of the systematic error caused by covarion-like evolution. It is important to note the importance that taxa sampling may have in reducing the resulting systematic error. Also, we tentatively proposed a tool to visualize and test for the presence of bias on four well-supported subtree phylogenies based on simple geometric concepts. In other words, we test the suitability of using RAS models for resolving the deepest branches of the phylogenetic tree. Reliability is especially dubious at this depth, as any posterior fixed changes (rest of the tree height) constitute noise regarding the resolution of these relationships. The closest procedure to our approach has been to check, normally a posteriori, the changes on the support values after selective resampling of data (Lockhart et al., 1998; Inagaki et al., 2004; Pisani, 2004; Brinkmann et al., 2005).

Although here we have focused only on quartet analysis, it may be extendable to a greater number of monophyletic groups by means of quartet puzzlinglike techniques (Strimmer and von Haeseler, 1996) or porting the geometrical framework described herein to a higher dimensional space. In order to estimate discrepancies in rates among lineages, we needed to have a reasonable number of taxa per each gro\up. Therefore this method cannot be used accurately with poorly taxed groups. Unfortunately, this is not an uncommon situation especially in studies performed on poorly sequenced taxonomy groups. Perhaps this may be avoided using the other dimension of the data, increasing the number of sites compared each time using a sliding window.

In any case, it is important to bear in mind that there is more than a single source of systematic error; for example, compositional heterogeneity (Mooers and Holmes, 2000; Foster, 2004). Moreover, bias may reinforce the right topology, as close phylogenetic relatives should intuitively tend to share traits that may not be accounted for by mainstream models. Therefore, thoughtful phylogeny curation is necessary in order to accurately assess the cause and effect of such model singularities.

We may devise some correcting protocols and alternative methods inspired by the simulation model and test proposed. Certainly, in molecular data sets, the accuracy of ML and Bayesian analyses may increase by including more sets of parameters in existing models such as the conservation coefficient per branch or subtree and by accounting for within-site rate changes. In this context, one could use likelihood-ratio tests or alternative model sorting criteria to determine whether the enhanced model explains the data significantly better and whether the simplified model is susceptible to bias. However, that may unbearably increase the computational cost and risk of overfitting. Model-based distance methods such as neighbor- joining could benefit from the fact that each pair distance need not to be estimated using the same global model. A plausible alternative is then to generate a mixed distance matrix, where many intersubtree or intertaxa models approximate better the evolutionary path between pairs of monophyletic groups. Another possibility is to perform a priori site-weighted bootstrapping or removal in order to improve data fitness for the model of choice. Although this last option may not seem to be the most attractive, as we may need to discard a sizable portion of the data, it is in fact constantly being done; e.g., leaving out the third codon position in coding DNA analyses.

In summary, diagnostic analysis approaches, like the one proposed in this work, may complement existing methods such as the maximum- likelihood and Bayesian ones and provide a visual statistical framework to take into account hidden parameters (covarion evolution, compositional bias, an so forth) that, when ignored, may hamper accurate phylogenetic inference.

ACKNOWLEDGMENTS

We would like to thank Davide Pisani for his valuable comments on the manuscript. We are also thankful to Professor Dan Bradley and Simon A. Travers for revising the grammar and spelling of the manuscript. Also we would like to express our gratitude to the editor and to both anonymous reviewers whose comments and suggestions were vital for the development of the final version of the manuscript. Computational resources were provided by Science Foundation Ireland (SFl) under the basic research grant program awarded to M.A.F. We would also like to express our gratitude to the computational resources hosted by the Bioinformatics Laboratory and Computer Science Department at the National University of Ireland, Maynooth.

REFERENCES

Ane, C., J. G. Burleigh, M. M. MaMahon, and M. J. Sanderson. 2005. Covarion structure in plastid genome evolution: A new statistical test. Mol. Biol. Evol. 22:914-924.

Bremer, B., R. K. Jansen, B. Oxelman, M. Backlund, H. Lantz, and K. J. Kirn. 1999. More characters or more taxa for a robust phylogenyCase study from the coffee family (Rubiaceae). Syst. Biol. 48:413-435.

Brinkmann, H., M. Giezen, Y. Zhou, G. P. Raucourt, and H. Philippe. 2005. An empirical assessment of long-branch attraction artefacts in deep eukaryotic phylogenomics. Syst. Biol. 54:743-757.

Brinkmann, H., and H. Philippe. 1999. Archaea sister group of Bacteria? Indications from tree reconstruction artefacts in ancient phylogenies. Mol. Biol. Evol. 16:817-825.

Chang, J. T. 1996. Inconsistency of evolutionary tree topology reconstruction methods when substitution rates vary across characters. Math. Biosci. 134:191-223.

Drummond, A., and K. Strimmer. 2001. PAL: An object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics 17:662-663.

Fitch, W. M., and E. Markowitz. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem. Genet. 4:579- 593.

Foster, P. G. 2004. Modelling compositional heterogeneity. Syst. Biol. 53:485-495.

Gadagkar, S. R., and S. Kumar. 2005. Maximum-likelihood outperforms maximum parsimony even when evolutionary rates are heterotachous. Mol. Biol. Evol. 22:2139-2141.

Galtier, N. 2001. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol. Biol. Evol. 18:866-873.

Gaucher, E. M., and M. M. Miyamoto. 2005. A call for likelihood phylogenetics even when the process of sequence evolution is heterogeneous. Mol. Phylogenet. Evol. 37:928-931.

Gu, X. 1999. Statistical methods for testing functional divergence after gene duplication. Mol. Biol. Evol. 16:1664-1674.

Gu, X. 2001. Maximum-likelihood approach for gene family evolution under functional divergence. Mol. Biol. Evol. 18:453-464.

Huelsenbeck, J. P. 1998. Systematic bias in phylogenetic analysis: Is the Strepsiptera problem solved? Syst. Biol. 47:519- 537.

Huelsenbeck, J. P. 2002. Testing a covariotide model of DNA substitution. Mol. Biol. Evol. 19:689-707.

Inagaki, Y, E. Susko, N. Fast, and A. Roger. 2004. Covarion shifts cause a long-branch attraction artifact that unites microsporidia and archaebacteria in EF-lalpha phylogenies. Mol. Biol. Evol. 21:1340-1349.

Jones, D. T., W. R. Taylor, and J. M. Thornton. 1992. The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8:275-282.

Juke, T. H., and C. R. Cantor. 1969. Evolution of protein molecule. Pages 21-132 m Mammalian protein metabolism (H. N. Munro, ed.). New York: Academic Press.

Kim, J. 1996. General inconsistency conditions for maximum parsimony: Effects of branch lengths and increasing number of taxa. Syst. Biol. 45:363-374.

Kim J. 1998. Large-scale phylogenies and measuring the performance of phylogenetic estimators. Syst. Biol. 47:43-60.

Kim, J. 2000. Slicing hyperdimensional oranges: The geometry of phylogenetic estimation. Mol. Phylogenet. Evol. 17:58-75.

Kishino, H., and M. Hasegawa. 1989. Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea. J. Mol. Evol. 29:170-179.

Knudsen, B., and M. M. Miyamoto. 2001. A likelihood ratio test for evolutionary rate shifts and functional divergence among proteins. Proc. Natl. Acad. Sci. USA 98:14512-14517.

Kolaczkowski, B., and J. W. Thornton. 2004. Performance of maximum parsimony and likelihood phylogenetics when evolution is heterogeneous. Nature 431:980-984.

Lockhart, P. J., M. A. Steel, A. C. Barbrook, D. H. Huson, M. A. Charleston, and C. J. Howe. 1998. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol. Biol. Evol. 15:1183-1188.

Lopez, P., D. Casane, and H. Philippe. 2002. Heterotachy, an important process of protein evolution. Mol. Biol. Evol. 19:1-7.

Lopez, P., P. Forterre, and H. Philippe. 1999. The root of the tree of life in the light of the covarion model. J. Mol. Evol. 49:496-508.

Mayrose, I., D. Graur, N. Ben-Tal, and T. Pupko. 2004. Comparison of site-specific rateinference methods for protein sequences: Empirical Bayesian methods are superior. Mol. Biol. Evol. 21:1781- 1791.

Mooers, A. O., and E. C. Holmes. 2000. The evolution of base composition and phylogenetic inference. Trends Ecol. Evol. 15:365- 369.

Penny, D., B. J. McComish, M. A. Charleston, and M. D. Hendy. 2001. Mathematical elegance with biochemical realism: Covarion model of molecular evolution. J. Mol. Evol. 53:711-723.

Philippe, H., Y. Zhou, H. Brinkmann, N. Rodrigue, and F. Delsuc. 2005. Heterotachy and long-branch attraction in phylogenetics. BMC Evol. Biol. 5:50.

Pisani, D. 2004. Identifying and removing fast-evolving sites using compatibility analysis: An example from the Arthropoda. Syst. Biol. 53:978-989.

Pupko, T., I. Pe'er, R. Shamir, and D. Graur. 2000. A fast algorithm for joint reconstruction of ancestral amino acid sequences. Mol. Biol. Evol. 17:890-896.

Siddall, M. E. 1998. Success of parsimony in the four-taxon case: Longbranch repulsion by likelihood in the Farris zone. Cladistics 14:209-302.

Spencer, M., E. Susko, and A. J. Roger. 2005. Likelihood, parsimony, and heterogeneous evolution. Mol. Biol. Evol. 22:1161- 1164.

Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: A quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13:964-969.

Strimmer, K., and A. von Haeseler. 1997. Likelihood-mapping: A simple method to visualize phylogenetic content of a sequence alignment. Proc. Natl. Acad. Sci. USA 94(13):6815-6819.

Sullivan, J., and D. L. Swofford. 2001. Should we use model- based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? Syst. Biol. 50:723-729.

Susko, E., Y. Inagaki, C. Field, M. E. Holder, and A. J. Roger. 2002. Testing for differences in rate-across-site distributions in phylogenetic subtrees. Mol. Biol. Evol. 19:1514-1523.

Susko, E., Y. Inagaki, and A. J. Roger. 2004. On inconsistency of the neighbor-joining, least squares, and minimum evolution estimation when substitution processes are incorrectly modeled. Mol. Biol. Evol. 21:1629-1642.

Susko, E., M. Spencer, and A. J. Roger. 2005. Biases in phylogenetic estimation can be caused by random sequence segments. J. Mol. Evol. 61:351-359.

Swofford, D. L., P. J. W\addell, J. P. Huelsenbeck, P. G. Foster, P. O. Lewis, and J. S. Rogers. 2001. Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. Syst. Biol. 50:525-539.

Tuffley, C., and M. Steel. 1998. Modeling the covarion hypothesis of nucleotide substitution. Math. Biosci. 147:63-91.

Uzzell, T., and K. W. Corbin. 1971. Fitting discrete probability distributions to evolutionary events. Science 172:1089-1096.

Yang, Z. 1997. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13:555-556.

Zhang, J., and M. Nei. 1996. Accuracies of ancestral amino acid sequences inferred by the parsimony likelihood and distance methods. J. Mol. Evol. 44:S139-A146.

First submitted 2 November 2005; reviews returned 6 January 2006; final acceptance 25 August 2006

Associate Editor: Tim Collins

VALENTIN RUANO-RUBIO1,2 AND MARIO A. FARES1,2

1 Molecular Evolution and Bioinformatics Laboratory, Department of Biology, National University of Ireland, Maynooth, Ireland

2 Present Address: Evolutionary Genetics and Bioinformatics Laboratory, Department of Genetics, Smurpt Institute of Genetics,

University of Dublin, Trinity College, Dublin 2, Ireland; E- mail: faresm@tcd.ie (M.A.F.)

Copyright Society of Systematic Biologists Feb 2007

(c) 2007 Systematic Biology. Provided by ProQuest Information and Learning. All rights Reserved.

Artifactual Phylogenies Caused By Correlated Distribution of Substitution Rates Among Sites and Line
Back to Current Headlines
Repair Credit   Gate Operator   Harley Davidson Accessories   Wedding DJ Massachusetts