Advertisers
Free Chat Rooms   UK Chat Rooms   Chat Community   Chat   
Free Chat Rooms   Punk Rock T-Shirts   Free Chat   Live Chat   Concert Bands T Shirts   Chat Rooms   Fitness News   Band T Shirts   
Free Web Directory | Directory Submission Service | Buy Text Links | Theaters and Showtimes | News Archive |
Suggest a Site | Check Status
Kiva - loans that change lives

A Step Toward Barcoding Life: A Model-Based, Decision-Theoretic Method to Assign Genes to Preexistin

Current Headlines

A Step Toward Barcoding Life: A Model-Based, Decision-Theoretic Method to Assign Genes to Preexistin

Mar 28, 03:39 AM

Current Headlines: By Abdo, Zaid; Golding, G Brian

Abstract.-

A major part of the barcoding of life problem is assigning newly sequenced or sampled individuals to existing groups that are preidentified externally (by a taxonomist, for example). This problem involves evaluating the statistical evidence towards associating a sequence from a new individual with one group or another. The main concern of our current research is to perform this task in a fast and accurate manner. To accomplish this we have developed a model-based, decision-theoretic framework based on the coalescent theory. Under this framework, we utilized both distance and the posterior probability of a group, given the sequences from members of this group and the sequence from a newly sampled individual to assign this new individual. We believe that this approach makes efficient use of the available information in the data. Our preliminary results indicated that this approach is more accurate than using a simple measure of distance for assignment. [Assignment; barcoding; coalescent; decision theory.]

(ProQuest-CSA LLC: ... denotes formulae omitted.)

DNA barcoding involves the use of a short DNA sequence as a means to taxonomically identify a specimen to the species, genera, or family level (Floyd et al., 2002; Hebert et al., 2003a, 2003b; Remigio and Hebert, 2003). The basic idea is to standardize the segment of DNA that is used for barcoding purposes and to determine the sequence of this segment from as many identified species as possible. Storing the resulting data in a searchable database, then permits sequences from new, unknown specimens to be identified via a comparison with the sequences from characterized species. The recognized utility of this methodology is resulting in a global, synchronized effort with more than 100 member organizations- including museums, zoos, botanical gardens, and universities- involved in setting a global standard in taxonomy and in creating a database of DNA barcodes.

Although the usefulness of this approach is well established (see, for example, Hajibabaei et al., 2006; Hebert et al., 2004b), some taxonomic groups, such as cowries, have shown an unacceptably high error rate for DNA barcodes (Meyer and Paulay, 2005). Part of the reason for this discrepancy is due to different levels of variation in the sequences within each species. It is difficult to discern whether or not differences in the new sequence are due to intraspecific variation or if it is an indication of interspecific differences. Unless the sampled sequence is identical to known sequences within the databases, an inference must be made as to which species (or other taxonomic group) the sequence belongs. There is therefore a need for statistical methods to determine the most appropriate assignment and the degree of confidence with which this assignment can be made.

The problems of taking a query sequence and determining the most appropriate group/species from which this sequence might have originated can be divided into two major parts: clustering and assignment. Clustering is an unsupervised learning problem that involves identifying groups in a training data set or a preexisting database. We will tackle this problem in a subsequent article. In this article, however, we focused on the second major part which is the assignment of newly sequenced individuals to existing groups that are preidentified externally (by a taxonomist or any other mean of identification). This problem is referred to as a supervised learning problem (classification or assignment), which involves evaluating the statistical evidence towards associating a new sequence with a group or another. Our main concern is to perform this task in an accurate and fast manner. In this paper, we developed a model-based, decision-theoretic framework for classification based on coalescent theory. To accomplish this, we utilized both distance and the posterior probability of a group given the data and the newly sampled sequence to assign this new sequence. We believe that this approach makes good use of the available information in the data.

METHODS

Decision Theory, Model-Based Assignment, and the Coalescent

Barcoding of life involves the assignment of an individual to a preexisting group using information drawn from a short DNA sequence, COXI in many cases. This assignment is clearly a decision-making problem. Robert (2001) and Sullivan and Joyce (2005) provide good reviews of decision theory and some of its applications. Ripley (1996) and Dudoit and Fridlyand (2003) provide a good introduction to the use of decision theory in assignment. Assuming that there exists a correct group k that the individual belongs to, assigning the individual to a wrong group j results in an error that should be utilized to reflect the uncertainty of the assignment. A loss function, L(k, j), can be used to quantify this error of assignment (Bain and Engelhardt, 1991; Ripley, 1996; Robert, 2001). Knowing that k is the true group eliminates the decision-making problem but, unfortunately, is generally not available knowledge. Accordingly, our aim is to make an informative decision to correctly assign an individual without knowing which group forms its true origin. An approach to do this is to find the group z that results in minimizing the posterior risk, R^sub i^, of assigning the individual to that group (Bain and Engelhardt, 1991; Dudoit and Fridlyand, 2003; Ripley, 1996; Robert, 2001). This risk is defined to be the expected loss and is presented in the following equation:

... (1)

where PT(x ∈ k | x) is the posterior probability that sequence x originated from group k given the observed sequence x. Ripley (1996) and Dudoit and Fridlyand (2003) refer to this risk as the classifier. Utilizing Bayes rule, we can calculate the posterior using Equation (2):

... (2)

where PT(x | x ∈ k) represents the likelihood of obtaining sequence x given that it belongs to group k and Pr (x ∈ k) is the prior probability that sequence x originated from group k. Calculating the likelihood assumes that a probabilistic model governs the distribution of the individual sequences within the groups.

Equation (1) assumes that individuals within a group are independent and identically distributed, i.i.d., a condition that does not hold when individuals share a common evolutionary history, as in our case. To reflect this dependence we rewrite Equation (1),

... (3)

where D represents the preexisting data. A good probabilistic model should utilize this dependence on the evolutionary history to make the assignment decision. Ideally, both the phylogenetic relationships between the different groups and the population genetic behavior within these groups should be taken into consideration (Felsenstein, 2004; Matz and Nielsen, 2005; Nielsen, 1998; Nielsen and Matz, 2006; Nielsen and Wakeley, 2001). An attractive candidate that describes the evolutionary process within groups or populations is the coalescent.

The coalescent was first introduced by Kingman (1982a, 1982b) to describe the behavior of a sample of individuals, backward in time, until they find their most recent common ancestor (MRCA). It was independently introduced by Hudson (1983) and Tajima (1983). Elaborate reviews of the coalescent are provided by Hudson (1990), Neuhauser (2003), Nordborg (2003), and Felsenstein (2004). The elegance and simplicity of the coalescent comes from looking at the evolutionary process backward in time, starting from the current sample, which eliminates the need to consider the unobserved and extinct individuals that might have lived through the history of evolution. This simplification is quite useful for both data analysis and simulation (Felsenstein, 2004; Hudson, 1990; Nordborg, 2003). This simplified model, along with the above-described decision-theoretic framework, formed the basis for our coalescent assigner.

The Coalescent Assigner

To avoid the increased complexity of adding the phylogenetic history to the required model to calculate the posterior probabilities described in Equation (3), we made the approximating assumption that the ancestors of each population (each group) have been evolving for a long period of time, enough to assume that each group is evolving independently from the other groups. Embedded in this assumption is that the species/groups have evolved long enough such that the stationarity of the nucleotide substitution process has been attained. Although Nielsen and Matz (2006) indicate that, based on some unpublished data, this assumption affects the performance of their proposed method, we found that our method performs very well under this assumption. This assumption also means that these groups are known by the taxonomist to be reciprocally monophyletic. This does not necessarily mean that these groups are too far away such that they do not overlap; hence, a phylogenetic analysis of sequences within these groups might not be able to establish such reciprocal monophyly (see Validation).

We also maintained the following assumptions. First, different groups that are potential targets of the assignment are fully defined and prespecified. Second, each \group forms a panmictic population that follows a Wright-Fisher neutral model of evolution that does not allow recombination, selection, or migration. Hence, the evolutionary process within each group is governed by one parameter, which is the expected number of mutational steps between any two individuals within that group (θ). Finally, because the groups are assumed to be fully defined, the evolutionary parameter, θ, is assumed to be known with certainty.

We chose the difference between the sequence we aim to assign, x, from the consensus sequence of the data within the assumed correct group, D^sub k^, as the basis to construct the loss function required to calculate the posterior risk. This function is not unique; any function can be used to reflect the importance of other factors, such as the location of the sampled individual in combination with the genetic distance or alone. The loss function we used is defined as follows:

... (4)

where LS represents the length of the sequence under study and dist() is the distance calculated as the number of nucleotides that differ between the individual to be assigned and the consensus of the group that it might be assigned to.

Doubt was built into our classifier by setting a limit above which assigning an individual to the group with minimum risk is questionable, which highlights the need to evaluate the origin of that particular individual and the possibility of it being misplaced. Given that the proportion of nucleotides that might match by chance in any two different sequences is 0.25 and that each of the groups are equally likely, our doubt threshold was calculated as 0.25/M, where M is the total number of available groups. Groups with minimum risk larger than this number were considered doubtful.

The posterior probability of membership of the current individual x to group k is given as follows:

... (5)

where θ is a known vector of parameters (θ^sub 1^, θ^sub 2^, . . . , θ^sub k^, . . . , θ^sub M^). Utilizing Bayes rule, this equation translates to

... (6)

where Pr(x | x ε k, D, θ) represents the likelihood of the newly added sequence, x, given that it originated from group k and given the known, preassigned data D and the known evolutionary parameters θ. Pr(x ε k | D, θ) is the prior probability of k being the origin of x given our knowledge of the grouping of the presampled data D and the parameters governing the evolutionary process within each of the preestablished groups, θ. This is a different approach than that taken by Nielsen and Matz (2006) because we use the fact that we already know the grouping of the preexisting data, D, to determine where the new data point, x, should be. This is appropriate as we are only concerned with the allocation of the new data point assuming that the presampled individuals are assigned correctly by external authorities (such as by taxonomists). The likelihood Pr(x | x ε k, D, θ) can be rewritten as in Equation (7):

... (7)

Utilizing the assumption of independence of the evolutionary history between the groups, Equation (7) becomes

... (8)

Assuming that knowing that x belongs to group k will not significantly alter the likelihood of the data, D^sub k^, given the parameter θ^sub k^ and incorporating Equation (8) in (6) results in:

... (9)

If we assume that knowing the presampled data, D, and the true parameter vector, θ, will not give any information about where to assign the new individual, and assume a uniform prior on where the new individual might belong, we get the following equation:

... (10)

which is in stark contrast to that presented in Nielsen and Matz (2006). In Equation (10) we use the known structure of the population to allocate any new individual. In other words, we calculate the likelihood of an individual belonging to a certain group rather than calculating the likelihood of the group once the individual is added. In the second case, the change in the likelihood needs to take into consideration data from all the groups all at once, as Nielsen and Matz (2006) pointed out. This is due to the dependence of the evolutionary history, as stated above, and makes evaluating the posterior probability a harder problem to solve.

Combining Equation (4) with Equation (10) we calculate the posterior risk of assigning individual x to group i as follows:

... (11)

Implementing the Caulescent Assigner

Implementing the coalescent assigner involves the following steps. First, we need to know the true parameter, θ, for each of the target groups of the assignment in order to calculate the likelihood of the presampled data given its group of origin. Because this is impossible in practice, we use an approximate alternative: we estimate this parameter for each group using the available data. An approach to estimate these evolutionary parameters is being developed as part of our clustering method, which provides the likelihood of the within-group data prior to adding the "to be assigned" individual-though current available software, such as COALESCE or FLUCTUATE (Kuhner et al., 1995, 1998), can also be used to accomplish this task. The resulting parameters are the maximum likelihood estimates, MLEs, of each of the θ's associated with each of those groups. Assuming that these MLEs are close to the "truth," the second step is to use them to calculate the likelihood of the data after adding the "to be assigned" individual to each of the available groups. There is no direct method to evaluate the likelihood of the data after adding the "to be assigned" individual, Pr(x, D^sub k^ | x ε k, θ^sub k^) in Equation (11), though we note that this likelihood function is equivalent to:

... (12)

where the sum is over all possible genealogies G. Pr(x, D^sub k^ | x ε k, G), and the prior Pr(G | θ^sub k^) are easily calculated (Durbin et al., 1998; Felsenstein, 1992, 2004; Kuhner et al., 1995, 1998; Swofford et al., 1996). Both a nave method, sampling directly from the posterior Pr(G | θ^sub k^), and a Markov chain Monte Carlo approach, similar to that of Kuhner et al. (1995), were implemented to calculate this sum, with an adjustment that allowed the direct calculation of the likelihood (Raftery, 1996). Keep in mind that we are assuming a full knowledge of θ^sub k^. The first approach is appropriate only for small within group sample sizes and is not efficient for large sample sizes (Stephens, 2003). This approach was used to demonstrate the power of the presented tool as compared to the distance method within that level of sampling. The MCMC allows for sampling from the tail of the genealogy distribution that contributes most to the likelihood providing a more efficient alternative. Our model of choice for calculating the Pr(x, D^sub k^ | x ε k, G) was F81 (Felsenstein, 1981; Swofford et al., 1996). This model was also used in FLUCTUATE to estimate the parameter θ^sub k^ for each of the groups.

Finding the consensus sequence comprised our third step in this implementation. We used a 50% majority rule to accomplish this step. The fourth and final step is to evaluate the risk using Equation (11). This is straightforward once all the likelihoods are evaluated.

A C++ software that integrates steps two through four is available at http://helix.mcmaster.ca/assigner.html.

VALIDATION

We utilized both simulated data and real data to evaluate the performance of the coalescent assigner introduced above. We start by describing the method devised to simulate and analyze data under prespecified evolutionary parameters. We then introduce a real data set and the approach taken to validate the method based on it. We compared the performance of our method to a distance based approach that does not take our understanding of the evolutionary process into consideration. The "correct" group of assignment, k, was chosen to be the one with minimum dist[x - consensus(D^sub k^)]; where x, dist(), and D^sub k^ are as described above. We considered the assignment doubtful if there exists a group j, different from k, with dist[x - consensus(D^sub k^)]/LS less than 3%, or with equal distance from the newly sampled individual, where LS is as defined above. We also considered our assignment in doubt using this approach if the minimum distance was greater than 3%. The use of the 3% threshold conforms to that used by Hebert et al. (2003b). This threshold will only serve to highlight the range of appropriate use of the distance measure (see Results and Discussion).

Simulated Data

Simulation allows control over the parameters of the evolutionary process by which we generate the data. This usually results in a simplification of this process and yet provides the advantage of knowing the truth. The performance of the method is then assessed by its ability to recover the known truth. Our aim was to evaluate the ability of our proposed method to assign an individual to its true group of origin under various degrees of divergence and overlap between the different groups and different levels of variation (different θ's) within each of these groups. The strategy we followed was to simulate data under a more complicated model than the one we based our method on. This gave more realistic data that helped test the robustness of our method against violations of its assumptions (Abdo et al., 2006; Minin et al., 2003).

We utilized the neotropical skipper butterfly Astraptes flugerator data (Hebert et al., 2004a) as a guide to simulate data. First, we assumed the existence of 12 predefined butterfly groups (Hebert et al. 2004a). The consensus sequence for each of these groups was obtained. Employing PAUP* 4b10 (Swofford, 1998), we estimated the phylogeny and the parameters of a model of evolution based on these consensus sequences. The model of evolution of choice was the general time-reversible model (GTR). Any unresolv\ed nodes were resolved arbitrarily to create a bifurcating tree. Only the topology of this tree and the estimated parameters of the model were used in the next steps of the simulation. The depth of the tree was varied to provide different scenarios of divergence and overlap, as described below. Figure 1 presents the topology and the shape of one of the resulting trees; it also shows the resulting substitution rate per site (branch length assignment is discussed below). This tree structure provided groups with different depths; hence, depending on the used phylogenetic scenario, we had multiple levels of difficulty of assignment within this tree. In particular, depending on the allowed level of divergence, groups LONCHO, YESENN, NUMT, and CELT can have high levels of overlap providing a challenge to the capacity of our method, and any other method, to correctly assign an individual that originated from one of these groups.

Any randomly generated sequence can serve as the ancestor of all sequences in the simulated evolutionary process; we chose the consensus of the 12 group-consensus sequences to be the seed sequence for the simulation.

The above setup provided the ancestor of all sequences to be simulated; the topology of the phylogeny, representing the history of evolution of the group-ancestry; and the model of evolution, and its parameters, used to simulate DNA sequences of the group ancestry. We simulated these group-ancestral data under different scenarios of phylogenetic evolution representing different levels of divergence. This was done using multiple rates of nucleotide substitution, ν, to generate appropriate branch lengths associated with the levels of divergence we wanted to test (ν was defined to be the expected number of substitutions per site between any two individuals on the phylogenetic tree). Branch lengths represented the expected number of substitutions, on average, based on a Yule (1924) model with parameter (1/ν). Nucleotide substitution rates, v, employed were 0.001, 0.01, 0.1, and 1 substitutions per site (Table 1), representing deep, moderately deep, moderately short, and short phylogenies (these phylogenies are provided as supporting material with this article). Rates greater than one resulted in large divergences between groups; under such conditions the groups were clearly delimited and were easy to distinguish. Our intention was to focus on situations where assignment was hard due to group overlap and due to the closeness of the evolutionary history resulting from the shortness of the phylogenies under study. Short phylogenies presented strong violation of our assumption of group-independence and provided a tool to test the robustness of our method to violating such fundamental assumption. Group ancestral sequences were simulated using Seq-Gen 1.3.2 (Rambaut and Grassly, 1997) and the GTR model of evolution.

Sequence data within each of the 12 groups were simulated by employing the ancestral sequences to form the most recent common ancestors (MRCAs) of each of the groups, and then utilizing the neutral coalescent to model within-group evolution. Multiple levels of θ (0.001, 0.01, and 0.1 per site) were used, representing slow, moderate, and fast within-group evolution (respectively) (Table 1). All three θ levels were used with each of the phylogenetic scenarios described above. This resulted in a number of highly complex data sets; on one hand we had short phylogenies coupled with fast within-group evolution (ν = 0.001, θ = 0.1), resulting in blurring the boundaries of the groups through high levels of overlap-and on the other hand we had deep phylogenies with slow within-group evolution (ν = 1, θ = 0.001), representing clear divergence translating to an easy assignment problem. The neutral coalescent served to generate the within-group genealogies, mutations were simulated using the aforementioned GTR model with the same parameters as those utilized in generating the ancestral sequences. Seq-Gen 1.3.2 (Rambaut and Grassly, 1997) was employed to generate the simulated, within-group DNA sequences.

To evaluate the effect of the sample size on the power of the analysis we used group sizes of 5, 10, and 25. An extra individual was generated within each group, only one of which (1 of 12) was randomly chosen to be reassigned. This was done to maintain equal group sizes prior to the assignment. We generated 100 replicate data sets for each combination of θ, and sample size at each level of phylogenetic divergence, ν. We used the above introduced implementation procedure to analyze our data. In the case of no within-group variation we assumed that the group's diversity was very small and we set θ equal to 10^sup -8^; the evolution process is very slow; hence, θ is very small within the group.

Real Data

The neotropical skipper butterfly Astraptes flugerator data (Hebert et al., 2004a) was also utilized directly to validate our method. These individuals have traditionally been classified as a single species but, on the basis of barcode data, have been suggested to represent 12 sympatric species. This data set was used due to the low level of divergence between the species and due to the fact that it is one of two data sets used by Nielsen and Matz (2006) to evaluate their method, thus providing a base of comparison between the two methods.

In evaluating our method we drew one "query" sequence at random from all available sequences. The groups were reconstructed in the absence of this removed sequence. We then implemented our method of assignment as described above. Table 2 introduces the groups and the number of sampled sequences within each of them. Because group Astraptes-MYST had only three individuals, it was excluded from the random sampling process to pick an individual to assign. This is a result of a built in limitation in FLUCTUATE that requires the size of the group to be more than two. Our clustering algorithm will overcome this limitation.

RESULTS AND DISCUSSION

Model-based methods capitalize on understanding the process governing the system under study and result in more informative and powerful tools to analyze data generated from such systems. However, in applying any statistical method it is important to understand the boundaries of its appropriate application. Particularly, in applying the coalescent assigner, or any method designed for the analysis of barcoding data, it is imperative to understand the dependence of these methods on the interaction between the phylogenetic evolutionary history between species/groups and the within-species population evolution (Meyer and Paulay, 2005). This section discusses the results of applying our method to the simulated data. It then presents the analysis of the real data, alludes to some pros and cons of the coalescent assigner that need to be taken into consideration, and, finally, to some of the future directions of our research.

Simulated Data

Evaluating the performance of our method involved studying its behavior on some of the relevant boundaries of the parameter space of the between- and within-species/group evolution. In particular, we studied the change in the power of the coalescent assigner when used to assign a presampled individual to groups pertaining to scenarios with high group overlap (short phylogeny with high θ), high divergence and group delimitation (deep phylogeny with low θ), along with an array of scenarios in between these two extremes (Table 1). We benchmarked our method by comparing to the change in power of the simple distance method under the same scenarios. Table 3 introduces the resulting power analysis for both the coalescent assigner (with MCMC chain length of 50,000) and the distance method, measured using the percent correct assignment of a newly sampled individual to its group of origin. The table shows clear improvement of these methods as the divergence of the phylogeny, ν, increases compared to the within group evolution (Fig. 2). This is quite logical in the case of the coalescent assigner; the more divergent these groups are, the greater the distinction between their MRCAs and the more likely that the within group evolutionary processes are independent and easier to distinguish. This is also quite logical in terms of the distance method; progeny of the divergent ancestors should be closer to their ancestor than to other groups, given that the within-group evolution is slower than that of the between-group evolution. This outcome is associated with the left triangle of each part of this table at each sample size where, mostly, 100% correct assignment was attained in both cases. An exception was at the 25 taxa sample size in the case of the coalescent assigner with θ = 0.1 and ν = 1 substitutions per site. This will be discussed in Pros, Cons, and Future Research.

Pronounced differences can be seen in the right triangle of the results of risk and distance methods in Table 3, especially at sample sizes 5 and 10, and θ = 0.01 and 0.1. This is also clearly seen in Figure 2. This part of the table is characterized by a within-group evolution on the same speed, or slightly faster, than that between the different groups. Group overlap is a major force playing under these conditions making correctly identifying the true origin a much harder problem. Table 3 indicates that the coalescent assigner is more powerful than the distance method in distinguishing the origin of the newly sampled individuals when the sample sizes are small. Assigning an individual based on distance alone under these conditions can be quite misleading, as the closeness of the different groups in their history can result in some individuals being closer to the wrong group. The coalescent assigner compensates for such an error by taking the within-group evolutionary history into account. This providesfor a more informed decision, resulting in a correct assignment even though the distance is shorter to the wrong group. It can also result in highlighting the correct group of origin as a candidate group to further consider, even though a wrong assignment had been attained. This occurs when the distance from the incorrect group is not too small compared to that of the true origin. When the distance is too small, both methods result in a wrong assignment. Figure 3 provides an illustration of this phenomenon. The figure plots the distribution of the relative difference between the shortest distance and the distance from the original group for ν = 0.001 substitutions per site, θ = 0.1 per site and sample size n = 10. It is clear from the figure that as the distance increased above 20 (3%), the coalescent assigner tended to correctly assign the presampled individual, whereas the distance method tended to falsely assign it to the closest group. This error was more obvious as the within-group evolutionary process became faster compared to the between-group evolutionary process, causing more overlap between the different groups.

It is worth noting that the power of the coalescent assigner dropped much more slowly than that of the distance method as θ increased (Fig. 4). The reason for such a drop in power of both methods is, again, the increase in overlap between the different groups (as θ increases) , especially as the divergence between the groups decreases (as ν reduces). This difference is quite large at ν = 0.001 and n = 5; although the power gradually dropped from 93% to 82% in the case of the coalescent assigner, it dropped from 92% to about 56% in the case of the distance-based method, with a sharp drop of 31 % between θ = 0.001 and 0.01. The same trend can also be seen at ν = 0.01 and at sample size 10 with ν = 0.001 and 0.01 substitutions.

As stated above, Table 3 shows only the percent correct assignment. In so doing, we combined both the correct assignment and the correct assignment with doubt; we define the correct assignment with doubt to be where assigning the sampled individual to the correct group while another group, or groups, are candidate(s) for the assignment (as described in Methods in the case of the coalescent assigner and in Validation in the case of the distance- based method). This combination works in favor of the distance method as the threshold dilema, referred to by Meyer and Paulay (2005), limits the usefulness of the distance measure to a certain range where the between-group variation is not too small so as a newly sampled individual is close to all groups, and the within- group variation is not too large that the newly sampled individual is too far from all groups including its origin. This problem is overcome when adding the history of evolution to the distance measure. Accordingly, we believe that the above presentation of the results is somewhat conservative. Further examination of this descripancy will be done in a future work.

The above analysis was redone using an assumption of perfect knowledge of the different θ's associated with the different groups (Table 4). The results indicate that perfect knowledge of these parameters will not affect the outcome of the assignment, a witness to the robustness of our method. Moreover, although an improvement in the model of evolution used to calculate the likelihoods might improve the correctness of the assignment, the coalescent assigner seems to be doing a good job in identifying the correct groups based on the assumed simple model of evolution, another witness to its robustness. It is fair to state here that the performance of the distance method might also be improved using corrected distances based on a model of evolution.

Analysis using the nave approach resulted in similar outcome as that presented for the MCMC when the sample size was small, n = 5. As expected, the performance of this approach deteriorated very quickly as the number of taxa (n) and θ increased. This confirms the discussion presented in Stephens (2003). Although the usefulness of such a brute force approach is limited, its speed is quite attractive when analyzing small sample sizes.

Real Data

The neotropical skipper butterfly Astraptes flugerator data (Hebert et al., 2004a) provided a scenario with small between-group divergence (short phylogeny) and low within-group variation (low θ's within the groups). This scenario was also characterized by differences between the within group θ's and differences in the group sizes. Thus, this example complemented the evaluation presented in the previous section where the θ's and the sample sizes were set to be at the same level within each of the groups. It was quite an interesting result to us that our method highlighted a misclassification problem in this data. In some cases there was a mixed signal that resulted in a misassignment between groups SENNOV and YESENN, with an indication that both groups were candidates for the misassigned individuals. This caused us to closely examine the misassigned individuals and the source groups. Based on this close examination, we identified a typo or a mistake in assigning one of the individuals of groups SENNOV to group YESENN; although the analysis presented in Hebert et al. (2004a) was based on 78 individuals in the YESENN group and 103 individuals in the SENNOV group, the data were deposited in the database as 79 individuals in YESENN and 102 individuals in SENNOV, with Astraptes flugerator-94- SRNP-8089-Senna papillosa being misplaced. We believe that this caused the evolutionary history of the two groups to be brought closer to one another than they should have been, resulting in the observed misclassification. This further highlights the importance of using the evolutionary history, along with the distance, to assign individuals to their putative groups. Adjusting the data and rerunning the analysis resulted in 100% correct assignment, with no doubt. Accordingly, we believe that the differences in the θ's and group sizes did not have an impact on the performance of our method, given the data scenario we tested under.

Using this data set also allowed for contrasting against Nielsen and Matz's (2006) method. The percent correct assignment (no ties) obtained in the first step of their two-step method ranged between 25% and 100% under their best performing scenario (CELT at 100%; TRIGO at 100%; SENNOV at 99%; INGCUP at 33%; FABOV at 91%; HIHAMP at 25%; LOHAMP at 83%; LONCHO at 89%; and YESENN at 100%)-compared to 100% assignment in our case, although we recognize that our approach in evaluating the performance of our method was somewhat different than theirs. We performed 1000 replicates, sampling one individual from the total population each time, resulting in higher sampling intensity from the larger groups (102 from LOHAMP; 131 from INGCUP; 172 from YESENN; 120 from TRIGO; 76 from FABOV; 219 from SENNOV; 10 from BYTTNER; 29 from HIHAMP; 89 from LONCHO; 44 from CELT; 0 from MYST; and 8 from NUMT). This strategy gave a fair chance to any of the individuals within the large groups to be sampled. In their simulations, Nielsen and Matz (2006) performed 100 replicates per species sampling a "query" sequence from a species and some "database" sequences, from each species, to compare against. We believe that their results might be improved should they use the consensus of the group rather than single, randomly sampled individuals to form the "database" sequences. The second step of their approach, where they employ a Bayesian method to break ties resulting from the first step, was only implemented to two species (INGCUP and FABOV) of these data; hence it was not possible for us to compare our results to the final product of their approach for all groups. Contrasting the efficiency of the two methods, we note that our approach eliminated the need for multiple testing along with the associated problem of correcting the level of significance when performing such numerous tests and that our approach can handle more than two populations at a time, warranting a faster analysis and eliminating the need for the two step procedure they propose. Though extending their Bayesian approach may result in utilizing more of the information in the data, the potential improvement in the assignment resulting from such an approach might not justify the accumulated computational expense that will be incurred.

Pros, Cons, and Future Research

Although our method utilizes more of the information incorporated in the data than distance methods, this power comes with a computational expense, especially in the case of large sample sizes and large θ. Our method is based on efficiently sampling from the posterior Pr(G | θ) as described in Methods. Large numbers of taxa combined with a large number of nucleotide patterns can result in the need for an extensive search and, hence, a much longer chain to guarantee generating a good sample from the tail of the genealogical distribution. This is quite clear when θ = 0.1 and sample size n = 25 (Tables 3 and 4). The deterioration in the performance of our sampling-based method can be attributed to the enlarged tree-sampling space due to both the increase in the number of taxa within a group and the increase in the number of nucleotide patterns due to the increase in θ. A comparison of the performance of our method to the distance method, Figure 5, and Table 3, at this level of within-group evolution with sample size, n = 25, weighs towards the simpler distance method. Increasing the length of the chain to 100,000 slightly improved these results. We anticipate the need to sample much more rigorously to improve the outcome of our method based on our sampling strategy. More rigorous sampling approaches might be required to improve the performanc\e of this method. An approach such as importance sampling (Chen et al., 2005; Griffiths and Tavar, 1994a, 1994b, 1994c; Stephens and Donnelly, 2000) might be useful to accomplish this task.

Nevertheless, the presented results indicate that our method was more powerful than a simple distance method when overlap might mask the correct assignment. It is quite important to note that our method performed very well even though a number of our major assumptions were violated when simulating the data. This performance is witness to the capacity of this method. Moreover, our method provides a middle ground that is simpler and faster than implementing a full approach that incorporates both phylogenetic and population genetic history (Felsenstien, 2004; Nielsen and Matz, 2006; Nielsen and Wakeley, 2001).

The next natural step is developing a model-based clustering approach that utilizes parts of the machinery introduced in this paper to cluster the different individuals to the correct groups with a certain measure of confidence. This clustering method will be the subject of a future publication.

ACKNOWLEDGMENTS

We are indebted to P. Joyce for useful discussions, R. Heckendorn for some guidance in programming, B. Catherman for attending to every problem we had running our programs and simulation at the Beowulf of the University of Idaho, K. Blair and M. Hahn for keeping the University of Idaho's Beowulf and SHARCnet, respectively, humming, and for their help and patience. Also, we thank J. Felsenstein and M. Kuhner for communication regarding PHYLIP and FLUCTUATE, respectively. Finally, we thank J. Stewart for her insightful comments on the manuscript. We would like also to thank R. Page, D. Faith, and R. Nielsen for reviewing the manuscript and for their insightful comments. This research was funded by an NSERC discovery grant and an CRC award to GBG. ZA is also funded by a research program startup funding provided by the University of Idaho. The University of Idaho's Beowulf is supported by NSF EPSCoR grant EPS-0080935, as well as NIH grants P20 RR16448 and P20 RRl6454 from theCOBRE and INBRE programs, respectively, of the National Center for Research Resources. SHARCnet is supported by CFI and ORDCF grants. This research was also supported through funding to the Canadian Barcode of Life Network from Genome Canada through the Ontario Genomics Institute, NSERC, and other sponsors listed at www.BOLNET.ca.

REFERENCES

Abdo, Z., U. M. E. Shiitte, S. J. Bent, C. J. Williams, L. J. Forney, and P. Joyce. 2006. Statistical methods for characterizing diversity of microbial communities by analysis of terminal restriction fragment length polymorphisms of 16S rRNA genes. Environ. Microbiol. 8:929-938.

Bain, L. J., and M. Engelhardt. 1991. Introduction to mathematical statistics. Duxbury Press, Belmont, California.

Chen, Y, J. Xie, and J. S. Liu. 2005. Stopping-time resampling for sequential Monte Carlo methods. J. R. Stat. Soc. B 67:199-217.

Dudoit, S., and J. Fridlyand. 2003. Classification in microarray experiments. Pages 93-158 m Statistical analysis of gene expression microarray data (T. Speed, ed.). Chapman and Hall/CRC, Boca Raton, Florida.

Durbin, R., S. Eddy, A. Krogh, and G. Mitchison. 1998. Biological sequence analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, Cambridge, UK.

Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17:368-376.

Felsenstein, J. 1992. Estimating effective population size from samples of sequences: A bootstrap Monte Carlo integration method. Genet. Res. 60:209-220.

Felsenstein, J. 2004. Inferring phytogenies. Sinauer Associates, Sunderland, Massachusetts.

Floyd, R., E. Abebe, A. Papert, and M. Blaxter. 2002. Molecular barcodes for soil nematode identification. Mol. Ecol. 11:839-850.

Griffiths, R. C., and S. Tavar. 1994a. Ancestral inference in population genetics. Stat. Sci. 9:307-319.

Griffiths, R. C., and S. Tavar. 1994b. Sampling theory for neutral alleles in varying environment. Philos. Trans. R. Soc. Lond. B 344:403-410.

Griffiths, R. C., and S. Tavar. 1994e. Simulating probability distributions in the coalescent. Theor. Pop. Biol. 46:131-159.

Hajibabaei, M., D. H. Janzen, J. M. Burns, W. Hallwachs, and P. D. Hebert. 2006. DNA barcodes distinguish species of tropical Lepidoptera. Proc. Natl. Acad. Sci. USA 103:968-971.

Hebert, P. D., A. Cywinska, S. L. Ball, and J. R. deWaard. 2003a. Biological identifications through DNA barcodes. Proc. Biol. Sci. 270:313-321.

Hebert, P. D., E. H. Penton, J. M. Burns, D. H. Janzen, and W. Hallwachs. 2004a. Ten species in one: DNA barcoding reveals cryptic species in the neotropical skipper butterfly Astraptes futyrator. Proc. Natl. Acad. Sci. USA 101:14812-14817.

Hebert, P. D., S. Ratnasingham, and J. R. deWaard. 2003b. Barcoding animal life: Cytochrome c oxidase subunit 1 divergences among closely related species. Proc. Biol. Sci. 270-279.

Hebert, P. D., M. Y. Stoeckle, T. S. Zemlak, and C. M. Francis. 2004b. Identification of birds through DNA barcodes. PLoS. Biol. 2:e312.

Hudson, R. R. 1983. Properties of neutral allele model with intragenic recombination. Theor. Pop. Biol. 23:183-201.

Hudson, R. R. 1990. Gene genealogies and the coalescent process. Evol. Biol. 7:1-44.

Kingman, J. 1982a. The coalescent. Stochast. Proc. Appl. 13:235- 248.

Kingman, J. 1982b. On the genealogy of large populations. J. Appl. Prob. 19A:27-43.

Kuhner, M. K., J. Yamato, and J. Felsenstein.1995. Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics 140:1421-1430.

Kuhner, M. K., J. Yamato, and J. Felsenstein. 1998. Maximum likelihood of population growth rates based on the coalescent. Genetics 149:429-434.

Matz, M. V., and R. Nielsen. 2005. A likelihood ratio test for species membership based on DNA sequence data. Philos. Trans. R. Soc. Lond. B Biol. Sei. 360:1969-1974.

Meyer, C. P., and G. Paulay. 2005. DNA barcoding: Error rates based on comprehensive sampling. PLoS. Biol. 3:e422.

Minin, V., Z. Abdo, P. Joyce, and J. M. Sullivan. 2003. Performance-based selection of likelihood models for phytogeny estimation. Syst. Biol. 52:674-683.

Neuhauser, C. 2003. Mathematical models in population genetics. Pages 577-601 in Handbook of statistical genetics, volume 2, 2nd edition (D. J. Balding, M. Bishop, and C. Cannings, eds.). John Wiley and Sons, West Sussex, UK.

Nielsen, R. 1998. Maximum likelihood estimation of population divergence times and population phylogenies under the infinite sites model. Theor. Pop. Biol. 53:143-151.

Nielsen, R., and M. Matz. 2006. Statistical approaches for DNA barcoding. Syst. Biol. 55:162-169.

Nielsen, R., and J. Wakeley. 2001. Distinguishing migration from isolation: A Markov chain Monte Carlo approach. Genetics 158:885- 896.

Nordborg, M. 2003. Coalescent theory. Pages 602-635 in Handbook of statistical genetics, volume 2,2nd edition (D. J. Balding, M. Bishop, and C. Cannings, eds.). John Wiley and Sons, West Sussex, UK.

Raftery, A. E. 1996. Hypothesis testing and model selection. Pages 163188 in Markov chain Monte Carlo in practice (W. R. Gilks, D. J. Spiegelhalter, and S. Richardson, eds.). Chapman and Hall, London, UK.

Rambaut, A., and N. C. Crassly. 1997. Seq-Gen: An application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13:235-238.

Remigio, E. A., and P. D. Hebert. 2003. Testing the utility of partial COI sequences for phylogenetic estimates of gastropod relationships. MoI. Phylogenet. Evol. 29:641-647.

Ripley, B. D. 1996. Pattern recognition and neural networks. Cambridge University Press, Cambridge, UK.

Robert, C. P. 2001. The Bayesian choice. Springer-Verlag, New York.

Stephens, M. 2003. Inference under the coalescent. Pages 636-661 in Handbook of statistical genetics, volume 2,2nd edition (D. J. Bolding, M. Bishop, and C. Cannings, eds.) John Wiley and Sons, West Sussex, UK.

Stephens, M., and P. Donnelly. 2000. Inference in molecular population genetics. J. R. Stat. Soc. B 62:605-655.

Sullivan, J. M., and P. Joyce. 2005. Model selection in phylogenetics. Annu. Rev. Ecol. Evol. Syst. 36:445-466.

Swofford, D. L. 1998. PAUP*: Phylogenetic analysis using parsimony (*and other methods). Version 4.ObIOa. Sinauer Associates, Sunderland, Massachusetts.

Swofford, D. L., G. J. Olsen, P. J. Waddell, and D. M. Hillis. 1996. Phylogenetic infrences. Pages 407-514 in Molecular systematics, 2nd edition (D. M. Hillis, C. Mortiz, and B. K. Mable, eds.). Sinauer Associates, Sunderland, Massachusetts.

Tajima, F. 1983. Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437-460.

Yule, G. U. 1924. A mathematical theory of evolution, based on the conclusions of Dr. J. C. Willis, R.R.S. Philos. Trans. R. Soc. Lond. B 213:21-87.

First submitted 6 March 2006; reviews returned 10 July 2006; final acceptance 22 August 2006

Associate Editor: Dan Faith

ZAID ABDO1 AND G. BRIAN GOLDING2

1 Department of Mathematics and Department of Statistics, University of Idaho, Moscow, Idaho 83844, USA

2 Department of Biology, McMaster University, Life Science Building, 1280 Main Street West, Hamilton, Ontario L8S 4K1, Canada;

E-mail: golding@mcmaster.ca (C.B.G.)

Copyright Society of Systematic Biologists Feb 2007

(c) 2007 Systematic Biology. Provided by ProQuest Information and Learning. All rights Reserved.

A Step Toward Barcoding Life: A Model-Based, Decision-Theoretic Method to Assign Genes to Preexistin
Back to Current Headlines
Repair Credit   Gate Operator   Harley Davidson Accessories   Wedding DJ Massachusetts