Advertisers
Free Chat Rooms   UK Chat Rooms   Chat Community   Chat   
Free Chat Rooms   Punk Rock T-Shirts   Free Chat   Live Chat   Concert Bands T Shirts   Chat Rooms   Fitness News   Band T Shirts   
Free Web Directory | Directory Submission Service | Buy Text Links | Theaters and Showtimes | News Archive |
Suggest a Site | Check Status

Base-Compositional Heterogeneity in the RAG1 Locus Among Didelphid Marsupials: Implications for Phyl

Current Headlines

Base-Compositional Heterogeneity in the RAG1 Locus Among Didelphid Marsupials: Implications for Phyl

Mar 28, 03:39 AM

Current Headlines: By Gruber, Karl F; Voss, Robert S; Jansa, Sharon A

Abstract.-

Although theoretical studies have suggested that base- compositional heterogeneity can adversely affect phylogenetic reconstruction, only a few empirical examples of this phenomenon, mostly among ancient lineages (with divergence dates >100 Mya), have been reported. In the course of our phylogenetic research on the New World marsupial family Didelphidae, we sequenced 2790 bp of the RAG1 exon from exemplar species of most extant genera. Phylogenetic analysis of these sequences recovered an anomalous node consisting of two clades previously shown to be distantly related based on analyses of other molecular data. These two clades show significantly increased GC content at RAG1 third codon positions, and the resulting convergence in base composition is strong enough to overwhelm phylogenetic signal from other genes (and morphology) in most analyses of concatenated datasets. This base-compositional convergence occurred relatively recently (over tens rather than hundreds of millions of years), and the affected gene region is still in a state of evolutionary disequilibrium. Both mutation rate and substitution rate are higher in GC-rich didelphid taxa, observations consistent with RAG1 sequences having experienced a higher rate of recombination in the convergent lineages. [Biased gene conversion; Didelphidae; nucleotide composition; phylogeny; RAG1; recombination.]

Base-compositional heterogeneity, as indicated by significant taxonomic differences in GC (versus AT) content among homologous sequences, is now widely recognized as a potential problem for phylogenetic analysis (e.g., by Loomis and Smith, 1990; Hasegawa and Hashimoto, 1993; Lake, 1994; Lockhart et al., 1994; Foster and Hickey, 1999; Mooers and Holmes, 2000; Tarrio et al., 2002; Jeffrey et al., 2006; Barrowclough et al., 2006). Both simulation studies (Lake, 1994; Conant and Lewis, 2001) and analyses of biological data sets (Lockhart et al., 1994; Chang and Campbell, 2000; Tarrio et al., 2002) suggest that commonly used phylogenetic methods may cluster taxa based on GC content rather than evolutionary history when similar base composition has evolved convergently in unrelated clades. Despite considerable interest in base-compositional convergence, only a few empirical examples have been reported in the literature (Conant and Lewis, 2001). The most compelling of these involve genes in photosynthetic organelles (Lockhart et al., 1992, 1994), phyla of bacteria (Galtier and Gouy, 1998), major clades of animals (Foster and Hickey, 1999), and ancient lineages of vertebrates (Chang and Campbell, 2000; Lockhart et al., 1994). Convincing examples of base-compositional convergence at lower taxonomic levels are seldom reported and few (Tarrio et al, 2000, 2001) have been analyzed in detail.

Here we report an exceptionally clear case of base-compositional convergence among New World marsupial mammals of the family Didelphidae, commonly known as opossums. Didelphids comprise about 90 species in 18 genera that collectively range from Patagonia to Canada and inhabit a wide variety of terrestrial habitats (Nowak, 1999; Gardner, 2005; Voss et al, 2005). Recent phylogenetic studies based on nuclear genes and morphology have resulted in highly congruent estimates of didelphid relationships (Jansa and Voss, 2000, 2005; Voss and Jansa, 2003; Voss et al., 2005; Steiner et al. 2005; Jansa et al., 2006). In particular, sequence data from the protein-coding genes interphotoreceptor retinoid-binding protein (IRBP) and dentin matrix protein 1 (DMP1); the noncoding nuclear transthyretin intron 1 (TTR); the mitochondrial 12S rDNA gene; and morphological characters all yield remarkably congruent phylogenetic signal (Steiner et al., 2005; Jansa et al., 2006). Certain relationships, however, have proven difficult to resolve with existing data sets.

To address this problem, we sequenced the recombination activating gene (RAGl) from 41 species of didelphids. RAGl is a single-copy, intron-free, proteincoding nuclear locus that has been used for phylogenetic inference in a variety of other vertebrate groups (e.g., Groth and Barrowclough, 1999; Barker etal., 2002; Waddell and Shelley, 2003; Mahon and Carpenter, 2004; San Mauro et al., 2005). Because previous analyses of RAGl sequence data resulted in highly resolved and robustly supported phylogenetic relationships at multiple hierarchical levels, we anticipated a straightforward analysis of our didelphid data. Instead, we discovered a striking example of convergent base composition that affords an unusual opportunity to explore the application of commonly suggested methodological remedies and to examine alternative evolutionary explanations.

MATERIALS AND METHODS

DNA Amplification and Sequencing

We amplified and sequenced a portion of the RAG1 gene corresponding to positions 240 to 3100 of Monodelphis domestica (Genbank accession U51897) using the primers listed in Table 1. For most specimens, we amplified three 1000- to 1200-bp fragments directly from genomic DNA and used each of these amplicons in a second nested PCR to create shorter (500- to 600-bp) fragments for sequencing. For some specimens with degraded DNA, however, these smaller fragments were amplified directly from genomic DNA. The initial amplification from genomic DNA was performed as 20 l reactions using Ampli-Taq Gold DNA polymerase (Perkin-Elmer, Boston, MA) and reaction conditions as previously described for our work with IRBP (Jansa and Voss, 2000). Reamplification reactions were performed as 30-to 40-l reactions with Taq DNA polymerase (Promega, Madison, WI). The resulting PCR products were purified and prepared for sequencing using either a Qiaquick PCR purification kit (QIAGEN, Santa Clarita, CA) or EXO-SAP enzymatic digestion (USB, Cleveland, OH). PCR products were sequenced in both directions using amplification primers and dye-terminator chemistry (BigDye Cycle Sequencing; Applied Biosystems), and sequencing reactions were run on an ABI 3700 automated sequencer. Base-calling ambiguities were resolved either by choosing the base on the cleanest strand or using the appropriate IUB ambiguity code if both strands showed the same ambiguity. All sequences, along with their specimen voucher numbers, have been deposited in GenBank with accession numbers DQ865883 to DQ865923. The data matrix and associated trees have been deposited on TreeBase with accession numbers S1579 and M2841.

Sequence Analysis

We used two approaches to test our RAG1 sequences for departures from base-compositional stationarity across taxa. First, we evaluated each taxon for differences from the average base composition across taxa for each codon position using a χ^sup 2^ goodness-of-fit test (Bonferroni-corrected α = 0.001). Second, we calculated the disparity index (ID) for each codon position (Kumar and Gadagkar, 2001) using MEGA version 3.0 (Kumar et al., 2004). This test compares the observed compositional difference between a given pair of sequences (D^sub c^) with the compositional difference (N^sub d^) that would be expected if the sequences were evolving under equilibrium (ID = D^sub c^ - N^sub d^). Two sequences that are perfectly homogeneous will have an ID value of zero; consequently, higher ID values reflect increasingly different base composition between sequence pairs. The significance of a given ID value is assessed by comparison with a Monte Carlo-generated distribution of expected values assuming a model of equilibrium base substitution (Kumar and Gadagkar, 2001).

Maximum parsimony (MP) analyses were performed using heuristic search algorithms implemented by PAUP* 4.0b10 (Swofford, 2003) with all characters treated as unordered and equally weighted. Heuristic searches were performed with 1000 replicates of random stepwise taxon addition and tree bisection-reconnection (TBR) branch swapping. For maximum likelihood (ML) analyses, we determined the best-fit model of nucleotide substitution for various partitions of the molecular data using the Akaike information criterion (AIC) as implemented in ModelTest 3.6 (Posada and Crandall, 1998). We evaluated the fit of various models to each gene (IRBP, DMP1, and RAG1) independently and to the combined molecular dataset. In addition, we partitioned the RAGl data into third codon positions and first + second codon positions, and we evaluated the best-fit model for each of these partitions. Finally, we employed a likelihood-ratio test to determine whether a molecular clock could be applied to the sequence data. Following model evaluation and selection, we obtained the maximum likelihood tree using a heuristic search in which parameter values were fixed according to the best- fit model, and a neighbor-joining tree was used as starting point for TBR branch swapping.

We assessed nodal support using nonparametric bootstrapping (Felsenstein, 1985) for the MP and ML analysis and Bremer support (Bremer, 1988, 1994) for the MP analysis. Bootstrap values for the MP analysis were calculated from 1000 pseudoreplicated data sets, each of which was analyzed heuristically using 10 randomadditionreplicates, TBR branch swapping, and no more than 10 trees saved per pseudoreplicate. The ML bootstrap analysis was based on 1000 pseudoreplicated data sets analyzed with the fast stepwise option in PAUP* using the same best-fit model employed in the original ML tree search. We calculated partitioned Bremer support (PBS; Baker and DeSalle, 1997) to quantify the relative support and conflict provided by different data partitions for individual nodes in the combineddata (IRBP+DMP1+RAG1+morphology) tree. All Bremer values were calculated using constraint trees and command files generated by TreeRot (Sorenson, 1999). We used likelihood-based Shimodaira-Hasegawa tests (SH; Shimodaira and Hasegawa, 1999) to evaluate the fit of a given data set to alternative topologies. Specifically, we addressed whether the RAG1 data could reject the tree derived from IRBP+DMP1 data set and vice versa. All SH tests were performed using 1000 RELL replicates allowing the parameters for the relevant model of nucleotide substitution to be estimated.

Finally, we applied two phylogeny-reconstruction methods specifically designed to compensate for base-compositional heterogeneity to the RAG1 data set. First, we calculated LogDet- corrected distances (Lockhart et al., 1994) and inferred a neighbor- joining tree from the resulting matrix using PAUP*. Second, we evaluated a likelihood-based method that allows for variation in GC content across lineages, as implemented in the software package nhml (Galtier and Gouy, 1998). Unfortunately, this method (hereafter referred to as GG98) has been reported to search only a small fraction of available tree space when more than a few taxa are analyzed (Galtier and Gouy, 1998). Given this limitation, we tested for convergence by performing two separate tree searches. In the first, we used the tree derived from all data except RAGl as the starting topology for branch swapping; in the second, we used the tree derived from RAG1 sequences alone as the initial topology.

To examine evolutionary rates for third positions of RAG1, we used a penalized likelihood approach as implemented in r8s version 1.50 (Sanderson, 2002). To provide the tree and branch lengths for this analysis, we optimized branch lengths for RAGl third positions on the topology resulting from likelihood analysis of all data except RAGl third positions. We established the best value for the smoothing parameter using the cross-validataion procedure with smoothing values ranging from 10^sup -10^ to 100. Searches using the optimal smoothing value (S = 0.001) employed the truncated Newton algorithm with 10 random restarts and 10 subsequent perturbations. Estimates of per-branch rates were scaled against time by setting the origin of didelphines to 40 million years based on the results published by Steiner et al. (2005).

RESULTS

Phylogenetic Analysis

The results of our previous didelphid phylogenetic research based on the interphotoreceptor retinoid binding protein (IRBP), dentin matrix protein 1 (DMP1), and morphology have been reported elsewhere (Jansa and Voss, 2000, 2005; Voss and Jansa, 2003; Jansa et al., 2006) and require only summary description here. Briefly, although trees resulting from separate analyses of these data sets show some topological differences, none are significant when assessed by comparing bootstrap support values node-by-node or by performing Shimodaira-Hasegawa tests. Therefore, we combined the IRBP, DMP1, and morphological data in a single matrix and analyzed them simultaneously using parsimony. The resulting strict consensus of four minimal-length trees (Fig. 1A) is highly resolved and includes several well-supported groups. In particular, clades C and H are recovered with strong (>90%) bootstrap support in our combined-data analysis. Both of these groups were previously supported by our separate analyses of IRBP (e.g., Jansa and Voss, 2000) and DMP1 (Jansa et al., 2006), and they have also been recovered by scnDNA hybridization (Kirsch and Palma, 1995) and by Bayesian analyses of 12S rRNA and TTR intron sequences (Steiner et al., 2005).

In contrast, clades C and H were not recovered by parsimony analysis of RAG1 sequences (Fig. 1B) because clades B and I form a group. This anomalous node (B+I) is strongly supported by RAG1 (parsimony bootstrap support = 97%) and does not appear to have resulted simply from long-branch attraction artifacts that are known to mislead parsimony analysis. In fact, an ML analysis of the RAG1 data set under its best-fit model of nucleotide substitution (GTR+I+Γ) recovered a tree (not shown) that was fully resolved but otherwise topologically consistent with the parsimony tree; notably, the B+I node was also strongly supported (likelihood bootstrap = 97%). Reciprocal Shimodaira-Hasegawa tests indicate that the tree derived from ML analysis of RAG1 is significantly different from that supported by the combined DMP1+IRBP data set (RAG1 on DMP1+IRBP tree: δ = 113.7, P< 0.001; DMP1+IRBP on RAGl tree: δ = 141.4, P< 0.001).

So strong is RAG1 support for the B+I group that a combined (simultaneous) parsimony analysis of all four data sets (IRBP+DMP1+RAG1+morphology) also recovers this node (Fig. 2). Bootstrap support for B+I, however, drops from 97% (in the RAG1- only analysis) to 78% (in the combined data analysis), presumably because of character conflict between RAG1 and the other datasets. Predictably, partitioned Bremer support (PBS) values are either all positive or only weakly conflicting at most nodes in the combined- data tree, but they are strongly conflicting for the B+I group: this node receives support only from RAGl (PBS^sub RAG1^ = +22.2), whereas the total PBS from the other data sets is strongly negative (PBS^sub IRBP+DMP1+morphology^ = -12.2). Of particular interest is the observation that most (97%) of the support for the B+I group comes from RAG1 third codon positions.

Base Composition Analyses

The frequency of G+C at third codon positions of RAG1 varies by nearly 20% across these didelphid taxa, whereas GC content at first and second positions of this gene varies by only 4% (Fig. 3). In contrast, IRBP and DMP1 show only slight variation in GC content at all codon positions. Both first + second and third positions of IRBP show about 4% variation in G+C frequency; first + second positions of DMPl vary by less than 3%, whereas third positions of DMPl vary by about 8%. Not surprisingly then, neither IRBP nor DMP1 sequences exhibit significant departures from base-compositional homogeneity across taxa at any codon positions (Jansa and Voss, 2000, 2005; Jansa et al., 2006). In contrast, our RAG1 sequences exhibit marked departures from base-compositional homogeneity across taxa. Although this gene showed no significant base-compositional bias at first or second positions for any didelphid taxon, 30 species (of the 41 we sequenced) show a significant departure from base-compositional homogeneity at third positions as assessed by χ^sup 2^ goodness- of-fit tests. Of these, 18 species have GC content at third positions (GC^sub 3^) equal to or greater than 70% (Table 2). Given the significant basecompositional heterogeneity at third positions of RAG1, we analyzed this partition independently using both MP and ML methods. Trees recovered under each analytical approach (the ML tree is shown in Fig. 4) were topologically identical to the parsimony and likelihood trees previously obtained from the entire gene. As expected, the anomalous B+I group receives very strong bootstrap support (97% ML; 98% MP) from RAG1 third positions.

The 18 species with GC^sub 3^ > 70% are all members of clade B or clade I (Fig. 4), suggesting that the anomalous grouping of these taxa is due to shared similarity in base composition. The ID values we computed to quantify base-compositional heterogeneity lend additional credence to this hypothesis. A UPGMA phenogram that clusters taxa based on ID scores reveals three main groupings: (1) taxa with GC^sub 3^ > 70%, (2) taxa with GC^sub 3^ = 61-66%, and (3) taxa with GC^sub 3^ = 57-59% (Fig. 4, inset). The largest differences in ID scores (at least 8.3 ID units) occur between taxa with GC^sub 3^ > 70% and the remaining taxa (with GC^sub 3^ = 57- 66%).

If the anomalous grouping of clades B and I is due to convergence in GC^sub 3^ content, then we would expect to observe that AT [arrow right] GC substitutions exceed AT [arrow left] GC substitutions at the nodes subtending each of the affected clades. To test for this bias, we used parsimony to optimize character-state transitions for RAG1 third positions on a tree derived from analysis of all other data (first and second positions of RAG1+IRBP+DMP1 +morphology). In fact, character-state reconstructions at the nodes subtending clades B and I do show an extraordinary bias in favor of AT [arrow right] GC substitutions (Table 3). This bias continues, although less strongly, along branches within each of these clades: the ratio of the total number of AT [arrow right] GC substitutions is on average three times higher than the number of AT [arrow left] GC substitutions within clades B and I (Table 3). In contrast, exemplar clades that are less GC-rich (i.e., with GCs in the range of 57% to 66%) have a ratio of AT [arrow right] GC to AT [much less than] GC substitutions that is much closer (0.9) to the one-to-one ratio expected under equilibrium.

Interestingly, third positions of RAG1 also appear to be evolving more quickly among taxa that have elevated GCa content than among other didelphids, a phenomenon that seems to be attributable both to an increased rate of nucleotide substitution and to an increased mutation rate. A higher substitution rate is suggested by significant rate heterogeneity across taxa at RAG1 third positions as assessed by a hierarchical likelihood-ratio test designed to assess the fit of a molecular clock (-InL^sub GTR+I+Γ+clock = 9268.74, δ= 547.28, df - 39, P 0.001), together with a significant positive correlation between pairwise patristic distances (based on ML-estimated branch lengths) and the extent of departure from base-compositional homogeneity (represented by ID values) for RAGl third positions (Mantel test: 1000 replicates, r = 0.495, P < 0.001). Moreover, a penalized likelihood analysis of molecular evolutionary rate indicates that third positions of RAG1 are evolving much more quickly in the GCs-rich clades (B and I) than they are in any other taxa (Fig. 5). This rate increase in GC^sub 3^- rich taxa appears to be peculiar to the RAG1 locus because it is not apparent for DMP1 or IRBP. Although DMP1 does exhibit significant among-taxa rate heterogeneity (-1nL^sup GTR+I+Γ+clock = 5966.36^, δ = 117.78, df - 39, P [much less than] 0.001), long branches for this data partition are not associated with the same taxa that show long branches for RAGl third positions. IRBP exhibits no significant rate heterogeneity among didelphids (- InLGTR+I+Γ+clock = 4511.05, δ = 53.67, df = 39, p = 0.059).

Mutation rates are impossible to measure directly from these data, but rough estimates can be obtained from individual heterozygosity values. Under the standard neutral expectation, heterozygosity scales as the product of effective population size and mutation rate (Kimura, 1983). If effective population size does not differ substantially among these taxa (an assumption that we cannot currently evaluate), then heterozygosity could be a reasonable proxy for mutation rate. Individual heterozygosity values for the RAG1 locus across didelphids range from zero to 0.005 and are significantly positively correlated with GC^sub 3^ content (r^sup 2^ = 0.0961, P = 0.049). In contrast, heterozygosity values for the other two loci (IRBP and DMPl) show no correlation with GC^sub 3^ content (r^sup 2^ = 0.004, P = 0.696). Therefore, RAG1 also appears to be unique in having an elevated mutation rate for those taxa with high GC^sub 3^ content.

To assess whether taxa in clade B and clade I are now at base- compositional equilibrium, we used the test described by Eyre- Walker (1994) that compares the relative GC content between pairs of taxa at sites that differ between them. If two sequences are at equilibrium, we expect half of the differing sites to be G or C and half to be A or T, regardless of whether the two lineages show different overall rates of evolution. Departures from this expectation can be assessed using the binomial distribution. We performed this test for six phylogenetically independent pairs, four of which show an elevated GC^sub 3^ content and two that do not. Neither of the comparisons involving taxa in non-GC-rich clades shows a significant departure from equilibrium substitution rates, whereas all of the comparisons involving GC-rich taxa do (Table 4). Therefore, members of the two clades that show shifts to higher GC content have yet to return to base-compositional equilibrium.

Phylogenetic Effects of Correcting for Base-Compositional Heterogeneity

In order to explore the utility of phylogenetic methods designed to correct for base-compositional heterogeneity among taxa, we analyzed third positions of RAGl using both LogDet distance transformations (Lockhart et al, 1994) and the GG98 maximum likelihood method that accounts for nonhomogeneous patterns of substitution among taxa (Galtier and Gouy, 1998). Because LogDet analysis may be affected by among-site rate heterogeneity (Lockhart et al., 1994), we calculated LogDet distances three different ways: including only parsimony informative sites; including a likelihood estimate for the proportion of invariant sites (pinv = 0.32); and including all sites without correction. Unfortunately, none of these approaches seem to be effective for these data. All analyses recover the anomalous B+I group with very high bootstrap support (Fig. 4). Because the GG98 model allows GC content to vary across all branches in the tree, it is parameter rich and too computationally intensive to perform a ML bootstrap analysis. To improve our confidence that we recovered the best tree while searching under this model, we used two different tree topologies (one based on all data except RAG1; one based on RAG1 only) as starting points for the branchswapping algorithm. Both searches converged on the same topology, which contained the anomalous B+I group.

DISCUSSION

Phylogenetic Implications

Our results clearly indicate that one episode of GC enrichment occurred in the ancestral lineage of clade B (Gracilinanus + Thylamys + Cryptonanus), and that another occurred in the ancestor of clade I (Marmosa + Micoureus). The resulting convergence in base composition is sufficiently strong that phylogenetic analyses of RAG1 sequences, separately or in combination with other data sets, strongly support a group (B+I) that contains these two GC-rich but historically unrelated clades. It is of no small concern that two phylogenetic methods explicitly designed to account for base- compositional heterogeneity among taxa not only failed to recover the correct placement of clades B and I but supported their anomalous grouping with high statistical support. Thus, RAG1 sequence evolution in didelphids provides a particularly clear example of convergent base composition misleading phylogenetic inference.

Didelphid RAG1 evolution is of further interest because it shows that severe base-compositional shifts can occur over comparatively short evolutionary timescales. Recent phylogenetic research indicates that the marsupial crown group is not known from pre- Tertiary fossils (Rougier et al., 1998). Moreover, according to Steiner et al.'s (2005) age estimates of didelphid cladogenetic events (based on relaxed-molecular-clock methods applied to nuclear gene sequences), the base-compositional shifts described in this report probably occurred sometime between 12 and 30 million years ago. By contrast, most previously reported examples of convergent base composition (cited in our Introduction) involve clades that are hundreds of million years old. The relative recency of base- compositional shifts for didelphid RAG1 sequences may explain, in part, why the two GC-rich lineages we identified show no evidence of having reached base-compositional equilibrium.

Several workers have suggested that convergent base composition will only be problematic for phylogenetic inference if taxa differ in GC content by more than about 10% (Eyre-Walker, 1998; Galtier and Gouy, 1995). Others (Conant and Lewis, 2001) have emphasized that additional evolutionary processes, including biased transition/ transversion ratios and among-site rate heterogeneity, may interact with base-compositional heterogeneity to lead phylogenetic inference astray. In particular, the inability to correctly infer phylogeny from convergently GC-biased sequences appears to be exacerbated by short internal branches (Conant and Lewis, 2001; Jermiin et al., 2004). For RAG1 third positions, the average difference in GC content between biased and nonbiased clades (ΔGC= 14%) is above the minimum value considered problematic for phylogenetic inference. After we added 4194 bp from other nuclear genes (IRBP, DMP1, and the remaining RAG1 codon positions), however, the average base- compositional difference between these clades dropped well below this value (AGC = 3%). Nevertheless, we were still unable to recover the correct phylogeny from a combined-data analysis.

If small differences in base composition among taxa can have misleading phylogenetic consequences, then the sensitivity of existing tests for this phenomenon is an important methodological issue. For the concatenated-gene data set, both the contingency test for base-compositional stationarity employed by PAUP* and the ID test (Kumar and Gadagkar, 2001) revealed significant base- compositional heterogeneity across taxa, even though ΔGC was only 3%. In contrast, the χ^sup 2^ goodnessof-fit test for individual departures from average GC content was unable to identify particular taxa that deviated from the average base composition of these data. However, we note that these AGC estimates are based on the total number of sites sequenced, rather than the subset of those sites that are parsimony informative. When we recalculated average differences in base composition between clades based on informative sites alone, the average ΔGC for RAG1 third positions between biased and nonbiased clades is 29%. When we performed a similar correction for the concatenated-gene data set, the average GC content between these clades was still high enough to be considered problematic (ΔGC = 15%). Therefore, we echo other workers (Jermiin et al., 2004; Conant and Lewis, 2001; Lockhart et al., 1994; Collins et al., 2005) in cautioning that it is prudent to test carefully for basecompositional bias prior to any phylogenetic analysis of nucleotide sequence data.

Although RAG1 is nonstationary and phylogenetically misleading at one hierarchical level, this gene provides both increased resolution and support for relationships among taxa within the GC-rich clades (clades B and I; compare Fig. 1A and B). Therefore, we would like to include the relevant phylogenetic signal that this gene contains in a combined-data analysis while mitigating the misleading signal due to base-compositional convergence. One possibility involves explicitly modeling (Galtier and Gouy, 1998;, Yang and Roberts, 1995) or correcting for (Lockhart et al., 1994; Lake, 1994) base- compositional heterogeneity across taxa. However, model-based approaches can be parameter-rich and difficult to implement in a combined data analysis; they are also conspicuously ineffective when applied to our nonstationary RAG1 sequences. Short of eliminating the biased gene entirely, different ways of handling such problematic datasets have been proposed (Loomi\s and Smith, 1990; Hasegawa and Hashimoto, 1993; Kumar and Gadagkar, 2001; Collins et al., 2005; Jeffrey et al., 2006). Among the most tractable are (1) analyzing amino acid sequences, (2) eliminating third positions, or (3) including only third position transversions (e.g., by "RY- coding"). As applied to our data, all three methods effectively eliminate the misleading signal due to basecompositional convergence: the spurious B+I clade was not recovered by any method, and nodes C and H were recovered with comparably high bootstrap support in all three analyses (Fig. 6). However, these procedures do affect the degree of resolution and support for recovered relationships within the GC-rich clades.

Evolution of Base Composition

Although RAG1 is a problematic source of data for inferring phylogenetic relationships among didelphids, examining the evolution of this gene on the best estimate of didelphid phylogeny derived from other data provides useful information about the causal mechanisms of base-compositional heterogeneity. RAG1 clearly exhibits very different evolutionary dynamics in two GC-rich lineages than it does in other clades. In particular, the higher rates of substitution and mutation for taxa in clades B and I merit explanation. One possibility invokes recent theory that links recombination rate and the evolution of GC content in mammalian genomes.

During meiotic recombination, different alleles may form heteroduplex DNA that contains T:G and OA mismatches (Lamb, 1984; Galtier et al., 2001). Theoretically, these mismatches should be repaired correctly to T: A and C:G, respectively. However, mismatch repair in mammals appears to be biased by a DNA repair mechanism that preferentially repairs T:G mismatches to C:G (Brown and Jiricny, 1987). This process has been termed biased gene conversion (BGC) and, if it occurs frequently, could cause regions of high recombination to become increasingly GC-rich (Galtier et al., 2001; Montoya-Burgos et al., 2003; Meunier and Duret, 2004). Mounting evidence strongly favors BGC as an important process driving base- compositional differences across the mammalian genome (Eyre-Walker and Hurst, 2001; Birdsell, 2002; Piganeau et al., 2002; Galtier, 2004; Perry and Ashworth, 1999). Moreover, if sequences are not at basecompositional equilibrium (and these GC-rich didelphid sequences are not; Table 4), then BGC predicts that the rate of silent substituiton (K^sub s^) will be both elevated and GC-biased (Piganeau et al., 2002). Consistent with these predictions, the GC- rich didelphid lineages all exhibit an elevated rate of substitution and a biased direction of substitution from AT [arrow right] GC. In addition, heterozygosity values for these taxa suggest that the GC- rich lineages also exhibit a relatively high rate of mutation, which is not necessarily a predicted outcome of BGC. Nonetheless, recent studies have suggested that recombination is itself mutagenic (Hellman et al., 2003; Lercher and Hurst, 2002), and it is possible that this aspect of recombination is directly contributing to the elevated mutation rate for these GC-rich taxa.

Recent evidence suggests that recombination hotspots can change genomic location over short evolutionary timescales; indeed, hotspot location varies even within human populations (Coop, 2005; Jeffreys et al., 2005). Given these findings, one possible explanation for our results is that RAG1 independently acquired a recombination hotspot in the ancestral lineages of clades B and I. Although we do not have direct estimates of recombination rates for these taxa, if RAG1 contains a 1- to 2-kb recombination hotspot, then we would expect the GC-rich region in affected taxa to be localized over only part of the sequenced 2.7-kb fragment. A sliding-window analysis reveals that GC-rich and non-GC-rich sequences have a similar base- compositional profile across the first 600 to 700 bp of the RAG1 exon, and that the GC-rich sequences diverge and adopt their characteristic base composition thereafter (Fig. 7). Although this result may be consistent with alternative explanations, including selection maintaining base composition across the 5' region of the gene, it may be evidence of an elevated recombination rate for part of the gene. If so, and if recombination hotspots are as evolutionarily mobile as recent studies suggest, then the base- compositional convergence that plagues our phylogenetic analysis of the RAG1 locus may be more widespread across nuclear loci than previously anticipated.

ACKNOWLEDGMENTS

We particularly thank Keith Barker and George Barrowclough for helpful discussions and comments. As ever, we are grateful to the curators and collections support personnel of the many museums that provided tissue samples for this work. In particular, we thank Phil Myers and Steve Hinshaw (UMMZ), Bruce Patterson and Bill Stanley (FMNH), Mark Engstrom and Burton Lim (ROM), Jim Patton (MVZ), and Robert Baker (TTU). Andrew Simons, Robert Zink, Mark Hafner, Roderic Page, Peter Foster, and two anonymous reviewers provided helpful comments on an earlier draft. The University of Minnesota Supercomputing Institute provided valuable computing time and resources. This study was supported by funds from the National Science Foundation (DEB-0211952) and the University of Minnesota.

REFERENCES

Baker, R. H., and R. DeSalle. 1997. Multiple sources of character information and the phylogeny of Hawaiian drosophilids. Syst. Biol. 46:654-673.

Barker, F. K., G. F. Barrowclough, and J. G. Groth. 2002. A phylogenetic hypothesis for passerine birds: Taxonomic and biogeographic implications of an analysis of nuclear DNA sequence data. Proc. R. Soc. Lond. B 269:295-308.

Barrowclough, G. F., J. G. Groth, and L. A. Mertz. 2006. The RAG- 1 exon in the avian order Caprimulgiformes: Phylogeny, heterozygosity, and base composition. Mol. Phylog. and Evol. 41:238- 248.

Birdsell, J. A. 2002. Integrating genomics, bioinformatics, and classical genetics to study the effects of recombination on genome evolution. Mol. Biol. Evol. 19:1181-1197.

Bremer, K. 1988. The limits of amino-acid sequence data in angiosperm phylogenetic reconstruction. Evolution 42:795-803.

Bremer, K. 1994. Branch support and tree stability. Cladistics 10:295-304.

Brown, T. C., and J. Jiricny. 1987. A specific mismatch repair event protects mammalian cells from loss of 5-methylcytosine. Cell 50:945-950.

Chang, B. S. W., and D. L. Campbell. 2000. Bias in phylogenetic reconstruction of vertebrate rhodopsin sequences. Mol. Biol. Evol. 17:1220-1231.

Collins, T. M., O. Fedrigo, and G. P. Naylor. 2005. Choosing the best genes for the job: The case for stationary genes in genome- scale phylogenetics. Syst. Biol. 54:493-500.

Conant, G. C., and P. O. Lewis. 2001. Effects of nucleotide composition bias on the success of the parsimony criterion in phylogenetic inference. Mol. Biol. Evol. 18:1024-1033.

Coop, G. 2005. Can a genome change its (hot)spots? Trends Ecol. Evol. 20:643-645.

Duret, L., and L. D. Hurst. 2001. The elevated GC content at exonic third sites is not evidence against neutralist models of isochore evolution. Mol. Biol. Evol. 18:757-762.

Eyre-Walker, A. 1994. DNA mismatch repair and synonymous codon evolution in mammals. Mol. Biol. Evol. 11:88-98.

Eyre-Walker, A. 1998. Problems with parsimony in sequences of biased base composition. J. Mol. Evol. 47:686-690.

Eyre-Walker, A., and L. D. Hurst. 2001. The evolution of isochores. Nat Revi: Genet. 2:549-555.

Felsenstein, J. 1985. Confidence limits on phylogenies: An approach using the bootstrap. Evolution 39:783-791.

Foster, P. G., and D. A. Hickey. 1999. Compositional bias may affect both DNA-based and protein-based phylogenetic reconstructions. J. Mol. Evol. 48:284-290.

Galtier, N. 2004. Recombination, GC-content, and the human pseudoautosomal boundary paradox. Trends Genet. 20:347-349.

Galtier, N., and M. Gouy. 1995. Inferring phylogenies from DNA sequences of unequal base compositions. Proc. Natl. Acad. of Sci., USA 92:11317-11321.

Galtier, N., and M. Gouy. 1998. Inferring pattern and process: Maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. Mol. Biol. Evol. 15:871-879.

Galtier, N., G. Piganeau, D. Mouchiroud, and L. Duret. 2001. GC- content evolution in mammalian genomes: The biased gene conversion hypothesis. Genetics 159:907-911.

Gardner, A. L. 2005. Order Didephimorphia. Pages 3-22 in Mammal species of the world: A taxonomic and geographic reference (D. E. Wilson and D. M. Reeder, eds.). Johns Hopkins University Press, Baltimore, Maryland.

Groth, J. G., and G. F. Barrowclough. 1999. Basal divergences in birds and the phylogenetic utility of the nuclear RAG-1 gene. Mol. Phylogenet. Evol. 12:115-123.

Hasegawa, M., and T. Hashimoto. 1993. Ribosomal RNA trees misleading? Nature 361:23.

Hellman, I., I. Ebersberger, S. E. Ptak, S. Paabo, and M. Przeworski. 2003. A neutral explanation for the correlation of diversity with recombination rates in humans. Am. J. Hum. Genet. 72:1527-1535.

Jansa, S. A., J. F. Forsman, and R. S. Voss. 2006. Different patterns of selection on the nuclear genes IRBP and DMP-1 affect the efficiency but not the outcome of phylogeny estimation for didelphid marsupials. Mol. Phylogen. Evol. 38:363-380.

Jansa, S. A., and R. S. Voss. 2000. Phylogenetic studies on didelphid marsupials I. Introduction and preliminary results from nuclear IRBP gene sequences. J. Mammal. Evol. 7:43-77.

Jansa, S. A., and R. S. Voss. 2005. Phylogenetic relationships of the marsupial genus Hyladelphys based on nuclear gene sequences and morphology. J. Mammal. 86:853-865.

Jeffreys, A. J., and C. A. May. 2004. Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nat. Genet. 2:151-156.

Jeffrey, O., H. Brinkmann, F. Delsuc, and H. Philippe. 2006. Phylogenomics: The beginning of incongruence? Trends Genet. 22:225- 231.

Jermiin, L. S., S. Y. \W. Ho., F. Ababneh, J. Robinson, and A. W. D. Larkum. 2004. The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol. 53:638- 643.

Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, UK.

Kumar, S., and S. R. Gadagkar. 2001. Disparity index: A simple statistic to measure and test the homogeneity of substitution patterns between molecular sequences. Genetics 158:1321-1327.

Kumar, S., K. Tamura, and M. Nei. 2004. MEGA3: Integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief. Bioinformatics 5:150-163.

Lake, J. A. 1994. Reconstructing evolutionary trees from DNA and protein sequences: Paralinear distances. Proc. Nat. Acad. Sci., USA 91:1455-1459.

Lamb, B. C. 1984. The properties of meiotic gene conversion important in its effects of evolution. Heredity 53:113-138.

Lercher, M. J., and L. D. Hurst. 2002. Human SNP variability and mutation rate are higher in regions of high recombination. Trends Genet. 18:337-340.

Lockhart, P. J., C. J. Howe, D. A. Bryant, T. J. Beanland, and A. W. D. Larkum. 1992. Substitutional bias confounds inference of cyanel origins from sequence data. J. Molec. Evol. 34:153-162.

Lockhart, P. J., M. A. Steel, M. A. Hendy, and D. Penny. 1994. Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol. 12:605-612.

Loomis, W. R, and D. W. Smith. 1990. Molecular phylogeny of Dictyostelium discoideum by protein sequence comparison. Proc. Nat. Acad. Sci. USA 87:9093-9097.

Mahon, A. R., and K. E. Carpenter. 2004. A phylogeny of putative perciform suborders based on the nuclear RAG1 gene. Integr. Comp. Biol. 44:723-723.

Meunier, J., and L. Duret. 2004. Recombination drives the evolution of GC-content in the human genome. Mol. Biol. Evol. 21:984- 990.

Montoya-Burgos, J. I., P. Boursot, and N. Galtier. 2003. Recombination explains isochores in mammalian genomes. Trends Genet. 19:128-130.

Mooers, A. O., and E. C. Holmes. 2000. The evolution of base composition and phylogenetic inference. Trends Ecol. Evol. 15:365- 369.

Nowak, R.M. 1999. Walker's mammals of the world, 6th edition. Johns Hopkins University Press, Baltimore, Maryland.

Perry, J., and A. Ashworth. 1999. Evolutionary rate of a gene affected by chromosomal position. Curr. Biol. 9:987-989.

Piganeau, G., D. Mouchiroud, L. Duret, and C. Gautier. 2002. Expected relationship between the silent substitution rate and the GC content: Implications for the evolution of isochores. J. Mol. Evol. 54:129-133.

Posada, D., and K. A. Crandall. 1998. ModelTest: Testing the model of DNA substitution. Bioinformatics 14:817-818.

Rougier, G. W., J. R. Wible, and M. J. Novacek. 1998. Implications of Deltatheridium specimens for early marsupial history. Nature 396:459-463.

San Mauro, D., M. Vences, M. Alcobendas, R. Zardoya, and A. Meyer. 2005. Initial diversification of living amphibians predated the breakup of Pangaea. Am. Nat. 165:590-599.

Shimodaira, H., and M. Hasegawa. 1999. Multiple comparisons of logliklehoods with applications to phylogenetic inference. Mol. Biol. Evol. 18:1114-1116.

Sorenson, M. D. 1999. TreeRot. Boston University, Boston, Massachusetts.

Steiner, C., M. Tilak, E. J. P. Douzery, and F. M. Catzeflis. 2005. New DNA data from a transthyretin nuclear intron suggest an Oligocene to Miocene diversification of living South America opossums (Marsupialia: Didelphidae). Mol. Phylogenet. Evol. 35:363- 379.

Swofford, D. L. 2003. PAUP*. Phylogenetic analysis using parsimony (*and other methods), version 4. Sinauer Associates, Sunderland, Massachusetts.

Tarro, R., F. Rodriguez-Trelles, and F. J. Ayala. 2000. Tree rooting with outgroups when they differ in their nucleotide composition from the ingroup: The Drosophila sultans and willistoni groups, a case study. Mol. Phylogenet. Evol. 16:344-349.

Tarr'o, R., F. Rodriguez-Trelles, and F. J. Ayala. 2001. Shared nucleotide composition biases among species and their impact on phylogenetic reconstructions of the Drosophilidae. MoI. Biol. Evol. 18:1464-1473.

Voss, R. S., and S. A. Jansa. 2003. Phylogenetic studies on didelphid marsupials II. Nonmolecular data and new IRBP sequences: Separate and combined analyses of didelphine relationships with denser taxon sampling. Bull. Am. Mus. Nat. Hist. 276:1-82.

Voss, R. S., D. P. Lunde, and S. A. Jansa. 2005. On the contents of Gracilinanus Gardner and Creighton, 1989, with the description of a previously unrecognized clade of small didelphid marsupials. Am. Mus. Novitates 3482:1-34.

Waddell, P. J., and S. Shelley. 2003. Evaluating placental interordinal phylogenies with novel sequences including RAG1, gammafibrinogen, ND6, and mt-tRNA, plus MCMC-driven nucleotide, amino acid, and codon models. Mol. Phylogenet. Evol. 28:197-224.

First submitted 18 May 2006; reviews returned 22 July 2006; final acceptance 5 September 2006

Associate Editor: Mark Hafner

KARL F. GRUBER,1 ROBERT S. VOSS,2 AND SHARON A. JANSA1

1 Bell Museum of Natural History and Department of Ecology, Evolution, and Behavior, University of Minnesota, St. Paul, Minnesota 55708, USA; E-mail: jansa003@umn.edu (S.A.J.).

2 Department of Mammalogy, American Museum of Natural History, New York, New York 20024, USA

Copyright Society of Systematic Biologists Feb 2007

(c) 2007 Systematic Biology. Provided by ProQuest Information and Learning. All rights Reserved.

Base-Compositional Heterogeneity in the RAG1 Locus Among Didelphid Marsupials: Implications for Phyl
Back to Current Headlines
Repair Credit   Gate Operator   Harley Davidson Accessories   Wedding DJ Massachusetts