PHYLOGENETICS - bio-siva homepage [PDF]

Cladistic methods use each alignment position as evolutionary information to build a tree. .... Build your cladogram, wh

3 downloads 19 Views 1MB Size

Recommend Stories


David Ke Hong's Homepage [PDF]
Z. Morley Mao at the EECS department (Computer Science and Engineering division) of University of Michigan, Ann Arbor. He received an MPhil ... 2014.8 - now EECS Department, CSE Division; The Hong Kong ... Ke worked as a teaching assistant or instruc

The Homepage
Silence is the language of God, all else is poor translation. Rumi

Personalliste Homepage
Make yourself a priority once in a while. It's not selfish. It's necessary. Anonymous

Route 47 Menu Homepage
Sorrow prepares you for joy. It violently sweeps everything out of your house, so that new joy can find

Homepage - Mota Engil : Mota Engil [PDF]
Institucional · Apresentação · História · Visão, Missão e Valores · Órgãos Sociais · Mensagem PCA · Mensagem CEO · Mota-Engil no Mundo · Presença Internacional · Europa · África · América Latina · Portfolio · Áreas de Negócio · Engenharia e Construçã

Katalog 37 Homepage
Be like the sun for grace and mercy. Be like the night to cover others' faults. Be like running water

Leistungsbeschreibung Homepage-Designer
If your life's work can be accomplished in your lifetime, you're not thinking big enough. Wes Jacks

Decapod Crustacean Phylogenetics
The only limits you see are the ones you impose on yourself. Dr. Wayne Dyer

Shinn Fu America Homepage
Your task is not to seek for love, but merely to seek and find all the barriers within yourself that

Huazhe Xu's Homepage
Keep your face always toward the sunshine - and shadows will fall behind you. Walt Whitman

Idea Transcript


PHYLOGENETICS After working with sequences for a while, one develops an intuitive understanding that for a given gene, closely related organisms have similar sequences and more distantly related organisms have more dissimilar sequences. Also, it seems logical that given a set of sequences, it should be possible to reconstruct the evolutionary relationships (ancestral relationships) among genes and among organisms. This involves creating a branching structure, termed a phylogeny or tree, that illustrates the relationships between the sequences. The study of the relationships between groups of organisms is called taxonomy , an ancient and venerable branch of classical biology. The branch of taxonomy that deals with numerical data such as DNA sequence is know as phylogenetics . This subject also overlaps significantly with a branch of evolutionary biology know as molecular evolution. Study of the evolutionary relationships between a species and its predecessors referred as phylogenetic analysis.

TERMINOLOGY : node : a node represents a taxonomic unit. This can be a taxon (an existing species) or an ancestor (unknown species : represents the ancestor of 2 or more species). branch : defines the relationship between the taxa in terms of descent and ancestry. topology : is the branching pattern. branch length : often represents the number of changes that have occurred in that branch. root : is the common ancestor of all taxa. distance scale : scale which represents the number of differences between sequences (e.g. 0.1 means 10 % differences between two sequences)

GRAPHS AND TREES : The clearest way to visualize the evolutionary relationships among organisms is to use a graph. In mathematics, a graph is a simple diagram used to show relationships between entities, such as numbers, objects or places. Entities are represented by nodes and relationships between them are shown as links or edges. Some simple graphs are as follows:

These graphs can be used for a variety of purposes. For example, G1 might represent a cyclic compound with atoms at the nodes and chemical bonds as the links, and G2 might represent a street map showing one-way streets. Note that in G2 the links have direction. This is a directed graph or digraph. G3 represents a special type of graph known as tree. Because it have "n" nodes and "n-1" links and no circuits. A graph can be referred as circuit graph if it is possible to move from one node back to itself along a sequence of links where no link is used more than once. G1 and G2 are not trees because each contain a circuit of nodes (A-B-C) .

DIFFERENT WAYS OF DRAWING TREE : Trees can be drawn in different ways. There are trees with unscaled branches and with scaled branches. Unscaled branches : the length is not proportional to the number of changes. Sometimes, the number of changes are indicated on the branches with numbers. The nodes represents the divergence event on a time scale. Scaled branches : the length of the branch is proportional to the number of changes. The distance between 2 species is the sum of the length of all branches connecting them. Is also possible to draw these trees with or without a root. For rooted trees, the root is the common ancestor. For each species, there is a unique path that leads from the root to that species. The direction of each path corresponds to evolutionary time. An unrooted tree specifies the relationships among species and does not define the evolutionary path.

As an example to know about phylogenetic tree , phylogenetic tree of entities C (chimpanzee), G (gorilla) , H (human) and O (oranguton) considered. Different type of phylogenetic trees as follows:

The first point to note about these trees is that there are two types of node. The ancestral nodes ( represented by boxes) give rise to branches. These may link to other ancestral nodes, or they may link to terminal nodes (shown as letters), which are also known as leaves or tips. Leaves represent known species and mark the end of the evolutionary pathway. Ancestral nodes may or may not correspond to a known species. The second point to note is that T1 and T2 are unrooted trees, whereas T3 is a rooted tree. T1 and T2 are identical except that T2 is drawn in a conventional style with angled branches to look more like a real tree. These are described as unrooted trees because neither of them shows the position of the last common ancestor of all the species. In T3, the position of this ancestor is indicated by the node F. The third point to note is that each tree is binary, that is, no ancestral nodes have more than two branches. Thus, the evolution of species is represented as a series of bifurcations, which fits in with cladistic theory. For this reason, the trees may also be termed cladograms. The fourth point to note is that the length of the branches may or may not be significant. In T1, T2 and T3 , all the branches are of the same length, whereas in T4 the branches are of different lengths. The lengths of the branches may be used too indicate the actual evolutionary distances between taxa. A cladogram which conveys a sense of evolutionary time using branch lengths may be called a phylogram. Finally, note that T4 shown the same data as T1 and T2 , but it is presented as it would appear in the output format of multiple sequence alignment software, such ClustalW/X. In this format, the links are still represented by lines but the ancestral nodes are represented by vertical lines rather than boxes which is defined as a group of organisms descended from a particular common ancestor. The groups of organisms included within a clade are defined arbitrarily. If , for example, the distance represented by the upper of the two lines beneath T4 was thought to be significantly close, H and C would be said to be in the same clade. If , however, the criterion was the length of the lower line, H,C and G would be placed in the same clade.

TREE STYLES : There are six styles possible: Cladogram, Phenogram, Curvogram, Eurogram, Swoopogram, and Circular Tree. The six styles can be described as follows (assuming a vertically growing tree): Cladogram nodes are connected to other nodes and to tips by straight lines going directly from one to the other. This gives a V-shaped appearance. The default settings if there are no branch lengths are designed to yield a V-shaped tree with a 90-degree angle at the base.

Phenogram nodes are connected to other nodes and to other tips by a horizontal and then a vertical line. This gives a particularly precise idea of horizontal levels.

Curvogram nodes are connected to other nodes and to tips by a curve which is one fourth of an ellipse, starting out horizontally and then curving upwards to become vertical. This pattern was suggested by Joan Rudd.

Eurogram so-called because it is a version of cladogram diagram popular in Europe. Nodes are connected to other nodes and to tips by a diagonal line that goes outwards and goes at most one-third of the way up to the next node, then turns sharply straight upwards and is vertical. Unfortunately it is nearly impossible to guarantee, when branch lengths are used, that the angles of divergence of lines are the same.

Swoopogram this option connects two nodes or a node and a tip using two curves that are actually each one-quarter of an ellipse. The first part starts out vertical and then bends over to become horizontal. The second part, which is at least two-thirds of the total, starts out horizontal and then bends up to become vertical. The effect is that two lineages split apart gradually, then more rapidly, then both turn upwards.

Circular Tree This is a style introduced by David Swofford in PAUP*. The tree grows outward from a central point, being essentially a Phenogram style tree in polar coordinates. The tips form a 360degree circle. The "vertical" lines run outward radially from the center, and the "horizontal" lines are arcs of circles centered on it.

METHODS OF PHYLOGENETIC ANALYSIS : There are two major groups of analyses to examine phylogenetic relationships between sequences : 1. Phenetic methods : trees are calculated by similarities of sequences and are based on distance methods. The resulting tree is called a dendrogram and does not necessarily reflect evolutionary relationships. Distance methods compress all of the individual differences between pairs of sequences into a single number. 2. Cladistic methods : trees are calculated by considering the various possible pathways of evolution and are based on parsimony or likelihood methods. The resulting tree is called a cladogram. Cladistic methods use each alignment position as evolutionary information to build a tree.

PHENETIC METHODS (DISTANCE BASED METHODS) : The phenetic approach is popular with molecular evolutionists because it relies heavily on character data - such as sequences - and requires relatively few assumptions. In this approach, a tree is constructed by considering the phenotypic similarities of the species without trying to understand the evolutionary history that brought the species to their current phenotypes. Since a tree constructed by this "current data only" method does not necessarily reflect evolutionary relationships, but rather is designed to represent phenotypic similarity, trees constructed via this method are called phenograms. In phenetic method tree framed by following the following steps: 1. Starting from an alignment, pairwise distances are calculated between DNA sequences as the sum of all base pair differences between two sequences (the most similar sequences are assumed to be closely related). This creates a distance matrix. All base changes can be considered equally or a matrix of the possible replacements can be used. Insertions and deletions are given a larger weight than replacements. Insertions or deletions of multiple bases at one position are given less weight than multiple independent insertions or deletions. it is possible to correct for multiple substitutions at a single site. 2. From the obtained distance matrix, a phylogenetic tree is calculated with clustering algorithms. These cluster methods construct a tree by linking the least distant pair of taxa, followed by successively more distant taxa. UPGMA clustering (Unweighted Pair Group Method using Arithmetic averages) : this is the simplest method Neighbor Joining : this method tries to correct the UPGMA method for its assumption that the rate of evolution is the same in all taxa. CLUSTERING ALGORITHMS : Clustering algorithms use distances to calculate phylogenetic trees. These tress are based solely on the relative numbers of similarities and differences between a set of sequences. First pairwise distances must be computed for all sequences that will be used to build the tree - thus creating a distance matrix. Cluster methods construct a tree by linking the least distant pairs of taxa, followed by successively more distant taxa. Cluster algorithms can be applied to many different types of molecular data including isozymes, restriction sites, RFLP's etc., UPGMA: The simplest of the distance methods is a type of cluster algorithm that is known as UPGMA (Unweighted Pair Group Method using Arithmetic averages). This method has gained popularity mostly because of its simplicity and also because of its speed (though many other distance methods are as fast). The GCG program PILEUP uses UPGMA to create its dendrogram of DNA sequences, and then uses this dendrogram to guide its multiple alignment algorithm. The GCG program DISTANCES calculates pairwise distances between a group of sequences. CALCULATING DISTANCES:

It is often useful to measure the genetic distance between two species, between two populations, or even between two individuals. For example, if you have two individuals who come to a hospital, and they both have the same genetic disease, you might want to know if they are related and if they might therefore have inherited the same gene. Otherwise, this might be manifestation of two separate mutations. The entire concept of numerical taxonomy is based on computing phylogenies from a table of distances. In the case of sequence data, pairwise distances must be calculated between all sequences that will be used to build the tree - thus creating a distance matrix. Distance measurements also allow for some measurement of the reliability of the final tree by the calculation of a variance which is computed from the variances of each entry in the initial distance matrix. Distance methods give a single measure of the amount of evolutionary change between two genomes since divergence from a common ancestor. Distances between DNA sequences are relatively simple to compute as the sum of all base pair differences between two sequences (this type of algorithm can only work for pairs of sequences that are similar enough to be aligned). Either all base changes are considered equally, or a simple matrix of the frequencies of the 12 possible types of replacements (each base can be replaced by one of the three other bases) can be used. Differences due to insertions/deletions (indels) are generally given a larger weight than replacements, but indels of multiple bases at one position are given less weight than multiple independent indels. t is also possible to correct for multiple substitutions at a single site, which is more common in distant relationships and for rapidly evolving sites. Distances between amino acid sequences are a bit more complicated to calculate. Some amino acids can replace one another with relatively little effect on the structure and function of the final protein while other replacements can be devastating. From the standpoint of the genetic code, changes between some amino acids can be made by a single DNA mutation while others require two or even three changes in the DNA sequence. In practice, what has been done is to calculate tables of frequencies of all amino acid replacements between sets of related amino acid sequences in the databanks. The most famous of these tables is the PAM250 matrix created by Dayhoff et al. in 1978. PAM stands for "Percent Accepted Mutations", also know as a mutation probability matrix, i.e. the probability that any amino acid will change to any other amino acid. A score above 0 indicates that these amino acids replace each other more often than expected by chance. That is they are functionally equivalent and/or easily inter-mutable. Scores below 0 indicate two amino acids that are seldom interchanged. DISTANCES writes its output into a matrix file that can then be used by the program GROWTREE to draw a tree based on either UPGMA or the Neighbor-Joining method. Absolute distance calculations can be corrected for multiple substitutions at a site with a variety of formulas.

NEIGHBOR JOINING: Another very popular distance method is the Neighbor Joining Method. The NJ algorithm is commonly applied with distance tree building, regardless of the optimization criterion. The fully resolved tree is "decomposed" from a fully unresolved "star" tree by successively inserting branch between a pair of closest neighbors and the remaining terminals in the tree. The closest neighbor pair is them consolidated, effectively reforming a star tree, and the process is repeated. The method is comparatively rapid, i.e., requiring only a few seconds or less for a 50 sequence tree.

This method attempts to correct the UPGMA method for its (frequently invalid) assumption that the same rate of evolution applies to each branch. Hence this method yields an unrooted tree. A modified distance matrix is constructed to adjust for these differences in the rate of evolution of each taxon. The least distant pairs of nodes are linked and their common ancestral node is added to the tree, their terminal nodes are pruned from the tree. This continues until only two nodes remain. ADVANTAGES: 1. It is fast and thus suited for large datasets and for bootstrap analysis. 2. It permits lineages with largely different branch lengths. 3. It permits correction for multiple substitutions. DISADVANTAGES: 1. The sequence information is reduced. 2. It gives only one possible tree. 3. It strongly dependent on the model of evolution used. FITCH-MARGOLIASH METHOD (FM): The Fitch-Margoliash method seeks to maximize the fit for the observed pairwise distances to a tree by minimizing the squared deviation of all possible observed distances relative to all possible path lengths on the tree. There are several variations that differ in how the error is weighted. The variance estimates are not completely independent because errors in all internal tree branches are counted at least twice. MINIMUM EVOLUTION (ME) METHOD: Minimum evolution seeks to find the shortest tree that is consistent with the path lengths measured in a manner similar to Fitch-Margoliash; that is, it works by minimizing the square deviation of observed to tree-based. Unlike Fm, Me does not use all possible pairwise distance and all possible associated tree path lengths. Rather, it fizzes the location of internal tree nodes based on the distance to external nodes, and then optimizes the internal branch length according to the minimum measured error between these "observed" points.

CLADISTIC METHODS (CHARACTER-BASED METHODS) : The second approach, known as the cladistic method , relies on a knowledge of ancestral relationships as well as current data. Cladistic methods of phylogenetic analysis are based on the explicit assumption that a set of sequences evolved from a common ancestor by a process of mutation and selection without mixing (hybridization or other horizontal gene transfers). Via cladistic methods, a tree is reconstructed by considering the various possible pathways of evolution and choosing from amongst these the best possible tree. Trees reconstructed via these methods are called cladograms. Computer algorithms based on the cladistic model generally rely on parsimony or maximum likelihood methods for the calculation of relationships and building of trees. In order to use cladistic software with sequence data, certain sequences must be designated as ancestral and others as derived . As a result, changes at certain positions will have a larger effect than others on the location of each sequence in the predicted tree. Cladistic methods based on Parsimony : For each position in the alignment, all possible trees are evaluated and are given a score based on the number of evolutionary changes needed to produce the observed sequence changes. The most parsimonious tree is the one with the fewest evolutionary changes for all sequences to derive from a common ancestor. This is a more time-consuming method than the distance methods. Cladistic methods based on Maximum Likelihood : This method also uses each position in an alignment, evaluates all possible trees, and calculates the likelihood for each tree using an explicit model of evolution ( Parsimony just looks for the fewest evolutionary changes). The likelihood's for each aligned position are then multiplied to provide a likelihood for each tree. The tree with the maximum likelihood is the most probable tree. This is the slowest method of all but seems to give the best result and the most information about the tree. For character data (physical traits of organisms such as morphology of organs etc.) and for higher (or perhaps we should say deeper) levels of taxonomy, the cladistic approach is almost certainly superior. However, cladistic methods are often difficult to implement with assumptions that are not always satisfied with molecular data. Phenetic approaches lead to generally faster algorithms and they often have nicer statistical properties for molecular data. STEPS IN CLADISTIC METHODS : 1. Choose the taxa whose evolutionary relationships interest you. These taxa must be clade if you hope to come up with plausible results. 2. Determine the characters (features of the organisms) and examine each taxon to determine the character states (decide whether each taxon does or does not have each character). All taxa must be unique. 3. Determine the polarity of characters (whether each character state is original or derived in each taxon). Note that this step is not absolutely necessary in some computer algorithms. Examining the character states in out groups to the taxa you are considering helps you determine the polarity. 4. Group taxa by synapomorphies (shared derived characteristics) not plesiomorphies (original, or "primitive", characteristics). 5. Work out conflicts that arise by some clearly stated method, usually parsimony (minimizing the number of conflicts). 6. Build your cladogram, which is NOT an evolutionary tree, following these rules: -- All taxa go on the endpoints of the cladogram, never at nodes. -- All cladogram nodes must have a list of synapomorphies which are common to all taxa above the node (unless the character is later modified). -- All synapomorphies appear on the cladogram only once unless the character state was derived separately by evolutionary parallelism.

Synapomorphies are the basis for cladistics : Cladistics is a particular method of hypothesizing relationships among organisms. Like other methods, it has its own set of assumptions, procedures, and limitations. Cladistics is now accepted as the best method available for phylogenetic analysis, for it provides an explicit and testable hypothesis of organismal relationships. The basic idea behind cladistics is that members of a group share a common evolutionary history, and are "closely related", more so to members of the same group than to other organisms. These groups are recognized by sharing unique features which were not present in distant ancestors. These shared derived characteristics are called synapomorphies. Note that it is not enough for organisms to share characteristics, in fact two organisms may share a great many characteristics and not be considered members of the same group. For example, consider a jellyfish, starfish, and a human; which two are most closely related? The jellyfish and starfish both live in the water, have radial symmetry, and are invertebrates, so you might suppose that they belong together in a group. This would not reflect evolutionary relationships, however, since the starfish and human are actually more closely related. It is not just the presence of shared characteristics which is important, but the presence of shared derived characteristics. In the example above, all three characteristics are believed to have been present in the common ancestor of all animals, and so are trivial for determining relationships, since all three organisms in question belong to the group "animals". While humans are different from the other two organisms, they differ only in characteristics which arose newly in an ancestor which is not shared with the other two. As you shall see on the next page, choosing the right characters is one of the most important steps in a cladistic analysis. There are three basic assumptions in cladistics: 1. Any group of organisms are related by descent from a common ancestor. 2. There is a bifurcating pattern of cladogenesis. 3. Change in characteristics occurs in lineages over time. The first assumption is a general assumption made for all evolutionary biology. It essentially means that life arose on earth only once, and therefore all organisms are related in some way or other. Because of this, we can take any collection of organisms and determine a meaningful pattern of relationships, provided we have the right kind of information. Again, the assumption states that all the diversity of life on earth has been produced through the reproduction of existing organisms. The second assumption is perhaps the most controversial; that is, that new kinds of organisms may arise when existing species or populations divide into exactly two groups. There are many biologists who hold that multiple new lineages can arise from a single originating population at the same time, or near enough in time to be indistinguishable from such an event. While this model could conceivably occur, it is not currently known how often this has actually happened. The other objection raised against this assumption is the possibility of interbreeding between distinct groups. This, however, is a general problem of reconstructing evolutionary history, and although it cannot currently be handled well by cladistic methods, no other system has yet been devised which accounts for it. The final assumption, that characteristics of organisms change over time, is the most important assumption in cladistics. It is only when characteristics change that we are able to recognize different lineages or groups. The convention is to call the "original" state of the characteristic plesiomorphic and the "changed" state apomorphic. The terms "primitive" and "derived" have also been used for these states, but they are often avoided by cladists, since those terms have been much abused in the past.

Maximum Parsimony : Parsimony is the most popular method for reconstructing ancestral relationships. Parsimony involves evaluating all possible trees and giving each a score based on the number of evolutionary changes that needed to explain the observed data. The most parsimonious tree is the one that requires the fewest evolutionary changes for all sequences to derive from a common ancestor. This is easiest to explain by example: Consider four sequences: ATCG, TTCG, ATCC, and TCCG. Imagine a tree that branches once at the first position, thus grouping ATCG and ATCC on one branch, and TTCG and TCCG on the other branch. Then each branch sub-divides into two sub-branches for a total of 3 nodes in the tree .

Counting backward from the bottom, each sequence is separated from the root by two nodes, so the sum of the changes is equal to 8. This is a more parsimonious tree than one that first divides ATCC on its own branch, then splits off ATCG, and finally divides TTCG from TCCG.

This tree also has three nodes, but when all of the distances back to the root are summed, the total is equal to 9. ADVANTAGES: 1. It is based on shared and derived characters. It therefore is a cladistic rather than a phenetic method. 2. It does not reduce sequence information to a single number. 3. It tries to provide information on the ancestral sequences. 4. It evaluates different trees. DISADVANTAGES: 1. It is slow in comparison with distance methods. 2. It does not use all the sequence information 3. It does not correct for multiple mutations 4. It does not provide information on the branch lengths. 5. It is notorious for its sensitivity to codon bias.

Maximum Likelihood : Maximum Likelihood is a powerful method of inference using sequence data, allowing one to take into account information about some of the processes of molecular evolution by using evolutionary ‘models’ in the phylogenetic analysis, which parsimony does not allow. An evolutionary model might consider the following characteristics of DNA sequences to account for biases in how they evolve over time: 1. Nucleotide frequencies for a gene sequence may not be equal (25% for each nucleotide, G,A, T, C). (e.g., mtDNA of many insects is 85-90% A/T-rich. 2. Transitional substitutions (purine-purine; pyrimidine-pyrimidine) are generally more likely to occur than transversional (purine-pyrimidine substitutions. 3. Not all DNA transformations (substitutional changes) occur at the same rate (e.g., 3rd codon positions change relatively frequently compared to 1st and 2nd codon positions). Likelihood begins with a tree topology (with known branch lengths) and knowledge about the probability of different types of nucleotide substitution changes (an evolutionary model). It uses this information to predict the likelihood that a given tree and substitution model will produce the observed DNA sequences in your data set. It continues to swap around the branches of the tree until it finds the tree topology that maximizes the likelihood that the tree could produce the observed DNA data. It is a powerful statistical approach to inferring phylogenies but it is computationally slow, thus limits the number of taxa you can include in a phylogenetic analysis. The method of maximum likelihood attempts to reconstruct a phylogeny using an explicit model of evolution. Certainly, for this given model of evolution, no other method will perform as well nor provide you with as much information about the tree. Unfortunately, this is computationally difficult to do and hence, the model of evolution must be a simple one. Even with simple models of evolutionary change the computational task is enormous and this is thus the slowest of all methods. This method really works best when it is used to test (or improve on) an existing tree.

ADVANTAGES: 1. They have often lower variance than other methods. 2. They tend to be robust to many violations of the assumptions in the evolutionary model even with very short sequences they tend to outperform alternative methods such as parsimony or distance methods. 3. The method is statistically well founded. 4. They evaluate different tree topologies. 5. They use all the sequence information. DISADVANTAGES: 1. It is very CPU intensive and thus extremely slow. 2. The result dependent on the model of evolution used.

CLADOGRAMS : The output from a phylogenetic analysis is a hypothesis of relationship of different taxa. This hypothesis can be represented as a cladogram, a branching diagram. Cladograms bear a lot in common with the notion of family trees. In a family tree we trace back our ancestry. For example, in the family tree on the top right, the ancestors of all the rest of the family are the initial black dot and yellow square. These ancestors give rise to three children, one of which mates and has two children. We can all trace our lineages back to one set of ancestors. All species have ancestors too. So, for example, sometime in the past an ancestral species (father) of Homo sapiens walked the earth. This ancestor went extinct (died), but left descendent species (children). In family trees, we can talk coherently about real ancestors. In biology, the ancestors are often gone sometimes without a trace. All we have left are the children. Reading cladograms is much like reading a family tree. Both are rich in information. Cladograms, like family trees, tell the pattern of ancestry and descent. Unlike family trees, ancestors in cladistics ideally give rise to only two descendent species. Also unlike family trees, new species form from splitting of old species. In speciation, it does not take two to tango. The formation of the two descendent species is called a splitting event. The ancestor is usually assumed to "die" after the splitting event.

In the first tree, labelled Cladogram A, notice the green dots. Each dot has a letter associated with it. The dots with letters are the nodes of the tree. The stems of the tree end with the taxa under consideration, represented by boxes. At each node a splitting event occurs. The node therefore represents the end of the ancestral taxon and the stems the species that split from the ancestor. The two taxa that split from the node are called sister taxa. They are called sister taxa because they are like the siblings from the parent or ancestor. The sister taxa must each be more closely related to one another than to any other group because they share a close common ancestor. In the same way, you are most closely related to your siblings than to anyone else since you share common parents. Lets focus on Node C in Cladogram A. At the node, the ancestor goes extinct but leaves two siblings hypothesized to be humans and gorillas. Humans and gorillas are sister taxa and are more closely related to one another than either is to chimpanzees or baboons. Working down the tree we come to node B. At this node the ancestor of the humans and gorillas split from the chimpanzees. Therefore the chimpanzees sister taxon is the human/gorilla ancestor. A sister taxon can be an ancestor and all its descedents. We call an ancestor plus all its descendents a clade. A cladogram shows us hypothesized clades. Finally we come to node A. Here, we find the splitting event that led to the baboons and the ancestor to the chimpanzees, humans and gorillas. By working our way down the cladogram we have learned the pattern of splitting. We have found out that chimpanzees, humans and gorillas are more closely related to each other than to baboons. In this example, baboons are the outgroup.

We would change the pattern of speciation events. In Cladogram B, humans and chimpanzees are sister taxa and in Cladogram C, chimps and gorillas are sister taxa. Which of the three cladograms presented above is correct? None of the cladograms can be proved correct, but Cladogram B is the best supported of the three based on character data and is therefore hypothesized to best reflect the true branching pattern.

PHYLOGENETIC TREE RELIABILITY : There is no guaranteed way to verify that a phylogenetic tree represents the true path of evolutionary change. However, there are ways in which to test the reliability of phylogenetic predictions. First, if different methods of tree construction give the same result, this is good evidence that the tree is reliable. Second, the data can be resampled to test their statistical significance. In a technique called bootstrapping, data are randomly sampled from any position within a multiple sequence alignment, and are built into new artificial alignments, which are then tested by tree building. BOOTSTRAP ANALYSIS (DATA RESAMPLING METHOD) : Takes multiple random resamplings, with replacement, of characters from a data matrix under study—resamples from the original sample. Resamples are called replicates, and each replicate character matrix is the same size as the original. Because resampling is done with replacement (each character returned to data set before next sample is taken), some characters may show up several times, while others may not be represented in the matrix at all. From each pseudoreplicate a tree is reconstructed. Usually run between 100-1000 reps.(100-1000 trees) results in a set of bootstrap trees. Some trees show one topology, other trees show another, still others may show another, and so on.

For large data sets all the bootstrap trees are assembled into a bootstrap consensus tree. On the consensus tree (often a 50% majority rule consensus tree), each node is labeled with the frequency of its occurrence among all the bootstrap trees.

Since sampling is random, some positions may be sampled more than once and others not at all. Ideally, the trees built by bootstrapping should always match the original tree, and this would be defined as '100% bootstrap support'. In reality, bootstrap support of 70% or more for any given branch of a tree is taken to provide 95% confidence that the branch is correct. Jack-knifing is a similar process in which about 50% of the original data are resampled and used to make a new matrix, from which phylogenetic relationships are reconstructed.

USES OF PHYLOGENIES : 1. TAXONOMY : It is used to classify organism on the basis of evolutionary relationships.

2. ECOLOGY : It is used to find out the socioecological relationship between organisms can be determined and environmental effects on organisms also determined by phylogenetic analysis.

3. ORGANISMS RELATIONSHIPS : Relationship between different characters and organism are compared with the help of phylogenetic analysis. 4. HOST AND PARASITE RELATIONSHIPS : Relationship between host and its corresponding parasite relationships determined with the help of phylogenetic analysis. This is helpful in the determination of drugs against parasite infection.

5. BIOGEOGRAPHY : Information regarding historical biogeography obtained from phylogenetic analysis. With this relationship between dissimilar organisms are determined.

COMPUTER SOFTWARE FOR PHYLOGENETICS : Due to the lack of consensus among evolutionary biologists about basic principles for phylogenetic analysis, it is not surprising that there is a wide array of computer software available for this purpose. Different theoretical models, different algorithms, different computer platforms, and different interfaces lead to a bewildering array of products from which to choose. Here only three of such program are discussed namely PHYLIP, PAUP and MAcCLADE.

PHYLIP : PHYLIP (the PHYLogeny Inference Package) is a package of programs for inferring phylogenies (evolutionary trees). It is available free over the Internet, and written to work on as many different kinds of computer systems as possible. The source code is distributed (in C), and executables are also distributed. In particular, already-compiled executables are available for Windows (95/98/NT/2000/me/xp), MacOS 8 and 9, MacOS X, and Linux systems. Complete documentation is available on documentation files that come with the package. Methods that are available in the package include parsimony, distance matrix, and likelihood methods, including bootstrapping and consensus trees. Data types that can be handled include molecular sequences, gene frequencies, restriction sites and fragments, distance matrices, and discrete characters. The programs are controlled through a menu, which asks the users which options they want to set, and allows them to start the computation. The data are read into the program from a text file, which the user can prepare using any word processor or text editor (but it is important that this text file not be in the special format of that word processor -- it should instead be in "flat ASCII" or "Text Only" format). Some sequence analysis programs such as the ClustalW alignment program can write data files in the PHYLIP format. Most of the programs look for the data in a file called "infile" -- if they do not find this file they then ask the user to type in the file name of the data file. Output is written onto special files with names like "outfile" and "outtree". Trees written onto "outtree" are in the Newick format, an informal standard agreed to in 1986 by authors of a number of major phylogeny packages. At this stage we do not have a mouse-windows interface for PHYLIP. PHYLIP is the most widely-distributed phylogeny package, and competes with PAUP* to be the one responsible for the largest number of published trees. PHYLIP has been in distribution since 1980, and has over 15,000 registered users. At present it contains 31 programs, which carry out different algorithms on different kinds of data. The programs in the package are: Programs for molecular sequence data PROTPARS: Protein parsimony DNAPARS: Parsimony method for DNA DNAMOVE: Interactive DNA parsimony DNAPENNY: Branch and bound for DNA DNACOMP: Compatibility for DNA DNAINVAR: Phylogenetic invariants DNAML: Maximum likelihood method DNAMLK: DNA ML with molecular clock DNADIST: Distances from sequences PROTDIST Distances from proteins RESTML: ML for restriction sites SEQBOOT: Bootstraps sequence data sets COALLIKE: Coalescent likelihoods from sampled phylogeny estimates Programs for distance matrix data FITCH: Fitch-Margoliash and least-squares methods KITSCH: Fitch-Margoliash and least squares methods with evolutionary clock NEIGHBOR: Neighbor-joining and UPGMA methods Programs for gene frequencies and continuous characters CONTML Maximum likelihood method GENDIST Computes genetic distances CONTRAST Computes contrasts and correlations for comparative method studies Programs for discrete state data (0/1) MIX Wagner, Camin-Sokal, and mixed parsimony criteria MOVE Interactive Wagner, C-S, mixed parsimony program PENNY Finds all most parsimonious trees by branch-and-bound DOLLOP, DOLMOVE, DOLPENNY same as preceding four programs, but for the Dollo and polymorphism parsimony criteria CLIQUE Compatibility method FACTOR re-code multistate characters Programs for plotting trees and consensus trees DRAWGRAM: Draws cladograms and phenograms on screens, plotters and printers DRAWTREE: Draws unrooted phylogenies on screens, plotters and printers CONSENSE: Majority-rule and strict consensus trees RETREE: Reroots, changes names and branch lengths, and flips trees

PAUP : David Swofford of the School of Computational Science and Information Technology, Florida State University, Tallahassee, Florida has written PAUP* (which originally meant Phylogenetic Analysis Using Parsimony). PAUP* version 4.0beta has been released as a provisional version by Sinauer Associates, of Sunderland, Massachusetts. It has Macintosh, PowerMac, Windows, and Unix/OpenVMS versions. PAUP* has many options and close compatibility with MacClade. It includes parsimony, distance matrix, invariants, and maximum likelihood methods and many indices and statistical tests. It is described in a web page at http://paup.csit.fsu.edu/, which also contains links to its web pages at Sinauer Associates. It is available for the following types of systems: For PowerMacs and 68k Macintoshes in a version with full mouse-windows user interface, For Windows in a version with a character-based command-line interface (which appears in a Windows window), For DOS or a Windows DOS box in a version which has command-line interface, and In a Unix/Linux version, with command-line interface, for Alpha Compaq/Digital Unix, Alpha Linux, PowerPC Linux, Intel-compatible Linux, Sun SPARC/UltraSPARC Solaris, and Alpha VMS.

MAcCLADE : MacClade is a pioneering program for interactive analysis of evolution of a variety of character types, including discrete characters and molecular sequences. It works on Macintoshes with MacOS (including MacOS X). MacClade enables you to use the mouse-window interface to specify and rearrange phylogenies by hand, and watch the number of character steps and the distribution of states of a given character on the tree change as you do so. It has many other features beyond this, including ability to edit data, print out phylogenies, and even simulate the evolution of data on a tree. MacClade was written by Wayne Maddison (now of the Department of Zoology, University of British Columbia) and David Maddison of the Department of Entomology, University of Arizona. It is distributed by Sinauer Associates of Sunderland, Massachusetts USA (their company web site is http://www.sinauer.com/". MacClade is described on its Web page, at http://phylogeny.arizona.edu/macclade/macclade.html. A demonstration version of MacClade 4 is also available there. An much earlier and less capable Version, 2.1 (which for example cannot read nucleic acid sequences and has many fewer features for discrete characters) is also available by anonymous ftp from the EMBL and Indiana molecular biology software servers at (respectively) ftp.bio.indiana.edu, and ftp.ebi.ac.uk, in directories molbio/mac and pub/software/mac, respectively, as a BinHexed and squeezed archive, (respectively macclade-old.hqx and macclade21.hqx.

WEB BASED PROGRAMS : A PC with faster CPU can handle some of the internet based tools of phylogenetic tree constructions. WEBPHYLIP, PhyloBLAST, BLAST2 and orthologue search servers come very useful.

TREE CONSTRUCTION THROUGH CLADISTIC METHOD : Consider, again, these taxa, characters and states: internal atrial two temporal pedicillate Taxon amnion legs scales blood nostrils septum fenestrations hemipenes gizzard teeth feathers wings vertebrae perch no no yes cold no no no no no no no no yes coelocanth no no yes cold yes yes no no no no no no yes salamander no yes no cold yes yes no no no yes no no yes frog no yes no cold yes yes no no no yes no no yes turtle yes yes yes cold yes yes no no no no no no yes man yes yes no warm yes yes no no no no no no yes gecko yes yes yes cold yes yes yes yes no no no no yes snake yes yes yes cold yes yes yes yes no no no no yes alligator yes yes yes cold yes yes yes no yes no no no yes budgy yes yes no warm yes yes yes no yes no yes yes yes

The preceding matrix, again, can be represented numerically (for convenience) as: 1 2 3 4 5 6 7 8 9 10 11 12 13 perch 0000000000 0 0 0 coelocanth 0 0 0 0 1 1 0 0 0 0 0 0 0 salamander 0 1 1 0 1 1 0 0 0 1 0 0 0 frog 0110110001 0 0 0 turtle 1100110000 0 0 0 man 1111110000 0 0 0 gecko 1100111100 0 0 0 snake 1100111100 0 0 0 alligator 1 1 0 0 1 1 1 0 1 0 0 0 0 budgy 1111111010 1 1 0 Hennigian argumentation proceeds as follows: Each apomorphic (derived) character state defines a relationship. That is, the presence of an amnion defines the group (turtle, man, gecko, snake, alligator, budgy). However, the absence of an amnion does NOT define the group (perch, coelocanth, salamander, frog) as having descended from a common ancestor exclusive of turtle, man, gecko, snake, alligator, and budgy. That is, character 1 defines the group:

but not the group:

We can then enumerate for all apomorphic states, for all characters, the group(s) hypothesized by each:

Having determined each group defined by each character, you can then begin, stepwise, to resolve the tree by adding the charatacters one at a time...

PHYLOGENETIC TREE DRAWING USING PHENETIC METHODS (DISTANCE MATRIX METHOD) : CONSTRUCTION OF A DISTANCE TREE USING CLUSTERING WITH THE UNWIEGHTED PAIR GROUP METHOD WITH ARITHMETIC MEAN (UPGMA): The UPGMA is the simplest method of tree construction. It was originally developed for constructing taxonomic phenograms. i.e., trees that reflect the phenotypic similarities between operational taxonomic units (OTUs), but it can also be used to construct phylogenetic trees if the rates of evolution are approximately constant among the different lineages. For this purpose the number of observed nucleotide or amino-acid substitutions can be used. UPGMA employs a sequential clustering algorithm, in which local topological relationships are identified in order of similarity, and the phylogenetic tree is build in a stepwise manner. Initial step for tree construction is to form the distance matrix from the similarity studies. The following distance matrix is used for the tree construction:

A

B

C

D

E

B

2









C

4

4







D

6

6

6





E

6

6

6

4



F

8

8

8

8

8

First step of tree construction, is to find the most similar two OTUs among all the OTUs and then treat these as a new single OTU. Such an OTU is referred to as a composite OTU. Subsequently from among the new group of OTUs, identify the pair with highest similarity i.e. lowest distance and so on, until only two OTUs left . FIRST STEP: From the distance matrix table it was concluded that A and B OTUs found to have highest similarity i.e. lowest distance ( 2 ) . Thus these two OTUs are clustered to form the new composite OTUs ( A,B ). The branching point is positioned at a distance of 2/2 = 1 substitution. Thus the tree for first step is as follows:

SECOND STEP : After combining (A, B) , new distance matrix build by using other OTUs and composite OTU. Distance matrix value for composite OTU and other OTUs determined using the distance values present in the previous distance matrix in other words the distance between a simple OTU and a composite OTU is the average of the distances between the simple OTUs using the following formulae:

Relationship between the other OTUs are obtained from the old table. Then the new distance matrix table is as follows:

From the table, it was found that D and E OTUs have least distance. So sub tree drawn between D and E by considering new composite OTU (D,E) . The branching point is positioned at a distance of 4/2 = 2 substitution. Thus the sub tree for second step is as follows:

THIRD STEP: Like second step, by using composite OTUs (A,B) and (D,E) , new distance matrix table prepared. Distance between new composite OTU (D,E) and other OTUs determined by using the following formulae:

Relationship between the other OTUs are obtained from the old table. Then the new distance matrix table is as follows:

From the table, it was found that (A,B) and C OTUs have least distance. So sub tree drawn between (A,B) and C by considering new composite OTU (AB,C) . The branching point is positioned at a distance of 4/2 = 2 substitution. Thus the sub tree for third step is as follows:

FOURTH STEP: Like third step, new distance matrix formed. Distance between (AB,C) and other OTUs determined using the following formulae:

Relationship between the other OTUs are obtained from the old table. Then the new distance matrix table is as follows:

From the table, it was found that (AB,C) and (D,E) OTUs have least distance. So sub tree drawn between (AB,C) and (D,E) by considering new composite OTU (ABC,DE) . The branching point is positioned at a distance of 6/2 = 3 substitution. Thus the sub tree for fourth step is as follows:

FIFTH STEP: Like other steps, new distance matrix formed. Distance between (ABC,DE) and other OTUs determined using the following formulae:

Relationship between the other OTUs are obtained from the old table. Then the new distance matrix table is as follows:

Since only two OTUs are present, final tree drawn by combining ABC,DE and F. The branching point is positioned at a distance of 8/2 = 4 substitution. Thus the tree for fifth step is as follows:

Although this method leads essentially to an unrooted tree, UPGMA assumes equal rates of mutation along all the branches, as the model of evolution used. The theoretical root, therefore, must be equidistant from all OTUs. Thus the root of the entire tree is then positioned at dist (ABCDE) ,F/2 = 4. Thus the rooted tree is as follows:



PHYLOGENY AND ONTOGENY: Evolutionary relationships between organisms referred as phylogeny whereas ontogeny describes the development of a particular organism. Ontology refers to a systematic account of the relationship between objects, concepts or other entities, and is used in artificial intelligence and information science to structure information. There are two types of ontologic methods namely static ontology and dynamic ontology. In static ontology, there is a formal and explicit specification of the relationships between entities, rather like the formal and inflexible classification systems. A dynamic ontology, for example, allows the relationship between entities to be progressively refined and updated, that is, the rules can be relaxed to fit around the problem rather than forcing descriptions onto the entities themselves.

MOLECULAR PHYLOGENY: The DNA of organisms in different lines of descent accumulates mutations over evolutionary time leading to divergence in macromolecular sequences (DNA, RNA and Protein sequences). Phylogenetic trees based on differences among macromolecular sequences are known as molecular phylogenies. Generally, the greater the divergence between two sequences the more ancient their last common ancestor (LCA), so evolutionary trees can be reconstructed on this principle. Molecular phylogenies are very informative compared to those based on traditional anatomical or morphological characters. This is because they are wider in scope i.e. it is possible to compare flowering plants and mammals using protein sequences but not using morphological characters. There are also many different sequences to choose from and data handling is consistent and objective.



LIMITATIONS IN PHYLOGENETICS: Neither the theory nor the practical applications of any algorithms are universally accepted throughout the scientific community. The application of different software packages to a data set is very likely to give different answers and minor changes to a data set are also likely to profoundly change the result. Despite all of these caveats, it is possible to calculate phylogenetic trees for data sets. Provided the data are clean, outgroups are correctly specified, appropriate algorithms are chosen, no assumptions are violated, bootstrapping is used, etc., can the true, correct tree be found and proven to be scientifically valid? Unfortunately, it is impossible to ever conclusively state what is the "true" tree for a group of sequences (or a group of organisms); taxonomy is constantly under revision as new data is gathered. Relationships calculated from sequence data actually represent the relationships between genes, this is not necessarily the same as relationships between whole organisms. Your data (the sequence of some gene or some other form of sequence data) may not have had the same phylogenetic history as the species within which they are contained. Different genes evolve at different speeds, and there is always the possibility of horizontal gene transfer (either by hybridization, vector mediated DNA movement, or direct uptake of

DNA).

Smile Life

When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile

Get in touch

© Copyright 2015 - 2024 PDFFOX.COM - All rights reserved.