Transitions

Transitions: The Evolution of Life

Note from afarensis: RPM at Evolgen has graciously allowed me to cross-post a series he is doing on detecting natural selection using molecular data. The first two posts are available below. Be sure to visit RPM at the link above. Hopefully, this will be the first of many posts, here at Transitions, concerning genetics.

***************************************************** Detecting Natural Selection (Introduction) Introduction

This is the first of multiple postings I plan to write about detecting natural selection using molecular data (ie, DNA sequences). A lot of news releases in the popular press recently have dealt with genes under selection in humans. Most people probably don't understand what it means for a gene to be under selection and how researchers detect selection; I often throw around terms like “signature of selection”. These posts will explain some of the basic concepts in molecular evolution and the theory behind how selection is detected. They will be geared toward a scientifically literate reader, and I will do my best to explain the basics of molecular biology and genetics before getting into the nitty gritty. If you can read and understand PZ Myers, you should be in good shape. If John Hawks or Gene Expression are more your taste, this information will probably be old hat. Throughout these posts, I will address the following: 1. The organization of the genome 2. Simple comparative methods for detecting selection 3. More complex comparative methods 4. Nucleotide variation in populations 5. Non-neutral patterns of sequence polymorphism This is by no means a table of contents, as I have yet to write any of these entries. This will be a sort of dynamic primer of molecular evolution – the next topic will be determined by previous postings. Check back soon for the first post in this series: “The Organization of the Genome”.

Detecting Natural Selection (Part 1)

The Organization of the Genome

This is the second of multiple postings I plan to write about detecting natural selection using molecular data (ie, DNA sequences). The first posting contained a brief introduction and can be found here. Before we can discuss how DNA sequences are used to identify evidence of natural selection, we must have an understanding of how the genome is organized. The genetic material that can be found in each and every one of your cells – and inside every cell in every living organism – is made up of two sugar-phosphate backbones connected by nitrogenous bases. The two “strands” are wrapped around each other in what is known as a double helix. The nitrogenous bases come in four flavors: adenine (A), guanine (G), thymine (T), and cytosine (C). The nitrogenous base along with the sugar and phosphate is known as a “nucleotide”. In the double helix, nucleotides pair – A is always paired with T, and G is always paired with C. One molecule of DNA goes on for millions of nucleotides, and is known as a “chromosome”. Along with the DNA, chromosomes also have proteins bound to them that help them wrap up into neat little packages. Humans have 46 chromosomes in everyone one of their cells. Each chromosome has a mate, or “homolog”, that contains nearly all of the same information, so we divide 46 chromosomes into 23 pairs. For each of the 23 pairs, one copy comes from your mother and one from your father. One of these pairs is a special set known as sex chromosomes (ie, X and Y). Females have two copies of the X chromosome (one from their mother and one from their father), whereas males have one X (from mom) and one Y (from dad). The National Center for Biotechnology Information has created a nifty genome browser that lets you explore the contents of each chromosome. Some of the sequences of nucleotides contain information that leads to the production of proteins. These proteins carry out essential functions within the cell (such as the production of more proteins), allow for communication between cells, and regulate cell division, among the many other tasks they perform. The sequences containing the information to a produce a protein are known as “coding sequences”. (Note: some people refer to them as “genes”, but genes may also include sequences that encode RNAs that are never translated into proteins.) Each chromosome contains hundreds of coding sequences interspersed throughout the length of the DNA molecule. In between these coding sequences are non-coding sequences of nucleotides. These non-coding sequences may contain information that determines when and how the coding sequences are translated into proteins. Here’s a little drawing showing two coding sequences separated by a non-coding sequence. That’s it for now. Next time we’ll talk about the organization of coding sequences and their regulatory regions. I promise we’ll be discussing natural selection soon.

Detecting Natural Selection (Part 2)
The Organization of the Genes

This is the third of multiple postings I plan to write about detecting natural selection using molecular data (ie, DNA sequences). The first posting contained a brief introduction and can be found here. The second post described the organization of the genome. In the last entry I mentioned that the term gene is often used interchangeably with protein coding sequence. In this entry, I will describe the structure of protein coding genes. Only a portion of the gene contains protein coding sequences. We will divide the gene into multiple parts: exons, introns, upstream sequence (or 5’ flanking regions), and downstream sequence (3’ flanking regions). The exons contain the protein coding sequence, and they are separated by introns. The introns and exons are transcribed into RNA, the introns are then spliced out to make messenger RNA (mRNA), and then the mRNA (coding sequence) is translated into a protein. (Note: The majority of life on earth is prokaryotic, and prokaryotic genes do not contain introns.) The region upstream of the gene usually contains non-coding sequences that control when and how the coding sequences are transcribed into mRNA (the introns and downstream regions may also contain transcriptional regulatory regions). For more on the regulation of transcription check out this site. For the rest of this discussion, we will refer to two types of sequences: non-coding (introns and upstream and downstream regions) and protein coding. The protein coding sequence of a gene is made up of sets of three nucleotides called codons. When the mRNA transcribed from a gene is translated into a protein, each codon encodes a single amino acid. Just like nucleotides are the building blocks of DNA, amino acids are the building blocks of proteins. There are 64 different 3 nucleotide combinations (4 different nucleotides in combinations of 3, or 4^3 = 64), but there are only 20 amino acids. That means that some amino acids are encoded by multiple codons. We refer to this as the redundancy of the genetic code. The U’s in this figure are the RNA equivalent of the T’s in DNA. The three nucleotides in a codon can each be referred to by their position in the codon: first, second, and third. Because of the redundancy of the genetic code, some codons encode the same amino acid as other codons. The codons that encode the same amino acid tend to have the same first and second nucleotide, but differ at the third codon position. For this reason, many mutations of the third codon position do not lead to a change in the amino acid encoded by the codon. Mutations that do not lead to a change in the amino acid encoded by a codon are known as synonymous. Conversely, those that lead to a change in amino acid are called non-synonymous. Next time we will discuss how we can compare synonymous and non-synonymous differences between two coding sequences to infer natural selection.

Detecting Natural Selection (Part 3)

Codon Based Models for Detecting Selection This is the fourth of multiple postings I plan to write about detecting natural selection using molecular data (ie, DNA sequences). The first post contained a brief introduction and can be found here. The second post described the organization of the genome, and the third described the organization of genes. As I mentioned in the previous post in this series, we can divide genes into protein coding sequence and non-coding sequence. Protein coding sequences are made up of codons (sets of three nucleotides) which encode amino acids. Because of the redundancy of genetic code (64 possible codons, but only 20 amino acids), many amino acids are encoded by multiple codons (see the codon table above). Nucleotide substitutions that do not change the amino acid encoded by a codon are said to be synonymous (or silent). Those that do change the amino acid are non-synonymous (some people refer to these as replacement substitutions). Tests for natural selection often compare one set of nucleotides (or sites) which may be under selection to another set of nucleotides (or sites) that are probably not under selection. The sites that are not under selection are said to be evolving under neutral processes, and they act as a scientific control. The patterns of evolution at the sites that may be under selection (the experimental group) are then compared to the neutral sites. If both sets are evolving similarly, then we fail to reject neutrality at the second set of sites. Conversely, if the two types of sites are evolving at different rates, we reject neutrality. If we assume that selection acts on the amino acid sequence (the protein encoded by the gene) then the synonymous substitutions should be selectively neutral. This is various corrections have been developed to calculate the actual number of substitutions for more diverged sequences.) Assuming that dS is an adequate estimate of the neutral rate of molecular evolution we interpret the three different outcomes of comparisons between dN and dS thusly: • If dN < dS : Non-synonymous sites are evolving slower than synonymous sites. We interpret this to mean that the non-synonymous sites are under selective constraint (or purifying selection) because they are evolving at a rate slower than the neutral expectation. This is the case for most genes when comparisons are made between any two species. This means that most amino acid substitutions are deleterious. • If dN = dS : Non-synonymous and synonymous sites are evolving at equal rates. Hence, non-synonymous substitutions are neutral. • If dN > dS : Non-synonymous sites are evolving faster than synonymous sites. This is evidence for positive selection because we assume that natural selection is acting on the amino acid sequence of the protein. The relationship between dN and dS is often summarized by the ratio of the two statistics (dN/dS). If dS>dN then 1>dN/dS; if dN=dS then dN/dS=1; if dN>dS then dN/dS>1. There are some obvious limitations (and some more subtle ones) to using dN and dS to identify genes under positive selection. As I mentioned earlier, dS may not be an adequate estimate of neutral evolution. Furthermore, when only one (or very few) amino acids are under positive selection, these statistics are not very sensitive to that selection. If synonymous substitutions accumulate, in general, at a faster rate than non-synonymous substitutions, the one (or few) amino acid substitution that occurs due to positive selection will not influence dN enough to cause it to be significantly greater than dS. One way to get around this problem is to perform a sliding window analysis (examine subsets of the coding sequence, say 30 codons at a time) to detect regions of a gene that have a signature of positive selection. Despite its limitations, this codon based approach for detecting selection is widely used and provides the basis for other methods of detecting selection. Next time we will discuss phylogenetics and relative rates. Detecting Natural Selection (Part 4)

Phylogenetics and Relative Rates

This is the fifth of multiple postings I plan to write about detecting natural selection using molecular data (ie, DNA sequences). The first post contained a brief introduction and can be found here. The second post described the organization of the genome, and the third described the organization of genes. The fourth post described codon based models for detecting selection. The simple codon based model for detecting natural selection that I described previously (dN/dS) involves comparing two homologous sequences. If we have three or more sequences, we can create a rooted phylogeny, and four or more sequences allow us to create an unrooted phylogeny. With the analysis of dN and dS we were not concerned with which lineage the substitutions occurred on. In our relative rate analysis we will be determining were in the tree (on which branch) the substitutions occurred. I will not get into the detail regarding different algorithms for creating phylogenies, and I will assume that we already know the evolutionary relationship of the sequences we are comparing. If you are interested in learning more about how phylogenies are created, I would recommend starting with this book and following the literature citations therein. I will point out that the length of the branches represents the number of substitutions that have accumulated along a particular lineage. Phylogenies can be created with either DNA sequences or translated protein coding sequences (amino acid sequences) depending on if the sequences are closely related or not. (DNA sequences are preferred for closely related sequences because they evolve faster and accumulate more substitutions in a shorter period of time, while amino acid sequences are preferred for more distantly related sequence because they evolve slower.) Recall from our codon based comparisons that we have, essentially, three different selective scenarios: 1. Neutral evolution - a sequence is evolving without the constraint or influence of natural selection 2. Purifying selection or selective constraint - natural selection is acting as a conservative force, restricting the evolution of a sequence 3. Positive selection - natural selection is driving the evolution of a sequence, causing it to evolve faster than the neutral expectation Relative rate tests compare the selective constraint along two lineages of a phylogeny. Assuming all other factors are equal (ie, population size, mutation rate, gametogenesis, generation time), if the selective constraint along both lineages is equal, the branch going from sequence A should be of equal length to the branch leading to sequence B. If, however, the selective constraints differ, the branch lengths should be unequal. If there are more substitutions along one lineage than the other, we must invoke some explanation for these differences. In some cases, differences in rates can be attributed to life history differences of the species from which the two sequences come. For example, mice have much shorter life spans than humans, and this has been used to explain differences in rates of evolution between the two lineages. Detecting Natural Selection (Part 5)

Allele and Genotype Frequencies in Populations This is the sixth of multiple postings I plan to write about detecting natural selection using molecular data (ie, DNA sequences). The first post contained a brief introduction and can be found here. The second post described the organization of the genome, and the third described the organization of genes. The fourth post described codon based models for detecting selection, and the fifth detailed how relative rates can be used to detect changes in selective pressure The previous analytical techniques we have discussed all deal with comparing sequences from different species (or homologous sequences that resulted from gene duplication). From here on out we will be discussing how variation within a species can be used to detect natural selection at the sequence level. To do this we must first address what we expect when there is no selection. This post deals with expected allele and genotype frequencies and changes in allele frequencies between generations (something I have written about before). Subsequent posts will use more recently developed analyses that allow us to detect selection by sampling allele frequencies in a population. Before we begin discussing how to detect natural selection we must lay out our neutral expectation (ie, null hypothesis). Let’s begin by assuming a one locus, two allele model, where the two alleles are given by A and a. We can use a Punnett square to determine the expected frequency of each genotype in a mating between two heterozygotes. A a A AA Aa a Aa aa We expect one quarter of the progeny to have the genotype AA, one quarter to have the genotype aa, and half to be heterozygous (Aa). We can extend this model to determine the expected genotype frequencies in a population after one generation of random mating. Let the frequency of allele A in the first generation be p, and the frequency of allele a is given by q (p + q = 1). p q p p2 pq q pq q2 As you can see from this table, when there is random mating, a large population, no mutation, no migration, and no natural selection, the genotype frequencies in the second generation can be given by the following equations (in terms of allele frequencies in the previous generation): * Freq(AA) = p2 * Freq(Aa) = 2pq * Freq(aa) = q2 We can then use these formulas to determine the allele frequencies in the second generation. Let p' and q' be the allele frequencies of A and a in the second generation, such that: * p' = p2 + pq * q' = q2 + pq As I have previously described, Hardy and Weinberg showed that random mating on its own does not change allele frequencies. We can see that by rearranging the equation for p' given above: p' = p(p + q) As mentioned above, p+q =1, so p' = p, and we have proof that random mating alone does not alter allele frequencies (the frequency of allele a does not change because q' = 1-p', which is equivalent to q'=q). Natural selection, however, can lead to changes in allele frequencies between generations. I detailed how to determine the expected allele frequencies after selection given the allele frequencies before selection and the fitness of the different genotypes in my post on mean fitness and genetic load. We can also derive the marginal fitness of the alleles (remember, fitness is a measure of the number of progeny left per individual carrying a particular genotype, and in diploid organisms the genotype consists of two alleles), and we get the following results: * WA = pWAA + qWAa * Wa = qWaa + pWAa where WA and Wa are the marginal fitnesses of alleles A and a, and WAA, WAa, and Waa are the fitness of each of the genotypes (the number of progeny left by an individual carrying that genotype). As you can see, by measuring changes in genotype frequencies from generation to generation we can estimate the fitness of each genotype, and by measuring changes in allele frequencies we can estimate the marginal fitness of each allele. These results provide the theoretical framework for all of population genetics, but they are rarely used to detect selection because more powerful techniques have been developed for molecular data. I still felt it necessary to lay out some of these concepts for you so that you can appreciate what will follow: detecting natural selection using nucleotide sequence polymorphism. Detecting Natural Selection (Part 6)

Calculating Nucleotide Sequence Polymorphism

This is the seventh of multiple postings I plan to write about detecting natural selection using molecular data (ie, DNA sequences). The first post contained a brief introduction and can be found here. The second post described the organization of the genome, and the third described the organization of genes. The fourth post described codon based models for detecting selection, and the fifth detailed how relative rates can be used to detect changes in selective pressure. The sixth post dealt with classical population methods for detecting selection using allele and genotype frequencies. Much of the popular press surrounding recent publications that proclaim to detect natural selection does not adequately detail whether the researchers have identified purifying selection (selective constraint) or positive (aka Darwinian) selection. See here for a particularly poor article that confused the heck out of me (and I claim to understand genome analysis). Recall from our discussion of codon based models for detecting selection that we can distinguish between these two selection regimes, although the codon based models are not very powerful. We can also use relative rates to identify differences in selective constraint between lineages, but it is difficult to differentiate relaxed constraint from positive selection using these analyses. Furthermore, there are a lot of computational biologists interested in identifying highly conserved regions between genomes under the assumption that these sequences are probably under strong purifying selection. But what if we’re interested in finding signatures of positive selection? In my opinion, the best data for this type of analysis is DNA sequence polymorphism. This post will detail some of the core concepts in calculating polymorphism from DNA sequences. Subsequent posts will detail the statistical analyses that can be performed on this data. Signatures of selection can be detected using either protein coding or non-coding DNA; the statistical techniques differ for analyzing these two types of sequences, but the initial steps are very similar. We will refer to region of the genome we are sequencing as a locus. The length of this region can range anywhere from about 500 nucleotides to thousands of nucleotides (the longer the region, the more work the grad student or undergrad doing the lab work must put in). Once a researcher has chosen a particular locus (either because they know sequencing it will be feasible or because it is a near a gene of particular interest), she will sequence it in approximately 20-30 individuals from some population (although it is common to sequence in as few as 10 or as many as hundreds of individuals). Choosing which individuals to sample (and which populations to sample from) is beyond the scope of this entry and often depends greatly on the natural history of the species in question. DNA alignment. Click on the image for a larger version. Once all of the “wet-lab” work is completed (which can take weeks if you are lucky or years if you chose a poorly studied taxon and tricky locus), the sequences must be aligned (see above). To read the alignment above, you need a quick primer on the nomenclature. The column on the left contains the identifiers (or names) of the sequences. The first set of sequences is the alignment of the first 50 nucleotides from the locus, followed by numbers that indicate the position of each nucleotide. Below that, we have positions 51-100, then 101-150, etc. Ideally, we would have all the positions aligned in a single block, but the limitations of printed paper prevent such a representation. The color coding is just there to make distinguishing the nucleotides from each other. I will assume we have a good alignment, although the alignment process can be quite tricky. Each homologous nucleotide can be compared between all of the sequences in an alignment. Sites at which all sequences have the same nucleotide (position 1 above, where all sequences have a G) are monomorphic. If there are different nucleotides at a site (position 6, where some sequences have a C and the others have a T), that site is said to be polymorphic. We can count the number of sites that are monomorphic and the amount that are polymorphic. We will discuss two types of polymorphism. The first, the number of segregating sites (Sn), is just the amount of polymorphic nucleotide sites in the data set. As the number of sequences in the data set increases, so too do the number of segregating sites. If this is not obvious to you, imagine we have sampled five sequences from a population (if you need a picture, imagine it’s the first five sequences listed above: D_yakuba, RPU74073, RPU74053, PSU74068, TJU74075). We can calculate the number of segregating sites using these five sequences, and this number will usually be less (and never be more) than if we added five more sequences to our sample. Each time we add a sequence, we identify more segregating sites, although there are diminishing returns as more sequences get added -- eventually (ok, after a really long time) you identify all of the segregating sites in a population, and adding more sequences will only result in adding the same polymorphic sites that you have already identified. The number of segregating sites depends on the number of sequences in your data set, and we will need to apply a simple correction to take this into account (I will address this in a later post). The second type of polymorphism, the average pairwise differences (p), does not suffer from this problem. To calculate p we must first compare all pairwise combinations of sequences in the data set (compare the first sequence to the second, the first to the third, the first to the forth, all the way to the second to the last and the last). In each pairwise comparison we calculate the number of nucleotide sites that differ between those two sequences. Once we have calculated the number of differences between each pair, we divide by the number of comparisons made to get the average pairwise differences. Because we are taking an average, this estimate of polymorphism does not depend on the number of sequences in the sample. In future posts we will discuss the theoretical framework behind detecting selection using nucleotide polymorphism and how our two estimates of polymorphism can be used to detect natural selection.