Bioinformatics and Transcription
INTRODUCTION
Although, in principle, the genome contains all the information one would need to understand a complex metabolic pathway or disease, the information is encoded in a combination of physical and logical constructs that make interpretation very difficult. As you saw in Chapter 3, genes are found in six different reading frames running in both directions and, more often than not, they contain intervening sequences that are subsequently spliced out of the final transcript. Genes are also embedded in multilayer three-dimensional structures. The primary unit of structure, the nucleosome, is composed of chromosomal DNA coiled around a histone protein complex. The location of individual genetic elements within this structure significantly impacts both transcription and replication. The positional effects are subtle because they are related to the topology of a helix coiled around a cylinder. Residues exposed to the interior face of the nucleosome are not accessible to enzymes involved in transcription or replication. Conversely, residues exposed on the surface of the structure and residues contained in the segments that connect nucleosomes are fully accessible. Enzymatic digestion experiments have revealed a level of variability in the structure, and it is now known that protein-coding regions can be hidden or made available for transcription in a time-dependent fashion that relates to the cell cycle. Furthermore, epigenetic factors such as methylated DNA and translocated genes can affect gene expression across multiple generations; as a result, genetically identical cells often exhibit different phenotypes. The combined effect of these genome-level variations makes it difficult to predict the expression of a particular gene and even more difficult to predict the sequence of a final spliced transcript.
Unfortunately, very few physical states or disease conditions are monogenic in nature. Even if a physical state were controlled by a single gene coding for a single protein, the up regulation of that gene would perturb the broader systemmany coding regions would be affected. Biology is a systems problem, nonlinear in nature, and the expression of a single gene has very little meaning outside the context of the web of interactions that describes a metabolic state. Each stage in the gene-expression pipeline provides important information about the factors that ultimately determine a phenotype:
-
Base sequence information can be used to identify conserved sequences, polymorphisms, promoters, splice sites, and other relevant features that are critical to a complete understanding of the function of any given gene.
-
Information about the up and down regulation of closely related messages, mRNA interference, life expectancy, and copy count of individual messages can help build a transcriptional view of a specific metabolic state.
-
Despite much analysis, it is not yet possible to predict the three-dimensional folded structure of a protein from its gene sequence. Furthermore, the final protein is often a substrate for any of a number of post-translational modificationsacetylation, methylation, carboxylation, glycosylation, etc. The enzymes that catalyze these reactions recognize structural domains that are difficult to infer using exclusively genomic data.
-
Intermediary metabolism is the result of millions of protein:protein interactions. These interactions are context sensitive in the sense that a given protein can exhibit different characteristics and serve completely different functions in different environments. The complex networks that describe these interactions are routinely referred to as systems biology.
The genome-centric view of molecular biology is slowly being replaced by a more comprehensive systems view. One of the most important elements of this approach is a comprehensive understanding of the transcriptional state of all genes involved in a specific metabolic profile. This picture is complicated by the fact that many species of RNA that will be identified as playing an important role in the profile are never translated into protein. As previously discussed in Chapter 3, many messages are degraded by the RNA silencing machinery within the cell and others are prevented from engaging in protein translation. These control mechanisms can cause a high copy-count message, one that is highly abundant within the cell, to be translated into a very small number of protein molecules. Regulatory messages (miRNA, siRNA) are relatively straightforward to spot because they are reproducibly short and lack sequences that are normally associated with ribosomal binding. However, these small regulatory messages are spliced from longer transcripts that certainly have the potential to cause confusion.
Any technique used to study the transcripts within a cell must be capable of spanning the range from single digit copy counts to very large numbers, often in the thousands. Accuracy is important because at the single-digit level, small changes in the number of copies of certain messages can have significant effects on metabolism and disease.
This chapter specifically focuses on transcriptionthe process of creating a messenger RNA template from a gene sequence. The process has particular significance in the context of this book because it represents the first information-transfer step between gene and protein. In one sense, the step from gene to transcript represents a step up in complexity because many different transcripts can be created from a single coding region. Conversely, each transcript may be viewed as a simplification because much extraneous information has been removed from the raw gene sequence. This simplification is particularly apparent in situations where splicing operations create a message that is substantially different from the original chromosomal sequence. The structural link between mRNA transcript and protein is much more direct than the link between gene and protein. Furthermore, the appearance of a message in the cytoplasm is a clear indication that a particular gene has become involved in a cell's metabolism. This direct link between the appearance of a transcript and metabolic changes within the cell has given rise to a new emerging discipline known as transcriptional profiling.
This discussion begins with a review of the different types of transcripts and their roles in gene expression. The goal is to lay a foundation for the remainder of the chapter, which focuses on various techniques for identifying and counting individual species of mRNA. Transcriptional profiling depends on these techniques in addition to a portfolio of algorithms for data analysis. Over the past few years, the size of a typical expression-profiling experiment has grown to include thousands of transcripts. The result has been a corresponding increase in the level of sophistication of the statistical methods used to analyze the results. These methods are the focus of much of this discussion.