This article describes the development of a novel program to process Affymetrix microarray files, which are used in the biological sciences to establish differences in gene expression between two conditions (e.g., diseased tissue versus healthy tissue).
Introduction and Background
Affymetrix gene expression microarrays (“chips”) are a commercial implementation of a powerful concept originally introduced to the world by Shena and colleagues in 1995 [1]. When successfully implemented, gene expression microarrays let a biologist measure the expression of thousands of genes simultaneously in a biological sample, such as heart tissue, and further, to compare that measure of expression between two biological states, such as diseased heart tissue and healthy heart tissue. In many ways, the introduction of microarray technology has created a revolution in biology, transforming the field into a “big data” science like its sister disciplines of physics and chemistry.
Gene expression microarrays (Figure 1A) are manufactured by attaching strands of deoxyribonucleic acid (DNA), corresponding to different genes of an organism, across the surface of a glass slide. The basic process of identifying genes that are expressed begins with the extraction of messenger RNA (mRNA) from a source, for example, healthy heart cells (Figure 1B). The molecule mRNA is made by cells when a gene is expressed, meaning its physical presence—assuming it can be reliably detected—is an indicator of gene expression. By fluorescently labeling the mRNA and hybridizing it to the surface of the chip, it is possible to quantify the intensity of multiple genes’ expression from that biological source by using a scanner able to measure a fluorescent signal. When this process is repeated on a different biological source, such as diseased heart tissue, a separate gene expression profile is created for the diseased tissue, which can then be computationally compared to the expression profile from the healthy tissue. In this manner, genes that are more highly expressed or more highly repressed in diseased tissue can be identified by comparing their expression profile to the expression profile of the same genes in the healthy tissue. This has obvious implications for determining which genes may be playing a role in disease development or any other biological process of interest, from cancer metastasis and drug resistance in medicine to fruit ripening and drought resistance in agriculture.
Figure 1. (A) Basic microarray design and layout. (B) Extraction of mRNA—expressed from the nucleus of a healthy heart cell—and its subsequent fluorescent labeling. (C) Hybridization of the fluorescently labeled mRNA (“target”) to the surface of an Affymetrix microarray chip, and its subsequent scanning to quantify the fluorescent signal, assumed to be proportional to gene expression.
The Affymetrix implementation of gene expression microarrays utilizes probesets, synthesized in place, on the surface of the microarray chip (Figure 1C). Probesets are groups of small DNA fragments that are complementary to different regions of the same mRNA molecule made whenever a gene is expressed. By combining the fluorescent signal of the probeset group, a single measure of gene expression is arrived at computationally, which is the primary focus of the algorithm presented here. Probesets are composed of groups of perfect match (“PM”) and mismatch (“MM”) probes. In the context of microarray analysis, the term “probe” refers to the strands of DNA physically tethered to the microarray chip and the term “target” refers to the fluorescently labeled sample of mRNA obtained from a biological source, which will be hybridized to the probes to measure gene expression. Perfect match probes are single-strand sequences of DNA, usually 25 nucleotides in length, which have perfect complementarity to the mRNA sequence to which they are designed to hybridize. Mismatch probes are identical to PM probes, with the exception of one nucleotide in the center of the molecule (typically at position 13) that is not a proper match to the mRNA with which it is designed to hybridize. The purpose of MM probes is to measure background fluorescent signal off the surface of the chip, which is one source of technical noise.
The analysis of microarray data involves numerous steps, some of which are not universally performed, but whose combination is collectively referred to as an “analysis pathway.” The steps in an analysis pathway typically involve:
- background correction: performed to remove fluorescent signal not due to biology
- probe normalization: used to place the individual datasets of an experiment on the same scale, so the datasets can be compared accurately
- perfect match (PM) probe correction: used to correct biases in the PM probe signal, often due to differences in DNA sequence between the probes
- summarization: performed to obtain a single measure of gene expression from the multiple measurements obtained by each probeset; this process often attempts to correct “probe” and “chip” technical noise
- probeset normalization: sometimes performed to make the probesets between datasets more directly comparable
- differentially expressed genes (DEG) test: uses a statistical test to identify “true” differentially expressed genes
Despite all of its positive aspects, microarray technology requires considerable knowledge to use effectively, as the raw signal that is generated from the technology is almost always noisy. The scientific literature is rife with algorithms designed to remove various sources of known error, and it is immediately clear that there is no “perfect” algorithm that is universally useful in all experimental situations. Even so, a seminal article by Zhu and colleagues [2] evaluated different combinations of commonly used algorithms—representing over 40,000 different analysis pathways—using a precisely controlled “spike-in” dataset, which allowed the researchers to identify the most important steps common to a good analysis pathway.
The algorithm presented here represents a merging of several of the “best step” analysis pathway practices identified in [2]. Specifically, the algorithm presented here uses the following analysis pathway:
- background correction: none
- probe normalization: performed using quantile normalization [3]
- perfect match (PM) probe correction: none
- summarization: performed using median polish [4]
- probeset normalization: none
- differentially expressed genes (DEG) test: while probability data is provided to aid interpretation, the identification of differentially expressed genes is not performed with statistical methods, but instead relies on graphical interpretation of the processed data
The algorithm presented here does not perform background correction, perfect match probe correction, or probeset normalization because the evidence presented in [2] suggests that these steps are at best unnecessary and sometimes even detrimental. Readers interested in a deeper discussion of microarray technology and data analysis are referred to the excellent reviews in [5, 6].
The Affymetrix Differential Gene Expression Detection (AffyDGED) Algorithm
The AffyDGED algorithm is template-driven, meaning that the algorithm expects several pieces of user-defined information to be provided in a notebook cell used as a template for entering the information.
The example above contains several variables that must be completed by the user to let AffyDGED do its job properly.
The variables requiring user input are:
- cellocation: This variable holds the directory location for finding the CEL data of the microarray hybridizations. CEL files contain the raw fluorescent data from a microarray experiment using Affymetrix technology.
- affycdflocation: This variable holds the directory location for finding the Affymetrix CDF library file. CDF stands for “chip description file” and refers to an Affymetrix file that describes, among other things, the location of the probes on the specific type of Affymetrix chip being used.
- savelocationroot: This variable holds the location where the user would like the final results of the analysis to be saved.
- qnormversion: This variable lets the user select between two options. The option “all” can be entered to perform quantile normalization using all the chips involved in an experiment at once, or alternatively, the option “condition” can be entered, which directs the algorithm to perform quantile normalization by condition, that is, perform quantile normalization twice, once using the data in the experimental condition (such as diseased heart tissue) and then a second time using the control condition data (such as healthy heart tissue). When in doubt, it is recommended that “all” be used.
- studyname: This variable lets the user name the experiment/study being processed by the AffyDGED algorithm. The output of AffyDGED is saved using this name to the location provided in “savelocationroot” above.
- experimentchips: This variable contains a list of the experimental condition datasets being studied.
- controlchips: This variable contains a list of the control condition datasets being studied.
To illustrate the features of the AffyDGED algorithm, we use data from a modestly sized microarray experiment involving the detection of differentially expressed genes between the saliva of pancreatic patients and healthy individuals [7]. All microarray data used in this study and presented here is publicly available at NCBI’s Gene Expression Omnibus portal (www.ncbi.nlm.nih.gov/geo), using the access number GSE14245.
The first tasks completed by AffyDGED include the loading of raw data, the determination of the physical dimensions of the chip data, and the conversion of Affymetrix probe position coordinates to equivalent Mathematica indices.
The chipdimensions and affyindextoMMAindices modules are designed to establish the number of rows and columns of probe data on the microarray chips, as well as to convert the probe position coordinates as assigned by Affymetrix to equivalent Mathematica indices.
For example, the chips used here (Human Genome U133 Plus 2.0) happen to be square, with 1,164 rows of information and 1,164 columns of information.
Affymetrix uses a single-number index (present on the Affymetrix CDF file) created from coordinates of the probes present on their microarray chips. The single-number index is derived from a formula used by Affymetrix that assumes an index of refers to the uppermost-leftmost position of the chip. To successfully load the probeset data into usable groups in Mathematica, it is necessary to shift the coordinates used by Affymetrix into indexing used by Mathematica.
Here is an example of the data contained within the Affymetrix CDF file.
The AffyDGED algorithm parses out the single-number indices for the perfect match probes of each probeset and converts that positional information into usable indices for Mathematica. The mismatch probes are purposely ignored in the AffyDGED algorithm because they often produce signals higher than the perfect match probes, which is an indication that the mismatch probes are not performing as originally intended by Affymetrix engineers.
There are 11 individual numbers, referring to the positions (in Affymetrix coordinates) of 11 perfect match probes of a single probeset used to measure the expression of a specific gene.
Here is the same positional information, now expressed in Mathematica‘s indexing system.
There are now 11 groups of Mathematica indices referring to the same data and accessible by conventional Mathematica indexing.
We can now take a look at the raw data from a single chip used in the pancreatic cancer study that has been loaded. Notice the extreme range of values typically seen in microarray experiments.
For this reason, we look at a histogram of the data using a logarithmic scale (Figure 2).
Figure 2. A histogram of the raw fluorescence intensity data (log scale) contained in the microarray chip GSM356796, used to measure the expression of genes in the saliva of a pancreatic cancer patient.
The majority of the data has an approximately Gaussian appearance (on a log scale), while still harboring extreme values on the rightward tail (look carefully along the axis). This shape is very characteristic of microarray data.
The next steps in processing include transforming the raw data to a scale, determining the probeset size specific for the chip in use, and performing quantile normalization.
The convention of transforming raw microarray data by is almost universally used in the microarray community, because it performs two useful functions. First, it makes the distribution of raw data more Gaussian (although certainly not perfectly so), and it aids interpretation of gene expression ratios for the end user, because it is easier to appreciate that a ratio of and on a log scale indicates the same degree of “up” and “down” regulation for a gene, as opposed to and on an absolute scale.
In the example shown here, this is the probeset size (the number of probes making up each probeset).
Quantile normalization is performed to place each dataset of the experiment on a common scale so the datasets can be appropriately compared to each other.
Using BoxWhiskerChart, this shows the difference between the pre-quantile normalized data and the post-quantile normalized data (Figure 3).
Figure 3. A box-and-whisker comparison of all 24 microarray chips used to compare gene expression between the saliva of pancreatic cancer and healthy patients, before and after quantile normalization.
Following quantile normalization, the probesets are summarized (i.e., a single measure of gene expression is generated) by first performing median polish and then taking the mean of the polished values for all probes within a probeset. After the probesets are summarized, the differential expression of each gene is obtained by subtracting the expression of a gene in the control condition from the expression of the gene in the experimental condition. For example, if the expression of a gene in the saliva of pancreatic cancer patients (the “experimental” condition) is 2.1 and the expression of the same gene in the saliva of healthy patients (the “control” condition) is 1.4, then the differential expression of the gene is .
Here are the distributions of gene expression in the saliva of pancreatic cancer and healthy patients, as well as the distribution of differentially expressed genes between the two states (Figure 4).
Figure 4. Histograms of the summarized (post median polish) gene expression values contained in the saliva of pancreatic cancer and healthy patients, as well as the difference in gene expression between those two biological groups.
From these results, simulations are performed to calculate probability -values, which answer the question, If we were to repeat this experiment many times, under the same experimental conditions as this study, how often would we find results as (or more) extreme than we have observed for each gene in the current study? The simulations performed to calculate the -values use the corrected fluorescent signal values contained within the experimental and control datasets and thus make the assumptions that the corrections are valid and that the data is now a good representative of the biological “truth” being studied. If these assumptions are correct, the probability values reported help the end user to gauge how rare a gene’s measure of differential expression is, but not—as is traditionally used in the microarray community—to make a decision about statistical or biological significance.
Following this, the algorithm organizes the output into a more human-readable table.
Here is a random sample of our current results.
The output columns are as follows:
Column 1: the signal-corrected fluorescence intensity of gene expression in the experimental condition
Column 2: the signal-corrected fluorescence intensity of gene expression in the control condition
Column 3: the measure of differential expression, obtained by subtracting column 2 from column 1
Column 4: the -value obtained through simulation
Column 5: the probeset name as assigned by Affymetrix
Column 6: descriptive information for the probeset including the genbank accession number, gene name, and gene product information
Due to the length of descriptive data in column 6, the output here has been purposefully rearranged to print column 6 underneath each data point’s first five column entries.
The final steps in processing the data include establishing the upper and lower thresholds for determining when a gene is considered up-regulated (turned “on”) in the experimental condition versus the control condition, and when a gene is considered down-regulated (turned “off”). Further, the algorithm saves the final results and provides a summary report of the completed analysis.
AffyDGED saves all the data and a list of differentially expressed genes to the location defined by the user in the template above. Further, both lists of data are saved in two formats, one convenient for users to continue to explore the data in Mathematica and the other conveniently readable in Microsoft Excel.
Support Files for AffyDGED
Affymetrix gene expression array technology utilizes a suite of library files containing important annotation information about the layout and content of its microarray products. AffyDGED can analyze any Affymetrix gene expression experiment as long as the .cdf (chip description file) and .gin (gene information) library files associated with the specific type of microarray chip being used are provided. These files are freely available to the public at www.affymetrix.com/support/technical/libraryfilesmain.affx. The variable “affycdflocation”, defined in the user template, holds the directory location of where the user has stored the .cdf file, and AffyDGED expects the .gin file to also be stored in the same location. Having the .cdf and .gin files together happens naturally, as they are packaged together by Affymetrix and unzip to the same location upon download.
How Accurate Are AffyDGED Results?
As mentioned previously, there is no universally correct algorithm for processing microarray data in all experimental circumstances. AffyDGED was developed under the guidance of [2] because the exact expression of genes in this study was precisely controlled and therefore is a useful gauge of how effectively a microarray analysis algorithm is performing.
Supplemental file 5 in [2] describes the 1,944 differentially expressed genes and 3,426 non-differentially expressed genes that were purposefully “spiked-in” to the microarray experiment. When this dataset is analyzed by the AffyDGED algorithm described here, the thresholds for determining “up” and “down” gene expression are calculated to be 0.225 and , respectively (the red lines in Figure 5). Establishing the thresholds for determining differential expression relies on the observation illustrated in Figure 5, that a plot of the processed data always reveals a tight clustering of data about the line . As the reader scans above and below that axis, note how the density of data noticeably separates from the cluster along that line. This observation was used to develop code that scans vertically up and down in small increments and establishes a breakpoint in each direction any time the density of data at a vertical position is 50% less than it was at the previous increment. These breakpoints become the thresholds for determining differentially expressed up and down genes.
Figure 5. The resulting differential expression threshold determination plot by AffyDGED when processing the data in [2] (accession number GSE21344).
The AffyDGED algorithm identifies 1,832 genes as differentially expressed in [2]. Of these, 1,591 genes overlap with the true 1,944 differentially expressed genes for a true positive rate of . This means AffyDGED was unable to correctly identify 18% of the true list of differentially expressed genes. While perhaps surprising to readers unfamiliar with microarray analysis, this places AffyDGED’s performance among the top performers of leading algorithms identified by [2]. State of the art at this time in microarray analysis means accepting a 1 out of 5 “miscall” rate in differential expression detection. Because of the complexity of microarray technology and of all the steps that occur prior to the actual algorithmic analysis of the data, it is important to realize that at least some of the miscalls by any microarray algorithm are really due to factors that are poorly controlled for by the underlying technology, and do not indicate a weakness in the algorithm itself [8].
Performance Timings on Different Datasets
To gauge the performance of AffyDGED, several publicly available datasets of different sizes and complexity were profiled. The first row of Table 1 shows the series accession number for each dataset available at NCBI’s Gene Expression Omnibus. All timings were obtained using quantile normalization with all chips (i.e. qnormversion set to “all”). Timings were acquired running Mathematica 9.0 under Windows 7 (64 bit) using an Intel Core i5-2500K processor overclocked to 4.48 Ghz.
AffyDGED performs very well, in all cases completing its analysis in less than seven and a half minutes. The largest chip in this comparison is the Human chip, where each of the processed files contains 13.2 Mb of data. AffyDGED is able to process the combined worth of data in a practically usable time frame.
Conclusion
Microarray technology continues to be heavily used by the biomedical and basic science research communities throughout the world. AffyDGED brings a contemporary algorithm useful in the real world to the Mathematica user community interested in exploring fundamental biology questions with their favorite computational tool chest.
References
[1] | M. Schena, D. Shalon, R. W. Davis, and P. O. Brown, “Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray,” Science, 270(5235), 1995 pp. 467-470. www.jstor.org/stable/2889064. |
[2] | Q. Zhu, J. C. Miecznikowsk, and M. S. Halfon, “Preferred Analysis Methods for Affymetrix GeneChips. II. An Expanded, Balanced, Wholly-Defined Spike-in Dataset,” BMC Bioinformatics, 11(285), 2010. doi:10.1186/1471-2105-11-285. |
[3] | B. M. Bolstad, R. A. Irizarry, M. Ästrand, and T. P. Speed, “A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias,” Bioinformatics, 19(2), 2003 pp. 185-193. doi:10.1093/bioinformatics/19.2.185. |
[4] | J. W. Tukey, Exploratory Data Analysis, Reading, MA: Addison-Wesley, 1977. |
[5] | C. A. Harrington, C. Rosenow, and J. Retief, “Monitoring Gene Expression Using DNA Microarrays,” Current Opinion in Microbiology, 3(3), 2000 pp. 285-291. doi:10.1016/S1369-5274(00)00091-6. |
[6] | J. Lovén, D. A. Orlando, A. A. Sigova, C. Y. Lin, P. B. Rahl, C. B. Burge, D. L. Levens, T. I. Lee, and R. A. Young, “Revisiting Global Gene Expression Analysis,” Cell, 151(3), 2012 pp. 476-482. doi:10.1016/j.cell.2012.10.012. |
[7] | L. Zhang, J. J. Farrell, H. Zhou, D. Elashoff, D. Akin, N.-H. Park, D. Chia, and D. T. Wong, “Salivary Transcriptomic Biomarkers for Detection of Resectable Pancreatic Cancer,” Gastroenterology, 138(3), 2010 pp. 949-957.e7. doi:10.1053/j.gastro.2009.11.010. |
[8] | R. A. Rubin, “A First Principles Approach to Differential Expression in Microarray Data Analysis,” BMC Bioinformatics, 10(292), 2009. doi:10.1186/1471-2105-10-292. |
T. Allen, “Detecting Differential Gene Expression Using Affymetrix Microarrays,” The Mathematica Journal, 2013. dx.doi.org/doi:10.3888/tmj.15-11. |
About the Author
Todd Allen is an associate professor of biology at HACC-Lancaster. His interest in computational biology using Mathematica took shape during his postdoctoral research years at the University of Maryland, where he developed a custom cDNA microarray chip to study gene expression changes in the fungal pathogen Cryphonectria parasitica.
Todd D. Allen, Ph.D.
Harrisburg Area Community College (Lancaster Campus)
East 206 R
1641 Old Philadelphia Pike
Lancaster, PA 17602
tdallen@hacc.edu