The Golden Spike Project:

Assessment of Microarray Analysis Methods

A project of the Halfon Lab

Au download Pt download Ag download technical comments

The Ag Spike experiment for Agilent two-color microarrays!

DOWNLOAD the new Platinum Spike experiment paper and supporting data.


DOWNLOAD the original Golden Spike experiment paper (Choe et al. 2005. "Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset." Genome Biology 6:R16) and additional data including scripts, CEL files, and clone-to-probe mappings.

DNA microarrays have become a leading research method in a diverse variety of biological and biomedical disciplines. However, assessing the accuracy of microarray data has been difficult as the “correct” answers typically are not known; that is, due to the vast numbers of genes interrogated in a microarray experiment, only a relatively small fraction of gene expression differences tend to be validated in any given study. This is of particular concern due to the tremendous number of proposed microarray data analysis methods, which have proliferated in tandem with the increased use of the arrays themselves. To illustrate the scale of the problem, consider just a single journal, Bioinformatics: in the six month period ending December 2004, this journal published over 40 papers involving various aspects of microarray analysis, of which at least one-half dealt with basic issues such as normalization or discovery of differentially expressed genes. Unfortunately, despite the large number of proposed algorithms, there have been relatively few studies that assess their relative performance. Thus the microarray user is left in the difficult position of choosing from among a large number of analysis options without the benefit of knowing which methods work the best.

The Golden Spike Project represents an attempt to rectify this situation by making available control microarray datasets in which the relative concentrations of all present genes are known. All data, scripts, technical comments, and links to publications will be available through this site.

Our first dataset (Choe et al. 2005) is a wholly-defined Affymetrix GeneChip experiment that has 1309 individual cRNAs “spiked in” at known relative concentrations between the two (spike-in and control) samples. This large number of defined RNAs enabled us to generate accurate estimates of false negative and false positive rates at each fold change level, beginning at only a 1.2x concentration difference. Such small fold changes can be biologically relevant, yet are frequently overlooked in microarray datasets due to a lack of knowledge as to how reliably such small changes can be detected.  Our dataset uses a defined background sample of 2551 RNA species present at identical concentrations in both sets of microarrays, rather than a biological RNA sample of unknown composition. This background RNA population was sufficiently large for normalization purposes, yet also enabled us to observe the distribution of truly non-specific signal from probe sets which correspond to RNAs not present in the sample.

We used this dataset to compare several common Affymetrix array analysis algorithms. Our results demonstrated that at several steps of analysis, large differences exist in the effectiveness of the various options that we considered. We found a significant limit to the sensitivity of the microarray experiments to detect small changes: in the best case scenario, we could detect approximately 95% of true DEGs with changes greater than 2-fold, but less than 30% with changes below 1.7 fold before exceeding a 10% false discovery rate. Importantly, we found that accurate detection of DEGs was maximized by combining aspects of different published methods, rather than by any one existing method.

A follow-up dataset, the Platinum Spike, is a second wholly-defined Affymetrix experiment with approximately 2100 spiked in cRNAs and 3643 unchanging control RNAs. A total of 18 microarrays were hybridized with samples from one of two conditions (9 arrays each). Unlike in the Golden Spike experiment, the two conditions are "balanced" in that similar numbers of RNAs are up- and down-regulated in each. We used the Platinum Spike dataset to compare over 40,000 possible analysis routes to determine the best analysis methods. Comparison with the Golden Spike experiment revealed that choice of analysis methods, particularly with respect to normalization, is critically dependent on the nature of the underlying RNA distributions; "balanced" and "unbalanced" gene expression requires different analysis choices. Therefore, a single "best" route for all microarray experiments may not exist.

Our most recent dataset is from the Ag Spike experiment, which is formulated similarly to the Platinum Spike but conducted using a set of 12 two-channel Agilent microarrays (two independent samples of three replicates each with dye-swaps). We compared 24 different analysis routes to determine the best choices. Comparison of the best results from the Ag Spike with those from the Platinum Spike show high concordance.