Normalization of cDNA microarray data
Introduction
In this paper, we suppose that an experiment has been conducted using a series of two-color cDNA microarrays. Each microarray has been hybridized with RNA from two sources labeled with different fluors. The two color channels will be referred to by convention as red and green. We suppose that the arrays have been scanned to produce images and that these images have been processed further by an image analysis program to produce measured red and green foreground and background intensities for each spot on each array. Before the gene expression profiles of the RNA samples can be analyzed and interpreted, the red and green intensities must be normalized relative to one another so that the red/green ratios are as far as possible an unbiased representation of relative expression.
The purpose of normalization is to adjust for effects which arise from variation in the microarray technology rather than from biological differences between the RNA samples or between the printed probes. Imbalances between the red and green dyes may arise from differences between the labeling efficiencies or scanning properties of the two fluors complicated perhaps by the use of different scanner settings. If the imbalance is more complicated than a simple scaling of one channel relative to the other, as it usually will be, then the dye bias is a function of intensity and normalization will need to be intensity dependent. The dye-bias will also generally vary with spatial position on the slide. Positions on a slide may differ because of differences between the print-tips on the array printer, variation over the course of the print-run, non-uniformity in the hybridization, or from artifacts on the surface of the array which affect one color more than the other. Finally, differences between arrays may arise from differences in print quality, from differences in ambient conditions when the plates were processed or simply from changes in the scanner settings. Therefore, normalization between as well as within arrays will need to be considered.
Write R and G for the background-corrected red and green intensities for each spot. Normalization is usually applied to the log-ratios of expression, which will be written M=log2R−log2G. The log-intensity of each spot will be written A=(log2R+log2G)/2, a measure of the overall brightness of the spot. (The letter M is a mnemonic for minus while A is a mnemonic for add.) It is convenient to use base-2 logarithms for M and A so that M is in units of 2-fold change and A is in units of 2-fold increase in brightness. On this scale, M=0 represents equal expression, M=1 represents a 2-fold change between the RNA samples, M=2 represents a 4-fold change, and so on.
Any negative values for R or G will yield missing values for M and A and the corresponding spots will be excluded from subsequent analysis including normalization. The frequency of negative values depends very much on the image analysis program and the background estimation method used. SPOT [1], for example, using a “morph” background gives very few negative intensities while other programs such as GenePix [2] using a “median” background may often result in 30% or more negative values. The loss of information which results from omitting such spots from the analysis is usually not great because spots with negative values for either R or G are usually too faint to show good evidence of differential expression. In any case, the relative merits of the different background correction methods are beyond the scope of this paper.
The plan of this paper is as follows. Section 2 describes diagnostic plots to visualize intensity and spatial trends. Section 3 describes the basic normalization method, print-tip loess normalization, designed to adjust for intensity and spatial trends. Section 4 describes composite loess normalization in which use is made of control spots known to be not differentially expressed. Section 5 considers normalization for other trends, in particular, correcting for print-order effects. Section 6 describes scale normalization between arrays. Section 7 describes the use of spot quality weights and Section 8 gives detailed commands to implement the normalization techniques using freely available software.
Section snippets
Visualization of intensity and spatial trends
The sub-array loess normalization methods described in this paper are based on the fact that dye balance typically varies with spot intensity and with spatial position on the array. It is a useful trouble-shooting step to display these trends visually as part of the normalization process.
The relationship between dye-bias and intensity can be seen best in an MA-plot, which is a scatterplot of the M-values against the A-values for an array [3]. Fig. 1 shows an MA-plot for an array showing three
Print-tip loess normalization
The idea of print-tip loess normalization can be visualized in Fig. 4. Each M-value is normalized by subtracting from it the corresponding value of the tip group loess curve. The normalized log-ratios N are the residuals from the tip group loess regressions, i.e.,where loessi(A) is the loess curve as a function of A for the ith tip group. Each loess curve is constructed by performing a series of local regressions, one local regression for each point in the scatterplot.
Composite loess normalization
It is usual to use all or most of the genes on the array in the normalization methods described above. It can be useful to modify this strategy if a suitable set of control spots is available which are known not to be differentially expressed. To be of most use in loess normalization, the control spots should span as wide a range of intensities as possible. A satisfactory set of controls for this purpose is a specially designed microarray sample pool (MSP) titration series in which the entire
Correcting for other trends
There are many other trends which could be estimated and adjusted for in the normalization step, although normally these are of less importance than the intensity and spatial trends already considered. For example, there can be differences between the purity of DNA from different amplification batches or from different clone libraries. This can mean that different spots on the microarray contain different effective quantities of DNA. Different amplification batches and different clone libraries
Between array normalization
Sometimes there are substantial scale differences between microarrays, because of changes in the photomultiplier tube settings of the scanner or for other reasons. In these circumstances, it is useful to scale-normalize between arrays. Scale-normalization is a simple scaling of the M-values from a series of arrays so that each array has the same median absolute deviation.
Fig. 6 displays side-by-side boxplots of the normalized M-values for a series of six replicate arrays including slide 0924
Weighting for spot quality
Most image analysis programs routinely record a variety of descriptive information about each spot apart from the foreground and background intensities. If this information is used to construct a numeric quality measure for each spot, then lower quality spots can be down-weighted in the normalization process.
Information which is recorded on each spot typically includes morphological details such as area, perimeter, and location plus heterogeneity measures such as standard deviations or
Software
Software to carry out the normalization methods described in this paper is freely available from the Bioconductor project site http://www.bioconductor.org. The Bioconductor packages use the free statistical programming environment R. For normalization of cDNA arrays, the relevant packages are marrayNorm [8], [9] and limma. Here, we give commands from the limma package.
The first step is data input. If all the SPOT output files are in the working directory of the R session and are named after the
Conclusion
Normalization methods for cDNA microarrays will no doubt see further development in the future, but print-tip loess normalization provides a well-tested general purpose normalization method which gives good results on a wide variety of arrays. The method may be refined by using quality weights for individual spots. The method is best combined with diagnostic plots of the data. When diagnostic plots show that biases still remain in the data after normalization, further normalization steps such
Acknowledgements
The authors are grateful to Dr. Lynn Corcoran for permission to use unpublished data from her laboratory at the WEHI and to Henrik Bengtsson for helpful discussions on plate-order normalization.
References (10)
- M.J. Buckley, Spot User’s Guide, CSIRO Mathematical and Information Sciences, Sydney, Australia, 2000. Available from:...
- GenePix Pro Microarray and Array Analysis Software, Axon Instruments Inc. Available from:...
- et al.
Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments
Statistica Sinica
(2002) - et al.
Local regression models
- Y.H. Yang, S. Dudoit, P. Luu, T.P. Speed, Normalization for cDNA microarray data, in: M.L. Bittner, Y. Chen, A.N....
Cited by (1535)
A Novel Interaction Between RAD23A/B and Y-family DNA Polymerases
2023, Journal of Molecular BiologyA novel batch-effect correction method for scRNA-seq data based on Adversarial Information Factorization
2024, PLoS Computational BiologyA Comprehensive Survey of Recent Approaches on Microarray Image Data
2024, SN Computer ScienceDeep Learning for Predicting Gene Regulatory Networks: A Step-by-Step Protocol in R
2024, Methods in Molecular Biology