The task of analyzing microarray data is often at least as much an art as a science, and it typically consumes considerably more time than the laboratory protocols required to generate the data. Part of the challenge is assessing the quality of the data and ensuring that all samples are comparable for further analysis.
Normalization of the raw data, which controls for technical variation between arrays within a study, is essential [7]. The challenge of normalization is to remove as much of the technical variation as possible while leaving the biological variation untouched. This is a big challenge, and here we only touch upon the main issues. First, visualization of the raw data is an essential part of assessing data quality, choosing a normalization method, and estimating the effectiveness of the normalization. Many methods for visualization, quality assessment, and data normalization have been developed (see [9] for a review, Text S1, and Figure S1). Related issues of background adjustment and data “summarization” (reducing multiple probes representing a single transcript to a single measurement of expression) for Affymetrix arrays are well introduced in chapter 2 of [10].
Clustering is a way of finding and visualizing patterns in the data. Many papers and indeed books have been written on this topic (see e.g., [11]–[13] and Text S1). Different methods highlight different patterns, so trying more than one method can be worthwhile. Note that while clustering finds predominant patterns in the data, those patterns may not correspond to the phenotypic distinction of interest in the experiment. To identify gene expression patterns related to this distinction, more directed methods are appropriate.