We begin with an example that will be used throughout the chapter.The data come from
Sorlie et al. (2001). The goal of that article was to “classify breast carcinomas based
on variations in gene expression derived from complementary deoxyribonucleic acid
(cDNA) microarrays and to correlate tumor characteristics to clinical outcome.’’ The
data consist of log fluorescence values for 456 cDNA clones measured on 85 tissue
samples. Of the 85 samples, 4 are normal tissue samples, 78 are carcinomas, and 3 are
fibroadenomas. Three of the four normal tissue samples were pooled normal breast
samples from multiple individuals. Sorlie et al. (2001) selected the 456 genes from
an initial set of 8102 genes so as to optimally identify the intrinsic characteristics
of breast tumors. In Figures 4.1 and 4.2, the data are plotted as heat maps.∗ This
representation assigns a color for every matrix entry, with negative (underexpressed)
values being green, and positive (overexpressed) values red. The data presented in
this plot were preprocessed by Sorlie et al. (2001), adjusting rows and columns to
have median zero. This preprocessing was applied before selection of the subset of
456 genes, so the column (i.e., sample) medians are not precisely zero.
Heatmaps are used to look for similarities between genes and between samples. They
are most effective if rows and columns are ordered so as to allow these patterns to
be identified. Clustering is often used to give this ordering, by identifying groups of
samples (genes) and then arranging the groups so that the closest groups are adjacent.
This is illustrated in Figure 4.1, where rows and columns are arranged according to
separate hierarchical clusterings. Sorlie et al. (2001) used a similar graphic to identify
interesting groups of genes and tumor subtypes. In Figure 4.2, five interesting gene
subgroups are given. These are similar to those identified by Sorlie et al. (2001).
These gene groups were selected because of unusually high or low expression levels
among some of the tumors (column). The gene groups highlighted in Figure 4.2 are
used to characterize the different tumor subtypes. The six tumor subtypes (indicated
by color from left to right of the dendogram in Figure 4.2) are Basal-like (red),