in a cDNA microarray experiment We propose the diagnostic plot (DP) - - PDF document

in a cdna microarray experiment
SMART_READER_LITE
LIVE PREVIEW

in a cDNA microarray experiment We propose the diagnostic plot (DP) - - PDF document

Vol. 38, No. 3 (2005) examined graphically using the scatter study is to identify the outlying slides Therefore, the main focus of this study. inference on the whole microarray quent analysis, resulting in unreliable slides can frequently


slide-1
SLIDE 1
  • Vol. 38, No. 3 (2005)

BioTechniques 463

INTRODUCTION Biological processes undergo complex interactions between many genes and gene products. Genome- wide microarray profiling technology has been recognized as a breakthrough to understand such complex gene regulations and interactions simultane-

  • usly in biology and medicine (1–4).

A cDNA microarray slide consists of thousands of cDNA clones spotted

  • n a high-density glass slide. Each

slide is competitively hybridized with two independent mRNA samples, labeled with red (Cy™5) and green (Cy3) fluorescent dyes. Each cDNA clone’s (or gene’s) expression levels can then be measured by reading two fluorescence intensities in the green (G) and red (R) channels for the two RNA samples. The ratio of these two fluorescence intensities at each spot represents the relative abundance of the corresponding cDNA probe (5). However, in cDNA microarray experiments, different sources of systematic and random error can

  • arise. These may significantly affect

the inference on the measured gene expression patterns. A normalization procedure and a variance-stabi- lizing transformation are commonly employed to remove (or minimize) the artifacts due to such error variation. Several normalization methods have been proposed using parametric and nonparametric statistical models (6–8). Those normalization methods mainly focus on adjusting for the location parameters such as means or medians. Larger variability is often observed at low log-transformed intensity regions, because at low intensity levels, the background noise is a larger proportion

  • f the observed expression intensity

(i.e., lower signal-to-noise ratio), while at high levels of expression intensity, this background noise is dominated by the expression intensity. To obtain homogeneous variability across different intensity regions and genes, variance-stabilizing and other transformation approaches have been suggested, including generalized log transformation (9–14). While these normalization approaches and variance-stabi- lizing transformations are useful for adjusting the bias of each individual slide, they do not provide a rigorous statistical criterion to detect outlying slides that have unusual expression patterns or show larger variability than other slides. At an earlier stage of analysis, each microarray slide is often examined graphically using the scatter plot between the two intensity channels to examine the overall patterns and

  • variability. However, such exami-

nation is based on subjective human pattern recognition, and outlying slides can frequently enter the subse- quent analysis, resulting in unreliable inference on the whole microarray study. Therefore, the main focus of this study is to identify the outlying slides that have unusual nonlinear expression patterns and/or larger variability than

  • ther slides in a microarray data set.

We propose the diagnostic plot (DP) approach that succinctly summarizes and detects outlying slides from a cDNA microarray study. The proposed DP is motivated by the observation that adjustment

  • f nonlinear trends between the two

intensity channels often results in different degrees of correlation between

  • them. Figure 1 shows the log-scatter

plots based on Lowess normalization for the slides from the rat neuronal

Diagnostic plots for detecting outlying slides in a cDNA microarray experiment

Taesung Park1, Sung-Gon Yi1, SeungYeoun Lee2, and Jae K. Lee3

BioTechniques 38:463-471 (March 2005)

1Seoul National University, 2Sejong University, Seoul, Korea, and 3University of Virginia, Charlottesville, VA, USA

Different sources of systematic and random error variations are often observed in cDNA microarray experiments. A simple scatter plot is commonly used to examine outlying slides that have unusual expression patterns or larger variability than other slides. These

  • utlying slides tend to have large impacts on the subsequent analyses, such as identification of differentially expressed genes and

clustering analysis. However, it is difficult to select outlying slides rigorously and consistently based on subjective human pattern recognition on their scatter plots. A graphical method and a rigorous diagnostic measure are proposed to detect outlying slides. The proposed graphical method is easy to implement and shown to be quite effective in detecting outlying slides in real microarray data

  • sets. This diagnostic measure is also informative to compare variability among slides. Two cDNA microarray data sets are carefully

examined to illustrate the proposed approach. A 3840-gene microarray experiment for neuronal differentiation of cortical stem cells and a 2076-gene microarray experiment for anticancer compound time-course expression of the NCI-60 cancer cell lines.

RESEARCH฀REPORT

slide-2
SLIDE 2

464 BioTechniques

  • Vol. 38, No. 3 (2005)

RESEARCH฀REPORT

microarray study described later. Slides were chosen based on the pattern (linear and nonlinear) and variability (small and large). Figure 1A shows the results for slide 9: the log-scatter plots of log-transformed green and red channel intensities. Although three scatter plots show the same slide, the correlation between two normalized channels greatly varies depending on the weight parameter γ introduced later in the Materials and Methods section. A distance measure is derived to summarize both the correlation and the curvature between two intensity channels as a diagnostic measure of each slide. This diagnostic measure becomes large if the correlation is low and/or the nonlinear expression pattern is severe, and vice

  • versa. That is, if the diagnostic measure
  • f a certain slide is significantly larger

than those of other slides, it would have different expression patterns from the

  • ther slides due to higher curvature and/
  • r larger variability than other slides.

The Materials and Methods addresses several key issues in micro- array normalization and introduces our DP approach. In the Results section, the proposed method is illustrated by using two cDNA microarray data sets: (i) a 3840-gene cDNA microarray experiment to search for expression changes during neuronal differentiation

  • f cortical stem cells and (ii) a 2076-

gene cDNA microarray experiment to investigate anticancer compound time-course expression patterns of the cancer cell lines in the NCI-60 panel. Our concluding remarks and discus- sions follow in the Discussion section. MATERIALS AND METHODS Suppose there are I cDNA micro- array slides (or chips), each with J

  • genes. For simplicity, the subscript

for different slides is omitted. Let Gj and Rj be the green and red intensities (background-corrected) for gene j,

  • respectively. Define (Mj, Aj) as:

Mj log(RjGj), Aj (log Rj log Gj) 1 2

Here, Mj is the log-ratio of the two intensities, and Aj is the log-trans- formed value of their geometric means. The purpose of normalization is to remove systematic variation in a microarray experiment that affects the measured gene expression levels (7). Two normalization methods are commonly used: (i) global normalization using the global median of log intensity ratios and (ii) intensity-dependent nonlinear normalization using a robust scatter plot smoothing (Lowess) curve. Under the assumption that only a small percentage of genes are differen- tially expressed without a bias to over-

  • r underexpression, M is expected to be

close to zero. With a certain correction term c(A), the normalized ratio M* is then obtained by M*(A) = M - c(A) [Eq. 1] Note that if c(A) is a constant, this normalization is simply a global normalization subtracting the constant from all genes in a chip. In general, c(A) is obtained by a nonparametric local regression, such as Lowess fit to the M versus A plot. Therefore, the normalization process is performed in two steps: (i) transforms (log G, log R) to (M, A) and then (ii) yields M* using regression models. Our DP approach first transforms M* back to (log G*, log R*), normalized log- green and log-red intensities. Note that such a transformation is occasionally performed for data interpretation in

  • practice. The following three steps are

implemented for DP: step 1, (log G, log R) → (M, A); step 2, (M, A) → M*(A) by normalization; and step 3, M*(A) → (log G*, log R*). Steps 1 and 2 are from the usual normalization process. In general, a larger value of c(A) is caused by systematic differences between log G and log R. For step 3, define (log G*, log R*) as follows: for 0 ≤ γ ≤ 1, log G* (γ) = log G γc(A), log R* (γ) = log R - (1-γ)c(A) [Eq. 2] The parameter γ is a weight parameter having values between zero and

  • ne. Regardless the values of γ, log

R*(γ) - log G*(γ) = M*. Let A* be the sum of the normalized channel inten- sities: A* = (log R* log G*) 1 2 [Eq. 3] If A = A*, there is a unique one-to-

  • ne transformation between (M*,

A*) and (log G*, log R*) for γ = 0.5. Some useful information regarding the pattern of the slide can be obtained by varying γ. When γ = 0, the green channel is fixed and c(A) is subtracted from the red channel. When γ = 1, the red channel is fixed and c(A) is added to the green channel. Figure 1 shows the log-scatter plots based on Lowess normalization and the density plot of A* with varying γ values. Both slides 9 and 15 show nonlinear patterns, but slide 15 has a much larger variability than slide 9. Slides 19 and 21 show linear patterns. Here, a linear pattern implies a constant slope that may not be equal to one. The variability is expressed in terms of how wide the intensities spread over M* = 0 line after normalization. Slide 19 shows a strong linear relationship, while slide 21 shows a weak linear relationship with a large variability. Table 1 summa- rizes the characteristics of four slides in terms of pattern, variability, and corre- lation of original intensities. Figure 1A shows the results for slide 9: the log-scatter plots of [log G*(γ), log R*(γ)] and the density plot

  • f A* for different values of γ. These

plots show quite different patterns over γ. Especially when γ = 0, they show some specific patterns. The correlation between two normalized channels greatly varies over γ. Note that the normalized ratios M*(γ) = log G*(γ) – log R*(γ) are the same for all values of γ. On the other hand, the distribution of A* differs much over γ. For example, in the last plot of Figure 1A, as γ increases from 0 to 1, the distribution of A* becomes more dispersed. Figure 1B shows the results for slide 15. The plots [log G*(γ), log R*(γ)] show that there is more variability at low intensities. The patterns observed here are more extreme than those of slide 9. Figure 1, C and D, show the plots for slides 19 and 21, respectively. Unlike slides 9 and 15, these two slides show consistent patterns and are less sensitive to γ. Even when γ = 0, they show similar patterns compared to other cases when γ > 0.

slide-3
SLIDE 3
  • Vol. 38, No. 3 (2005)

BioTechniques 465 Figure 1. Varying (log G, log R) plots based on Lowess normalization and density plots of A* for different values of γ from rat neuronal microarray

  • data. G, green (Cy3) fluorescent dye; R, red (Cy5) fluorescent dye.
slide-4
SLIDE 4

466 BioTechniques

  • Vol. 38, No. 3 (2005)

RESEARCH฀REPORT

As a result, the correlation between two normalized channels does not vary much over γ. The density plots of A* also show quite consistent patterns for these slides. There are no effects of γ

  • n A* for slide 19. Although there are

some minor effects of γ on slide 21, the dispersion of A* do not vary much

  • ver γ. Note that being less sensitive to

γ would imply that the Lowess curve c(A) is within some neighborhood of 0 for all values of A and that Lowess normalizing has little effect on these slides. From these considerations, the correlation coefficients of [log G*(γ), log R*(γ)] is considered as a summary

Figure 2. Diagnostic plots. (A) Rat neuronal microarray data: within-slide correlation plot. (B) Rat neuronal microarray data: diagnostic plot using d. (C) NCI dose-response microarray data: within-slide correlation plot. (D) NCI dose-response microarray data: diagnostic plot using d. Table 1. Examples of Slides Showing Different Patterns and Variability Characteristics Slide No. 9 15 19 21 Pattern nonlinear nonlinear linear linear Variability small large small large Original correlation middle low high middle Effect of normalization large large small small Effect of γ large large small small large small-middle large middle Range of ρ large large small small Outlying slide No Yes No Yes

slide-5
SLIDE 5
  • Vol. 38, No. 3 (2005)

BioTechniques 467

measure of diagnostic. Two diagnostic plots are proposed using these

  • measures. The first plot is called the

within-slide correlation plot, and the second one is called the DP. Within-Slide Correlation Plot The within-slide correlation plot shows the values of correlation coeffi- cients over γ ∈ [0, 1] (refer to Figure 2A). It is a plot between γ (0 ≤ γ ≤ 1) (x-axis) and (y-axis), where is the sample correlation coeffi-

  • cient. Thus, each slide is represented

by a line in this plot. If there is a strong positive linear relationship between two original channels, the values of tend to be constant and closer to one (Figure 1C). If there is a weak linear relationship, the values of also tend to be constant but much smaller than 1 (Figure 1D). On the other hand, if there is a nonlinear pattern in a slide, the values of tend to vary depending on the variability

  • f slides. For example, when this nonlin-

earity can be fixed by Lowess normal- ization, the correlation can increase, and its maximum can be close to one. This plot can be intuitively under- stood as follows. The slides repre- sented by the lines on the top with little changes on γ are considered as those showing a strong linear relationship between the two channels. The slides plotted by the lines in the middle with little changes are those showing a weak linear relationship between the two

  • channels. The slides plotted by lines

with steep slopes are the ones showing a nonlinear relationship between the two channels. Depending on the variability of slide, its maximum can be close to one or much smaller than

  • ne. Thus, the patterns of slides can be

easily identified from the lines in the within-slide correlation plot.

Figure 3. Plots of (log G, log R) for the original 36 slides from rat neuronal microarray data. The rows represent six time sequences. The first three col- umns represent the triplicated slides from the ciliary neurotrophic factor (CNTF)-treated group. The next three columns show the triplicated slides from the no CNTF-treated group. The number indicates the slide identification number. G, green (Cy3) fluorescent dye; R, red (Cy5) fluorescent dye.

slide-6
SLIDE 6

468 BioTechniques

  • Vol. 38, No. 3 (2005)

RESEARCH฀REPORT

Diagnostic Plot From the within-slide corre- lation plot, more condense summary measures can be derived. First, the mean correlation coefficient of each slide can be calculated. Second, the difference between the maximum and the minimum values of can be

  • evaluated. Thus, the former summarizes

the overall correlation between the two channels, and the latter summarizes the variability (or range) of such corre-

  • lation. Let

be the mean value of the sample correlation coefficients and R be the range of correlation coefficients. In the examples, the within-slide corre- lation plots are monotonic (Figure 2); in this case R is simply | ρ(0) - ρ(1) |. The DP is a plot of 1 - (x-axis), and R (y-axis) (refer to Figure 2B). In the examples, was calculated for γ = 0, .1, .2, …, 1. For the slides with linear patterns, R is close to zero. For the slides showing nonlinear patterns, R becomes large. The values of 1 - depend on variability. Thus, the dots that lie on the left lower corner (close to origin) represent the slides with a strong linear pattern. The dots on the left upper corner represent the slides with a nonlinear pattern and a small variability. The dots on the right lower corner represent the ones with a weak linear pattern, while the dots on the right upper corner represent the ones with a nonlinear pattern and a large variability. Diagnostic Measure From the proposed diagnostic plot, the outlying slides can be easily detected visually. However, in order to identify outlying slides more objec- tively, we propose a diagnostic measure using the distance defined as follows. Let (x1, y1), …, (xI, yI) denote the slides in the DP. In this plot, note that the point (0, 0) represents an ideal condition

  • f slide. For each slide, we can thus

compute the following distance: di = xi

2 yi 2

for i = 1, …, I, where di is the Euclidean distance from the origin summarizing the pattern and overall variability of each slide. Detection of Outlying Slides Outlying slides can be detected as

  • nes having large d values. A cutoff

criterion can be derived if a distribu- tional assumption of di can be made. For example, if di (i = 1, …, I) follows a Gaussian distribution with the mean μ and the variance σ2. Then, the one- sided confidence limit can be obtained: dCL tα,I-1d Sd / I, [Eq. 4] where is d _ is the sample mean, is the sample variance, and tα,I-1 is the upper 100(1 - α) percentile of t - distri- bution with I - 1 degrees of freedom. In order to minimize the effect

  • f outliers in estimating means and

variances, a robust inference can be applied by using a trimmed means and

  • variances. Also a more rigorous approach

Figure 4. NCI dose-response microarray data. Plots of (log G, log R) for the original 14 slides. G, green (Cy3) fluorescent dye; R, red (Cy5) fluorescent dye.

slide-7
SLIDE 7
  • Vol. 38, No. 3 (2005)

BioTechniques 469

can assume that all (xi, yi)T follow the bivariate normal distribution and can construct a distance measure based on the chi-square distribution. A detailed description for this method is given in the supplementary material (see the BioTech- niques’ web site at www.BioTechniques. com/March05/ParkSupplementary. html). Alternatively, a simple clustering approach such as K-means algorithm can be applied to identify outlying slides without any distributional assumption. In order to understand the character- istics of the diagnostic plots more system- atically, we performed a simulation study by generating microarray data with a certain biased error. We leave the details

  • f the simulation results in the supple-

mentary material. Through simulation studies, these diagnostic plots are shown to effectively represent the pattern and variability of the slides both with a simple (constant-variance) bias that can be easily adjusted by a normalization method and with a nonconstant-variance bias that cannot be easily adjusted. RESULTS Neuronal Differentiation Data We first apply our approach to an array data set for a study of cortical stem rat cells (biostats.snu.ac.kr/data). The objective of the study was to identify differentially expressed genes between two experimental groups (15). A cDNA microarray experiment was performed to study the 3840-gene expression profile for the neuronal differentiation of cortical stem cells. There are two experimental groups [ciliary neurotrophic factor (CNTF) and no CNTF] and six time points (12 h, 1 day, 2 days, 3 days, 4 days, and 5 days) for comparison. All the experiments were replicated three times. Here, we mainly focus on drawing diagnostic plots to detect outlying slides. Figure 3 shows the log-scatter plots

  • f 36 slides, (log G, log R). As shown

in this figure, the patterns of the slides varies greatly even among replicated

  • arrays. Lowess normalization was

performed for each slide. From the normalized M*, the normalized (log G*, log R*) were obtained with varying γ values in Equation 2. Figure 2A shows the within-slide correlation plot. For Lowess curve, the user defined smoothing parameter f, which is the percentage of the data used for smoothing at each point, set at 0.2. We tried different values of f, but the plots are less sensitive to f due to large number of genes. This plot contains 36 lines, where each line represents one

  • slide. As expected, the 36 slides show

quite a variety of patterns. In general, the correlation coefficient increases slowly as γ increases. Slides 15, 21, 31, and 36 seem to have quite different patterns from those of other slides. Figure 2B shows the DP. This plot summarizes the information in Figure

  • 2A. That is, each line becomes one
  • point. As observed from this figure,

slides 15, 21, 31, and 36 also show quite distinctive patterns from those of

  • ther slides. The original plots of these

slides have significant variations, as shown in Figure 3. Using the diagnostic measures, we compute the 95% confidence region, assuming that diagnostic measures follow a normal distribution. The dots

  • utside of this circle can be regarded

as problematic slides. Those dots are slides 15, 21, 31, and 36. The normality assumption of di

  • r (xi, yi) can be checked by using

simple quantile-quantile (QQ) plots. The QQ plots of the x-axis, 1 - , and the y-axis, R, do not provide a strong evidence that the normality assumption for these variables is not reasonable (refer to the supplementary material). Also, the QQ plot of d shows that the normality assumption is valid. We also applied the K-means clustering analysis. Both K = 3 and K = 4 provide reasonable results. Figure 2B shows the results for K = 3. The triangle, plus, and circle represent three

  • clusters. As expected, groups are well
  • separated. Four outlying slides are the
  • nes detected by the normal distribu-

tional approach. NCI-60 Cell Line Data We also apply our approach to a microarray data set of the NCI-60 cancer cell lines (NCI-60). These cell lines derived from human tumors have been widely used for investigations on various drugs and molecular targets (discover.nci.nih.gov). The National Cancer Institute's Developmental Therapeutic Programs (dtp.nci.nih. gov) has been studying a large number

  • f anticancer drug compounds and

molecular targets on the 60 cancer cell lines (16–18). In particular, the NCI-60 microarray data have been frequently reanalyzed as an experimental model due to the inaccessibility to human tumor tissues for various studies on cancer. Using HCT116, one of the colon cell lines in the NCI-60 panel, gene expression profile of dose- and time-dependent effects were performed by the topoi- somerase inhibitor I camptothecin compound (CPT) (19). Among the multiple subsets of slides, one subset consisting of 14 slides was randomly selected to demonstrate the proposed

  • approach. This subset was reported to be
  • btained under a homogeneous experi-

mental condition. Figure 4 shows the log-scatter plots of these 14 slides; slides 37, 38, and 71 are outlying slides. Figure 2, C and D, shows the within- slide correlation plot and the DP for the 14 slides, respectively. As observed in the scatter plot, slides 37 and 38 are identified by the diagnostic measure with a 5% significance level, and slide 71 is the next slide found close to the boundary of the critical region. Therefore, in this relatively small NCI- 60 microarray data set, the patterns and

  • verall variability of their individual

slides could be effectively examined by

  • ur DP method.

We also performed the K-means clustering analysis. Figure 2D shows the results for K = 3. The triangle, plus, and circle represent three different clusters. As expected, groups are well separated. Unlike the normal distributional approach, three slides are clustered together, including two slides detected by the normal distributional approach. Similar results were obtained for K = 4. As shown in the supplementary material, the QQ plots for the NCI microarray data yielded similar results to those of the neuronal differentiation

  • data. That is, the QQ plots of 1 –

, R, and d do not provide strong evidences that the normality assumption for these variables is not reasonable.

slide-8
SLIDE 8

470 BioTechniques

  • Vol. 38, No. 3 (2005)

RESEARCH฀REPORT

DISCUSSION In microarray experiments undesirable systematic variations are

  • bserved from different sources. A

nonparametric regression normalization method is often applied to adjust for such biased variation on an M versus A plot. However, normalization alone cannot control all systematic variations. The main idea of the proposed DP lies in findings that some slides have considerable differences and nonlinear relationships between two channel expression patterns for varying trans- formation-constant γ values and that even though such a general nonlinear trend can be artificially adjusted by the normalization methods above, signif- icant differences in such expression patterns of a few outlying slides exhibit fundamental problems in analyzing the whole microarray data set. Note that the normalized expression difference M* is occasionally transformed back to two normalized channel intensity values G* and R*, but the γ-dependent effect on this transformation has not been pointed out by any other studies, which we have found quite useful in microarray data analysis. Statistical analysis on cDNA arrays is typically based on the normalized log-ratio (M*) of two channels. From the normalized M*, it is difficult to distinguish whether the original slide has a linear pattern or a nonlinear

  • pattern. For example, slides 9 and 19 in

Figure 1 have totally different patterns

  • riginally, but their normalized M*

have similar distributions. Thus, a blind use of normalization may ignore such

  • information. Sometimes, the patterns

may also provide useful information. The DP method can also be applied to microarray data adjusted by a variance-stabilization transformation. For example, for the green channel intensity (instead of standard log-trans- formation log G), a more generalized transformation can be used (9). The parameter λ can be estimated by the maximum likelihood method (10). The parameter λ is expected to play a key role in appli- cation of DP to variance stabilization, as γ does to normalization. The appli- cability of the plot remains as a future problem. In our applications of the two micro- array data sets, we found that the DP results were consistent with the obser- vations of the scatter plot of each slide, quantitatively and objectively summa- rizing such observations. The remaining issue is how to treat any outlying slides that are detected by DP. The simplest approach is to eliminate these outlying

  • slides. However, data reduction in a

small-sample microarray experiment may result in a significant loss of power in detecting differentially expressed

  • genes. An alternative approach is to use

the diagnostic measure d as a weight variable, wi = (1/di)/(Σ1/di) for the ith

  • slide. That is, the larger the diagnostic

measure, the smaller weight the slide has in the analysis. The weighted analysis is commonly used in many statistical

  • models. The weight is usually chosen to

be the inverse of variance. The diagnostic measure can be treated as a summary measure of variability of slide. To investigate such a possibility, we reanalyzed the neuronal differentiation data by eliminating the outlying slides and using d as a weight variable. Figure 5 summarizes the number of significant genes detected by three methods: (i) all 36 slides in the analysis; (ii) 32 slides in the analysis after eliminating four slides; and (iii) 36 slides in the analysis using a weight proportional to 1/d. Forty-three genes are identified by all three methods. In this preliminary experiment, we have found that the analysis after eliminating four slides may provide results that are more valid (refer to the supplementary material for more details).

ACKNOWLEDGMENTS

The authors wish to thank four anonymous referees whose comments were extremely helpful. The work was supported in part by the Korea Science and Engineering Foundation (KOSEF) through the Statistical Research Center for Complex Systems at Seoul National University to T.P. and by the U.S. Amer- ican Cancer Society grant no. RSG-02- 182-01-MGO to J.K.L.

Figure 5. Further analyses for rat neuronal microarray data. Number of significant genes using three methods: all method, elimination method, and weighting method.

slide-9
SLIDE 9
  • Vol. 38, No. 3 (2005)

BioTechniques 471

COMPETING INTERESTS STATEMENT

The authors declare no competing interests.

REFERENCES

1.Friddle, C.J., T. Koga, E.M. Rubin, and J.

  • Bristow. 2000. Expression profiling reveals

distinct sets of genes altered during induction and resgression of cardiac hypertrophy. Proc.

  • Natl. Acad. Sci. USA 97:6745-6750.

2.Galitski, T., A.J. Saldanha, C.A. Styles, E.S. Lander, and J.P. Mesirov. 1999. Ploidy regu- lation of gene expression. Science 285:251- 254. 3.Golub, T.R., D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, et al. 1999. Molecular classification of cancer: class discovery and class prediction by gene expression monitor-

  • ing. Science 286:531-537.

4.Scherf, U., D.T. Ross, M. Waltham, L.H. Smith, J.K. Lee, L. Tanabe, K.W. Kohn, W.C. Reinhold, et al. 2000. A gene expres- sion database for the molecular pharmacology

  • f cancer. Nat. Genet. 24:236-244.

5.Chen, Y., E.R. Dougherty, and M.L. Bittner.

  • 1997. Ratio-based decisions and quantita-

tive analysis of cDNA microarray images. J.

  • Biomed. Opt. 2:364-374.

6.Kerr, M.K., M. Martin, and G.A. Churchill.

  • 2000. Analysis of variance for gene expression

microarray data. J. Comput. Biol. 7:819-837. 7.Yang, Y.H, S.D. Dudoit, P. Luu, and T.P.

  • Speed. 2001. Normalization for cDNA micro-

array data. SPIE BioE 2:141-152. 8.Wolfinger, R.D., G. Gibson, E.D. Wolfinger,

  • L. Bennett, H. Hamadeh, P. Bushel, C. Af-

shari, and R.S. Paules. 2001. Assessing gene significance from cDNA microarray expres- sion data via mixed models. J. Comput. Biol. 8:625-637. 9.Durbin, B.P., J.S. Hardin, D.M. Hawkins, and D.M. Rocke. 2002. A variance-stabiliz- ing transformation for gene-expression micro- array data. Bioinformatics 18:S105-S110. 10.Durbin, B. and D.M. Rocke. 2003. Estima- tion of transformation parameters for microar- ray data. Bioinformatics 19:1360-1367. 11.Durbin, B.P. and D.M. Rocke. 2003. Vari- ance-stabilizing transformations for two-color

  • microarrays. Bioinformatics 20:660-667.

12.Huber, W., A. von Heydebreck, H. Sult- mann, A. Poustka, and M. Vingron. 2002. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18: S96-S104. 13.Rocke, D.M. and B.P. Durbin. 2001. A mod- el for measurement errors for gene expression

  • arrays. J. Comput. Biol. 8:557-569.

14.Rocke, D.M. and B.P. Durbin. 2003. Ap- proximate variance-stabilizing transforma- tions for gene-expression microarray data. Bioinformatics 19:966-972. 15.Park, T., S.-G. Yi, S.-M. Lee, S.Y. Lee, D.-

  • H. Yoo, M.-Y. Chang, and Y.-S. Lee. 2003.

Statistical tests for identifying differentially expressed genes in time course microarray ex-

  • periments. Bioinformatics 19:694-703.

16.Paull, K.D., R.H. Shoemaker, L. Hodes, A. Monks, D.A. Scudiero, L. Rubinstein, J. Plowman, and M.R. Boyd. 1989. Display and analysis of patterns of differential activity

  • f drugs against human tumor cell lines: de-

velopment of mean graph and COMPARE al-

  • gorithm. J. Natl. Cancer Inst. 81:1088-1092.

17.Weinstein, J.N., T.G. Myers, P.M. O'Connor, S.H. Friend, A.J. Fornace Jr., K.W. Kohn, T. Fojo, S.E. Bates, et al. 1997. An information-intensive approach to the molecular pharmacology of cancer. Science 275:343-349. 18.Shi, L.M., Y. Fan, J.K. Lee, T. Myers, M. Waltham, A. Andrews, U. Scherf, K.D. Paull, et al. 2000. Mining and visualizing large anticancer drug databases. J. Chem. Inf.

  • Comput. Sci. 40:367-379.

19.Zhou, Y., F.G. Gwadry, W.C. Reinhold, L.D. Miller, L.H. Smith, U. Scherf, E.T. Liu, K.W. Kohn, et al. 2002. Transcriptional regulation of mitotic genes by camptothecin- induced DNA damage: microarray analysis of dose- and time-dependent effects. Cancer Res. 62:1688-1695.

Received 1 May 2004; accepted 9 September 2004.

Address correspondence to: Taesung Park Department of Statistics Seoul National University 56-1 Shilim-Dong, Kwanak-Gu Seoul, 151-742, Korea e-mail: tspark@stats.snu.ac.kr