Program an analysis workflow Day 1. Basic functionality of - - PowerPoint PPT Presentation

program an analysis workflow
SMART_READER_LITE
LIVE PREVIEW

Program an analysis workflow Day 1. Basic functionality of - - PowerPoint PPT Presentation

Program an analysis workflow Day 1. Basic functionality of Chipster (Eija) Microarray data analysis with Chipster Data import (Eija) Quality control (Jarno) 16.-17.4.2008 Normalization (Jarno) Describing the


slide-1
SLIDE 1

1 Microarray data analysis with Chipster 16.-17.4.2008

Jarno Tuimala Eija Korpelainen

Program – an analysis workflow

Day 1.

  • Basic functionality of Chipster (Eija)
  • Data import (Eija)
  • Quality control (Jarno)
  • Normalization (Jarno)
  • Describing the experiment
  • Filtering and missing value considerations (Jarno)

Day 2.

  • Statistical testing (Jarno)
  • Clustering and visualization (Jarno)
  • Annotation (Eija)
  • Promoter analysis (Eija)
  • Experimental design (Jarno) – if time allows

Demo data

Affymetrix

  • Kidney cancer
  • 8 controls, 9 cancer patients

Agilent

  • Acute leukemia
  • 7 controls, 7 FLT mutated

Illumina

  • Teratozoospermia
  • 5 controls, 8 affected

Introduction to microarrays Introduction to microarrays

slide-2
SLIDE 2

2 Research using microarrays

Plan!

  • Experimental design

Laboratory work

  • Extract, label, hybridize

Computer work

  • Scanning, image analysis
  • Bioinformatics

Laboratory work

  • Confirmation

Publish

  • Submit data to public databases

Introduction to Chipster Chipster

  • Goal: Easy access to leading analysis tools such as those developed in the

R/Bioconductor project

  • Features
  • Easy to use graphical user interface
  • Comprehensive selection of tools
  • Support for different array types (Affymetrix, Agilent, Illumina, cDNA)
  • Compatible with Windows, Linux and Mac OS X
  • Easy to install and update
  • Wizards and workflows
  • Interactive graphics
  • Transparency (as opposed to “black box”)
  • Alternative annotations for Affymetrix arrays
  • Automatic tracking of performed analyses
  • http://www.csc.fi/english/customers/university/useraccounts/scientificservices.pdf
  • http://chipster.csc.fi

How does it work?

internet front end

SSL SOAP international Web Services

ANALYSIS VISUALISATION

CSC desktop client

Java Web Start installs and updates client automatically Corona/Murska

analyser security

slide-3
SLIDE 3

3

Aleksi Kallio Jarno Tuimala Taavi Hupponen Mika Rissanen, Janne Käki, Mikko Koski, Petri Klemelä All the pilot users Department of computer science (HY) Dario Greco (HY)

  • Prof. Olli Yli-Harja’s group (TUT)

GeneCruiser team (MIT Broad Institute) Tekes/SA SYSBIO-program

Acknowledgements

Data Tools Visualization

Phenodata – describing your experiment

Phenodata file is created during normalization Fill in the group column with numbers describing your experimental setup

  • e.g. 1 = healthy control, 2 = cancer sample
  • necessary for the statistical tests to work

If you bring in previously created normalized data and phenodata:

  • Choose ”import directly” in the import tool
  • Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”

If you brought in normalized data and need to create phenodata for it:

  • Utilities/ Generate phenodata (fill in the chiptype parameter!)
  • Right click on normalized data, choose ”Link to” phenodata and link type ”Annotation”
  • Fill in the group column
slide-4
SLIDE 4

4 Visualizing the data

Data visualization panel

  • Maximize and redraw for better viewing

Two types of visualizations

  • 1. Interactive visualizations produced by the client program
  • Select the visualization method from the pulldown menu of the data

visualization panel

  • Save by right clicking on the image
  • 2. Static images produced by R/Bioconductor, Weeder, etc
  • Select from Analysis tools/ Visualisation
  • View by double clicking on the image file
  • Save by right clicking on the file name and choosing ”Export”

Interactive visualizations by the client

Spreadsheet Histogram Scatterplot 3D scatterplot Expression profiles Clustered profiles Hierarchical clustering SOM clustering Array pseudo-image Available actions: Change titles, colors etc Zoom in/out Select and annotate genes using the MIT GeneCruiser

slide-5
SLIDE 5

5 Static images produced by R/Bioconductor

Volcano plot Box plot Histogram Heatmap Venn diagram Idiogram Chromosomal position Correlogram Dendrogram QC stats plot RNA degradation plot K-means clustering SOM-clustering

Automatic tracking of analysis history Running many analyses simultaneously

You can have max 5 analysis jobs running at the same time Use Task manager to

  • view parameters, status,…
  • cancel jobs
slide-6
SLIDE 6

6 Workspace – continue later/elsewhere

Saving your workspace allows you to continue later

  • File/ Save workflow
  • File/ Load workflow

Currently it is possible to have only one workspace saved at the time If you would like to continue your work on another computer, you need to transfer the workspace-snapshot -folder to the corresponding location

  • C:\Documents and Settings\ekorpela\nami-work-files\workspace-snapshot

Workflow – reusing your analysis pipeline

Creates a ”macro” that can be applied to another normalized dataset and phenodata Choose a dataset, and workflow records the analysis steps that lead to that dataset You can give the workflow a meaningful name (ending .bsh), but it has to be located in the chipster-scripts folder under nami-work-files You can run a workflow on another computer by making it visible to Chipster with ”Reload workflows from disk” You can change parameters directly to the workflow file

Wizard – autopilot for analysis Wizard for Affymetrix data

Ready-made workflow to find differentially expressed genes

  • Normalization
  • Phenodata creation
  • Statistical test
  • Hierarchical clustering
slide-7
SLIDE 7

7 Importing files

Affymetrix CEL-files are imported to Chipster automatically Other files are imported using the Import tool

Import tool, step 1

Define

  • Header
  • Footer
  • Title row
  • Delimiter

Import tool, step 2

Define columns Modify flags

Importing Agilent files

Sample (rMeanSignal) Sample background (rBGMedianSignal) Control (gMeanSignal) Control background (gBGMedianSignal) Identifier (ProbeUID) Annotation (ControlType) https://extras.csc.fi/biosciences/chipster-manual/data-formats.html

slide-8
SLIDE 8

8 Exercise Exercise I

  • 1. Import the demo data of your favorite type in Chipster

Affymetrix Agilent

  • 2. Save the workspace
  • 3. Have lunch (back at 13.00)

Quality control Quality control tools

Quality control -tools

  • Affymetrix basic

RNA degradation + Affy QC

  • Affymetrix RLE & NUSE (might take a long time to run)

Fits a model to expression values

  • Agilent

MA-plot + density plot + boxplot

Visualization – dendrogram Statistics - NMDS

slide-9
SLIDE 9

9 Affymetrix I

Quality control tools are run on raw data (CEL files).

  • Dendrogram and NMDS on normalized data

Affymetrix II Agilent General QC – dendrogram and NMDS

slide-10
SLIDE 10

10 Scatterplots Heatmaps (this took an hour to calculate) QC-tools in Chipster

Quality control

  • Affymetrix basic
  • Affymetrix RLE and NUSE
  • Agilent 2-color

Visualization

  • Dendrogram
  • Heatmap
  • Correlogram

Statistics

  • NMDS

Normalization

slide-11
SLIDE 11

11 What is normalization?

Normalization is the process of removing systematic variation from the data. Typically you would normalize your data so that all the chips become comparable.

Methods

Affymetrix

  • Background correction + expression estimation + summarization
  • RMA (default) uses only PM probes, fits a model to them, and gives out

expression values after quantile normalization and median polishing

Agilent

  • Background correction + averaging duplicate spots + normalization

After normalization the expression values are always expressed

  • n log2-scale

Affymetrix

Methods: MAS5, Plier, RMA, GCRMA, Li-Wong

  • MAS5 is the older Affymetrix method, Plier is a newer one
  • RMA is the default, and works rather nicely if you have more than a

few chips

  • GCRMA is similar to RMA, but takes also GC% content into account
  • Li-Wong is the method implemented in dChip

Variance stabilization makes the variance over all the chips similar

  • Works only with MAS5 and Plier, since all others output log2-

tranformed data by default (and thus corrected for the same phenomenon)

Custom chiptype

  • If you want to use reannotated probes (they are really assigned to

the genes where they belong), select one from this menu.

Agilent I

Background correction

  • Background treatment

None, Subtract, Edwards, Normexp

  • Background offset

0 or 50

Normalize chips

  • None, median, loess

Normalize genes (not typically used)

  • None, scale (to median), quantile

Chiptype

  • A must setting!
slide-12
SLIDE 12

12 Agilent II

Background treatment typically generates many negative values that are coded as missing values after log2-transformation.

  • Usual subtract option does this
  • Using normexp + offset 50 will generate no negative values,

and gives rather good estimates (best method reported)

Loess removes curvature from the data (suggested)

Checking normalization Exercise Exercise II

Normalize your dataset

  • Use two different normalization schemes

Describe the experiment (fill in phenodata) Check the quality of your dataset

  • Is there difference between the normalization schemes
  • If there is, select the better one, and continue with it
slide-13
SLIDE 13

13 Filtering Gene filtering

Removing probes for genes that are

  • Not expressed
  • Expressed at constant level (not changing)

Often a good idea, and necessary before multiple testing correction can be adequately applied

  • Some controversy on this…

Non-specific filtering

  • Expression, flags, SD, …

Specific filtering

  • Statistical testing

Non-specific filtering

Often used for removing bad quality data:

  • Intensity value too low
  • Intensity value saturated
  • Appearance of the spot is abnormal

Typically, non-changing genes are also removed These can be removed using

  • Filter by standard deviation
  • Filter by interquartile range
  • Filter by expression

Specific filtering

Selecting genes that are associated with some phenotype Typically involves statistical testing Biologists typically concentrate on fold change (magnitude of effect), statisticians on p-value.

  • Both tell a slightly different story. Fold change ignores

knowledge of variability, p-value ignores the size of the effect.

  • Take both into account by combining the filters.
  • Filter on expression value (what is biologically significant)

and test for differences (what is statistically significant)

slide-14
SLIDE 14

14 Unspecific filtering in Chipster

Pre-processing

  • Filter by expression
  • Select the upper and lower cut-offs
  • Select the number of chips this rule has to fulfilled on
  • Select whether to return genes inside or outside the range
  • Filter by SD
  • Select the percentage of genes to filter out
  • Filter by interquartile range (IQR)
  • Select the IQR
  • Filter by coefficient of variation (CV)
  • Median is used for filtering on CV (cannot be changed)

Utilities

  • 1. Calculate descriptive statistics
  • 2. Filter using a column

Venn diagram

Select three datasets in Chipster Run the Venn diagram tool from Visualization tool category

SD CV IQR

Exercise Exercise III

Filter your dataset using unspecific filtering

  • Use two different schemes
  • Compare the schemes using Venn diagram
  • Are there any common genes?
slide-15
SLIDE 15

15 Statistics Some terminology

  • Usually tests for comparing means of two or more groups are

used

  • Variance might be of interest too, but in practise this is never done.
  • Parametric tests (assume data normally distributed)
  • Typically used for microarray data
  • Non-parametric tests (assume no normality)
  • P-value
  • Risk of saying that there is a difference when there really isn’t
  • Traditionally 0.05 is used as a cut-off for significance
  • False discovery range is a p-value corrected for multiple tests (more on

this later)

Mean and variance, an example for 1 gene

  • 6
  • 4
  • 2

2 4 6 0.0 0.1 0.2 0.3 0.4

density.default(x = x1)

N = 100000 Bandwidth = 0.08956 Density

  • 10
  • 5

5 10 0.0 0.1 0.2 0.3 0.4

density.default(x = y1)

N = 100000 Bandwidth = 0.09006 Density

Statistical testing

  • Needs replication (>2 chips per group)
  • Replication makes it possible to estimate uncertainty or variability in the
  • measurements. This is typically measured by standard deviation.
  • Comparing means (parametric tests)
  • One-group tests
  • Compare to a known mean
  • Example: One-sample t-test
  • Two-group tests
  • Compare two groups’ means
  • Example: Two-sample t-test
  • Several group tests
  • Compare several groups’ means
  • Example: Analysis of variance (ANOVA)
  • Two or more groups, two or more factors
  • Compare means in the groups according to both factor simultaneously
  • Example: multiple linear regression (linear modeling in Chipster)
slide-16
SLIDE 16

16 t-test

  • Compares means of two groups
  • If the p-value is small that means that there is a difference between the groups.
  • If the p-value is large (>0.05), there is no difference between the groups.
  • p-value is a risk of saying that there is a difference when there actually isn’t.
  • A test for every gene is run separately -> thousands of tests and p-values

SE x x t

2 1 −

=

ANOVA

A generalization of t-test. Compares means of several groups. Tells whether the means are different, but not which means differ from each other.

  • For this you can use post-hoc tests (not implemented in

Chipster) or linear modelling (implemented in Chipster)

A test for every gene is run separately -> thousands of tests and p-values

Multiple testing correction I

  • After getting the results for all the genes, p-values are

adjusted for the number of tests conducted.

  • When making several comparisons using the same test, some
  • f the results will be chance findings.
  • Example: if p threshold is 0.05, every 20th significant result might be due

to chance alone. If there were 10000 genes that were tested, 500 genes would be expected to be chance findings. If we found 550 genes to be significant, most of those (500) would be false positives, and only a minority are true positives (50).

  • This can be corrected for (to some extent) by using a multiple

testing correction.

  • Benjamini and Hochberg FDR: If FDR threshold is 0.05, 5% of

significant results are expected to be false positives (chance findings). If we tested 10000 genes, and 500 genes were significant after FDR correction, 25 of those are expected to be false positives, and 475 are expected to be true positives.

  • Thus, FDR can be much higher than p-value, and the results can still be

meaningful and worth investigating.

Multiple testing correction II

The ranking of the genes does not change after multiple testing correction!

  • If you know that you can validate, say, 10 genes, then there’s

no difference if you select the most significant genes before or after the multiple testing correction.

  • If there are no significant genes left after multiple testing

correction, you probably have some differences, but not enough power in your experiment to detect those differences. In that case the top 10 genes are still the ones that are most likely to validate.

slide-17
SLIDE 17

17 Gene set test (”global test”)

A typical result of an microarray experiment is a list of differentially expressed genes. Biologically, grouping these genes in pathways or functional categories would be more interesting. Are pathways associated with our endpoints of interest?

  • Is there a difference in nucleotide metabolism

between 5-FU-treated cancer patients and their healthy controls?

Works on the expression values data.

Gene enrichment analysis

A typical result of an microarray experiment is a list of differentially expressed genes. Biologically, grouping these genes in pathways or functional categories would be more interesting. Takes a list of differentially expressed genes, and tests whether they are enriched in any functional categories. Works on the gene list.

Statistical tests in Chipster

Statistics

  • One sample tests
  • Are the genes expressed at all (different from 0)?
  • Two group tests
  • Several group tests
  • Linear modeling

Visualization

  • Volcano plot

Exercise

slide-18
SLIDE 18

18 Exercise IV

To find differentially expressed genes, run a suitable statistical test for your (filtered) data set. Are these expressed genes enriched into some KEGG pathways?

  • There is a separate test for this.

Clustering Clustering methods

Hierarchical clustering Non-hierarchical clustering

  • K-means
  • QT-clustering
  • Self-organizing maps

Classification aka class prediction

  • K-nearest neighbor (KNN)

Unsupervised v. supervised

slide-19
SLIDE 19

19 Hierachical clustering

Two phases:

  • Pick a distance measure
  • Euclidean distance
  • Standard / Pearson correlation
  • Pick the dendrogram drawing method
  • Average linkage

Average linkage example Hierarchical clustering - heatmap K-means clustering

Finds K clusters from the data. User has to specify the number of clusters (K).

slide-20
SLIDE 20

20 K-means clustering Clustering in Chipster

Clustering

  • Hierarchical
  • Includes reliability checking of the resulting tree with

bootstrapping

  • K-means

Statistics

  • PCA (principal component analysis)
  • NMDS (non-metric multidimensional scaling)

Exercise Exercise V

Cluster your differentially expressed genes using hierarchical clustering

slide-21
SLIDE 21

21 Annotation Annotation

Annotation = Descriptive text used for labeling features. For genes, extra information about their location in chromosomes, biological functions, etc. Retrieved from multiple biological databases and stored as a single database in Chipster (generated by Bioconductor project). Not available for all chiptypes, but required by certain analyses (annotation, gene enrichment analysis, promoter analysis) For Affymetrix: either built-in or GeneCruiser For other chiptypes: built-in

Alternative CDF environments for Affy

CDF if a file that links individual probes to their location in genes (probesets) Affymetrix default annotation use old CDF files that map a sizable number of probes to wrong genes Alternative CDFs (custom chiptype in Affymetrix normalization) fixes this problem After using the alt CDFs, you can’t use gene set enrichment or promoter analysis tools

  • No annotation files exist for alt CDFs

Promoter analysis

slide-22
SLIDE 22

22 Promoter analysis with Chipster

Promoter sequences = sequences upstream of annotated transcription start site of RefSeq genes (from UCSC Golden Path) Pattern discovery: Weeder

  • looks for common sequence motifs in a set of promoters

Pattern matching: ClusterBuster

  • looks for clusters of known transcription factor binding sites using the

JASPAR matrices

Promoters from genes with similar expression patterns

Pattern discovery

Program to find common motifs

  • Tool comparison: Nature Biotech. (2005)

23:137 => Weeder

Weeder

Enumerates all oligos of given length, determines which appear in a significant fraction of seqs, ranks them according to statistical significance Pavesi et al (2004) Nuc Acids Res. Jul (W199-203) Species (human, mouse, rat, yeast) [human]

  • Background frequency files (oligo count of intergenic regions of a given organism)

Promoter size (short, medium, long) [short] Analyze strands (single, both) [single] Motif appears more than once per sequence (yes, no) [no] Number of motifs to return (1-100) [10] Percentage of sequences the motif should appear in (1-100) [50] Transcription factor binding site size (small, medium) [small]

  • Small= 6 (1 mismatch allowed) and 8 (2 mismatches allowed)
  • Medium= 10 (3 mismatches allowed)
slide-23
SLIDE 23

23

Collection of known binding motifs for TFs (Genomatix, Transfac, JASPAR) Program to scan the sequence for binding sites

Pattern matching

TTTTTATA

ClusterBuster

Looks for clusters of transcription factor binding sites Uses the JASPAR open access matrix database

  • http://jaspar.cgb.ki.se/cgi-bin/jaspar_db.pl

Frith et al (2003) Nuc Acids Res, 31(13):3666-8 Species (human, mouse, rat, yeast) [human] Promoter size (short, medium, long) [short] Cluster score threshold [5] Motif score threshold [6] Expected distance between motifs in a cluster [35] Range for counting nucleotide frequencies [100] Pseudocounts [0.375]

ClusterBuster output

slide-24
SLIDE 24

24 Exercise Exercise VI

Search your list of differentially expressed genes for binding sites of known transcription factors

Extra material Linear modeling in Chipster

slide-25
SLIDE 25

25 Linear model

Y = a + bx1 + cx2 + dx1x2

  • Like a normal multiple regression
  • Intercept (a) is included by default
  • Can contain both main effects (b, c) and interaction effects (d)

Linear modeling in Chipster can take into account at most three main effect, their interactions, one technical replication level, and one level of pairing

  • This is enough for all the experiments I’ve encountered in GEO

so far.

  • Technical replication: one biological sample is hybridized on

more than one array

  • Pairing: before-after –type of setting. Measurements available

just prior to treatment and after it from exactly the same cell culture flasks.

Setting up the model I

All columns (max. three) in the phenodata can be either tested as linear (is there a trend towards higher numbers?) or as a factor (are there differences between the groups?).

  • With 2 groups there’s no difference in these settings.

1 2 3 1 2 3 linear factor

Liner modeling tool

Columns 1…3

  • Main effects

Column 4

  • Technical repl.

Column 5

  • Pairing

One main effect – 3 groups

linear factor

slide-26
SLIDE 26

26 Setting up the model

If you want to include more than one main effect, you need to add new columns to you phenodata.

Two main effects – both have two groups

No interactions Two-way interactions, with significant genes returned for all effects (main effects and interactions)

Pairing or technical replication

All samples in the same pairing or replication groups are coded with the same number. Different groups are coded with a running number.

Result files

A model matrix and one result file are saved.

slide-27
SLIDE 27

27 Experimental design Some things to ponder

Bad experimental design is bad science!

  • Wasted money
  • More animal or human suffering
  • Unreliable results

The main aspects of experimental design are

  • Randomization and balancing (often neglected)
  • Replication (usually rather well handled)
  • Blocking (not even known of)
  • Factorial experiments (sometimes considered)

You also need to consider

  • Sample size
  • Controls (direct or indirect measurements)

Before running the experiment

Define the principal hypothesis to test. Everything cannot be tested!

  • ”I run this experiment for comparing two treatments on
  • Arabidopsis. Now coming to think of it, these plants were of

different age. Can you also test for the effect of it?”

Which are the main sources of variability? They need to be taken into account in the experimental design!

  • Laboratory personnel (more than one person involved?)
  • Chips (from more than one batch?)
  • Biological samples (inter- or intraindividual variability?)
  • Hybridization conditions (is the method standardized?)
  • Day (often the greatest source or variation)
  • Intermingled with variation from chips, biological samples,
  • etc. if not properly taken into account

Replication

Techical replication:

  • Take a sample per animal, and hybridize every sample to

several chips.

Biological replicate:

  • Take a sample per animal, and hybridize every sample to one

chip.

Replication does not mean taking repeated measurements from the same experimental units. That typically generates a time series. Technical replication, when analysed as a biological replicate is a pseudoreplicate. Pseudoreplication generates more problems than it solves.

slide-28
SLIDE 28

28 Balancing

Balancing means that there should be an equal number

  • f experimental units is all groups.

Balanced designs are statistically more powerful than unbalanced designs. Example:

  • In the study of breast cancer, 30 individuals were recruited fro

the cancer cohort, and 30 individuals as their health controls (balanced for the disease).

  • 60 Affymetrix chips are available for hybridizing these samples.

Affymetrix station only takes 8 chips at a time, so 4 cancer patients and 4 healthy controls are randomly picked to be hybridized in every batch (balanced for day effect).

  • Two laboratory technicians are making the hybridizations. Both

process 30 samples, half being cancer patients and half healthy controls (balanced for the technician).

Randomization

Randomization is a way to control for effects of factors not explicitely taken care of in the experimental design. In randomization experimental units are randomly allocated to treatment groups.

  • Sixty cell culture vials are randomly divided into control and

treatment groups. They retain their places in the incubator regardless of the group (completely randomized trial).

Random does not mean haphazard. Randomization takes some effort. Use e.g., dice, playing cards, random number generator, random number tables, etc. for randomization. In the best case the randomization is blind. The experimenter must not be able to identify the samples before the whole experiment has been concluded.

Completely randomized design

1 2 1 1 4 2 1 1 2 3 1 2 1 2 2 2 2 2 1 1 D C B A Row #

Let’s divide 16 samples into two groups of equal size. I’ve created a random number table on the right. Reading the table from the top left to the bottom right, the cell culture vials are assigned to two groups. We might then arrange the vials on the tray in the same order and put the tray in the incubator.

Blocking

Blocking is arranging experimental units into similar groups. Blocking is used for controlling for factor that can not be manipulated, but are known. Example:

  • While studying a responce to a drug treatment, both males and

females were recruited for the study. Responce might depend

  • n sex, so individuals were first divided into two groups

according to their sex, and then randomly assigned to treatment groups (randomized block design).

slide-29
SLIDE 29

29 Factorial designs

In factorial design several factors are manipulated at the same time. Better to analyze together than separately, because factorial design allows one to assess the possible interaction. Example:

  • Cells were treated with vitamin-C and hydroxen peroxide.

Culturing cell alone with either chemical leads to missing the interaction where vitamin-C prevents peroxide induced cell death to some extent.

Main effects: vitamin-C and peroxide Interaction: vitamin-C * peroxide

Sample size

We need to use a sufficient amount of samples to reach reliable conclusions. Using too small or too big sample size is a waste of resources. Founding out the correct sample size for DNA microarray experiments is tricky. Use of previous experiments for the same chip type and biological material is often needed. In epidemiological studies estimating the sample size is a must. It might be hard to get published otherwise. To estimate the sample size, we need an estimate of

  • Effect size
  • Variability
  • Desired false positive rate
  • Desired false negative rate

Sample size – a comparsion of two experiments Sample size – a rule of thumb

In statistics, variability in intrincically associated with statistical significance. The lower the variability of replicates, the higher the significance. Doubling sample size halves variance making the detection of differences easier.

slide-30
SLIDE 30

30 Direct or indirect measurements?

Reference Sample Reference Sample Sample 2 Sample 2

An example of a better…good design

Comparing two groups of samples.

  • 20 samples in each group (40 in total).
  • You’re interested in comparing the two states (diseased,

health).

  • Interindividual variability (due to sex) can be expected.
  • Using Affymetrix chips (all from the same batch).
  • You’re doing all the wetlab work.

Hybridize (randomly ordered):

  • 12122211
  • 22112112

1=healthy

  • 21211212

2=diseased

  • 22221111

1=male

  • 12211212

2=female