Introduction to microarrays Thierry Sengstag, PhD Bioinformatics - - PowerPoint PPT Presentation

introduction to microarrays
SMART_READER_LITE
LIVE PREVIEW

Introduction to microarrays Thierry Sengstag, PhD Bioinformatics - - PowerPoint PPT Presentation

EMBnet's introduction to bioinformatics Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute of Bioinformatics Part I Technology of microarrays Swiss Institute of Bioinformatics Biology Fundamentals


slide-1
SLIDE 1 Swiss Institute of Bioinformatics

EMBnet's introduction to bioinformatics

Introduction to microarrays

Thierry Sengstag, PhD Bioinformatics Core Facility

slide-2
SLIDE 2 Swiss Institute of Bioinformatics

Part I

Technology of microarrays

slide-3
SLIDE 3 Swiss Institute of Bioinformatics

Biology Fundamentals - Genes

slide-4
SLIDE 4 Swiss Institute of Bioinformatics

Biology Fundamentals - Expression

Transcriptome: Genes Proteome: Proteins

Microarrays

slide-5
SLIDE 5 Swiss Institute of Bioinformatics

Genomics Fundamentals - Complexity

Difficulties: §Contaminations §Alternative Splicing §Alternative PolyAdenylation mRNA purification

slide-6
SLIDE 6 Swiss Institute of Bioinformatics

RNA abundance in mammalian cells

rRNA tRNA

mRNA

80%

1%

1-50 50-500 500+ Molecules/cell 3 x 106 molecules/cell 3 x 105 molecules/cell 1-2 x104 different genes

slide-7
SLIDE 7 Swiss Institute of Bioinformatics

DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes in a small sample and to analyze the expression of those genes. Microarrays are ordered sets of DNA molecules of known sequence attached on a surface at a known position (spot). In a microarray experiment one hybridizes mRNA molecules of an extract of a sample to the spots of the microarray. Main families of microarrays:

  • Spotted arrays, PCR products spotted on chip, ~500 nt
  • Oligo-arrays, e.g. Agilent, oligos ~60 nt length, in-situ
  • Affymetrix, short sequence oligos, 25 nt, in-situ

What are DNA Microarrays ?

slide-8
SLIDE 8 Swiss Institute of Bioinformatics

1- Samples 2- Extracting mRNA 3- Labeling 4- Hybridizing 5- Scanning 6- Visualizing

slide-9
SLIDE 9 Swiss Institute of Bioinformatics

Various technological choices:

  • 104 to 106 features on a single array
  • Single- vs two-color approach
  • Hybridization protocols

Questions addressed:

  • What are the differences (in gene expression) between cell lines ?
  • What is the difference between knock-out and wild-type mice?
  • What is the difference between a tumor and a healthy tissue ?
  • Are there different tumor types ?

Key concept: Compare gene expression in two (or more) cell/tissue types ? Gene expression assessed by measuring the number of RNA transcripts in a tissue sample. (Primary goal of this course.)

What are DNA Microarrays ?

slide-10
SLIDE 10 Swiss Institute of Bioinformatics

Phase 1: Preparation of the microarray environment

  • Which sequences do we want to interrogate on the arrays ?
  • Other technically important questions (choice of scanner,

chemistry, etc…)

  • Presently (2006): commercial platforms are standard for

"usual" organisms; "exotic" organisms still require custom-made arrays Phase 2: Use of the microarray

  • "Experimental design" (with a statistician)
  • Preparation of RNA samples
  • Hybridization, scanning, signal extraction
  • Statistical analysis

Two major phases of a microarray experiment

slide-11
SLIDE 11 Swiss Institute of Bioinformatics

Phase 1: Which sequences should we spot ?

Depends on the organism:

  • Human, Mouse, Rat, … have large databases of sequences that

can be used to design probes by bioinformatics means

  • "Exotic" organisms require cloning of 100s to 1000s mRNA

transcripts, then spotting of DNA after PCR amplification (sequencing of interesting genes can be done after microarray experiment)

Depends on platform:

  • Oligo arrays can probe any region of a mRNA transcript
  • Affymetrix require the sequence to be in the 3'-UTR region
slide-12
SLIDE 12 Swiss Institute of Bioinformatics

Biological question (e.g. Differentially expressed genes, Sample class prediction, etc.)

Testing

Biological verification and interpretation Microarray experiment

Significance

Experimental design Image analysis/ Quality assessment Normalization

Clustering Discrimination

(failed) Pre-processing steps Data Analysis

Scientific Process

Phase 2:

slide-13
SLIDE 13 Swiss Institute of Bioinformatics

Spotted array preparation

“Average” mouse mRNA cDNA isolation Test sequence (probe) production

~100 - ~2000 bp RT-PCR (conversion mRNA-cDNA, amplification)

slide-14
SLIDE 14 Swiss Institute of Bioinformatics

Array Production: Spotting

slide-15
SLIDE 15 Swiss Institute of Bioinformatics

Spotting in action…

1. Some rounds of pin cleaning 2. Pickup PCR products from plate 3. Spot one feature on each subarray Spotting arrays

slide-16
SLIDE 16 Swiss Institute of Bioinformatics

Oligo array preparation

Sequence databases Millions of experiences worldwide Probe (sequence) design

  • known genes
  • putative genes
  • alternative splicing
  • GC contents

Gene-specific sequences

~60 bp sequences

In-situ synthesis

slide-17
SLIDE 17 Swiss Institute of Bioinformatics

Spotted and oligo array usage

Hybridization washing

Relative mRNA levels

Scanning cy5 labeled cDNA cy3 labeled cDNA

Mix

slide-18
SLIDE 18 Swiss Institute of Bioinformatics

Affymetrix chip preparation

Sequence databases Millions of experiments worldwide Probe (sequence) design

  • known genes
  • putative genes
  • alternative splicing
  • GC contents

Bioinformatics thinking yields gene-specific sequences (3’-end)

25 nt sequences

In-situ synthesis

~100s of nt “consensus” sequences

slide-19
SLIDE 19 Swiss Institute of Bioinformatics

Affymetrix chip usage

Hybridization washing Relative mRNA levels Scanning Scanning

slide-20
SLIDE 20 Swiss Institute of Bioinformatics

Affymetrix system

(11 to 16)

Usually the most 3 prime area, often UTR

25mer 25mer 25mer

AAAA. .

25mer

slide-21
SLIDE 21 Swiss Institute of Bioinformatics

Probe preparation & hybridization

  • Extract mRNA or total RNA
  • RT, add 5’ anchor
  • PCR with labelled nucleotide (Cy3, Cy5, DIG, …)
  • Overlay probe on the chip, put in the hybridization

chamber, wash

slide-22
SLIDE 22 Swiss Institute of Bioinformatics

Scanner basics

  • Based on fluorescence

– 1 or 2 lasers: cy3 cy5 (seldom more)

  • Most scanners are confocal

– Target a very limited volume

  • f space

(signal only from focal plane) – Need to “scan” the surface

  • 16-bits ADC converters

– Range of values: 0-65535 – Log2 range: 0 – 16

  • Scan various supports

– Glass Slide (e.g. Agilent, PerkinElmer) – Affymetrix

slide-23
SLIDE 23 Swiss Institute of Bioinformatics

Confocal scanner

Dye Photons Electrons Signal Laser PMT A/D Converter excitation amplification Filtering Time-space averaging

slide-24
SLIDE 24 Swiss Institute of Bioinformatics

Images from Scanner

  • Resolution

– standard 10 µm [currently, best ~ 5µm] – 100µm spot on chip = 10 pixels in diameter

  • Image format

– Typically: TIFF (tagged image file format) 16 bit (65,536 levels of gray) – also other formats – 1cm x 1cm image at 16 bit = 2Mb

  • Separate image for each fluorescent sample

– channel 1, channel 2, etc.

slide-25
SLIDE 25 Swiss Institute of Bioinformatics

Images in analysis software

  • The two 16-bit images (Cy3, Cy5) are viewed as 8-bit

images

  • Display fluorescence intensities for both wavelengths

using a 24-bit RGB overlay image

  • RGB image :

– Blue values (B) are set to 0 – Red values (R) are used for Cy5 intensities – Green values (G) are used for Cy3 intensities

  • Qualitative representation of results
slide-26
SLIDE 26 Swiss Institute of Bioinformatics

Images : examples

Cy3 Cy5 repressed Control > Treated green induced Control < Treated red unchanged Control = Treated yellow

Gene expression Signal strength Spot color

slide-27
SLIDE 27 Swiss Institute of Bioinformatics

Image analysis (scanner variability)

ScanArray 4000 Agilent G2565AA

slide-28
SLIDE 28 Swiss Institute of Bioinformatics

Image processing

  • Align channels
  • Identify spot pixels
  • Identify background pixels
  • Compute representative value, e.g.

– Mean foreground value – Median background value

slide-29
SLIDE 29 Swiss Institute of Bioinformatics

2-color Arrays Image Processing

GenePix

slide-30
SLIDE 30 Swiss Institute of Bioinformatics

2-color Arrays Image Processing

A difficult case… J J J J

slide-31
SLIDE 31 Swiss Institute of Bioinformatics

Quantification of Expression

For each spot on the slide, calculate Red intensity = Rfg - Rbg (fg = foreground, bg = background) and Green intensity = Gfg - Gbg and combine them in the log (base 2) ratio Log2(Red/Green) Often, fg = mean and bg = median of relevant pixel intensities

slide-32
SLIDE 32 Swiss Institute of Bioinformatics

M and A values

  • M-value is the log2 of the ratio of expression values

– Properties:

  • If the gene is expressed with the same intensity in the red

and green conditions: M=0

  • If the gene is more expressed in the red condition: M>0
  • if the gene is more expressed in the green condition: M<0
  • A-value is the average of the log2 of expression

M = log2( Red / Green )

= log2( Red ) – log2( Green )

A = 1/2 ( log2( Red ) + log2( Green ) )

= log2( sqrt( Red * Green ) )

slide-33
SLIDE 33 Swiss Institute of Bioinformatics

Why we take logs

Linear scale Log scale

Better representation of genes with "medium" expression: Biologically, a unit change in log2 represents a 2-fold change.

slide-34
SLIDE 34 Swiss Institute of Bioinformatics

MvA plots

  • Relationship between Intensity and MvA plots

log2(Green) log2(Red) 16 16 M A 16 45o Rotation

Red saturation Red saturation Green saturation Green saturation

+1

  • 1

(+ stretching by a factor 2 along M axis)

slide-35
SLIDE 35 Swiss Institute of Bioinformatics

Hybridization of extra material with known concentrations (spikes)

A real-life MvA plot

slide-36
SLIDE 36 Swiss Institute of Bioinformatics

End of Part I

slide-37
SLIDE 37 Swiss Institute of Bioinformatics

Part II

Extraction of gene signal from microarrays

slide-38
SLIDE 38 Swiss Institute of Bioinformatics

Biological question (e.g. Differentially expressed genes, Sample class prediction, etc.)

Testing

Biological verification and interpretation Microarray experiment

Significance

Experimental design Image analysis/ Quality assessment Normalization

Clustering Discrimination

(failed) Pre-processing steps Data Analysis

Scientific Process

slide-39
SLIDE 39 Swiss Institute of Bioinformatics

Steps in Images Processing

  • Addressing (or Gridding)

– Assigning coordinates to each spot

  • Segmentation

– Classification of pixels as either foreground (signal)

  • r background
  • Information Extraction

– Foreground fluorescence intensity pairs (R,G) – Background intensities – Quality measures

slide-40
SLIDE 40 Swiss Institute of Bioinformatics

Addressing

This is the process of assigning coordinates to each of the spots. Automating this part of the procedure permits high throughput analysis.

4 by 4 grids 19 by 21 spots per grid

slide-41
SLIDE 41 Swiss Institute of Bioinformatics

Addressing

slide-42
SLIDE 42 Swiss Institute of Bioinformatics

Problems in automatic addressing

  • Misregistration of the red and green channels
  • Rotation of the array in the image

Rotat i

  • n

Rotat i

  • n
slide-43
SLIDE 43 Swiss Institute of Bioinformatics

Problems in automatic addressing

  • Skew in the array
slide-44
SLIDE 44 Swiss Institute of Bioinformatics
  • Parameters to address spot positions

– Separation between rows and columns of grids – Individual translation of grids – Separation between rows and columns of spots within each grid – Small individual translation of spots – Overall position of the array in the image

Addressing

  • Basic structure of images known

(determined by the arrayer)

slide-45
SLIDE 45 Swiss Institute of Bioinformatics

Steps in Images Processing

  • Addressing (or Gridding)

– Assigning coordinates to each spot

  • Segmentation

– Classification of pixels as either foreground (signal)

  • r background
  • Information Extraction

– Foreground fluorescence intensity pairs (R,G) – Background intensities – Quality measures

slide-46
SLIDE 46 Swiss Institute of Bioinformatics

Segmentation Methods

  • Fixed circles
  • Adaptive circles
  • Adaptive shape

– Edge detection – Seeded Region Growing (R. Adams and L. Bishof (1994): Regions grow outwards from seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region

  • Histogram methods
slide-47
SLIDE 47 Swiss Institute of Bioinformatics

Fixed circle segmentation

  • Fits a circle with a constant diameter to all spots in

the image

  • Easy to implement
  • The spots should be of the same shape and size
slide-48
SLIDE 48 Swiss Institute of Bioinformatics

Adaptive circle segmentation

  • The circle diameter is

estimated separately for each spot

Dapple finds spots by detecting edges of spots (second derivative)

  • Problematic if spot

exhibits oval shapes

slide-49
SLIDE 49 Swiss Institute of Bioinformatics

Limitation of circular segmentation

—Small spot —Not circular

Result of Seed Region Growing

slide-50
SLIDE 50 Swiss Institute of Bioinformatics

Adaptive shape segmentation

  • Specification of starting points or seeds

– Bonus: already know geometry of array

  • Regions grow outwards from the seed points

preferentially according to the difference between a pixel’s value and the running mean

  • f values in an adjoining region
slide-51
SLIDE 51 Swiss Institute of Bioinformatics

Histogram segmentation

  • Choose target mask larger than any spot
  • Fg and bg intensities determined from the

histogram of pixel values for pixels within the masked area

  • Example : QuantArray

– Background : mean between 5th and 20th percentile – Foreground : mean between 80th and 95th percentile

  • May not work well when a large

target mask is set to compensate for variation in spot size ! "

slide-52
SLIDE 52 Swiss Institute of Bioinformatics

Steps in Images Processing

  • Addressing (or Gridding)

– Assigning coordinates to each spot

  • Segmentation

– Classification of pixels as either foreground (signal)

  • r background
  • Information Extraction

– Foreground fluorescence intensity pairs (R,G) – Background intensities – Quality measures

slide-53
SLIDE 53 Swiss Institute of Bioinformatics

Information Extraction

  • Spot Intensities

§ mean of pixel intensities § median of pixel intensities § Pixel variation (e.g. IQR)

  • Background values

§ None § Local § Constant (global)

  • Quality Information

Take the average

slide-54
SLIDE 54 Swiss Institute of Bioinformatics

Spot ‘foreground’ intensity

  • The total amount of hybridization for a spot is

proportional to the total fluorescence generated by the spot

  • Spot intensity = sum of pixel intensities within

the spot mask

  • Since later calculations are based on ratios

between Cy5 and Cy3, we compute the average* pixel value over the spot mask

*alternative : ratios of medians may be better than means if bright specks present

slide-55
SLIDE 55 Swiss Institute of Bioinformatics

Background intensity

  • The measured fluorescence intensity includes

a contribution of non-specific hybridization and other chemicals on the glass

  • Fluorescence from regions not occupied by

DNA should be different from regions

  • ccupied by DNA

→ one solution is to use local negative controls (spotted DNA that should not hybridize)

slide-56
SLIDE 56 Swiss Institute of Bioinformatics

BG: None

  • Do not consider the background

– Can be better than some forms of local background determination with good quality arrays

        + − + + − + =

bg G bg fg G fg bg R bg fg R fg

G G R R M

, , , , 2

log σ σ σ σ

        + + + + ≈ ) ( ) ( log

, , , , 2 bg G fg G fg bg R fg R fg

G R σ σ σ σ

With a loose mathematical notation:

        + + =

fg G fg fg R fg

G R M

, , 2

log σ σ worse than

slide-57
SLIDE 57 Swiss Institute of Bioinformatics

BG: Local

  • Focus on small regions surrounding the spot mask
  • Median of pixel values in this region
  • Most software implements such an approach

#$ %$

  • By ignoring pixels immediately surrounding the

spots, bg estimate is less sensitive to the performance of the segmentation procedure

slide-58
SLIDE 58 Swiss Institute of Bioinformatics

Background can matter

Without BG correction With BG correction

slide-59
SLIDE 59 Swiss Institute of Bioinformatics

Summary

  • Image analysis is a crucial preprocessing step

– Association of a "geographic" location (and corresponding annotation) with signal intensities – Several non-trivial technical choices (scanner, image analysis software, etc…) can affect the quality of the signal

  • Bg correction is sometimes not desirable

(low bg arrays)

slide-60
SLIDE 60 Swiss Institute of Bioinformatics

Biological question (e.g. Differentially expressed genes, Sample class prediction, etc.)

Testing

Biological verification and interpretation Microarray experiment

Significance

Experimental design Image analysis/ Quality assessment Normalization

Clustering Discrimination

(failed) Pre-processing steps Data Analysis

Scientific Process

slide-61
SLIDE 61 Swiss Institute of Bioinformatics

Quality assessment overview Visual inspection of images Evaluation of MvA plots Compare statistical summaries for the chips

slide-62
SLIDE 62 Swiss Institute of Bioinformatics

$& '%(( & %% ((

Co-registration and overlay offers a quick visualization, revealing information on color balance, uniformity of hybridization, spot uniformity, background, and artifacts such as dust or scratches

Red/Green overlay images

slide-63
SLIDE 63 Swiss Institute of Bioinformatics

Spatial plots: background from two slides

slide-64
SLIDE 64 Swiss Institute of Bioinformatics

Practical Problems 1

Comet Tails

§ Likely cause: insufficiently rapid immersion

  • f the slides in

the succinic anhydride blocking solution

slide-65
SLIDE 65 Swiss Institute of Bioinformatics

Practical Problems 2

% ) %

slide-66
SLIDE 66 Swiss Institute of Bioinformatics

Practical Problems 3

High Background

  • 2 likely causes:

– Insufficient blocking – Precipitation of the labeled probe

Weak Signals

slide-67
SLIDE 67 Swiss Institute of Bioinformatics

Practical Problems 4

  • § *!&

+

slide-68
SLIDE 68 Swiss Institute of Bioinformatics

Practical Problems 5

slide-69
SLIDE 69 Swiss Institute of Bioinformatics

Artifacts in microarrays

  • We are interested in finding true biologically

meaningful differences between sample types

  • Due to other sources of systematic variation,

there are also usually artifactual differences

  • Sources of artifacts include:

– print tips - differences in subarrays – plate effects – differences in rows within subarray – batch effects – hybridization artifacts

slide-70
SLIDE 70 Swiss Institute of Bioinformatics

Sample boxplot

  • ,-

,. /'!0

slide-71
SLIDE 71 Swiss Institute of Bioinformatics

*.1&2+%2# 34

Boxplots of log2R/G

(Example data associated to limmaGUI package.)

slide-72
SLIDE 72 Swiss Institute of Bioinformatics

Biological question (e.g. Differentially expressed genes, Sample class prediction, etc.)

Testing

Biological verification and interpretation Microarray experiment

Estimation

Experimental design Image analysis/ Quality assessment Normalization

Clustering Discrimination

(failed) Pre-processing steps Data Analysis

Scientific Process

slide-73
SLIDE 73 Swiss Institute of Bioinformatics

Pin group (sub-array) effects

slide-74
SLIDE 74 Swiss Institute of Bioinformatics

Boxplots, highlighting pin group effects

Clear example of spatial bias

  • *
slide-75
SLIDE 75 Swiss Institute of Bioinformatics

Preprocessing: Normalization

  • Why?

To correct for systematic differences between samples

  • n the same slide, or between slides, which do not

represent true biological variation between samples

  • How do we know it is necessary?

By examining self-self hybridizations, where no true differential expression is occurring. There are dye biases which vary with spot intensity, location on the array, plate origin, pins, scanning parameters, etc.

slide-76
SLIDE 76 Swiss Institute of Bioinformatics

What is self-self hybridization?

  • In dual channel (2-color) microarrays, such as

cDNA arrays, two samples are each labeled with a different fluorescent dye

  • In most studies, the samples are from different

sources (e.g. cancer vs. normal)

  • However, it is also possible to co-hybridize two

samples from the same source (but differently labeled)

slide-77
SLIDE 77 Swiss Institute of Bioinformatics

Dual channel co-hybridizations

(self-self) Control sample Treated sample Control sample Control sample

slide-78
SLIDE 78 Swiss Institute of Bioinformatics

False color overlay Boxplots within pin-groups Scatter (MA-)plots

Self-self hybridizations

slide-79
SLIDE 79 Swiss Institute of Bioinformatics

Normalization: global

  • Normalization based on a global adjustment

log2 R/G → log2 R/G - c = log2 R/(kG)

  • Common choices for k or c = log2k are c =

median or mean of log ratios for a particular gene set (e.g. all genes, or control, or ‘housekeeping’ genes)

  • Another possibility is total intensity

normalization, where k = ∑Ri/ ∑Gi

slide-80
SLIDE 80 Swiss Institute of Bioinformatics

Effect of global normalization

slide-81
SLIDE 81 Swiss Institute of Bioinformatics

Normalization: intensity-dependent

  • Here, run a line through the middle of the MA

plot, shifting the M value of the pair (A,M) by c=c(A), i.e. log2 R/G → log2 R/G - c (A) = log2 R/(k(A)G)

  • One estimate of c(A) is made using the LOWESS

(or loess) function of Cleveland (1979): LOcally WEighted Scatterplot Smoothing

slide-82
SLIDE 82 Swiss Institute of Bioinformatics

Effect of lowess normalization

slide-83
SLIDE 83 Swiss Institute of Bioinformatics

Comparison between arrays

  • Different arrays often do not show identical signal

distribution of M values

– Various technical reasons (e.g. labeling efficiency, amount of labelled RNA, scanner settings, etc…)

  • Need to normalize the signal

between chips

– Multiple possibilities, one

  • ften used: "scale normalization"
slide-84
SLIDE 84 Swiss Institute of Bioinformatics

Boxplots of log ratios from 3 replicate self-self hybs Left panel: before normalization Middle panel: after within print-tip group normalization Right panel: after a further between-slide scale normalization

Scale normalization: between slides

Idea: make the median spread of M values identical by multiplying them by a chip-specific constant

slide-85
SLIDE 85 Swiss Institute of Bioinformatics

Assume: All slides have the same spread in M

  • True log ratio is mij where i represents different

slides and j represents different spots

  • Observed is Mij, where Mij = ai mij
  • Robust estimate of ai is

MADi = medianj { |mij - median(mij) | }

  • Could instead make same assumption for print tip

groups (rather than slides)

Taking scale into account

slide-86
SLIDE 86 Swiss Institute of Bioinformatics

NCI 60 experiments

slide-87
SLIDE 87 Swiss Institute of Bioinformatics

Same normalization on another data set

slide-88
SLIDE 88 Swiss Institute of Bioinformatics

Normalization: Summary

  • Reduces systematic (not random) effects
  • Makes it possible to compare several arrays
  • Use logratios (MVA plots)
  • Lowess normalization (dye bias)
  • Pin-group location normalization
  • Pin-group scale normalization
  • Between slide scale normalization
  • Control Spots
  • Normalization introduces more variability
  • Outliers (bad spots) handled with replication
slide-89
SLIDE 89 Swiss Institute of Bioinformatics

cDNA gene expression data

Data on p genes for n samples:

$ 56 78

$ 78 9 56*:5 7 ;$6

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ... 2

  • 0.10

0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4

  • 0.45
  • 1.03
  • 0.79
  • 0.56
  • 0.32

... 5

  • 0.06

1.06 1.35 1.09

  • 1.09

...

slide-90
SLIDE 90 Swiss Institute of Bioinformatics

Software for Microarray Analysis

  • Very large number of commercial and free

softwares (GeneSpring, PathwayAssist,…)

  • There are several R packages for microarray

analysis available as part of the open source BioConductor project http://www.bioconductor.org/

  • BioC software often created by the author of

the methodology

slide-91
SLIDE 91 Swiss Institute of Bioinformatics

Biological question (e.g. Differentially expressed genes, Sample class prediction, etc.)

Testing

Biological verification and interpretation Microarray experiment

Significance

Experimental design Image analysis/ Quality assessment Normalization

Clustering Discrimination

(failed) Pre-processing steps Data Analysis

Scientific Process

slide-92
SLIDE 92 Swiss Institute of Bioinformatics

cDNA gene expression data

Data on p genes for n samples:

$ 56 78

$ 78 9 56*:5 7 ;$6

sample1 sample2 sample3 sample4 sample5 …

1 0.46 0.30 0.80 1.51 0.90 ... 2

  • 0.10

0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4

  • 0.45
  • 1.03
  • 0.79
  • 0.56
  • 0.32

... 5

  • 0.06

1.06 1.35 1.09

  • 1.09

...

slide-93
SLIDE 93 Swiss Institute of Bioinformatics

Replicated experiments

  • Have n replicates
  • For each gene, have n values of M = log2 fold

change, one from each array

  • Summarize M1, ..., Mn for each gene by

– M = average (M1, ..., Mn) – s = SD(M1, ..., Mn)

  • Rank genes in order of strength of evidence in favor
  • f DE
  • How might we do this?
slide-94
SLIDE 94 Swiss Institute of Bioinformatics

Which genes are DE?

  • Difficult to judge significance

– massive multiple testing problem – genes dependent – don’t know null distribution of M

  • Strategy

– aim to rank genes – assume most genes are not DE (depending on type of experiment and array) – find genes separated from the majority

slide-95
SLIDE 95 Swiss Institute of Bioinformatics

Ranking criteria

  • Genes i = 1, ..., p
  • Mi = average log2 fold change for gene i

– Problem : genes with large variability likely to be selected, even if not DE

  • Fix that by taking variability into account:

use ti = Mi/ (si/√n)

– Problem : genes with extremely small variances make very large t – When the number of replicates is small, the smallest si are likely to be underestimates

slide-96
SLIDE 96 Swiss Institute of Bioinformatics

Summary

  • Image analysis is important to extract information

from the array

– Background may or may not be taken into account

  • Normalization procedures are always needed

– To remove systematic (technical) effects – To allow comparisons between chips

  • Identification of differentially expressed genes is

difficult

– No absolute estimation of significance is possible – Ranking of genes by significance is possible

slide-97
SLIDE 97 Swiss Institute of Bioinformatics

End of part II

slide-98
SLIDE 98 Swiss Institute of Bioinformatics

Part III

Higher-level analysis

slide-99
SLIDE 99 Swiss Institute of Bioinformatics

Finding biological information

Once the matrix of gene-expression vs samples is available, statistical tools can be used to:

  • Find similarity (or difference) of expression pattern in

differentially expressed genes

  • Find differentially expressed functional groups of genes

(pathway analysis, gene ontology)

  • Find classes in the set of samples

(unsupervised analysis)

  • Use differentially expressed genes as a mean to classify

samples in known categories (supervised analysis)

  • Find genes significantly related to survival in a pool of patients
slide-100
SLIDE 100 Swiss Institute of Bioinformatics

Unsupervised analysis: Cluster analysis

  • data matrix (n,p)
  • distance matrix (n,n),

similarity matrix (n,n)

  • cluster formation:

– mutually exclusive clusters – hierarchical clusters

  • comparison of

clusters, means and variances

Dendrogram

slide-101
SLIDE 101 Swiss Institute of Bioinformatics

Hierarchical Clustering (real case)

Sorlie et al. Proc Natl Acad Sci U S A 2001 Sep 11;98(19):10869-74

slide-102
SLIDE 102 Swiss Institute of Bioinformatics
slide-103
SLIDE 103 Swiss Institute of Bioinformatics

Unsupervised analysis: PCA

  • Principal Components Analysis

(PCA)

  • Columns (resp. rows) of

expression matrix viewed as points in multidimensional space

  • Find “dominant” directions in

space and hope these directions can be associated with known parameters

  • Genes (resp. samples) with largest

projection on those vectors explain the parameter

X1 X2 PC1 PC2 X1 X2

slide-104
SLIDE 104 Swiss Institute of Bioinformatics

Supervised analysis example: KNN

  • Based on a measure of distance between observations (e.g.

Euclidean distance or one minus correlation)

  • k-nearest neighbor rule (Fix and Hodges (1951)) classifies an
  • bservation X as follows:

– find the k observations in the learning set closest to X – predict the class of X by majority vote, i.e., choose the class that is most common among those k observations.

  • The number of neighbors k can be chosen by cross-validation

?

Gene 1 Gene 2 k-Nearest Neighbor (knn)

Data Matrix

Gene Sample

slide-105
SLIDE 105 Swiss Institute of Bioinformatics

High-level analysis (summary)

  • Needed to extract biological information after

low-level processing conducted

  • Can be used to:

– Discover new interactions between genes (unsupervised analysis) – Validate biological hypotheses (supervised analysis)

  • Contact your local statistician for more on this

topic !

slide-106
SLIDE 106 Swiss Institute of Bioinformatics

End of Part III