EMBnet's introduction to bioinformatics
Introduction to microarrays
Thierry Sengstag, PhD Bioinformatics Core Facility
Introduction to microarrays Thierry Sengstag, PhD Bioinformatics - - PowerPoint PPT Presentation
EMBnet's introduction to bioinformatics Introduction to microarrays Thierry Sengstag, PhD Bioinformatics Core Facility Swiss Institute of Bioinformatics Part I Technology of microarrays Swiss Institute of Bioinformatics Biology Fundamentals
EMBnet's introduction to bioinformatics
Introduction to microarrays
Thierry Sengstag, PhD Bioinformatics Core Facility
Biology Fundamentals - Genes
Biology Fundamentals - Expression
Transcriptome: Genes Proteome: Proteins
Microarrays
Genomics Fundamentals - Complexity
Difficulties: §Contaminations §Alternative Splicing §Alternative PolyAdenylation mRNA purification
RNA abundance in mammalian cells
rRNA tRNA
mRNA
80%
1%
1-50 50-500 500+ Molecules/cell 3 x 106 molecules/cell 3 x 105 molecules/cell 1-2 x104 different genes
DNA microarray is a technology that allows scientists to simultaneously detect thousands of genes in a small sample and to analyze the expression of those genes. Microarrays are ordered sets of DNA molecules of known sequence attached on a surface at a known position (spot). In a microarray experiment one hybridizes mRNA molecules of an extract of a sample to the spots of the microarray. Main families of microarrays:
What are DNA Microarrays ?
1- Samples 2- Extracting mRNA 3- Labeling 4- Hybridizing 5- Scanning 6- Visualizing
Various technological choices:
Questions addressed:
Key concept: Compare gene expression in two (or more) cell/tissue types ? Gene expression assessed by measuring the number of RNA transcripts in a tissue sample. (Primary goal of this course.)
What are DNA Microarrays ?
Phase 1: Preparation of the microarray environment
chemistry, etc…)
"usual" organisms; "exotic" organisms still require custom-made arrays Phase 2: Use of the microarray
Two major phases of a microarray experiment
Phase 1: Which sequences should we spot ?
Depends on the organism:
can be used to design probes by bioinformatics means
transcripts, then spotting of DNA after PCR amplification (sequencing of interesting genes can be done after microarray experiment)
Depends on platform:
Biological question (e.g. Differentially expressed genes, Sample class prediction, etc.)
Testing
Biological verification and interpretation Microarray experiment
Significance
Experimental design Image analysis/ Quality assessment Normalization
Clustering Discrimination
(failed) Pre-processing steps Data Analysis
Scientific Process
Phase 2:
Spotted array preparation
“Average” mouse mRNA cDNA isolation Test sequence (probe) production
~100 - ~2000 bp RT-PCR (conversion mRNA-cDNA, amplification)
Array Production: Spotting
Spotting in action…
1. Some rounds of pin cleaning 2. Pickup PCR products from plate 3. Spot one feature on each subarray Spotting arrays
Oligo array preparation
Sequence databases Millions of experiences worldwide Probe (sequence) design
Gene-specific sequences
~60 bp sequences
In-situ synthesis
Spotted and oligo array usage
Hybridization washing
Relative mRNA levels
Scanning cy5 labeled cDNA cy3 labeled cDNA
Mix
Affymetrix chip preparation
Sequence databases Millions of experiments worldwide Probe (sequence) design
Bioinformatics thinking yields gene-specific sequences (3’-end)
25 nt sequences
In-situ synthesis
~100s of nt “consensus” sequences
Affymetrix chip usage
Hybridization washing Relative mRNA levels Scanning Scanning
Affymetrix system
(11 to 16)
Usually the most 3 prime area, often UTR
25mer 25mer 25mer
AAAA. .
25mer
Probe preparation & hybridization
chamber, wash
Scanner basics
– 1 or 2 lasers: cy3 cy5 (seldom more)
– Target a very limited volume
(signal only from focal plane) – Need to “scan” the surface
– Range of values: 0-65535 – Log2 range: 0 – 16
– Glass Slide (e.g. Agilent, PerkinElmer) – Affymetrix
Confocal scanner
Dye Photons Electrons Signal Laser PMT A/D Converter excitation amplification Filtering Time-space averaging
Images from Scanner
– standard 10 µm [currently, best ~ 5µm] – 100µm spot on chip = 10 pixels in diameter
– Typically: TIFF (tagged image file format) 16 bit (65,536 levels of gray) – also other formats – 1cm x 1cm image at 16 bit = 2Mb
– channel 1, channel 2, etc.
Images in analysis software
images
using a 24-bit RGB overlay image
– Blue values (B) are set to 0 – Red values (R) are used for Cy5 intensities – Green values (G) are used for Cy3 intensities
Images : examples
Cy3 Cy5 repressed Control > Treated green induced Control < Treated red unchanged Control = Treated yellow
Gene expression Signal strength Spot color
Image analysis (scanner variability)
ScanArray 4000 Agilent G2565AA
Image processing
– Mean foreground value – Median background value
2-color Arrays Image Processing
GenePix
2-color Arrays Image Processing
A difficult case… J J J J
Quantification of Expression
For each spot on the slide, calculate Red intensity = Rfg - Rbg (fg = foreground, bg = background) and Green intensity = Gfg - Gbg and combine them in the log (base 2) ratio Log2(Red/Green) Often, fg = mean and bg = median of relevant pixel intensities
M and A values
– Properties:
and green conditions: M=0
M = log2( Red / Green )
= log2( Red ) – log2( Green )
A = 1/2 ( log2( Red ) + log2( Green ) )
= log2( sqrt( Red * Green ) )
Why we take logs
Linear scale Log scale
Better representation of genes with "medium" expression: Biologically, a unit change in log2 represents a 2-fold change.
MvA plots
log2(Green) log2(Red) 16 16 M A 16 45o Rotation
Red saturation Red saturation Green saturation Green saturation
+1
(+ stretching by a factor 2 along M axis)
Hybridization of extra material with known concentrations (spikes)
A real-life MvA plot
Biological question (e.g. Differentially expressed genes, Sample class prediction, etc.)
Testing
Biological verification and interpretation Microarray experiment
Significance
Experimental design Image analysis/ Quality assessment Normalization
Clustering Discrimination
(failed) Pre-processing steps Data Analysis
Scientific Process
Steps in Images Processing
– Assigning coordinates to each spot
– Classification of pixels as either foreground (signal)
– Foreground fluorescence intensity pairs (R,G) – Background intensities – Quality measures
Addressing
This is the process of assigning coordinates to each of the spots. Automating this part of the procedure permits high throughput analysis.
4 by 4 grids 19 by 21 spots per grid
Addressing
Problems in automatic addressing
Rotat i
Rotat i
Problems in automatic addressing
– Separation between rows and columns of grids – Individual translation of grids – Separation between rows and columns of spots within each grid – Small individual translation of spots – Overall position of the array in the image
Addressing
(determined by the arrayer)
Steps in Images Processing
– Assigning coordinates to each spot
– Classification of pixels as either foreground (signal)
– Foreground fluorescence intensity pairs (R,G) – Background intensities – Quality measures
Segmentation Methods
– Edge detection – Seeded Region Growing (R. Adams and L. Bishof (1994): Regions grow outwards from seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region
Fixed circle segmentation
the image
Adaptive circle segmentation
estimated separately for each spot
Dapple finds spots by detecting edges of spots (second derivative)
exhibits oval shapes
Limitation of circular segmentation
—Small spot —Not circular
Result of Seed Region Growing
Adaptive shape segmentation
– Bonus: already know geometry of array
preferentially according to the difference between a pixel’s value and the running mean
Histogram segmentation
histogram of pixel values for pixels within the masked area
– Background : mean between 5th and 20th percentile – Foreground : mean between 80th and 95th percentile
target mask is set to compensate for variation in spot size ! "
Steps in Images Processing
– Assigning coordinates to each spot
– Classification of pixels as either foreground (signal)
– Foreground fluorescence intensity pairs (R,G) – Background intensities – Quality measures
Information Extraction
§ mean of pixel intensities § median of pixel intensities § Pixel variation (e.g. IQR)
§ None § Local § Constant (global)
Take the average
Spot ‘foreground’ intensity
proportional to the total fluorescence generated by the spot
the spot mask
between Cy5 and Cy3, we compute the average* pixel value over the spot mask
*alternative : ratios of medians may be better than means if bright specks present
Background intensity
a contribution of non-specific hybridization and other chemicals on the glass
DNA should be different from regions
→ one solution is to use local negative controls (spotted DNA that should not hybridize)
BG: None
– Can be better than some forms of local background determination with good quality arrays
+ − + + − + =
bg G bg fg G fg bg R bg fg R fg
G G R R M
, , , , 2
log σ σ σ σ
+ + + + ≈ ) ( ) ( log
, , , , 2 bg G fg G fg bg R fg R fg
G R σ σ σ σ
With a loose mathematical notation:
+ + =
fg G fg fg R fg
G R M
, , 2
log σ σ worse than
BG: Local
#$ %$
spots, bg estimate is less sensitive to the performance of the segmentation procedure
Background can matter
Without BG correction With BG correction
Summary
– Association of a "geographic" location (and corresponding annotation) with signal intensities – Several non-trivial technical choices (scanner, image analysis software, etc…) can affect the quality of the signal
(low bg arrays)
Biological question (e.g. Differentially expressed genes, Sample class prediction, etc.)
Testing
Biological verification and interpretation Microarray experiment
Significance
Experimental design Image analysis/ Quality assessment Normalization
Clustering Discrimination
(failed) Pre-processing steps Data Analysis
Scientific Process
Quality assessment overview Visual inspection of images Evaluation of MvA plots Compare statistical summaries for the chips
$& '%(( & %% ((
Co-registration and overlay offers a quick visualization, revealing information on color balance, uniformity of hybridization, spot uniformity, background, and artifacts such as dust or scratches
Red/Green overlay images
Spatial plots: background from two slides
Practical Problems 1
Comet Tails
§ Likely cause: insufficiently rapid immersion
the succinic anhydride blocking solution
Practical Problems 2
% ) %
Practical Problems 3
High Background
– Insufficient blocking – Precipitation of the labeled probe
Weak Signals
Practical Problems 4
+
Practical Problems 5
Artifacts in microarrays
meaningful differences between sample types
there are also usually artifactual differences
– print tips - differences in subarrays – plate effects – differences in rows within subarray – batch effects – hybridization artifacts
Sample boxplot
,. /'!0
*.1&2+%2# 34
Boxplots of log2R/G
(Example data associated to limmaGUI package.)
Biological question (e.g. Differentially expressed genes, Sample class prediction, etc.)
Testing
Biological verification and interpretation Microarray experiment
Estimation
Experimental design Image analysis/ Quality assessment Normalization
Clustering Discrimination
(failed) Pre-processing steps Data Analysis
Scientific Process
Pin group (sub-array) effects
Boxplots, highlighting pin group effects
Clear example of spatial bias
Preprocessing: Normalization
To correct for systematic differences between samples
represent true biological variation between samples
By examining self-self hybridizations, where no true differential expression is occurring. There are dye biases which vary with spot intensity, location on the array, plate origin, pins, scanning parameters, etc.
What is self-self hybridization?
cDNA arrays, two samples are each labeled with a different fluorescent dye
sources (e.g. cancer vs. normal)
samples from the same source (but differently labeled)
Dual channel co-hybridizations
(self-self) Control sample Treated sample Control sample Control sample
False color overlay Boxplots within pin-groups Scatter (MA-)plots
Self-self hybridizations
Normalization: global
log2 R/G → log2 R/G - c = log2 R/(kG)
median or mean of log ratios for a particular gene set (e.g. all genes, or control, or ‘housekeeping’ genes)
normalization, where k = ∑Ri/ ∑Gi
Effect of global normalization
Normalization: intensity-dependent
plot, shifting the M value of the pair (A,M) by c=c(A), i.e. log2 R/G → log2 R/G - c (A) = log2 R/(k(A)G)
(or loess) function of Cleveland (1979): LOcally WEighted Scatterplot Smoothing
Effect of lowess normalization
Comparison between arrays
distribution of M values
– Various technical reasons (e.g. labeling efficiency, amount of labelled RNA, scanner settings, etc…)
between chips
– Multiple possibilities, one
Boxplots of log ratios from 3 replicate self-self hybs Left panel: before normalization Middle panel: after within print-tip group normalization Right panel: after a further between-slide scale normalization
Scale normalization: between slides
Idea: make the median spread of M values identical by multiplying them by a chip-specific constant
Assume: All slides have the same spread in M
slides and j represents different spots
MADi = medianj { |mij - median(mij) | }
groups (rather than slides)
Taking scale into account
NCI 60 experiments
Same normalization on another data set
Normalization: Summary
cDNA gene expression data
Data on p genes for n samples:
$ 56 78
$ 78 9 56*:5 7 ;$6
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ... 2
0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4
... 5
1.06 1.35 1.09
...
Software for Microarray Analysis
softwares (GeneSpring, PathwayAssist,…)
analysis available as part of the open source BioConductor project http://www.bioconductor.org/
the methodology
Biological question (e.g. Differentially expressed genes, Sample class prediction, etc.)
Testing
Biological verification and interpretation Microarray experiment
Significance
Experimental design Image analysis/ Quality assessment Normalization
Clustering Discrimination
(failed) Pre-processing steps Data Analysis
Scientific Process
cDNA gene expression data
Data on p genes for n samples:
$ 56 78
$ 78 9 56*:5 7 ;$6
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ... 2
0.49 0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10 0.20 ... 4
... 5
1.06 1.35 1.09
...
Replicated experiments
change, one from each array
– M = average (M1, ..., Mn) – s = SD(M1, ..., Mn)
Which genes are DE?
– massive multiple testing problem – genes dependent – don’t know null distribution of M
– aim to rank genes – assume most genes are not DE (depending on type of experiment and array) – find genes separated from the majority
Ranking criteria
– Problem : genes with large variability likely to be selected, even if not DE
use ti = Mi/ (si/√n)
– Problem : genes with extremely small variances make very large t – When the number of replicates is small, the smallest si are likely to be underestimates
Summary
from the array
– Background may or may not be taken into account
– To remove systematic (technical) effects – To allow comparisons between chips
difficult
– No absolute estimation of significance is possible – Ranking of genes by significance is possible
Finding biological information
Once the matrix of gene-expression vs samples is available, statistical tools can be used to:
differentially expressed genes
(pathway analysis, gene ontology)
(unsupervised analysis)
samples in known categories (supervised analysis)
Unsupervised analysis: Cluster analysis
similarity matrix (n,n)
– mutually exclusive clusters – hierarchical clusters
clusters, means and variances
Dendrogram
Hierarchical Clustering (real case)
Sorlie et al. Proc Natl Acad Sci U S A 2001 Sep 11;98(19):10869-74
Unsupervised analysis: PCA
(PCA)
expression matrix viewed as points in multidimensional space
space and hope these directions can be associated with known parameters
projection on those vectors explain the parameter
X1 X2 PC1 PC2 X1 X2
Supervised analysis example: KNN
Euclidean distance or one minus correlation)
– find the k observations in the learning set closest to X – predict the class of X by majority vote, i.e., choose the class that is most common among those k observations.
?
Gene 1 Gene 2 k-Nearest Neighbor (knn)
Data Matrix
Gene Sample
High-level analysis (summary)
low-level processing conducted
– Discover new interactions between genes (unsupervised analysis) – Validate biological hypotheses (supervised analysis)
topic !