Does Sequence Similarity Predict Expression Similarity Kui Zhang - - PowerPoint PPT Presentation
Does Sequence Similarity Predict Expression Similarity Kui Zhang - - PowerPoint PPT Presentation
Does Sequence Similarity Predict Expression Similarity Kui Zhang Section on Statistical Genetics University of Alabama at Birmingham May 28, 2004 Motivation Microarray technology allows us to monitor the expression of thousands of genes
Motivation
- Microarray technology allows us to monitor the
expression of thousands of genes simultaneously
- The estimates of each individual gene effect size are
generally very low in precision due to small sample size
- The completion of the human genome project
provides another type information of genes
- Ultimate Goal: combining sequence information to
improve the estimation of individual gene effect size in microarray data analysis
- Does sequence similarity predict expression similarity?
Some Studies for Correlation of Expression Data and Sequence Data
- Correlation between gene expression and gene
location: – Kruglyak and Tang, 2000; – Fukuoka et al., 2004;
- Correlation between the co-expression of genes and
the presence of common sequence elements in their upstream regions: – Bussmaker et al., 2001; – Ge et al., 2001;
Methods
- Choose 4 Affymetrix HG-U133 type Microarray data
sets
- Define the sequence similarity by pair-wise E-value
from BLAST search
- Define the expression similarity by pair-wise
correlation coefficient
- Investigate the relationship between sequence-similar
pairs and expression-similar pairs
Affymetrix HG-U133 Microarray
- Provides 18,400 transcripts and variants
- Represents 22,283 genes, including 14,500
well-characterized genes
- Contains more than 22,000 probe sets and 500,000
distinct oligonucleotide features
- Has 8,645 genes with consensus sequences
Define Sequence Similarity
- Use 8,645 genes having consensus sequences in
Affymetrix HG-U133 array
- Translate each sequence to 6 reading frames
- Run the program tblastx (without gap) for all
translated sequences against themselves
- Provide the bit score and e-value for similar sequences
- Set the cut-off e-value as 10−5
- Find 7,396 sequences (genes) having at least one
similar sequences except themselves and only these genes are retained for further analysis
Distribution of E-Value
Histogram of Number of Similar Genes
Number of Similar Genes
- No. of Genes
200 400 600 800 1000 2000 4000
Histogram of E−Value
Natural Log of E−Value
- No. of Genes
−500 −400 −300 −200 −100 10000 25000
Microarray Data Sets
- Cancer Study (A): containing 15 normal oral muscosal
samples, 41 squamous cell carcinomas of the head and neck, and 5 adenocarcenomas of the head and neck, published in Cancer Res. 64:55-63 (2004)
- Affymetrix HG-U133 Serial Dilution (B): containing
42 arrays, processed by RMA.
- Gene Therapy Study (C): consisting of 20 arrays,
divided into 4 groups, each treated with a different viral vector
- Breast Cancer Data (D): consisting of 10 breast
tumors from old women >49 years and 9 from young women < 40 years old
Initial Investigation
- Calculate the Pearson and Spearman correlation
coefficient for each pair of genes
- Divide [-1,1] into 20 bins with equal length
- Count the number of pairs with correlation coefficient
falling in each bin
- Count the number of sequence-similar pairs based
upon BLAST search results in each bin
- Calculate the percentage of sequence-similar pairs in
each bin
Distribution of Pearson Correlation Coefficient
- ● ● ● ● ●
- ●
- ● ● ● ●
−1.0 −0.5 0.0 0.5 1.0 0 e+00 3 e+06
Data A
Bins Number of Pairs
- ● ● ●
- ●
- ● ● ●
−1.0 −0.5 0.0 0.5 1.0 0 e+00 2 e+06 4 e+06
Data B
Bins Number of Pairs
- ● ●
- ●
- ● ●
−1.0 −0.5 0.0 0.5 1.0 1000000 3000000
Data C
Bins Number of Pairs
- ● ● ● ● ● ●
- ● ●
- −1.0
−0.5 0.0 0.5 1.0 1000000 2500000
Data D
Bins Number of Pairs
Distribution of Spearman Correlation Coefficient
- ● ● ● ● ●
- ●
- ● ● ● ● ●
−1.0 −0.5 0.0 0.5 1.0 0 e+00 3 e+06 6 e+06
Data A
Bins Number of Pairs
- ● ● ● ● ●
- ●
- ● ● ● ● ●
−1.0 −0.5 0.0 0.5 1.0 0 e+00 3 e+06 6 e+06
Data B
Bins Number of Pairs
- ● ● ● ●
- ●
- ● ● ● ●
−1.0 −0.5 0.0 0.5 1.0 0 e+00 2 e+06 4 e+06
Data C
Bins Number of Pairs
- ● ● ●
- ●
- ● ● ●
−1.0 −0.5 0.0 0.5 1.0 0 e+00 2 e+06 4 e+06
Data D
Bins Number of Pairs
Percentage of Sequence-similar Pairs in Each Bin - Pearson Correlation Coefficient
- ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
- −1.0
−0.5 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5
Data A
Bins Percentage of Seqence−similar Pairs
- ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
- −1.0
−0.5 0.0 0.5 1.0 0.01 0.03 0.05
Data B
Bins Percentage of Seqence−similar Pairs
- ● ● ● ● ● ●
- ●
- −1.0
−0.5 0.0 0.5 1.0 0.002 0.006 0.010
Data C
Bins Percentage of Seqence−similar Pairs
- ● ● ● ● ● ● ● ● ● ● ●
- −1.0
−0.5 0.0 0.5 1.0 0.000 0.004 0.008
Data D
Bins Percentage of Seqence−similar Pairs
Percentage of Sequence-similar Pairs in Each Bin - Spearman Correlation Coefficient
- ● ● ● ● ● ● ● ● ● ●
- −1.0
−0.5 0.0 0.5 1.0 0.00 0.02 0.04 0.06
Data A
Bins Percentage of Seqence−similar Pairs
- ●
- ● ● ● ● ● ● ● ● ● ● ●
- −1.0
−0.5 0.0 0.5 1.0 0.00 0.04 0.08
Data B
Bins Percentage of Seqence−similar Pairs
- ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
- −1.0
−0.5 0.0 0.5 1.0 0.00 0.05 0.10 0.15
Data C
Bins Percentage of Seqence−similar Pairs
- ● ● ● ● ● ● ● ● ● ● ● ● ●
- −1.0
−0.5 0.0 0.5 1.0 0.002 0.006 0.010
Data D
Bins Percentage of Seqence−similar Pairs
Hierarchical Clustering Of Sequence-Similar Pairs
- Group 7,396 genes using hierarchial clustering
- Define the distance between each pair of genes as
their e-value
- Take the distance between two clusters as the
geometric average of pair-wise e-value between sequences in each cluster
- Use 37 different values to cut trees
The Distance Used for Cutting Trees
Level Natural Log of Distance 1
- 450
5
- 250
10
- 80
15
- 30
20
- 7
25
- 2
30 3 35 8
Distribution of Number of Clusters and Number of Genes
- ● ● ●
- ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
- ● ● ● ● ● ● ● ● ●
5 10 15 20 25 30 35 2000 4000 6000
Clustering Process
Hierarchical Level Number
Distribution of Cluster Size
Distribution of Cluster Size
Cluster Size
- No. of Clusters
10 20 30 40 50 60 70 100 200 300 400 500 600
Methods
- Calculate the average correlation coefficient for all
possible gene pairs at each hierarchical level
- Compute the average correlation coefficient for gene
pairs in the same cluster at each hierarchical level
- At each hierarchical level, calculate percentage of
gene pairs having correlation coefficient less than 0.30 in the same cluster among all gene pairs with correlation coefficient less than 0.30
- At each hierarchical level, calculate percentage of
gene pairs having correlation coefficient greater than 0.60 in the same cluster among all gene pairs with correlation coefficient greater than 0.60
Average Pearson Correlation Coefficient in Same Cluster
- 5
10 15 20 25 30 35 0.05 0.15 0.25
Data A
Hierarchical Level Average Correlation Coefficient
- 5
10 15 20 25 30 35 0.02 0.04 0.06 0.08 0.10
Data B
Hierarchical Level Average Correlation Coefficient
- 5
10 15 20 25 30 35 0.02 0.06 0.10
Data C
Hierarchical Level Average Correlation Coefficient
- 5
10 15 20 25 30 35 0.35 0.40 0.45 0.50
Data D
Hierarchical Level Average Correlation Coefficient
Average Spearman Correlation Coefficient in Same Cluster
- 5
10 15 20 25 30 35 0.01 0.03 0.05
Data A
Hierarchical Level Average Correlation Coefficient
- 5
10 15 20 25 30 35 −0.002 0.000 0.002
Data B
Hierarchical Level Average Correlation Coefficient
- 5
10 15 20 25 30 35 0.000 0.010 0.020
Data C
Hierarchical Level Average Correlation Coefficient
- 5
10 15 20 25 30 35 0.005 0.015
Data D
Hierarchical Level Average Correlation Coefficient
Percentage of Gene Pairs in Same Cluster - Pearson Correlation Coefficient (I)
- 5
10 15 20 25 30 35 0.00 0.02 0.04 0.06 0.08
Data A
Hierarchical Level Percentage
- 5
10 15 20 25 30 35 0.00 0.02 0.04
Data B
Hierarchical Level Percentage
- 5
10 15 20 25 30 35 0.00 0.02 0.04
Data C
Hierarchical Level Percentage
- 5
10 15 20 25 30 35 0.00 0.02 0.04
Data D
Hierarchical Level Percentage
Percentage of Gene Pairs in Same Cluster - Pearson Correlation Coefficient (II)
- 5
10 15 20 25 30 35 20 40 60 80
Data A
Hierarchical Level Ratio
- 5
10 15 20 25 30 35 1 2 3 4 5 6
Data B
Hierarchical Level Ratio
- 5
10 15 20 25 30 35 1.0 1.5 2.0 2.5 3.0 3.5
Data C
Hierarchical Level Ratio
- 5
10 15 20 25 30 35 1.0 1.5 2.0 2.5
Data D
Hierarchical Level Ratio
Percentage of Gene Pairs in Same Cluster - Spearman Correlation Coefficient (I)
- 5
10 15 20 25 30 35 0.00 0.02 0.04
Data A
Hierarchical Level Percentage
- 5
10 15 20 25 30 35 0.00 0.02 0.04 0.06 0.08
Data B
Hierarchical Level Percentage
- 5
10 15 20 25 30 35 0.00 0.02 0.04 0.06
Data C
Hierarchical Level Percentage
- 5
10 15 20 25 30 35 0.00 0.02 0.04 0.06
Data D
Hierarchical Level Percentage
Percentage of Gene Pairs in Same Cluster - Spearman Correlation Coefficient (II)
- 5
10 15 20 25 30 35 10 20 30 40
Data A
Hierarchical Level Ratio
- 5
10 15 20 25 30 35 5 10 15
Data B
Hierarchical Level Ratio
- 5
10 15 20 25 30 35 1.0 1.4 1.8 2.2
Data C
Hierarchical Level Ratio
- 5
10 15 20 25 30 35 0.4 0.6 0.8 1.0 1.2
Data D
Hierarchical Level Ratio
Conclusions
- Higher percentage of sequence-similar pairs in each
bin of correlation-coefficient
- Higher average of correlation coefficient in the same
cluster of sequence-similar gene pairs
- Higher percentage of gene pairs having high
correlation coefficient in the same cluster
- The gene pairs with high sequence similarity are likely
to be co-expressed
- This may therefore be useful in improving estimates
- f gene effect in microarray data analysis
Future Study
- Use different cut-off e-values
- Use other information of genes, such as up stream
sequence, gene functional annotations
- Extend this study to other species, such as mouse,
rat, etc.
- Assess the significance using analytical or simulate
methods
- Develop novel methods to improve the estimation of
effect size for each gene in microarray study combining sequence data
Acknowledgements
- Section on Statistical Genetics
David B. Allison Grier P. Page Jelai Wang
- Department of Microbiology