Does Sequence Similarity Predict Expression Similarity Kui Zhang - - PowerPoint PPT Presentation

does sequence similarity predict expression similarity
SMART_READER_LITE
LIVE PREVIEW

Does Sequence Similarity Predict Expression Similarity Kui Zhang - - PowerPoint PPT Presentation

Does Sequence Similarity Predict Expression Similarity Kui Zhang Section on Statistical Genetics University of Alabama at Birmingham May 28, 2004 Motivation Microarray technology allows us to monitor the expression of thousands of genes


slide-1
SLIDE 1

Does Sequence Similarity Predict Expression Similarity

Kui Zhang Section on Statistical Genetics University of Alabama at Birmingham

May 28, 2004

slide-2
SLIDE 2

Motivation

  • Microarray technology allows us to monitor the

expression of thousands of genes simultaneously

  • The estimates of each individual gene effect size are

generally very low in precision due to small sample size

  • The completion of the human genome project

provides another type information of genes

  • Ultimate Goal: combining sequence information to

improve the estimation of individual gene effect size in microarray data analysis

  • Does sequence similarity predict expression similarity?
slide-3
SLIDE 3

Some Studies for Correlation of Expression Data and Sequence Data

  • Correlation between gene expression and gene

location: – Kruglyak and Tang, 2000; – Fukuoka et al., 2004;

  • Correlation between the co-expression of genes and

the presence of common sequence elements in their upstream regions: – Bussmaker et al., 2001; – Ge et al., 2001;

slide-4
SLIDE 4

Methods

  • Choose 4 Affymetrix HG-U133 type Microarray data

sets

  • Define the sequence similarity by pair-wise E-value

from BLAST search

  • Define the expression similarity by pair-wise

correlation coefficient

  • Investigate the relationship between sequence-similar

pairs and expression-similar pairs

slide-5
SLIDE 5

Affymetrix HG-U133 Microarray

  • Provides 18,400 transcripts and variants
  • Represents 22,283 genes, including 14,500

well-characterized genes

  • Contains more than 22,000 probe sets and 500,000

distinct oligonucleotide features

  • Has 8,645 genes with consensus sequences
slide-6
SLIDE 6

Define Sequence Similarity

  • Use 8,645 genes having consensus sequences in

Affymetrix HG-U133 array

  • Translate each sequence to 6 reading frames
  • Run the program tblastx (without gap) for all

translated sequences against themselves

  • Provide the bit score and e-value for similar sequences
  • Set the cut-off e-value as 10−5
  • Find 7,396 sequences (genes) having at least one

similar sequences except themselves and only these genes are retained for further analysis

slide-7
SLIDE 7

Distribution of E-Value

Histogram of Number of Similar Genes

Number of Similar Genes

  • No. of Genes

200 400 600 800 1000 2000 4000

Histogram of E−Value

Natural Log of E−Value

  • No. of Genes

−500 −400 −300 −200 −100 10000 25000

slide-8
SLIDE 8

Microarray Data Sets

  • Cancer Study (A): containing 15 normal oral muscosal

samples, 41 squamous cell carcinomas of the head and neck, and 5 adenocarcenomas of the head and neck, published in Cancer Res. 64:55-63 (2004)

  • Affymetrix HG-U133 Serial Dilution (B): containing

42 arrays, processed by RMA.

  • Gene Therapy Study (C): consisting of 20 arrays,

divided into 4 groups, each treated with a different viral vector

  • Breast Cancer Data (D): consisting of 10 breast

tumors from old women >49 years and 9 from young women < 40 years old

slide-9
SLIDE 9

Initial Investigation

  • Calculate the Pearson and Spearman correlation

coefficient for each pair of genes

  • Divide [-1,1] into 20 bins with equal length
  • Count the number of pairs with correlation coefficient

falling in each bin

  • Count the number of sequence-similar pairs based

upon BLAST search results in each bin

  • Calculate the percentage of sequence-similar pairs in

each bin

slide-10
SLIDE 10

Distribution of Pearson Correlation Coefficient

  • ● ● ● ● ●
  • ● ● ● ●

−1.0 −0.5 0.0 0.5 1.0 0 e+00 3 e+06

Data A

Bins Number of Pairs

  • ● ● ●
  • ● ● ●

−1.0 −0.5 0.0 0.5 1.0 0 e+00 2 e+06 4 e+06

Data B

Bins Number of Pairs

  • ● ●
  • ● ●

−1.0 −0.5 0.0 0.5 1.0 1000000 3000000

Data C

Bins Number of Pairs

  • ● ● ● ● ● ●
  • ● ●
  • −1.0

−0.5 0.0 0.5 1.0 1000000 2500000

Data D

Bins Number of Pairs

slide-11
SLIDE 11

Distribution of Spearman Correlation Coefficient

  • ● ● ● ● ●
  • ● ● ● ● ●

−1.0 −0.5 0.0 0.5 1.0 0 e+00 3 e+06 6 e+06

Data A

Bins Number of Pairs

  • ● ● ● ● ●
  • ● ● ● ● ●

−1.0 −0.5 0.0 0.5 1.0 0 e+00 3 e+06 6 e+06

Data B

Bins Number of Pairs

  • ● ● ● ●
  • ● ● ● ●

−1.0 −0.5 0.0 0.5 1.0 0 e+00 2 e+06 4 e+06

Data C

Bins Number of Pairs

  • ● ● ●
  • ● ● ●

−1.0 −0.5 0.0 0.5 1.0 0 e+00 2 e+06 4 e+06

Data D

Bins Number of Pairs

slide-12
SLIDE 12

Percentage of Sequence-similar Pairs in Each Bin - Pearson Correlation Coefficient

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5

Data A

Bins Percentage of Seqence−similar Pairs

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.01 0.03 0.05

Data B

Bins Percentage of Seqence−similar Pairs

  • ● ● ● ● ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.002 0.006 0.010

Data C

Bins Percentage of Seqence−similar Pairs

  • ● ● ● ● ● ● ● ● ● ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.000 0.004 0.008

Data D

Bins Percentage of Seqence−similar Pairs

slide-13
SLIDE 13

Percentage of Sequence-similar Pairs in Each Bin - Spearman Correlation Coefficient

  • ● ● ● ● ● ● ● ● ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.00 0.02 0.04 0.06

Data A

Bins Percentage of Seqence−similar Pairs

  • ● ● ● ● ● ● ● ● ● ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.00 0.04 0.08

Data B

Bins Percentage of Seqence−similar Pairs

  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.00 0.05 0.10 0.15

Data C

Bins Percentage of Seqence−similar Pairs

  • ● ● ● ● ● ● ● ● ● ● ● ● ●
  • −1.0

−0.5 0.0 0.5 1.0 0.002 0.006 0.010

Data D

Bins Percentage of Seqence−similar Pairs

slide-14
SLIDE 14

Hierarchical Clustering Of Sequence-Similar Pairs

  • Group 7,396 genes using hierarchial clustering
  • Define the distance between each pair of genes as

their e-value

  • Take the distance between two clusters as the

geometric average of pair-wise e-value between sequences in each cluster

  • Use 37 different values to cut trees
slide-15
SLIDE 15

The Distance Used for Cutting Trees

Level Natural Log of Distance 1

  • 450

5

  • 250

10

  • 80

15

  • 30

20

  • 7

25

  • 2

30 3 35 8

slide-16
SLIDE 16

Distribution of Number of Clusters and Number of Genes

  • ● ● ●
  • ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • ● ● ● ● ● ● ● ● ●

5 10 15 20 25 30 35 2000 4000 6000

Clustering Process

Hierarchical Level Number

slide-17
SLIDE 17

Distribution of Cluster Size

Distribution of Cluster Size

Cluster Size

  • No. of Clusters

10 20 30 40 50 60 70 100 200 300 400 500 600

slide-18
SLIDE 18

Methods

  • Calculate the average correlation coefficient for all

possible gene pairs at each hierarchical level

  • Compute the average correlation coefficient for gene

pairs in the same cluster at each hierarchical level

  • At each hierarchical level, calculate percentage of

gene pairs having correlation coefficient less than 0.30 in the same cluster among all gene pairs with correlation coefficient less than 0.30

  • At each hierarchical level, calculate percentage of

gene pairs having correlation coefficient greater than 0.60 in the same cluster among all gene pairs with correlation coefficient greater than 0.60

slide-19
SLIDE 19

Average Pearson Correlation Coefficient in Same Cluster

  • 5

10 15 20 25 30 35 0.05 0.15 0.25

Data A

Hierarchical Level Average Correlation Coefficient

  • 5

10 15 20 25 30 35 0.02 0.04 0.06 0.08 0.10

Data B

Hierarchical Level Average Correlation Coefficient

  • 5

10 15 20 25 30 35 0.02 0.06 0.10

Data C

Hierarchical Level Average Correlation Coefficient

  • 5

10 15 20 25 30 35 0.35 0.40 0.45 0.50

Data D

Hierarchical Level Average Correlation Coefficient

slide-20
SLIDE 20

Average Spearman Correlation Coefficient in Same Cluster

  • 5

10 15 20 25 30 35 0.01 0.03 0.05

Data A

Hierarchical Level Average Correlation Coefficient

  • 5

10 15 20 25 30 35 −0.002 0.000 0.002

Data B

Hierarchical Level Average Correlation Coefficient

  • 5

10 15 20 25 30 35 0.000 0.010 0.020

Data C

Hierarchical Level Average Correlation Coefficient

  • 5

10 15 20 25 30 35 0.005 0.015

Data D

Hierarchical Level Average Correlation Coefficient

slide-21
SLIDE 21

Percentage of Gene Pairs in Same Cluster - Pearson Correlation Coefficient (I)

  • 5

10 15 20 25 30 35 0.00 0.02 0.04 0.06 0.08

Data A

Hierarchical Level Percentage

  • 5

10 15 20 25 30 35 0.00 0.02 0.04

Data B

Hierarchical Level Percentage

  • 5

10 15 20 25 30 35 0.00 0.02 0.04

Data C

Hierarchical Level Percentage

  • 5

10 15 20 25 30 35 0.00 0.02 0.04

Data D

Hierarchical Level Percentage

slide-22
SLIDE 22

Percentage of Gene Pairs in Same Cluster - Pearson Correlation Coefficient (II)

  • 5

10 15 20 25 30 35 20 40 60 80

Data A

Hierarchical Level Ratio

  • 5

10 15 20 25 30 35 1 2 3 4 5 6

Data B

Hierarchical Level Ratio

  • 5

10 15 20 25 30 35 1.0 1.5 2.0 2.5 3.0 3.5

Data C

Hierarchical Level Ratio

  • 5

10 15 20 25 30 35 1.0 1.5 2.0 2.5

Data D

Hierarchical Level Ratio

slide-23
SLIDE 23

Percentage of Gene Pairs in Same Cluster - Spearman Correlation Coefficient (I)

  • 5

10 15 20 25 30 35 0.00 0.02 0.04

Data A

Hierarchical Level Percentage

  • 5

10 15 20 25 30 35 0.00 0.02 0.04 0.06 0.08

Data B

Hierarchical Level Percentage

  • 5

10 15 20 25 30 35 0.00 0.02 0.04 0.06

Data C

Hierarchical Level Percentage

  • 5

10 15 20 25 30 35 0.00 0.02 0.04 0.06

Data D

Hierarchical Level Percentage

slide-24
SLIDE 24

Percentage of Gene Pairs in Same Cluster - Spearman Correlation Coefficient (II)

  • 5

10 15 20 25 30 35 10 20 30 40

Data A

Hierarchical Level Ratio

  • 5

10 15 20 25 30 35 5 10 15

Data B

Hierarchical Level Ratio

  • 5

10 15 20 25 30 35 1.0 1.4 1.8 2.2

Data C

Hierarchical Level Ratio

  • 5

10 15 20 25 30 35 0.4 0.6 0.8 1.0 1.2

Data D

Hierarchical Level Ratio

slide-25
SLIDE 25

Conclusions

  • Higher percentage of sequence-similar pairs in each

bin of correlation-coefficient

  • Higher average of correlation coefficient in the same

cluster of sequence-similar gene pairs

  • Higher percentage of gene pairs having high

correlation coefficient in the same cluster

  • The gene pairs with high sequence similarity are likely

to be co-expressed

  • This may therefore be useful in improving estimates
  • f gene effect in microarray data analysis
slide-26
SLIDE 26

Future Study

  • Use different cut-off e-values
  • Use other information of genes, such as up stream

sequence, gene functional annotations

  • Extend this study to other species, such as mouse,

rat, etc.

  • Assess the significance using analytical or simulate

methods

  • Develop novel methods to improve the estimation of

effect size for each gene in microarray study combining sequence data

slide-27
SLIDE 27

Acknowledgements

  • Section on Statistical Genetics

David B. Allison Grier P. Page Jelai Wang

  • Department of Microbiology

Elliot J. Lefkowitz