[PPT] - Does Sequence Similarity Predict Expression Similarity Kui Zhang PowerPoint Presentation

SLIDE 1

Does Sequence Similarity Predict Expression Similarity

Kui Zhang Section on Statistical Genetics University of Alabama at Birmingham

May 28, 2004

SLIDE 2

Motivation

Microarray technology allows us to monitor the

expression of thousands of genes simultaneously

The estimates of each individual gene effect size are

generally very low in precision due to small sample size

The completion of the human genome project

provides another type information of genes

Ultimate Goal: combining sequence information to

improve the estimation of individual gene effect size in microarray data analysis

Does sequence similarity predict expression similarity?

SLIDE 3

Some Studies for Correlation of Expression Data and Sequence Data

Correlation between gene expression and gene

location: – Kruglyak and Tang, 2000; – Fukuoka et al., 2004;

Correlation between the co-expression of genes and

the presence of common sequence elements in their upstream regions: – Bussmaker et al., 2001; – Ge et al., 2001;

SLIDE 4

Methods

Choose 4 Affymetrix HG-U133 type Microarray data

sets

Define the sequence similarity by pair-wise E-value

from BLAST search

Define the expression similarity by pair-wise

correlation coefficient

Investigate the relationship between sequence-similar

pairs and expression-similar pairs

SLIDE 5

Affymetrix HG-U133 Microarray

Provides 18,400 transcripts and variants
Represents 22,283 genes, including 14,500

well-characterized genes

Contains more than 22,000 probe sets and 500,000

distinct oligonucleotide features

Has 8,645 genes with consensus sequences

SLIDE 6

Define Sequence Similarity

Use 8,645 genes having consensus sequences in

Affymetrix HG-U133 array

Translate each sequence to 6 reading frames
Run the program tblastx (without gap) for all

translated sequences against themselves

Provide the bit score and e-value for similar sequences
Set the cut-off e-value as 10−5
Find 7,396 sequences (genes) having at least one

similar sequences except themselves and only these genes are retained for further analysis

SLIDE 7

Distribution of E-Value

Histogram of Number of Similar Genes

Number of Similar Genes

No. of Genes

200 400 600 800 1000 2000 4000

Histogram of E−Value

Natural Log of E−Value

No. of Genes

−500 −400 −300 −200 −100 10000 25000

SLIDE 8

Microarray Data Sets

Cancer Study (A): containing 15 normal oral muscosal

samples, 41 squamous cell carcinomas of the head and neck, and 5 adenocarcenomas of the head and neck, published in Cancer Res. 64:55-63 (2004)

Affymetrix HG-U133 Serial Dilution (B): containing

42 arrays, processed by RMA.

Gene Therapy Study (C): consisting of 20 arrays,

divided into 4 groups, each treated with a different viral vector

Breast Cancer Data (D): consisting of 10 breast

tumors from old women >49 years and 9 from young women < 40 years old

SLIDE 9

Initial Investigation

Calculate the Pearson and Spearman correlation

coefficient for each pair of genes

Divide [-1,1] into 20 bins with equal length
Count the number of pairs with correlation coefficient

falling in each bin

Count the number of sequence-similar pairs based

upon BLAST search results in each bin

Calculate the percentage of sequence-similar pairs in

each bin

SLIDE 10

Distribution of Pearson Correlation Coefficient

● ● ● ● ●
●
● ● ● ●

−1.0 −0.5 0.0 0.5 1.0 0 e+00 3 e+06

Data A

Bins Number of Pairs

● ● ●
●
● ● ●

−1.0 −0.5 0.0 0.5 1.0 0 e+00 2 e+06 4 e+06

Data B

Bins Number of Pairs

● ●
●
● ●

−1.0 −0.5 0.0 0.5 1.0 1000000 3000000

Data C

Bins Number of Pairs

● ● ● ● ● ●
● ●
−1.0

−0.5 0.0 0.5 1.0 1000000 2500000

Data D

Bins Number of Pairs

SLIDE 11

Distribution of Spearman Correlation Coefficient

● ● ● ● ●
●
● ● ● ● ●

−1.0 −0.5 0.0 0.5 1.0 0 e+00 3 e+06 6 e+06

Data A

Bins Number of Pairs

● ● ● ● ●
●
● ● ● ● ●

−1.0 −0.5 0.0 0.5 1.0 0 e+00 3 e+06 6 e+06

Data B

Bins Number of Pairs

● ● ● ●
●
● ● ● ●

−1.0 −0.5 0.0 0.5 1.0 0 e+00 2 e+06 4 e+06

Data C

Bins Number of Pairs

● ● ●
●
● ● ●

−1.0 −0.5 0.0 0.5 1.0 0 e+00 2 e+06 4 e+06

Data D

Bins Number of Pairs

SLIDE 12