[PPT] - Gene regulation, protein networks and disease a computational PowerPoint Presentation

SLIDE 1

1

Gene regulation, protein networks and disease – a computational perspective

Ron Shamir School of Computer Science Tel Aviv University

CPM Helsinki July 3 2012

1

SLIDE 2

Outline

Finding regulatory

motifs I, II, III

Utilizing case-control

expression profiles and networks I, II

Chromosomal

aberrations in cancer

DEGAS

2

SLIDE 3

Regulation of Transcription

A gene’s ranscription regulation is

mainly encoded in the DNA in a region called the promoter

Each promoter contains several short

DNA subsequences, called binding sites (BSs) that are bound by specific proteins called transcription factors (TFs)

TF TF Gene 5’ 3’ BS BS  promoter 

SLIDE 4

4

Position Weight Matrix (PWM)

0.2 0.7 0.8 0.1 A 0.6 0.4 0.1 0.5 0.1 C 0.1 0.4 0.1 0.5 G 0.3 0.1 0.1 0.9 T

ATGCAGGATACACCGATCGGTA 0.0605 GGAGTAGAGCAAGTCCCGTGA 0.0605 AAGACTCTACAATTATGGCGT 0.0151 Score: product of base probabilities. Need score threshold for hits.

SLIDE 5

I. Finding Regulatory Motifs

5

C. Linhart, Y. Halperin Gen

enome Res e Resea earch 08 08

SLIDE 6

Cluster I Cluster II Cluster III

Gene e exp xpression microarray ays

Clust stering

Location a analysis (C (ChIP-chip, … …) Functional g l group (e (e.g., G GO term)

Motif discovery:

The tw two-step tep s strategy egy

Pr Promoter sequences

Motif discov

very

Co Co-reg egulated ed g gene set et

6

SLIDE 7

Amad adeus us

A Motif Algorithm for Detecting Enrichment in mUltiple Species



Supp pports d diverse m motif d disco covery t tasks:

1. Find ove
ver-re

repre resented motifs in given sets of

f genes.
2. Identify motifs with global s

l spatial f l feature res given onl nly the genomic sequences.



How? w?



A general pipeli line a arc rchi hitecture for enumerating motifs.



Different statistical sc scoring sc scheme mes of motifs for different motif discovery tasks.

7

SLIDE 8

Motif search algorithm

 Pipeline of refinement phases of increased complexity

k-mer Prepr

process Mismat atch List o

f k-mers

Merge

PW

PWM Optimiza zation

Cutoff = = 0.005 005

PW

PWM

Mo

Motif Mo Model el:

Phases:

8

SLIDE 9

 Input

put: Target set (size T) = co-regulated genes Background (BG BG) set (size B) = entire genome

 Mo

Motif enri richment s sco cori ring:

 Hyper-geom

metric

 Binne

nned e enrichment nt s score

 Bino

nomi mial

Scor coring ov

ver-rep

epres esen ented ed m motifs

B T b t

9

Le Length GC GC-conte tent B1 T1 b1 B2 b2 B3 T3 b3 T4 b4 T2 B4

20 20-40 40% 40 40-60 60% 0.4-0.7kb kbp 0.7-1kbp bp

SLIDE 10

Metazoan motif discovery benchmark:

42 42 targ rget s sets of

f 26

26 TFs, s, 8 8 miRNAs As from from 29 29 studies s (expre ression

n, C

, Chip-ChIP hIP,..) ,..) i in hu human, , mou

use,

, fly fly, w , worm

rm.

All ll m mot

tifs

fs a are re experi rimentally ve veri rified Ave verage t targ rget s set size: : 400 400 genes ( (383 383 Kb Kbp) )

10

SLIDE 11

11

SLIDE 12

12

SLIDE 13

Amade deus s – Global spatial analysis

Promoter sequences Output

Motif(s) Gene e expression

n

microarrays Location anal analysis ( (ChIP-chip, … …) Functi tional g group ( p (e.g., G GO te term) m)

Co Co-re regula lated g gene set

13

SLIDE 14

Task II: Glo lobal a l analy lyse ses

 Localization w.r.t the TSS  Strand-bias  Chromosomal preference

TSS SS

5’

Scores for spatial features of motif occurrences In Input: Sequences (no target-set / expression data)

Motif if s scorin ing:

14

SLIDE 15

Global analysis: Chromosomal preference in C. elegans

Input: t:

 All

ll wo worm promoters rs (~ (~18 18,000 00) )

 Score

re: : chromosomal al prefere rence Re Results: Novel m l motif on

n chro

rom IV IV

15

SLIDE 16

Global analysis: Chromosomal preference in C. elegans

Input: t:

 All

ll wo worm promoters rs ( (~18 18,000 000) )

 Score

re: : chrom hromosomal p pre refe ference Re Results: Novel m l motif on

n chr

hrom

m IV

IV

16

SLIDE 17

II. Finding Transcriptional

Programs

17

Y. Halperin, C. Linhart, I. Ulitsky NAR

AR 1 0 1 0

SLIDE 18

Goal

Given expression profiles, find the transcriptional programs active in them:

the co-regulated genes,
the motifs that govern their co-

regulation

SLIDE 19

Our goal

al: b

: bypas ass t the two-step a approac ach

Cluster I Cluster II Cluster III

Gene expression microarrays

Clustering

Promoter sequences Expression data Output

Motif(s)

Co-regulated gene set

19

Simultaneous s infer erence o e of the e motif tifs a and the exp pr p profiles o

f

their ir t targe gets ts

SLIDE 20

Allegro: expression model

 Discretization of expression patterns

e1=Up (U) e2=Same (S) e3=Down (D) ≥1.0 (-1.0, 1.0) ≤-1.0

cm … c2 c1 1.5

0.8
2.3

g cm … c2 c1 U … S D g

Ex Expressi ssion p pattern Discrete e expression

n

Pattern ( (DEP EP)

 Condition frequency matrix (CFM)  Condition weight matrix (CWM

WM)

cm … c2 c1 0.78 … 0.1 0.05 U 0.14 … 0.2 0.9 S 0.08 … 0.7 0.05 D

F =

( )

log

W ij ij

f F r

           

=

( R={rij} is the BG CFM)

⇒ Log-likelihood ratio (LLR

LLR) score

20

SLIDE 21

Allegro

verview

21

SLIDE 22

Yeast osmotic shock pathway

 Allegro can discover multiple motifs with diverse expression patterns,

even if the response is in a small fraction of the conditions

 Extant two-step techniques recovered only 4 of the above motifs:

 K-means/C

/CLI LICK + + Amadeus/W /Weeder: RRPE, PAC, MBF, STRE

 Iclust

st + + FIRE: E: RRPE, PAC, Rap1, STRE

 ~6,000 genes, 133 conditions [O’Rourke et al. ’04]

22

SLIDE 23

3’ ’ UT UTR R an anal alysis: Hu

Human an st stem c cells s

 ~14,000 genes, 124 conditions (various types of

proliferating cells) [Mueller et. al, Nature’08]

 Biases in length / GC-content of 3’ UTRs, e.g.:

100 highly-expressed genes in… 3’ UTR: length GC Embryoid bodies 584 47% Undifferentiated ESCs 774 44% ESC-derived fibroblasts 1240 39% Fetal NSCs 1422 43% (ESCs = embryonic stem cells, NSCs = neural stem cells)

 Extant methods / Allegro with HG score: report

nly false positives

23

SLIDE 24

Hu Human an st stem cells: s: results using binned score

Most highly expressed miRNAs in human/mouse ESCs

Current knowledge

Abundant & functional in neural cell lineage Expressed specifically in neural lineage; active role in neurogenesis miRN RNA expressi ssion targets s expressi ssion

miRNA expression from [Laurent ’08]

24

SLIDE 25

Yaron Orenstein Chaim Linhart 25 Yonit Halperin Igor Ulitsky

SLIDE 26

Open questions

 Better PWM inference: new scores, algs  Richer models for in vivo / in vitro data – really

helpful or diminishing return?

 How to evaluate model quality: match to

literature? Ranking based? In vivo? In vitro?

 Integration of motif finding & expression  Principled means to find motif pairs

26

SLIDE 27

27

Using expression profiles and protein networks to understand cancer I

I. Ulitsky, R. M. Karp RECOMB 09

09

I. Ulitsky, A. Krishnamurthy, R. M. Karp PLo

LoS One ne 1 0 1 0

27

SLIDE 28

DNA chips / Microarrays

Simultaneous measurement of

expression levels of all genes.

Global view of cellular

processes.

> 800,000 profiles available in

ArrayExpress

28

SLIDE 29

Protein-protein interactions (PPIs)

A regulates/binds to B
High throughput: abundant, noisy
Large, readily available resource

29

SLIDE 30

Case/control studies

A typical study: 100s

expression profiles of sick (case) & healthy (control) individuals

Classification: Given a

partition of the samples into types, classify the types of new samples

Can the network help?

samples genes sick healthy ?

30

SLIDE 31

The network angle

Integrate case-control profiles with

network information

Extract dysregulated pathways specific to

the cases

Account for heterogeneity among cases
Meaningful pathway: connected

31

SLIDE 32

Preprocessing

For each gene, use the

distribution of values among the controls to decide if the gene is dysregulated in each of the cases

Control 1 Control 2

A B C D E

Control 3 Control 4 Case 1 Case 2 Case 3

1 1 A 1 1 B 1 C 1 D

1

1 1 E

Case 1 Case 2 Case 3

Case 1 Case 2 Case 3 B A C E D

32

SLIDE 33

Case 1 Case 2 Case 3 B A C E D

Dysregulated pathway

Input:

– Bipartite graph: genes, cases – Edge (gene g, case c) if g is dysregulated in c – A network over the genes

Dysregulated pathway (DP):

smallest connected subnetwork s.t. sufficiently many genes are dysregulated in all but few cases

Small pathway  focused disease

explanation

Min connected set cover problem

Case 1 Case 2 Case 3 B A C E D

k= 2,l= 1

≤l ≥k

33

SLIDE 34

Complexity

Set cover problem: Given sets of elements,

find fewest sets that cover all elements

All are NP-Hard
Devised approximation and heuristic algs

k l G Problem 1 Clique Set cover k Clique Set k-cover 1 >0 Clique Partial set cover 1 Any Connected set cover (Shuai & Hu 06)

DysrEgulated Gene set Analysis via Subnetworks

DEGAS

34

SLIDE 35

Huntington Disease down- regulated pathway

Brain exp profiles of 38 patients, 32 controls

(Hodges et al 06)

The most significant pathway

found for k=25 (p < 0.005)

Enriched with:

– HD modifiers – HD relevant genes – Calcium signaling

Huntingtin

utlier

35

SLIDE 36

Breast cancer meta-analysis

6 breast cancer studies comparing poor and good

prognosis

– Van’t Veer et al. Nature 2002 – Van de Vijver et al. NEJM 2002 – Wang et al. Lancet 2005 – Minn et al. Nature 2005 – Sotiriou et al. PNAS 2003 – Pawitan et al. Breast Cancer Research 2005

Poor prognosis = metastases within 5 years
1,004 patients in total
Elements = studies
Discovered 2 significant DPs associated with poor

prognosis and one associated with good prognosis

36

SLIDE 37

Poor prognosis network 1

k = 40, l = 2,

p<0.005

Enriched with

cell-cycle associated genes (p=2·10−26) & YY1 targets (p=2.42·10−16)

Enriched with

genes localized to the nucleus

37

SLIDE 38

Poor prognosis network 2

– Only DP2 network is strongly enriched with stem cell genes – DP2 enriched with cytoplasmic genes

HMMR: recently

discovered breast cancer risk factor

Found by removing network 1 and repeating the search

(k=50; p < 0.005)

Also significantly enriched with cell cycle genes
Not merely a segmentation of a single network:

38

SLIDE 39

Summary

A method for finding subnetworks of

dysregulated genes

Specific to cases, but allows outliers

and exception

Connected set cover paradigm
Better approximations??

39

SLIDE 40

Igor Ulitsky, Whitehead Inst Dick Karp, Berkeley Akshay Krishnamurthy CMU 40

SLIDE 41

41

Support: Israel Academy of Sciences, Wolfson Foundation, Edmond J. Safra Foundation, US-Israel BSF, German-Israeli Fund, EU 6th and 7th Frameworks, Intel, IBM, I-CORE Gene regulation & disease. postdocs available