Gene regulation, protein networks and disease a computational - - PowerPoint PPT Presentation

gene regulation protein networks and disease a
SMART_READER_LITE
LIVE PREVIEW

Gene regulation, protein networks and disease a computational - - PowerPoint PPT Presentation

Gene regulation, protein networks and disease a computational perspective Ron Shamir School of Computer Science Tel Aviv University CPM Helsinki July 3 2012 1 1 Outline Finding regulatory motifs I, II, III Utilizing


slide-1
SLIDE 1

1

Gene regulation, protein networks and disease – a computational perspective

Ron Shamir School of Computer Science Tel Aviv University

CPM Helsinki July 3 2012

1

slide-2
SLIDE 2

Outline

  • Finding regulatory

motifs I, II, III

  • Utilizing case-control

expression profiles and networks I, II

  • Chromosomal

aberrations in cancer

DEGAS

2

slide-3
SLIDE 3

Regulation of Transcription

  • A gene’s ranscription regulation is

mainly encoded in the DNA in a region called the promoter

  • Each promoter contains several short

DNA subsequences, called binding sites (BSs) that are bound by specific proteins called transcription factors (TFs)

TF TF Gene 5’ 3’ BS BS  promoter 

slide-4
SLIDE 4

4

Position Weight Matrix (PWM)

0.2 0.7 0.8 0.1 A 0.6 0.4 0.1 0.5 0.1 C 0.1 0.4 0.1 0.5 G 0.3 0.1 0.1 0.9 T

ATGCAGGATACACCGATCGGTA 0.0605 GGAGTAGAGCAAGTCCCGTGA 0.0605 AAGACTCTACAATTATGGCGT 0.0151 Score: product of base probabilities. Need score threshold for hits.

slide-5
SLIDE 5
  • I. Finding Regulatory Motifs

5

  • C. Linhart, Y. Halperin Gen

enome Res e Resea earch 08 08

slide-6
SLIDE 6

Cluster I Cluster II Cluster III

Gene e exp xpression microarray ays

Clust stering

Location a analysis (C (ChIP-chip, … …) Functional g l group (e (e.g., G GO term)

Motif discovery:

The tw two-step tep s strategy egy

Pr Promoter sequences

Motif discov

  • very

Co Co-reg egulated ed g gene set et

6

slide-7
SLIDE 7

Amad adeus us

A Motif Algorithm for Detecting Enrichment in mUltiple Species

Supp pports d diverse m motif d disco covery t tasks:

  • 1. Find ove
  • ver-re

repre resented motifs in given sets of

  • f genes.
  • 2. Identify motifs with global s

l spatial f l feature res given onl nly the genomic sequences.

How? w?

A general pipeli line a arc rchi hitecture for enumerating motifs.

Different statistical sc scoring sc scheme mes of motifs for different motif discovery tasks.

7

slide-8
SLIDE 8

Motif search algorithm

 Pipeline of refinement phases of increased complexity

k-mer Prepr

process Mismat atch List o

  • f k-mers

Merge

PW

PWM Optimiza zation

Cutoff = = 0.005 005

PW

PWM

  • Mo

Motif Mo Model el:

  • Phases:

8

slide-9
SLIDE 9

 Input

put: Target set (size T) = co-regulated genes Background (BG BG) set (size B) = entire genome

 Mo

Motif enri richment s sco cori ring:

 Hyper-geom

  • metric

 Binne

nned e enrichment nt s score

 Bino

nomi mial

Scor coring ov

  • ver-rep

epres esen ented ed m motifs

B T b t

9

Le Length GC GC-conte tent B1 T1 b1 B2 b2 B3 T3 b3 T4 b4 T2 B4

20 20-40 40% 40 40-60 60% 0.4-0.7kb kbp 0.7-1kbp bp

slide-10
SLIDE 10

Metazoan motif discovery benchmark:

42 42 targ rget s sets of

  • f 26

26 TFs, s, 8 8 miRNAs As from from 29 29 studies s (expre ression

  • n, C

, Chip-ChIP hIP,..) ,..) i in hu human, , mou

  • use,

, fly fly, w , worm

  • rm.

All ll m mot

  • tifs

fs a are re experi rimentally ve veri rified Ave verage t targ rget s set size: : 400 400 genes ( (383 383 Kb Kbp) )

10

slide-11
SLIDE 11

11

slide-12
SLIDE 12

12

slide-13
SLIDE 13

Amade deus s – Global spatial analysis

Promoter sequences Output

Motif(s) Gene e expression

  • n

microarrays Location anal analysis ( (ChIP-chip, … …) Functi tional g group ( p (e.g., G GO te term) m)

Co Co-re regula lated g gene set

13

slide-14
SLIDE 14

Task II: Glo lobal a l analy lyse ses

 Localization w.r.t the TSS  Strand-bias  Chromosomal preference

TSS SS

5’

Scores for spatial features of motif occurrences In Input: Sequences (no target-set / expression data)

Motif if s scorin ing:

14

slide-15
SLIDE 15

Global analysis: Chromosomal preference in C. elegans

Input: t:

 All

ll wo worm promoters rs (~ (~18 18,000 00) )

 Score

re: : chromosomal al prefere rence Re Results: Novel m l motif on

  • n chro

rom IV IV

15

slide-16
SLIDE 16

Global analysis: Chromosomal preference in C. elegans

Input: t:

 All

ll wo worm promoters rs ( (~18 18,000 000) )

 Score

re: : chrom hromosomal p pre refe ference Re Results: Novel m l motif on

  • n chr

hrom

  • m IV

IV

16

slide-17
SLIDE 17
  • II. Finding Transcriptional

Programs

17

  • Y. Halperin, C. Linhart, I. Ulitsky NAR

AR 1 0 1 0

slide-18
SLIDE 18

Goal

Given expression profiles, find the transcriptional programs active in them:

  • the co-regulated genes,
  • the motifs that govern their co-

regulation

slide-19
SLIDE 19

Our goal

  • al: b

: bypas ass t the two-step a approac ach

Cluster I Cluster II Cluster III

Gene expression microarrays

Clustering

Promoter sequences Expression data Output

Motif(s)

Co-regulated gene set

19

Simultaneous s infer erence o e of the e motif tifs a and the exp pr p profiles o

  • f

their ir t targe gets ts

slide-20
SLIDE 20

Allegro: expression model

 Discretization of expression patterns

e1=Up (U) e2=Same (S) e3=Down (D) ≥1.0 (-1.0, 1.0) ≤-1.0

cm … c2 c1 1.5

  • 0.8
  • 2.3

g cm … c2 c1 U … S D g

Ex Expressi ssion p pattern Discrete e expression

  • n

Pattern ( (DEP EP)

 Condition frequency matrix (CFM)  Condition weight matrix (CWM

WM)

cm … c2 c1 0.78 … 0.1 0.05 U 0.14 … 0.2 0.9 S 0.08 … 0.7 0.05 D

F =

( )

log

W ij ij

f F r

           

=

( R={rij} is the BG CFM)

⇒ Log-likelihood ratio (LLR

LLR) score

20

slide-21
SLIDE 21

Allegro

  • verview

21

slide-22
SLIDE 22

Yeast osmotic shock pathway

 Allegro can discover multiple motifs with diverse expression patterns,

even if the response is in a small fraction of the conditions

 Extant two-step techniques recovered only 4 of the above motifs:

 K-means/C

/CLI LICK + + Amadeus/W /Weeder: RRPE, PAC, MBF, STRE

 Iclust

st + + FIRE: E: RRPE, PAC, Rap1, STRE

 ~6,000 genes, 133 conditions [O’Rourke et al. ’04]

22

slide-23
SLIDE 23

3’ ’ UT UTR R an anal alysis: Hu

Human an st stem c cells s

 ~14,000 genes, 124 conditions (various types of

proliferating cells) [Mueller et. al, Nature’08]

 Biases in length / GC-content of 3’ UTRs, e.g.:

100 highly-expressed genes in… 3’ UTR: length GC Embryoid bodies 584 47% Undifferentiated ESCs 774 44% ESC-derived fibroblasts 1240 39% Fetal NSCs 1422 43% (ESCs = embryonic stem cells, NSCs = neural stem cells)

 Extant methods / Allegro with HG score: report

  • nly false positives

23

slide-24
SLIDE 24

Hu Human an st stem cells: s: results using binned score

Most highly expressed miRNAs in human/mouse ESCs

Current knowledge

Abundant & functional in neural cell lineage Expressed specifically in neural lineage; active role in neurogenesis miRN RNA expressi ssion targets s expressi ssion

miRNA expression from [Laurent ’08]

24

slide-25
SLIDE 25

Yaron Orenstein Chaim Linhart 25 Yonit Halperin Igor Ulitsky

slide-26
SLIDE 26

Open questions

 Better PWM inference: new scores, algs  Richer models for in vivo / in vitro data – really

helpful or diminishing return?

 How to evaluate model quality: match to

literature? Ranking based? In vivo? In vitro?

 Integration of motif finding & expression  Principled means to find motif pairs

26

slide-27
SLIDE 27

27

Using expression profiles and protein networks to understand cancer I

  • I. Ulitsky, R. M. Karp RECOMB 09

09

  • I. Ulitsky, A. Krishnamurthy, R. M. Karp PLo

LoS One ne 1 0 1 0

27

slide-28
SLIDE 28

DNA chips / Microarrays

  • Simultaneous measurement of

expression levels of all genes.

  • Global view of cellular

processes.

  • > 800,000 profiles available in

ArrayExpress

28

slide-29
SLIDE 29

Protein-protein interactions (PPIs)

  • A regulates/binds to B
  • High throughput: abundant, noisy
  • Large, readily available resource

29

slide-30
SLIDE 30

Case/control studies

  • A typical study: 100s

expression profiles of sick (case) & healthy (control) individuals

  • Classification: Given a

partition of the samples into types, classify the types of new samples

  • Can the network help?

samples genes sick healthy ?

30

slide-31
SLIDE 31

The network angle

  • Integrate case-control profiles with

network information

  • Extract dysregulated pathways specific to

the cases

  • Account for heterogeneity among cases
  • Meaningful pathway: connected

31

slide-32
SLIDE 32

Preprocessing

  • For each gene, use the

distribution of values among the controls to decide if the gene is dysregulated in each of the cases

Control 1 Control 2

A B C D E

Control 3 Control 4 Case 1 Case 2 Case 3

1 1 A 1 1 B 1 C 1 D

1

1 1 E

Case 1 Case 2 Case 3

Case 1 Case 2 Case 3 B A C E D

32

slide-33
SLIDE 33

Case 1 Case 2 Case 3 B A C E D

Dysregulated pathway

  • Input:

– Bipartite graph: genes, cases – Edge (gene g, case c) if g is dysregulated in c – A network over the genes

  • Dysregulated pathway (DP):

smallest connected subnetwork s.t. sufficiently many genes are dysregulated in all but few cases

  • Small pathway  focused disease

explanation

  • Min connected set cover problem

Case 1 Case 2 Case 3 B A C E D

k= 2,l= 1

≤l ≥k

33

slide-34
SLIDE 34

Complexity

  • Set cover problem: Given sets of elements,

find fewest sets that cover all elements

  • All are NP-Hard
  • Devised approximation and heuristic algs

k l G Problem 1 Clique Set cover k Clique Set k-cover 1 >0 Clique Partial set cover 1 Any Connected set cover (Shuai & Hu 06)

DysrEgulated Gene set Analysis via Subnetworks

DEGAS

34

slide-35
SLIDE 35

Huntington Disease down- regulated pathway

  • Brain exp profiles of 38 patients, 32 controls

(Hodges et al 06)

  • The most significant pathway

found for k=25 (p < 0.005)

  • Enriched with:

– HD modifiers – HD relevant genes – Calcium signaling

Huntingtin

  • utlier

35

slide-36
SLIDE 36

Breast cancer meta-analysis

  • 6 breast cancer studies comparing poor and good

prognosis

– Van’t Veer et al. Nature 2002 – Van de Vijver et al. NEJM 2002 – Wang et al. Lancet 2005 – Minn et al. Nature 2005 – Sotiriou et al. PNAS 2003 – Pawitan et al. Breast Cancer Research 2005

  • Poor prognosis = metastases within 5 years
  • 1,004 patients in total
  • Elements = studies
  • Discovered 2 significant DPs associated with poor

prognosis and one associated with good prognosis

36

slide-37
SLIDE 37

Poor prognosis network 1

  • k = 40, l = 2,

p<0.005

  • Enriched with

cell-cycle associated genes (p=2·10−26) & YY1 targets (p=2.42·10−16)

  • Enriched with

genes localized to the nucleus

37

slide-38
SLIDE 38

Poor prognosis network 2

– Only DP2 network is strongly enriched with stem cell genes – DP2 enriched with cytoplasmic genes

  • HMMR: recently

discovered breast cancer risk factor

  • Found by removing network 1 and repeating the search

(k=50; p < 0.005)

  • Also significantly enriched with cell cycle genes
  • Not merely a segmentation of a single network:

38

slide-39
SLIDE 39

Summary

  • A method for finding subnetworks of

dysregulated genes

  • Specific to cases, but allows outliers

and exception

  • Connected set cover paradigm
  • Better approximations??

39

slide-40
SLIDE 40

Igor Ulitsky, Whitehead Inst Dick Karp, Berkeley Akshay Krishnamurthy CMU 40

slide-41
SLIDE 41

41

Support: Israel Academy of Sciences, Wolfson Foundation, Edmond J. Safra Foundation, US-Israel BSF, German-Israeli Fund, EU 6th and 7th Frameworks, Intel, IBM, I-CORE Gene regulation & disease. postdocs available