CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - - PowerPoint PPT Presentation

cs345a data mining jure leskovec and anand rajaraman j
SMART_READER_LITE
LIVE PREVIEW

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Feature selection: Feature selection: Given a set of features X 1 , X n Want to predict Y from a subset A = (X Want to predict Y from a subset A =


slide-1
SLIDE 1

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j

Stanford University

slide-2
SLIDE 2

 Feature selection:  Feature selection:

  • Given a set of features X1, … Xn
  • Want to predict Y from a subset A = (X

X )

  • Want to predict Y from a subset A = (Xi1,…,Xik)
  • What are the k most informative features?

 Active learning:

  • Want to predict medical condition

p

  • Each test has a cost (but also reveals information)
  • Which tests should we perform to make most

effective decisions?

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

slide-3
SLIDE 3

 Influence maximization:  Influence maximization:

  • In a social network, which nodes to advertise to?
  • Which are the most influential blogs?

Which are the most influential blogs?

 Sensor placement:

  • Given a water distribution network
  • Where should we place sensors to quickly detect

p q y contaminations?

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

slide-4
SLIDE 4

 Given:  Given:

  • finite set V
  • A function F: 2V 

A function F: 2 

 Want:

A* = argmaxA F(A)

A

s.t. some constraints on A

 For example:

  • Influence maximization: V= F(A)=
  • Sensor placement: V= F(A)=
  • Feature selection: V= F(A)=

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

slide-5
SLIDE 5

 Given random variables Y, X1, … Xn

,

1, n

 Want to predict Y from subset A = (Xi1,…,Xik)

Y k

Naïve Bayes Model:

“Sick” X1 “F ” X2 “R h” X3 “C h”

Naïve Bayes Model: P(Y,X1,…,Xn) = P(Y) i P(Xi | Y)

 Want k most informative features:

“Fever” “Rash” “Cough”

A* = argmax I(A; Y) s.t. |A| k where I(A; Y) = H(Y) ‐ H(Y | A)

5

Uncertainty before knowing A Uncertainty after knowing A

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-6
SLIDE 6

 Given: finite set V of features utility function

Y “Sick”

 Given: finite set V of features, utility function

F(A) = I(A; Y)

 Want: A* V such that

X1 “Fever” X2 “Rash” X3 “Cough”

 Want: A  V such that

G d hill li bi

“Fever” “Rash” “Cough”

Typically NP‐hard!

Greedy hill‐climbing:

Start with A0 = {} For i = 1 to k

How well does this simple

6

s* = argmaxs F(A  {s}) Ai = Ai‐1  {s*}

t s s p e heuristic do?

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-7
SLIDE 7

 Greedy hill climbing produces a solution A  Greedy hill climbing produces a solution A

where F(A) (1‐1/e) of optimal value (~63%)

[Hemhauser, Fisher, Wolsey ’78] [ , , y ]

 Claim holds for functions F with 2 properties:

  • F is monotone:

if A  B then F(A)  F(B) and F({})=0

  • F is submodular:

adding element to a set gives less improvement than adding to one of subsets

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

slide-8
SLIDE 8

Definition: Definition:

 Set function F on V is called submodular if:

For all A BV: For all A,BV: F(A)+F(B)  F(AB)+F(AB)

B A A  B

A  B

+ + 

A  B

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

slide-9
SLIDE 9

 Diminishing returns characterization  Diminishing returns characterization

Definition:

 Set function F on V is called submodular if:  Set function F on V is called submodular if:

For all AB, sB:

Gain of adding s to a small set Gain of adding s to a large set

F(A  {s}) – F(A) ≥ F(B  {s}) – F(B)

s B A +

Large improvement

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

s +

Small improvement

slide-10
SLIDE 10

 Given random variables X

X

 Given random variables X1,…,Xn  Mutual information:

F(A) = I(A; V\A) = H(V\A) – H(V\A|A) F(A) = I(A; V\A) = H(V\A) – H(V\A|A) = y,A P(A) [log P(y|A) – log P(y)]

 Mutual information F(A) is submodular

[Krause‐Guestrin ’05]

F(A  {s}) – F(A) = H(s|A) – H(s| V\(A{s}))

  • A  B  H(s|A)  H(s|B)
  • “Information never hurts”

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

slide-11
SLIDE 11

 Let Y = i i Xi+ where (X1

X ) ~ N(;  ) Let Y i i Xi+, where (X1,…,Xn,) N( ; ,)

 Want to pick a subset A to predict Y  Var(Y|XA=xA): conditional var of Y given XA=xA  Var(Y|XA xA): conditional var. of Y given XA xA  Expected variance:

Var(Y | XA) =  p(xA) Var(Y | XA=xA) dxA

 Variance reduction:

FV(A) = Var(Y) – Var(Y | XA)

V A

 Then [Das‐Kempe 08]:

  • FV(A) is monotonic

Orthogonal matching pursuit [Tropp Donoho]

V( )

  • FV(A) is submodular*

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

*under some conditions on 

[Tropp‐Donoho] near optimal!

slide-12
SLIDE 12

 F

F submodular functions on V

 F1,…,Fm submodular functions on V

and 1,…,m > 0

 Then: F(A) =   F (A) is submodular!  Then: F(A) = i i Fi(A) is submodular!  Submodularity closed under nonnegative

linear combinations

 Extremely useful fact:

y

  • F(A) submodular   P() F(A) submodular!
  • Multicriterion optimization:

Multicriterion optimization: F1,…,Fm submodular, i>0  i i Fi(A) submodular

12 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-13
SLIDE 13

 Each element covers some area  Each element covers some area  Observation: Diminishing returns

S

N l t

S1 S’

New element:

S2 S1 S3 S’ S2 S4

A={S1, S2} Adding S’helps a lot B={S1, S2, S3, S4} Adding S’helps l l

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

very little

slide-14
SLIDE 14

 F is submodular: A  B  F is submodular: A  B

Gain of adding a set s to a small solution Gain of adding a set s to a large solution

F(A  {s}) – F(A) ≥ F(B  {s}) – F(B)

 Natural example:

A

Gain of adding a set s to a small solution Gain of adding a set s to a large solution

p

  • Sets s1, s2,…, sn
  • F(A) = size of union of si

s

F(A) si e of union of si (size of covered area)

B s

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

s

slide-15
SLIDE 15

 Most influential set of

a d 0.4

Most influential set of size k: set S of k nodes producing largest

b a d f 0.4 0.2 0.2 0.3 0.3 0.3

producing largest expected cascade size F(S) if activated

e g f h 0.4 0.2 0.4 0.3 0.3 0.3 3 0.2

size F(S) if activated

[Domingos‐Richardson ‘01]

c g i 0.4

 Optimization problem:

) ( max S F

p p

3/9/2010 15 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

) (

k size

  • f

S

slide-16
SLIDE 16

 Fix outcome i of coin flips

a d 0.4

 Fix outcome i of coin flips  Let Fi(S) be size of

cascade from S

b a d f 0.4 0.2 0.2 0.3 0.3 0.3

cascade from S given these coin flips

e g f h 0.4 0.2 0.4 0.3 0.3 0.3 3 0.2

flips

  • Let Fi(v) = set of nodes reachable from v

c g i 0.4

  • n live‐edge paths
  • Fi(S) = size of union Fi(v) → Fi is submodular

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16

  • F= ∑ Fi→ F is submodular [Kempe‐Kleinberg‐Tardos ‘03]
slide-17
SLIDE 17

 Given a real city water  Given a real city water

distribution network

 And data on how

contaminants spread in the network P bl d b US

S S

 Problem posed by US

Environmental Protection Agency

S S

Protection Agency

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17

slide-18
SLIDE 18

[Leskovec et al., KDD ’07]

 Real metropolitan area  Real metropolitan area

water network:

  • V = 21,000 nodes

V 21,000 nodes

  • E = 25,000 pipes

 Water flow simulator provided by EPA  3.6 million contamination events  Multiple objectives:

  • Detection time, affected population, …

, p p ,

 Place sensors that detect well “on average”

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18

slide-19
SLIDE 19

 Utility of placing sensors

  • Water flow dynamics, demands of households, …

 For each subset A  V compute utility F(A)

d l d L i Model predicts High impact Medium impact Low impact location Contamination

S2 S3 S S3

location

S1 S4 S1 S2 S4

Sensor reduces impact through early detection! Set V of all

19

S1

High sensing quality F(A) = 0.9 Low sensing quality F(A)=0.01 network junctions

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-20
SLIDE 20

Gi en

 Given:

  • Graph G(V,E), budget B
  • Data on how outbreaks o1, …, oi, …,oK spread
  • ver time

 Select a set of nodes A maximizing the reward

Reward for detecting outbreak i

subject to cost(A) ≤ B

20 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-21
SLIDE 21

 Cost:  Cost:

  • Cost of monitoring is node

dependent dependent

 Reward:

  • Minimize the number of affected

nodes:

R(A) A

  • If A are the monitored nodes, let

R(A) denote the number of nodes we save ( )

21 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-22
SLIDE 22

 Claim: [Krause et al ’08]

Claim: [Krause et al. 08]

  • Reward function is submodular

 Consider cascade i:  Consider cascade i:

  • Ri(uk) = set of nodes saved from uk
  • Ri(A) = size of union Ri(uk) ukA

u2 Ri(u2)

Ri(A) size of union Ri(uk), ukA Ri is submodular

 Global optimization:

u1 Ri(u1) Cascade i

Global optimization:

  • R(A) =  Prob(i) Ri(A)

 R is submodular

22 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-23
SLIDE 23

1.4

Offline h

ted F(A) tter 1 1.2

(Nemhauser) bound

tion protec gher is bet 0.6 0.8

Greedy

Populat Hi 0.2 0.4

Greedy solution

5 10 15 20 N b f l d Water network

(1‐1/e) bound quite loose… can we get better bounds?

23

Number of sensors placed

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-24
SLIDE 24

 Suppose A is some solution to

So: For each s V\A,

pp argmaxA F(A) s.t. |A|  k and A* = {s1,…,sk} is OPT solution Then

let s = F(A{s})‐F(A) Order so that 1 2  …  n Then: F(A*)  F(A) + i=1

k i

 Then:

i 1 i

24 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-25
SLIDE 25

1 2 1.4

Offline (Nemhauser)

ty F(A) etter

1 1.2

(Nemhauser) bound Data‐dependent bound

sing qualit igher is be

0.6 0.8

Greedy

Sens Hi

0.2 0.4

solution

5 10 15 20

Number of sensors placed

Submodularity gives data‐dependent bounds on the performance of any algorithm

26

Number of sensors placed

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-26
SLIDE 26

 13 participants  Performance measured in 30 different criteria

G: Genetic algorithm H: Other heuristic D: Domain knowledge E: “Exact” method (MIP)

20 25 30

er

E G H

H: Other heuristic E: Exact method (MIP)

5 10 15 20

tal Score

her is bette

E D D G G G G H H

5

To

High

D D G H

27

24% better performance than runner‐up!

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-27
SLIDE 27

 Simulated 3 6M contaminations on 40  Simulated 3.6M contaminations on 40

machines for 2 weeks [Krause et al. ‘08]

  • 152 GB of simulation data
  • 152 GB of simulation data
  • 16GB in RAM (compressed)

 Very accurate computation of F(A)  Very slow evaluation of F(A):

  • Would take 6 weeks for all 30 settings

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 28

slide-28
SLIDE 28

 Hill‐climbing algorithm is slow: ll l b

Hill climbing algorithm is slow:

  • At each iteration we need to

re‐evaluate gains of all sensors

a reward

Hill‐climbing

  • It scales as O(n k)

a b

300 es)

c d

r is better

200 3 me (minute

Exhaustive search (All subsets) Naive greedy

e Add element with

Lower

100 Running tim

greedy

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 29

Add element with highest marginal gain

1 2 3 4 5 6 7 8 9 10 Number of sensors selected R

slide-29
SLIDE 29

 In round i+1:  In round i+1:

  • have so far picked Ai = {s1,…,si}
  • pick si 1 = argmax F(Ai  {s})‐F(Ai)

pick si+1 argmaxs F(Ai  {s}) F(Ai) i.e., maximize “marginal benefit” s(Ai) s(Ai) = F(Ai  {s})‐F(Ai)

 Observation: Submodularity implies

i  j  s(Ai)  s(Aj)

 (A )   (A )

i  j  s(Ai)  s(Aj)

s s(Ai)  s(Ai+1)

Marginal benefits snever increase!

30 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-30
SLIDE 30

Lazy hill climbing algorithm: Lazy hill climbing algorithm:

  • First iteration as usual
  • K

d d li t f i l

a Benefit s(A) a

  • Keep an ordered list of marginal

benefits i from previous iteration

b c b c d b

iteration

  • Re‐evaluate i only for top

element

d e d e c e

element

  • If i stays on top, use it,
  • therwise re‐sort

e

31 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-31
SLIDE 31

 Using “lazy evaluations” [Krause et al ‘08]  Using lazy evaluations [Krause et al. 08]

  • 1 hour/20 sensors

 Done in 2 days!  Done in 2 days!

er

300 utes)

Exhaustive search (All b t )

wer is bette

200 time (min

(All subsets) Naive greedy

Low

100 Running

Lazy

32

1 2 3 4 5 6 7 8 9 10 Number of sensors selected

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-32
SLIDE 32

 For each sV, let c(s)>0 be its cost

For each sV, let c(s) 0 be its cost (e.g., feature acquisition costs, …)

 Cost of a set C(A) = sA c(s) (modular function)  Want to solve:  Want to solve:

A* = argmaxA F(A) s.t. C(A)  B (budget) C t b fit d l ith

 Cost‐benefit greedy algorithm:

Start with A = {} While there is an s  V\A s.t. C(A{s})  B While there is an s  V\A s.t. C(A{s})  B

A = A  {s*}

33 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-33
SLIDE 33

 Consider the following  Consider the following

problem: max F(A) s t C(A)1

Set A F(A) C(A) {a} 2 

maxA F(A) s.t. C(A)1 Cost‐benefit greedy picks a

{b} 1 1

Cost‐benefit greedy picks a. Then cannot afford b!  Cost‐benefit greedy performs arbitrarily badly! arbitrarily badly!

34 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-34
SLIDE 34

 Theorem [Leskovec‐Krause et al. ‘07]:

Theorem [Leskovec Krause et al. 07]:

  • ACB: cost‐benefit greedy solution and
  • AUC: unit‐cost greedy solution (i.e., ignore costs)

Th Then: max { F(ACB), F(AUC) }  ½ (1‐1/e) OPT

 Can still compute online bounds and

speed up using lazy evaluations

 Note: Can also get

  • (1‐1/e) approximation in time O(n4)

[Sviridenko ’04]

  • Slightly better than ½(1‐1/e) in O(n2) [Wolsey ‘82]

35 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-35
SLIDE 35

= I have 10 minutes. Which blogs should I read to be blogs should I read to be most up to date?

[Leskovec‐Krause et al. ‘07]

?

[Leskovec Krause et al. 07]

= Who are the most influential

?

bloggers?

36 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-36
SLIDE 36

Want to read things Want to read things before others do.

Detect blue & yellow soon but miss red.

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 37

Detect all stories but late.

slide-37
SLIDE 37

400

0.7

Our Alg

better

200 300

seconds) exhaustive search (all subsets) naive d

etter

ed

0.4 0.5 0.6

Our Alg

lower is b

100 200

running time ( greedy

Lazy

higher is be

ades capture

0.2 0.3 4 in-links all outlinks # posts random

1 2 3 4 5 6 7 8 9 10 number of blogs selected

r

blog selection 45k blogs h

number of blogs

casca

20 40 60 80 100 0.1 random

 Submodular formulation outperforms heuristics

blog selection ~45k blogs

p

 700x speedup using lazy evaluations

3/9/2010 38 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-38
SLIDE 38

 Naïve approach: Just pick 10 best blogs

Naïve approach: Just pick 10 best blogs

 Selects big, well known blogs (Instapundit, etc.)  These contain many posts, take long to read!

ed cost/benefit analysis des capture ignoring cost cascad x 104

39

number of posts (time) allowed x 104

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-39
SLIDE 39

 Maximization of submodular functions:  Maximization of submodular functions:

  • NP hard
  • B t

d hill li bi t t ~63% f

  • But can use greedy hill climbing to get ~63% of

OPT

 Minimization of submodular functions:

  • Polynomial time solvable
  • Polynomial time solvable
  • Best known algorithm: (n5) function evaluations

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 40

slide-40
SLIDE 40

 Set function F on V is called submodular if

Set function F on V is called submodular if 1) For all A,BV: F(A)+F(B)  F(AB)+F(AB) 2) For all AB, sB: F(A{s})–F(A)≥F(B{s})–F(B)

 F is called supermodular if ‐F is submodular  F is called modular if F is both sub‐ and

supermodular: E.g., for modular (“additive”) F ( )  (i) F(A) = iA w(i)

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 41

slide-41
SLIDE 41

 Optimize the worst case:  Optimize the worst case:

  • [Krause et al. ’07]

 Online maximization of submodular functions:  Online maximization of submodular functions:

  • [Golovin‐Streeter ’08]

A A Pi k t A A A1 A2 Pick sets SFs F1 F2 A3 F3 AT FT … … Reward

1

r1=F1(A1) Total: t rt  max

2

r2

3

r3

T

rT … Time

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 42

Time

slide-42
SLIDE 42

 Most of the slides borrowed from  Most of the slides borrowed from

Andreas Krause

 http://www blogcascades org  http://www.blogcascades.org  http://www.submodularity.org

3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 43