CS345a: Data Mining Jure Leskovec and Anand Rajaraman j
Stanford University
CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - - PowerPoint PPT Presentation
CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Feature selection: Feature selection: Given a set of features X 1 , X n Want to predict Y from a subset A = (X Want to predict Y from a subset A =
Stanford University
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4
Naïve Bayes Model:
Naïve Bayes Model: P(Y,X1,…,Xn) = P(Y) i P(Xi | Y)
5
Uncertainty before knowing A Uncertainty after knowing A
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
Y “Sick”
X1 “Fever” X2 “Rash” X3 “Cough”
“Fever” “Rash” “Cough”
Start with A0 = {} For i = 1 to k
6
s* = argmaxs F(A {s}) Ai = Ai‐1 {s*}
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7
A B
A B
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8
Gain of adding s to a small set Gain of adding s to a large set
Large improvement
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9
Small improvement
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10
V( )
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11
*under some conditions on
12 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
S
N l t
S1 S’
New element:
S2 S1 S3 S’ S2 S4
A={S1, S2} Adding S’helps a lot B={S1, S2, S3, S4} Adding S’helps l l
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13
very little
Gain of adding a set s to a small solution Gain of adding a set s to a large solution
Gain of adding a set s to a small solution Gain of adding a set s to a large solution
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14
a d 0.4
b a d f 0.4 0.2 0.2 0.3 0.3 0.3
e g f h 0.4 0.2 0.4 0.3 0.3 0.3 3 0.2
c g i 0.4
3/9/2010 15 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
k size
S
a d 0.4
b a d f 0.4 0.2 0.2 0.3 0.3 0.3
e g f h 0.4 0.2 0.4 0.3 0.3 0.3 3 0.2
c g i 0.4
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16
S S
S S
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17
[Leskovec et al., KDD ’07]
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18
d l d L i Model predicts High impact Medium impact Low impact location Contamination
S2 S3 S S3
location
S1 S4 S1 S2 S4
Sensor reduces impact through early detection! Set V of all
19
S1
High sensing quality F(A) = 0.9 Low sensing quality F(A)=0.01 network junctions
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
Reward for detecting outbreak i
20 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
21 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
u2 Ri(u2)
u1 Ri(u1) Cascade i
22 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
1.4
ted F(A) tter 1 1.2
tion protec gher is bet 0.6 0.8
Populat Hi 0.2 0.4
5 10 15 20 N b f l d Water network
23
Number of sensors placed
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
k i
i 1 i
24 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
1 2 1.4
ty F(A) etter
1 1.2
sing qualit igher is be
0.6 0.8
Sens Hi
0.2 0.4
5 10 15 20
Number of sensors placed
26
Number of sensors placed
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
20 25 30
er
E G H
5 10 15 20
tal Score
her is bette
E D D G G G G H H
5
To
High
D D G H
27
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 28
a reward
a b
300 es)
c d
r is better
200 3 me (minute
Exhaustive search (All subsets) Naive greedy
e Add element with
Lower
100 Running tim
greedy
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 29
Add element with highest marginal gain
1 2 3 4 5 6 7 8 9 10 Number of sensors selected R
(A ) (A )
s s(Ai) s(Ai+1)
30 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
a Benefit s(A) a
b c b c d b
d e d e c e
e
31 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
er
300 utes)
Exhaustive search (All b t )
wer is bette
200 time (min
(All subsets) Naive greedy
Low
100 Running
Lazy
32
1 2 3 4 5 6 7 8 9 10 Number of sensors selected
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
33 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
34 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
35 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
36 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 37
400
0.7
Our Alg
better
200 300
seconds) exhaustive search (all subsets) naive d
etter
ed
0.4 0.5 0.6
Our Alg
lower is b
100 200
running time ( greedy
Lazy
higher is be
ades capture
0.2 0.3 4 in-links all outlinks # posts random
1 2 3 4 5 6 7 8 9 10 number of blogs selected
r
blog selection 45k blogs h
number of blogs
casca
20 40 60 80 100 0.1 random
blog selection ~45k blogs
3/9/2010 38 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
ed cost/benefit analysis des capture ignoring cost cascad x 104
39
number of posts (time) allowed x 104
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 40
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 41
A A Pi k t A A A1 A2 Pick sets SFs F1 F2 A3 F3 AT FT … … Reward
1
r1=F1(A1) Total: t rt max
2
r2
3
r3
T
rT … Time
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 42
Time
3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 43