cs345a data mining jure leskovec and anand rajaraman j
play

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Feature selection: Feature selection: Given a set of features X 1 , X n Want to predict Y from a subset A = (X Want to predict Y from a subset A =


  1. CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University

  2.  Feature selection:  Feature selection:  Given a set of features X 1 , … X n  Want to predict Y from a subset A = (X  Want to predict Y from a subset A = (X i1 ,…,X ik ) X )  What are the k most informative features?  Active learning:  Want to predict medical condition p  Each test has a cost (but also reveals information)  Which tests should we perform to make most effective decisions? 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

  3.  Influence maximization:  Influence maximization:  In a social network, which nodes to advertise to?  Which are the most influential blogs? Which are the most influential blogs?  Sensor placement:  Given a water distribution network  Where should we place sensors to quickly detect p q y contaminations? 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

  4.  Given:  Given:  finite set V  A function F: 2 V  A function F: 2   Want: A * = argmax A F(A) A s.t. some constraints on A  For example:  Influence maximization: V= F(A)=  Sensor placement: V= F(A)=  Feature selection: V= F(A)= 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

  5.  Given random variables Y, X 1 , … X n , 1 , n  Want to predict Y from subset A = (X i1 ,…,X ik ) Y Naïve Bayes Model: Naïve Bayes Model: “Sick” k P(Y,X 1 ,…,X n ) = P(Y)  i P(X i | Y) X 1 X 2 X 3 “Fever” “F ” “R “Rash” h” “C “Cough” h”  Want k most informative features: A* = argmax I(A; Y) s.t. |A|  k where I(A; Y) = H(Y) ‐ H(Y | A) Uncertainty Uncertainty before knowing A after knowing A 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

  6.  Given: finite set V of features utility function  Given: finite set V of features, utility function F(A) = I(A; Y) Y  Want: A *  V such that  Want: A  V such that “Sick” X 1 X 2 X 3 “Fever” “Fever” “Rash” “Rash” “Cough” “Cough” Typically NP ‐ hard! G Greedy hill ‐ climbing: d hill li bi How well does Start with A 0 = {} For i = 1 to k this simple t s s p e s * = argmax s F(A  {s}) heuristic do? A i = A i ‐ 1  {s * } 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

  7.  Greedy hill climbing produces a solution A  Greedy hill climbing produces a solution A where F(A)  (1 ‐ 1/e) of optimal value (~63%) [Hemhauser, Fisher, Wolsey ’78] [ , , y ]  Claim holds for functions F with 2 properties:  F is monotone: if A  B then F(A)  F(B) and F ({})=0  F is submodular: adding element to a set gives less improvement than adding to one of subsets 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

  8. Definition: Definition:  Set function F on V is called submodular if: For all A B  V: For all A,B  V: F(A)+F(B)  F(A  B)+F(A  B)  + + A A  B B A  B A  B 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

  9.  Diminishing returns characterization  Diminishing returns characterization Definition:  Set function F on V is called submodular if:  Set function F on V is called submodular if: For all A  B, s  B: F(A  {s}) – F(A) ≥ F(B  {s}) – F(B) Gain of adding s to a small set Gain of adding s to a large set + s Large improvement A B + s Small improvement 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

  10.  Given random variables X  Given random variables X 1 ,…,X n X  Mutual information: F(A) = I(A; V\A) = H(V\A) – H(V\A|A) F(A) = I(A; V\A) = H(V\A) – H(V\A|A) =  y,A P(A) [log P(y|A) – log P(y)]  Mutual information F(A) is submodular [Krause ‐ Guestrin ’05] F(A  {s}) – F(A) = H(s | A) – H(s | V\(A  {s}))  A  B  H(s|A)  H(s|B)  “Information never hurts” 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

  11.  Let Y =  i  i X i +  where (X 1 Let Y  i  i X i +  , where (X 1 ,…,X n ,  ) N( ;  ,  ) X  ) ~ N(  ;   )  Want to pick a subset A to predict Y  Var(Y|X A =x A ): conditional var of Y given X A =x A  Var(Y|X A x A ): conditional var. of Y given X A x A  Expected variance: Var(Y | X A ) =  p(x A ) Var(Y | X A =x A ) dx A  Variance reduction: F V (A) = Var(Y) – Var(Y | X A ) V A  Then [Das ‐ Kempe 08] : Orthogonal matching pursuit  F V (A) is monotonic V ( ) [Tropp Donoho] [Tropp ‐ Donoho]  F V (A) is submodular * near optimal! * under some conditions on  3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

  12.  F  F 1 ,…,F m submodular functions on V F submodular functions on V and  1 ,…,  m > 0  Then: F(A) =   F (A) is submodular!  Then: F(A) =  i  i F i (A) is submodular!  Submodularity closed under nonnegative linear combinations  Extremely useful fact: y  F  (A) submodular    P(  ) F  (A) submodular!  Multicriterion optimization: Multicriterion optimization: F 1 ,…,F m submodular,  i >0   i  i F i (A) submodular 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

  13.  Each element covers some area  Each element covers some area  Observation: Diminishing returns N New element: l t S S 1 S 1 S’ S’ S 2 S 3 S 2 S 4 A={S 1 , S 2 } B={S 1 , S 2 , S 3 , S 4 } Adding S’helps Adding S’helps a lot very little l l 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

  14.  F is submodular: A  B  F is submodular: A  B F(A  {s}) – F(A) ≥ F(B  {s}) – F(B) Gain of adding a set s to a small solution Gain of adding a set s to a small solution Gain of adding a set s to a large solution Gain of adding a set s to a large solution A  Natural example: p s  Sets s 1 , s 2 ,…, s n  F(A) = size of union of s i F(A) si e of union of s i B (size of covered area) s s 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

  15.  Most influential set of Most influential set of 0.4 a a d d 0.4 size k: set S of k nodes 0.2 0.3 0.3 0.2 producing largest producing largest 0.3 3 b f f 0.2 e h expected cascade 0.4 0.4 0.3 0.2 0.3 size F(S) if activated size F(S) if activated 0.3 g g i 0.4 c [Domingos ‐ Richardson ‘01] F S ( ( ) ) max  Optimization problem: p p S of size k 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15

  16. 0.4  Fix outcome i of coin flips  Fix outcome i of coin flips a a d d 0.4 0.2  Let F i (S) be size of 0.3 0.3 0.2 0.3 3 cascade from S cascade from S b f f 0.2 e h given these coin 0.4 0.4 0.3 0.2 0.3 0.3 flips flips g g i 0.4 c • Let F i (v) = set of nodes reachable from v on live ‐ edge paths • F i (S) = size of union F i (v) → F i is submodular • F= ∑ F i → F is submodular [Kempe ‐ Kleinberg ‐ Tardos ‘03] 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16

  17.  Given a real city water  Given a real city water distribution network  And data on how contaminants spread in the network  Problem posed by US P bl d b US S S S S Environmental Protection Agency Protection Agency 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17

  18. [Leskovec et al., KDD ’07]  Real metropolitan area  Real metropolitan area water network:  V = 21,000 nodes V 21,000 nodes  E = 25,000 pipes  Water flow simulator provided by EPA  3.6 million contamination events  Multiple objectives:  Detection time, affected population, … , p p ,  Place sensors that detect well “on average” 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18

  19.  Utility of placing sensors  Water flow dynamics, demands of households, …  For each subset A  V compute utility F(A) L Low impact i Model predicts d l d location High impact Contamination Medium impact location S 3 S 1 S 2 S 3 S 4 S S 2 S 1 Sensor reduces impact through S 4 early detection! Set V of all network junctions S 1 Low sensing quality F(A)=0.01 High sensing quality F(A) = 0.9 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 19

  20.  Given: Gi en  Graph G(V,E), budget B  Data on how outbreaks o 1, …, o i , …,o K spread over time  Select a set of nodes A maximizing the reward Reward for detecting outbreak i subject to cost(A) ≤ B 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

  21.  Cost:  Cost:  Cost of monitoring is node dependent dependent  Reward:  Minimize the number of affected A nodes: R(A) ( )  If A are the monitored nodes, let R(A) denote the number of nodes we save 3/9/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend