data science for networked data
play

Data science for networked data Po-Ling Loh University of - PowerPoint PPT Presentation

Data science for networked data Po-Ling Loh University of Wisconsin-Madison Department of Statistics AISTATS Okinawa, Japan April 16, 2019 Joint work with: Justin Khim (UPenn), Varun Jog (UW-Madison), Ashley Hou (UW-Madison), Wen Yan


  1. Graph testing Observations: Infection status of n nodes in graph k infected nodes (1) c censored (nonreporting) nodes ( ⋆ ) n − k − c uninfected nodes (0) vs. vs. Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 25 / 45

  2. Graph testing H 0 H 1 H 2 vs. vs. T = 10 T = 0 T = 3 Compute test statistic T = # edges between infected nodes Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 26 / 45

  3. Graph testing H 0 H 1 H 2 vs. vs. T = 10 T = 0 T = 3 Compute test statistic T = # edges between infected nodes Need to construct proper rejection rule based on T , derive validity of hypothesis test Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 26 / 45

  4. Infection model Parameters λ, η For each node v , generate T v ∼ Exp ( λ ) For each edge ( u , v ), generate T uv ∼ Exp ( η ) Infection time of any vertex v is t v = min u ∈ N ( v ) { t u + T uv } ∧ T v Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 27 / 45

  5. Infection model Parameters λ, η For each node v , generate T v ∼ Exp ( λ ) For each edge ( u , v ), generate T uv ∼ Exp ( η ) Infection time of any vertex v is t v = min u ∈ N ( v ) { t u + T uv } ∧ T v Observation vector corresponds to infection states at a certain time Subset of censored nodes chosen uniformly at random Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 27 / 45

  6. Permutation test Goal: For α ∈ (0 , 1), construct rejection rule such that P (reject | H 0 is true) ≤ α Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 28 / 45

  7. Permutation test Goal: For α ∈ (0 , 1), construct rejection rule such that P (reject | H 0 is true) ≤ α n � � Use permutation test that computes T for reassignments k , c , n − k − c of infected/nonreporting/uninfected nodes H 1 T = 0 T = 4 T = 4 T = 4 Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 28 / 45

  8. Permutation test Goal: For α ∈ (0 , 1), construct rejection rule such that P (reject | H 0 is true) ≤ α n � � Use permutation test that computes T for reassignments k , c , n − k − c of infected/nonreporting/uninfected nodes H 1 T = 0 T = 4 T = 4 T = 4 Based on (randomly chosen) permutations, compute p -value/rejection region and reject H 0 if ( p -value of T ) ≤ α Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 28 / 45

  9. Permutation test α do not reject H 0 reject H 0 T ( I ) Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 29 / 45

  10. Permutation test α do not reject H 0 reject H 0 T ( I ) In practice, sufficient to compute empirical distribution from large number of random permutations Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 29 / 45

  11. Theory for permutation test Success depends on symmetries of underlying networks rather than parameters λ, η Consider Π 0 = Aut( G 0 ) and Π 1 = Aut( G 1 ), subsets of S n Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 30 / 45

  12. Theory for permutation test Success depends on symmetries of underlying networks rather than parameters λ, η Consider Π 0 = Aut( G 0 ) and Π 1 = Aut( G 1 ), subsets of S n 1 2 1 2 8 6 3 4 π 2 Aut( G ) 7 5 4 3 6 7 5 8 Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 30 / 45

  13. Theory for permutation test Success depends on symmetries of underlying networks rather than parameters λ, η Consider Π 0 = Aut( G 0 ) and Π 1 = Aut( G 1 ), subsets of S n 1 2 1 2 8 6 3 4 π 2 Aut( G ) 7 5 4 3 6 7 5 8 Theorem Let π be drawn uniformly from S n . If Π 1 Π 0 = S n , the permutation test controls Type I error at level α . Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 30 / 45

  14. Extensions and open directions Characterization of condition Π 1 Π 0 = S n for various graph families Bounds on Type II error for specific graphs Conditioning on identity of censored nodes Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 31 / 45

  15. Extensions and open directions Characterization of condition Π 1 Π 0 = S n for various graph families Bounds on Type II error for specific graphs Conditioning on identity of censored nodes Open directions: How to identify which graphs to use as null/alternative hypotheses? Inhomogeneous λ and η ? Confidence sets for underlying network? Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 31 / 45

  16. Resource allocation ? Justin Khim Varun Jog Ashley Hou Wen Yan (UPenn) (UW-Madison) (UW-Madison) (Southeast University) Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 32 / 45

  17. Influence maximization ( with Justin Khim and Varun Jog) New goal: Seed a network to “infect” as many nodes as possible Useful for information dissemination, marketing, etc. t = 0 Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 33 / 45

  18. Influence maximization ( with Justin Khim and Varun Jog) New goal: Seed a network to “infect” as many nodes as possible Useful for information dissemination, marketing, etc. t = 0 t = 1 Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 33 / 45

  19. Influence maximization ( with Justin Khim and Varun Jog) New goal: Seed a network to “infect” as many nodes as possible Useful for information dissemination, marketing, etc. t = 0 t = 1 t = 2 Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 33 / 45

  20. Influence maximization ( with Justin Khim and Varun Jog) New goal: Seed a network to “infect” as many nodes as possible Useful for information dissemination, marketing, etc. t = 0 t = 1 t = 2 Questions 1 If k nodes may be infected initially, which nodes should be selected to maximize infection spread? Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 33 / 45

  21. Influence maximization ( with Justin Khim and Varun Jog) New goal: Seed a network to “infect” as many nodes as possible Useful for information dissemination, marketing, etc. t = 0 t = 1 t = 2 Questions 1 If k nodes may be infected initially, which nodes should be selected to maximize infection spread? 2 How to determine maximal set efficiently? Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 33 / 45

  22. Model: Linear threshold model ( broadly, triggering models ) Edges have weights ( b ij ), satisfying � j b ji ≤ 1 Nodes choose thresholds θ i ∈ [0 , 1] i.i.d., uniformly at random Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 34 / 45

  23. Model: Linear threshold model ( broadly, triggering models ) Edges have weights ( b ij ), satisfying � j b ji ≤ 1 Nodes choose thresholds θ i ∈ [0 , 1] i.i.d., uniformly at random 0 . 5 0 . 6 0 . 2 0 . 4 0 . 4 0 . 3 0 . 9 0 . 7 0 . 1 t = 0 On each round, uninfected nodes compute total weight of infected neighbors and become infected if � b ji > θ i j is infected Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 34 / 45

  24. Model: Linear threshold model ( broadly, triggering models ) Edges have weights ( b ij ), satisfying � j b ji ≤ 1 Nodes choose thresholds θ i ∈ [0 , 1] i.i.d., uniformly at random 0 . 5 0 . 5 0 . 6 0 . 6 0 . 2 0 . 2 0 . 4 0 . 4 0 . 4 0 . 4 0 . 3 0 . 3 0 . 6 0 . 9 0 . 9 0 . 2 0 . 7 0 . 7 0 . 1 0 . 1 t = 0 t = 1 On each round, uninfected nodes compute total weight of infected neighbors and become infected if � b ji > θ i j is infected Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 34 / 45

  25. Model: Linear threshold model ( broadly, triggering models ) Edges have weights ( b ij ), satisfying � j b ji ≤ 1 Nodes choose thresholds θ i ∈ [0 , 1] i.i.d., uniformly at random 0 . 5 0 . 5 0 . 5 0 . 6 0 . 6 0 . 6 0 . 2 0 . 2 0 . 2 0 . 4 0 . 4 0 . 4 0 . 4 0 . 4 0 . 4 0 . 3 0 . 3 0 . 3 0 . 6 0 . 6 0 . 9 0 . 9 0 . 9 0 . 2 0 . 2 0 . 7 0 . 7 0 . 5 0 . 7 0 . 1 0 . 1 0 . 1 t = 0 t = 1 t = 2 On each round, uninfected nodes compute total weight of infected neighbors and become infected if � b ji > θ i j is infected Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 34 / 45

  26. Previous work Monotonicity, submodularity of influence function in triggering models ( Kempe et al. ’03 ) Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 35 / 45

  27. Previous work Monotonicity, submodularity of influence function in triggering models ( Kempe et al. ’03 ) 1 − 1 � � = ⇒ Greedy algorithm yields -approximation to e A ⊆ V : | A |≤ k I ( A ) max Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 35 / 45

  28. Previous work Monotonicity, submodularity of influence function in triggering models ( Kempe et al. ’03 ) 1 − 1 � � = ⇒ Greedy algorithm yields -approximation to e A ⊆ V : | A |≤ k I ( A ) max However, method involves approximating I at each iteration of greedy algorithm via simulations Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 35 / 45

  29. Key contributions 1 Computable upper and lower bounds for influence function in general triggering models 2 Characterization of gap between bounds Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 36 / 45

  30. Key contributions 1 Computable upper and lower bounds for influence function in general triggering models 2 Characterization of gap between bounds 3 Proof of monotonicity, submodularity for family of lower bounds 1 − 1 � � = ⇒ -approximation for sequential greedy algorithm e Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 36 / 45

  31. Key contributions 1 Computable upper and lower bounds for influence function in general triggering models 2 Characterization of gap between bounds 3 Proof of monotonicity, submodularity for family of lower bounds 1 − 1 � � = ⇒ -approximation for sequential greedy algorithm e Leads to significant speed-ups: LB 1 LB 2 UB Simulation Erd¨ os-Renyi 1.00 2.36 27.43 710.58 Preferential attachment 2.56 28.49 759.83 1.00 2 D -grid 1.00 2.43 47.08 1301.73 Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 36 / 45

  32. Budget allocation (with Ashley Hou) Problem: Given fixed budget to distribute amongst influencers, how to optimally allocate resources? T S y (1) = 2 y (4) = 3 Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 37 / 45

  33. Budget allocation (with Ashley Hou) Problem: Given fixed budget to distribute amongst influencers, how to optimally allocate resources? T S y (1) = 2 y (4) = 3 Mathematical formulation: If resources { y ( s ) } s ∈ S are allocated among source nodes S , probability of influencing customer t is � (1 − p st ) y ( s ) I t ( y ) = 1 − ( s , t ) ∈ E Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 37 / 45

  34. Budget allocation (with Ashley Hou) Problem: Given fixed budget to distribute amongst influencers, how to optimally allocate resources? T S y (1) = 2 y (4) = 3 Mathematical formulation: If resources { y ( s ) } s ∈ S are allocated among source nodes S , probability of influencing customer t is � (1 − p st ) y ( s ) I t ( y ) = 1 − ( s , t ) ∈ E so we solve max � t ∈ T I t ( y ) s.t. � s ∈ S y ( s ) ≤ B Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 37 / 45

  35. Robust variant In practice, might not know edge parameters p = { p st } , or even edge structure Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 38 / 45

  36. Robust variant In practice, might not know edge parameters p = { p st } , or even edge structure Robust optimization framework: � � � I p max min t ( y ) p ∈ Σ � s ∈ S y ( s ) ≤ B t ∈ T Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 38 / 45

  37. Robust variant In practice, might not know edge parameters p = { p st } , or even edge structure Robust optimization framework: � � � I p max min t ( y ) p ∈ Σ � s ∈ S y ( s ) ≤ B t ∈ T Goal: Develop efficient algorithms for robust budget allocation with provable approximation guarantees Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 38 / 45

  38. Robust variant In practice, might not know edge parameters p = { p st } , or even edge structure Robust optimization framework: � � � I p max min t ( y ) p ∈ Σ � s ∈ S y ( s ) ≤ B t ∈ T Goal: Develop efficient algorithms for robust budget allocation with provable approximation guarantees Ingredients: Maximization of min of submodular functions, extensions to integer lattices and budget constraints Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 38 / 45

  39. Network immunization (with Wen Yan) Goal: Given a budget of interventions at nodes/edges of a graph, how to optimally distribute resources to retard an epidemic? Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 39 / 45

  40. Network immunization (with Wen Yan) Goal: Given a budget of interventions at nodes/edges of a graph, how to optimally distribute resources to retard an epidemic? Interested in fractional immunization , which only decreases infectiveness of nodes/edges 0 . 2 0 . 5 0 . 4 Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 39 / 45

  41. Network immunization Formulation as influence maximization problem: � � min A ⊆ V : | A |≤ k I ( A ; { b ij } − { θ ij } ) max � θ ij ≤ B Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 40 / 45

  42. Network immunization Formulation as influence maximization problem: � � min A ⊆ V : | A |≤ k I ( A ; { b ij } − { θ ij } ) max � θ ij ≤ B Challenges: Bilevel optimization problem involving discrete and continuous variables 1 No computable closed-form expression for I or ∇I 2 Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 40 / 45

  43. Local algorithms Muni Pydi Varun Jog (UW-Madison) (UW-Madison) Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 41 / 45

  44. Maximizing graph functions Given function f defined on nodes of a graph Examples: Degree, age of node, power/population level, etc. 2 3 2 4 1 1 1 6 2 2 2 2 2 Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 42 / 45

  45. Maximizing graph functions Given function f defined on nodes of a graph Examples: Degree, age of node, power/population level, etc. 2 3 2 4 1 1 1 6 2 2 2 2 2 Goal: Maximize f by “walking” along edges and querying values Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 42 / 45

  46. Maximizing graph functions Given function f defined on nodes of a graph Examples: Degree, age of node, power/population level, etc. 2 3 2 4 1 1 1 6 2 2 2 2 2 Goal: Maximize f by “walking” along edges and querying values Could use “vanilla random walk” with transition probabilities P ij = w ij d i , but can we leverage smoothness/structure of graph function? Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 42 / 45

  47. Metropolis-Hastings algorithm MH algorithm specified by target density p f and proposal distribution Q (stochastic matrix) Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 43 / 45

  48. Metropolis-Hastings algorithm MH algorithm specified by target density p f and proposal distribution Q (stochastic matrix) Transition matrix: � � 1 , p f ( j ) Q ji � Q ij min , j � = i , p f ( i ) Q ij P ij = 1 − � j = i j � = i P ij , Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 43 / 45

  49. Metropolis-Hastings algorithm MH algorithm specified by target density p f and proposal distribution Q (stochastic matrix) Transition matrix: � � 1 , p f ( j ) Q ji � Q ij min , j � = i , p f ( i ) Q ij P ij = 1 − � j = i j � = i P ij , Known convergence of MH algorithm to p f Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 43 / 45

  50. Metropolis-Hastings algorithm MH algorithm specified by target density p f and proposal distribution Q (stochastic matrix) Transition matrix: � � 1 , p f ( j ) Q ji � Q ij min , j � = i , p f ( i ) Q ij P ij = 1 − � j = i j � = i P ij , Known convergence of MH algorithm to p f Idea: Build a density p f maximized wherever f is maximized, hope that MH algorithm finds maximizers quickly Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 43 / 45

  51. Local algorithm 1 Initialize at random vertex i 0 Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 44 / 45

  52. Local algorithm 1 Initialize at random vertex i 0 2 Take T steps of MH algorithm according to transition matrix P Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 44 / 45

  53. Local algorithm 1 Initialize at random vertex i 0 2 Take T steps of MH algorithm according to transition matrix P 3 Output maximum among { f ( i 0 ) , . . . , f ( i T ) } Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 44 / 45

  54. Local algorithm 1 Initialize at random vertex i 0 2 Take T steps of MH algorithm according to transition matrix P 3 Output maximum among { f ( i 0 ) , . . . , f ( i T ) } � � and Q = D − 1 W Exponential walk: p f ( i ) ∝ exp γ f ( i ) Laplacian walk: p f ( i ) ∝ f 2 ( i ) and Q defined with respect to eigenvectors of graph Laplacian L = D − W Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 44 / 45

  55. Local algorithm 1 Initialize at random vertex i 0 2 Take T steps of MH algorithm according to transition matrix P 3 Output maximum among { f ( i 0 ) , . . . , f ( i T ) } � � and Q = D − 1 W Exponential walk: p f ( i ) ∝ exp γ f ( i ) Laplacian walk: p f ( i ) ∝ f 2 ( i ) and Q defined with respect to eigenvectors of graph Laplacian L = D − W Theoretical results: Rates of convergence in TV distance, hitting time bounds for both algorithms in terms of graph/function characteristics Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 44 / 45

  56. Summary Many interesting data analysis problems involving network-structured data Po-Ling Loh (UW-Madison) Data science for networked data Apr 16, 2019 45 / 45

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend