Announcements: - Thank you for participating in our mid-quarter - PowerPoint PPT Presentation

Announcements: - Thank you for participating in our mid-quarter evaluation Thank you for participating in our homework feedback polls! ☺ - - Course project - Average was ~80% Don’t worry about grade but take feedback seriously - - Project Milestone due Thu Sun - No late days and no exceptions - Consider meeting with your assigned TA

 We often think of networks being organized into modules, clusters, communities: 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 2

5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 3

Nodes Nodes Adjacency matrix Network 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 4

 Find micro-markets by partitioning the query-to-advertiser graph: query advertiser [Andersen, Lang: Communities from seed sets, 2006] 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 5

 Clusters in Movies-to-Actors graph: [Andersen, Lang: Communities from seed sets, 2006] 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 6

 Discovering social circles, circles of trust: [McAuley, Leskovec: Discovering social circles in ego networks, 2012] 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 7

 Graph is large ▪ Assume the graph fits in main memory ▪ For example, to work with a 200M node and 2B edge graph one needs approx. 16GB RAM ▪ But the graph is too big for running anything more than linear time algorithms  We will cover a PageRank based algorithm for finding dense clusters ▪ The runtime of the algorithm will be proportional to the cluster size (not the graph size!) 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 8

 Discovering clusters based on seed nodes ▪ Given: Seed node s ▪ Compute (approximate) P ersonalized P age R ank ( PPR ) around node s (teleport set={ s }) ▪ Idea is that if s belongs to a nice cluster, the random walk will get trapped inside the cluster Seed node 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 9

Cluster “quality” (lower is better) Good clusters Seed node  Algorithm outline: Node rank in decreasing PPR score ▪ Pick a seed node s of interest ▪ Run PPR with teleport set = { s } ▪ Sort the nodes by the decreasing PPR score ▪ Sweep over the nodes and find good clusters 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 10

5 1  Undirected graph 𝑯(𝑾,𝑭): 2 6 4 3  Partitioning task: ▪ Divide vertices into 2 disjoint groups 𝐵, 𝐶 = 𝑊\𝐵 A B=V\A 5 1 2 6 4 3  Question: ▪ How can we define a “good” cluster in 𝑯 ? 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 11

 What makes a good cluster? ▪ Maximize the number of within-cluster connections ▪ Minimize the number of between-cluster connections 5 1 2 6 4 3 A V\A 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 12

 Express cluster quality as a function of the “edge cut” of the cluster  Cut: Set of edges (edge weights) with only one node in the cluster: Note: This works for weighted and unweighted (set all w ij =1 ) graphs A 5 1 cut(A) = 2 2 6 4 3 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 13

 Partition quality: Cut score ▪ Quality of a cluster is the weight of connections pointing outside the cluster  Degenerate case: “Optimal cut” Minimum cut  Problem: ▪ Only considers external cluster connections ▪ Does not consider internal cluster connectivity 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 14

[Shi-Malik]  Criterion: Conductance: Connectivity of the group to the rest of the network relative to the density of the group    | {( , ) ; , } | i j E i A j A  = ( ) A − min( ( ), 2 ( )) vol A m vol A 𝒘𝒑𝒎(𝑩) : total weight of the edges with at least m … number of edges of one endpoint in 𝑩 : 𝐰𝐩𝐦 𝑩 = σ 𝒋∈𝑩 𝒆 𝒋 the graph ◼ Vol(A)=2*#edges inside A + #edges pointing out of A d i … degree ◼ Why use this criterion? of node I E... edge set ◼ Produces more balanced partitions of the graph 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 15

𝝔 = 𝟑/𝟓 = 𝟏.𝟔 𝝔 = 𝟕/𝟘𝟑 = 𝟏. 𝟏𝟕𝟔 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 16

 Algorithm outline: Conductance 𝝔 𝑩 𝒋 ▪ Pick a seed node s of Good clusters interest ▪ Run PPR w/ teleport={ s } ▪ Sort the nodes by the decreasing PPR score ▪ Sweep over the nodes and find good clusters Node rank i in decreasing PPR score  Sweep: ▪ Sort nodes in decreasing PPR score 𝑠 1 > 𝑠 2 > ⋯ > 𝑠 𝑜 ▪ For each 𝒋 compute 𝝔(𝑩 𝒋 = 𝒔 𝟐 , … 𝒔 𝒋 ) ▪ Local minima of 𝝔(𝑩 𝒋 ) correspond to good clusters 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 17

 The whole Sweep Conductance 𝝔 𝑩 𝒋 Good clusters curve can be computed in linear time: ▪ For loop over the nodes ▪ Keep hash-table of Node rank i in decreasing PPR score nodes in a set 𝐵 𝑗 ▪ To compute 𝝔 𝑩 𝒋+𝟐 = 𝐷𝑣𝑢(𝐵 𝑗+1 )/𝑊𝑝𝑚(𝐵 𝑗+1 ) ▪ 𝑊𝑝𝑚 𝐵 𝑗+1 = 𝑊𝑝𝑚 𝐵 𝑗 + 𝑒 𝑗+1 ▪ 𝐷𝑣𝑢 𝐵 𝑗+1 = 𝐷𝑣𝑢 𝐵 𝑗 + 𝑒 𝑗+1 − 2#(𝑓𝑒𝑕𝑓𝑡 𝑝𝑔 𝑣 𝑗+1 𝑢𝑝 𝐵 𝑗 ) 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 18

 How to compute Personalized PageRank (PPR) without touching the whole graph? ▪ Power method won’t work since each single iteration accesses all nodes of the graph: 𝐬 (𝐮+𝟐) = 𝛄𝐍 ⋅ 𝐬 (𝒖) + 𝟐 − 𝜸 𝒃 At index S ▪ 𝒃 is a teleport vector: 𝒃 = 𝟏 … 𝟏 𝟐 𝟏 … 𝟏 𝑼 ▪ 𝒔 is the personalized PageRank vector  Approximate PageRank [Andersen, Chung, Lang, ‘07] ▪ A fast method for computing approximate Personalized PageRank ( PPR ) with teleport set ={ s } ▪ ApproxPageRank(s, β , ε ) ▪ s … seed node ▪ β … teleportation parameter ▪ ε … approximation error parameter 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 19

 Overview of the approximate PPR ▪ Lazy random walk , which is a variant of a random walk that stays put with probability 1/2 at each time step, and walks to a random neighbor the other half of the time: 𝑒 𝑗 … degree of 𝑗 (𝒖) ▪ Keep track of residual PPR score 𝒓 𝒗 = 𝒒 𝒗 − 𝒔 𝒗 ▪ Residual tells us how well PPR score 𝑞 𝑣 of 𝒗 is approximated ▪ 𝒒 𝒗 … is the “true” PageRank of node 𝒗 (𝒖) … is PageRank estimate of node 𝑣 at around 𝒖 ▪ 𝒔 𝒗 𝒓 𝒗 If residual 𝒓 𝒗 of node 𝒗 is too big 𝒆 𝒗 ≥ 𝜻 then push the walk further (distribute some of residual 𝑟 𝑣 to all 𝑣 ’s neighbors along out- coming edges), else don’t touch the node 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 20

 A different way to look at PageRank: [Jeh&Widom. Scaling Personalized Web Search , 2002] 𝒒 𝜸 (𝒃) = 𝟐 − 𝜸 𝒃 + 𝜸 𝒒 𝜸 (𝑵 ⋅ 𝒃) ▪ 𝒒 𝜸 (𝒃) is the true PageRank vector with teleport parameter 𝜸 , and teleport vector 𝒃 ▪ 𝒒 𝜸 (𝑵 ⋅ 𝒃) is the PageRank vector with teleportation vector 𝑵 ⋅ 𝒃 , and teleportation parameter 𝜸 ▪ where 𝑵 is the stochastic PageRank transition matrix ▪ Notice: 𝑵 ⋅ 𝒃 is one step of a random walk 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 21

 Proving: 𝒒 𝜸 (𝒃) = 𝟐 − 𝜸 𝒃 + 𝜸 𝒒 𝜸 (𝑵 ⋅ 𝒃) ▪ We can break this probability into two cases: ▪ Walks of length 0, and ▪ Walks of length longer than 0 ▪ The probability of length 0 walk is 𝟐 − 𝜸 , and the walk ends where it started, with walker distribution 𝒃 ▪ The probability of walk length >0 is 𝜸 , and then the walk starts at distribution 𝒃 , takes a step, (so it has distribution 𝑵𝒃 ), then takes the rest of the random walk with distribution 𝒒 𝜸 (𝑵𝒃) ▪ Note that we used the memoryless nature of the walk: After we know the location of the second step of the walk has distribution 𝑵𝒃 , the rest of the walk can forget where it started and behave as if it started at 𝑵𝒃 . This is the key idea of the proof. 5/7/2020 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 22

Announcements: - Thank you for participating in our mid-quarter - PowerPoint PPT Presentation

Announcements: - Thank you for participating in our mid-quarter evaluation Thank you for participating in our homework feedback polls! - - Course project - Average was ~80% Dont worry about grade but take feedback seriously - -

Questions? Questions? Questions? Questions? Questions? Questions? Questions? Questions?

ICR Event S eries Thank you to our Gold S ponsor Thank you to our Gold S ponsor Thank you to

Thank You to our Speakers David Frum 2 1 Thank You to our Sponsors DIAMOND SPONSOR 3 Thank

Thank you to Thank you Platinum Sponsors Thank you Gold Sponsors Thank you Silver Sponsors

October 20-23 Washington Hilton Thank You to Our Lead Sponsor Thank You to Our Platinum

Mid-Region Council of Governments Mid-Region Metropolitan Planning Organization Mid-Region

Physicists Summary Asher Kaboth 21 Sept 2016 Thank you! Thank you to the organizers!

Final Final Participating Participating Team Team 24 Teams 72 Participants 15 University

Mid Shannon Wilderness Park The potential future of the Longford bogs Mid Shannon Potential 22

Contents Participating economies Key performance indicators Participating markets

Investing for Participating Products ASHK Joint Regional Seminar Back to Basics - Evolving

Mid-Snake TMDL By Cassie Sundquist and Chris Jeszke Mid Snake TMDL EPA approved the Mid Snake

Mid-Year Budget Report February 17, 2015 Purpose of Mid-Year The Annual Mid-Year Budget

Mid Y Mid Yea ear r Review view Mid Year Review Committee of the Whole - August 23, 2016

Cafeteria Mid-Year Election Changes Laurie Brophy Del Horton Making Mid-Year Election Changes

OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR

On the Algorithmic Power of Spiking Neural Networks Chi-Ning Chou Kai-Min Chung Chi-Jen Lu

= no shared accounts In open systems it should rely on authenticity of the information, in

Portfolio Management Financial Markets, Day 4, Class 1 Jun Pan Shanghai Advanced Institute of

Charmonium decays at the BESIII experiment Hu Jifeng for BESIII Collaboration Graduate

PREVENTING THE THREATS OF TOMORROW AND BEYOND Jonathan Kaftzan VP Product marketing &

Classification Problems From Regression to Classification x } Suppose we have two classes of

I n t r o d u c t i o n t o H i g h - l e v e l S y n t h e s i s

Chapter 8. Support Vector Machines Wei Pan Division of Biostatistics, School of Public Health,

Sambuz

Useful Links

Newsletter

Mail Us

Announcements: - Thank you for participating in our mid-quarter - PowerPoint PPT Presentation

Announcements: - Thank you for participating in our mid-quarter evaluation Thank you for participating in our homework feedback polls! - - Course project - Average was ~80% Dont worry about grade but take feedback seriously - -

Questions? Questions? Questions? Questions? Questions? Questions? Questions? Questions?

ICR Event S eries Thank you to our Gold S ponsor Thank you to our Gold S ponsor Thank you to

Thank You to our Speakers David Frum 2 1 Thank You to our Sponsors DIAMOND SPONSOR 3 Thank

Thank you to Thank you Platinum Sponsors Thank you Gold Sponsors Thank you Silver Sponsors

October 20-23 Washington Hilton Thank You to Our Lead Sponsor Thank You to Our Platinum

Mid-Region Council of Governments Mid-Region Metropolitan Planning Organization Mid-Region

Physicists Summary Asher Kaboth 21 Sept 2016 Thank you! Thank you to the organizers!

Final Final Participating Participating Team Team 24 Teams 72 Participants 15 University

Mid Shannon Wilderness Park The potential future of the Longford bogs Mid Shannon Potential 22

Contents Participating economies Key performance indicators Participating markets

Investing for Participating Products ASHK Joint Regional Seminar Back to Basics - Evolving

Mid-Snake TMDL By Cassie Sundquist and Chris Jeszke Mid Snake TMDL EPA approved the Mid Snake

Mid-Year Budget Report February 17, 2015 Purpose of Mid-Year The Annual Mid-Year Budget

Mid Y Mid Yea ear r Review view Mid Year Review Committee of the Whole - August 23, 2016

Cafeteria Mid-Year Election Changes Laurie Brophy Del Horton Making Mid-Year Election Changes

OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR SKILLS OUR

On the Algorithmic Power of Spiking Neural Networks Chi-Ning Chou Kai-Min Chung Chi-Jen Lu

= no shared accounts In open systems it should rely on authenticity of the information, in

Portfolio Management Financial Markets, Day 4, Class 1 Jun Pan Shanghai Advanced Institute of

Charmonium decays at the BESIII experiment Hu Jifeng for BESIII Collaboration Graduate

PREVENTING THE THREATS OF TOMORROW AND BEYOND Jonathan Kaftzan VP Product marketing &amp;

Classification Problems From Regression to Classification x } Suppose we have two classes of

I n t r o d u c t i o n t o H i g h - l e v e l S y n t h e s i s

Chapter 8. Support Vector Machines Wei Pan Division of Biostatistics, School of Public Health,

Sambuz

Useful Links

Newsletter

Mail Us

PREVENTING THE THREATS OF TOMORROW AND BEYOND Jonathan Kaftzan VP Product marketing &