Bigger, Faster, Random(ized): Computing in the Era of Big Data - PowerPoint PPT Presentation

Bigger, Faster, Random(ized): Computing in the Era of Big Data Ioana Dumitriu Department of Mathematics University of Washington (Seattle) Joint work with Grey Ballard, Gerandy Brito, James Demmel, Maryam Fazel, Roy Han, Kameron Harris, Amin Jalali MIDAS Seminar Series, U Mich January 12, 2018 January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 1 / 36

Intro/Overarching Theme: Large Data and Randomization 1 The Stochastic Block Model 2 Results and improvements Graph Expanders and the Spectral Gap 3 Results Applications Random matrices in Numerical Linear Algebra 4 Why is communication bad? Randomized Spectral Divide and Conquer Conclusions 5 January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 2 / 36

Intro/Overarching Theme: Large Data and Randomization Data, Data, Data − Large corporations accumulate and store massive amounts of data, some of which gets mined in order to inform decision-making − Some of the implications of this are very worrisome (see “Weapons of Math Destruction” by Cathy O’Neil), but most are already ingrained in the way business is conducted, research is done, etc. World is data-driven. − Data Mining ( ∼ a subset of Machine Learning) includes Clustering/Community Detection (social, biological networks) January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 3 / 36

Intro/Overarching Theme: Large Data and Randomization Data, Data, Data − Large corporations accumulate and store massive amounts of data, some of which gets mined in order to inform decision-making − Some of the implications of this are very worrisome (see “Weapons of Math Destruction” by Cathy O’Neil), but most are already ingrained in the way business is conducted, research is done, etc. World is data-driven. − Data Mining ( ∼ a subset of Machine Learning) includes Clustering/Community Detection (social, biological networks) Association Rule Learning (e.g., extrapolation of preferences for the purposes of marketing) January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 3 / 36

Intro/Overarching Theme: Large Data and Randomization Data, Data, Data − Large corporations accumulate and store massive amounts of data, some of which gets mined in order to inform decision-making − Some of the implications of this are very worrisome (see “Weapons of Math Destruction” by Cathy O’Neil), but most are already ingrained in the way business is conducted, research is done, etc. World is data-driven. − Data Mining ( ∼ a subset of Machine Learning) includes Clustering/Community Detection (social, biological networks) Association Rule Learning (e.g., extrapolation of preferences for the purposes of marketing) Classification, regression, anomaly detection, etc. January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 3 / 36

Intro/Overarching Theme: Large Data and Randomization Data Algorithms − In many ways, randomization is a key factor in understanding how to do these things: Devising mathematical models for analysis, threshold studies, theoretical guarantees, benchmarking e.g., the Stochastic Block Model for clustering January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 4 / 36

Intro/Overarching Theme: Large Data and Randomization Data Algorithms − In many ways, randomization is a key factor in understanding how to do these things: Devising mathematical models for analysis, threshold studies, theoretical guarantees, benchmarking e.g., the Stochastic Block Model for clustering Extrapolating from incomplete data e.g., matrix completion for marketing algorithms uses random matrix results new results point to the usefulness of graph expanders January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 4 / 36

Intro/Overarching Theme: Large Data and Randomization Data Algorithms − In many ways, randomization is a key factor in understanding how to do these things: Devising mathematical models for analysis, threshold studies, theoretical guarantees, benchmarking e.g., the Stochastic Block Model for clustering Extrapolating from incomplete data e.g., matrix completion for marketing algorithms uses random matrix results new results point to the usefulness of graph expanders Speeding up algorithms by using only a random subset of the data, etc. January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 4 / 36

Intro/Overarching Theme: Large Data and Randomization Use of Numerical Linear Algebra for Data Algorithms − Most algorithms for data mining make heavy use of numerical linear algebra, sometimes for very large matrices (10 6 entries) − Parallelism and state-of-the-art algorithms in LAPACK/Matlab January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 5 / 36

Intro/Overarching Theme: Large Data and Randomization Use of Numerical Linear Algebra for Data Algorithms − Most algorithms for data mining make heavy use of numerical linear algebra, sometimes for very large matrices (10 6 entries) − Parallelism and state-of-the-art algorithms in LAPACK/Matlab − But there is a less-known cost to algorithms that relates to communication, and not all algorithms are optimized January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 5 / 36

Intro/Overarching Theme: Large Data and Randomization Use of Numerical Linear Algebra for Data Algorithms − Most algorithms for data mining make heavy use of numerical linear algebra, sometimes for very large matrices (10 6 entries) − Parallelism and state-of-the-art algorithms in LAPACK/Matlab − But there is a less-known cost to algorithms that relates to communication, and not all algorithms are optimized − Randomization can also help with that (e.g., a randomized non-symmetric eigenvalue solver ) January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 5 / 36

The Stochastic Block Model Part 1: Clustering in the Stochastic Block Model January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 6 / 36

The Stochastic Block Model The Clustering Problem Inputs a network with clusters (possibly also overlapping) and asks whether it is possible to detect/recover them accurately and efficiently. Applications in machine learning, community detection, synchronization, channel transmission, etc. Questions are many and subtle Huge body of work: OR, EE, ThCS, Math January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 7 / 36

The Stochastic Block Model The Stochastic Block Model (SBM) A.k.a. the “planted partition” model Clasically uses the Erd˝ os-Rényi random graph G ( n , p ) , in which each edge between a pair of vertices in an n -set occurs independently with probability p January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 8 / 36

The Stochastic Block Model The Stochastic Block Model (SBM) A.k.a. the “planted partition” model Clasically uses the Erd˝ os-Rényi random graph G ( n , p ) , in which each edge between a pair of vertices in an n -set occurs independently with probability p Consider K G ( n i , p i ) independent and non-overlapping, joined by a multipartite G ( n 1 , . . . , n K , q ) . January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 8 / 36

The Stochastic Block Model The Stochastic Block Model (SBM) A.k.a. the “planted partition” model Clasically uses the Erd˝ os-Rényi random graph G ( n , p ) , in which each edge between a pair of vertices in an n -set occurs independently with probability p Consider K G ( n i , p i ) independent and non-overlapping, joined by a multipartite G ( n 1 , . . . , n K , q ) . Under what sort of conditions on the n i , p i , K , q can one (almost) recover/approximate/detect the presence of the partition? January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 8 / 36

The Stochastic Block Model The Stochastic Block Model (SBM) A.k.a. the “planted partition” model Clasically uses the Erd˝ os-Rényi random graph G ( n , p ) , in which each edge between a pair of vertices in an n -set occurs independently with probability p Consider K G ( n i , p i ) independent and non-overlapping, joined by a multipartite G ( n 1 , . . . , n K , q ) . Under what sort of conditions on the n i , p i , K , q can one (almost) recover /approximate/detect the presence of the partition? January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 9 / 36

The Stochastic Block Model SBM Analysis Recovery: Huge body of literature in OR/EE/ThCS; possibility of recovery studied via the Maximum Likelihood Estimator (MLE) and convex relaxations using semidefinite programming (SDPs); multiple-structure SDPs (sparse+low-rank, e.g. Vinayak, Oymak, Hassibi (2014)). Most general analysis for recovery via information-theoretic impossibility bounds and a convex relaxation for the MLE in Chen and Xu (2015); various order-sharp bounds for K equivalent clusters ( K may grow with n ). Other work for more restricted models (including thresholds e.g., Abbe, Sandon (2015) or partial recovery/approximation/detectability, e.g., Yun, Proutiere (2014), Coja-Oghlan (2010), Le, Levina, Vershynin (2015), Guedon and Vershynin (2015), Decelle, Krzakala, Moore, Zdeborova (2011) January 12, 2018 Ioana Dumitriu (UW) Randomness in Data Mining 10 / 36

Bigger, Faster, Random(ized): Computing in the Era of Big Data - PowerPoint PPT Presentation

Bigger, Faster, Random(ized): Computing in the Era of Big Data Ioana Dumitriu Department of Mathematics University of Washington (Seattle) Joint work with Grey Ballard, Gerandy Brito, James Demmel, Maryam Fazel, Roy Han, Kameron Harris, Amin

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. & Law Response to ERA I ( ii)

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

Inphi Moves Big Data Faster Inphi Moves Big Data Faster Inphis New Canopus DSP Enabling

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

Reactive Systems Why now? Electronic Commerce Era Multicore Era Cloud Era Backlash to the BOFH

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

More, bigger, better and joined More, bigger, better and joined HNV: The pros: Recognising

Human Error - The Weakest link in CyberSecurity Exceptional IT. Real People. Bigger Purpose.

Where Bigger Is Where Bigger Is Jan 2016 Jan 2016 Cautionary Statement Cautionary Statement

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by

13 B: Summary of CS1102S CS1102S: Data Structures and Algorithms Martin Henz April 16, 2010

Definite Integrals Fundamental Theorem of Calculus Slide 3 / 85 Slide 4 / 85 Consider the

I still have no voice, so Wendy (another calculus teacher) will be lecturing today. Yes, she

The Mean Value Theorem A student at the University of Connecticut happens to be travelling to

CSE 421 Midterm Scores Mean 83 Sigma 11 1 CSE 421 Algorithms Sequence Alignment 1 Sequence

Gas Regulatory Change Programme EU/GB Charging & CAM Incremental 2019 Overview Sarah

Gradient, STEM, and Regression Models for Motion Perception: Relationships and Extensions Eero

Pairwise, Rigid Registration The ICP Algorithm and Its Variants 1 1 Correspondence Problem

Bigger, Faster, Random(ized): Computing in the Era of Big Data - PowerPoint PPT Presentation

Bigger, Faster, Random(ized): Computing in the Era of Big Data Ioana Dumitriu Department of Mathematics University of Washington (Seattle) Joint work with Grey Ballard, Gerandy Brito, James Demmel, Maryam Fazel, Roy Han, Kameron Harris, Amin

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. &amp; Law Response to ERA I ( ii)

FASTER TRANSFORMER Bo Yang Hsueh, 2019/12/18 AGENDA What is Faster Transformer Introduce the

Machine Learning Anders Holst SICS Big Data Analytics Analysis Big Data Big Value Big Data

BIG DATA 2 This is the Big Data era Big Data are linked System G WHAT IS GRAPH COMPUTING

Inphi Moves Big Data Faster Inphi Moves Big Data Faster Inphis New Canopus DSP Enabling

Random Numbers RANDOM VS PSEUDO RANDOM Truly Random numbers From Wolfram: A random number

E RA- MIN 2 Sta rting De c 1 st 2016 2 About ERA MIN 2 ERA MIN 2 is an ERA NET

Reactive Systems Why now? Electronic Commerce Era Multicore Era Cloud Era Backlash to the BOFH

Big Data Algorithms with Medical Applications Yixin Chen Outline Challenges to big data

Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015 Big Data Trends Bigger data

CS535 Big Data 1/22/2020 Sangmi Lee Pallickara CS535 Big Data | Computer Science Department

COMP9313: Big Data Management Introduction to Big Data Management What is big data? Tweeted by

More, bigger, better and joined More, bigger, better and joined HNV: The pros: Recognising

Human Error - The Weakest link in CyberSecurity Exceptional IT. Real People. Bigger Purpose.

Where Bigger Is Where Bigger Is Jan 2016 Jan 2016 Cautionary Statement Cautionary Statement

Bigger GPUs and Bigger Nodes Carl Pearson (pearson@illinois.edu) PhD Candidate, advised by

13 B: Summary of CS1102S CS1102S: Data Structures and Algorithms Martin Henz April 16, 2010

Definite Integrals Fundamental Theorem of Calculus Slide 3 / 85 Slide 4 / 85 Consider the

I still have no voice, so Wendy (another calculus teacher) will be lecturing today. Yes, she

The Mean Value Theorem A student at the University of Connecticut happens to be travelling to

CSE 421 Midterm Scores Mean 83 Sigma 11 1 CSE 421 Algorithms Sequence Alignment 1 Sequence

Gas Regulatory Change Programme EU/GB Charging &amp; CAM Incremental 2019 Overview Sarah

Gradient, STEM, and Regression Models for Motion Perception: Relationships and Extensions Eero

Pairwise, Rigid Registration The ICP Algorithm and Its Variants 1 1 Correspondence Problem

ERA 1 ERA I I ( i) Deakin and Faculty of Bus. & Law Response to ERA I ( ii)

Gas Regulatory Change Programme EU/GB Charging & CAM Incremental 2019 Overview Sarah