Data-Intensive Distributed Computing CS 431/631 451/651 (Winter - PowerPoint PPT Presentation

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (1/4) October 29, 2019 Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States 1 See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

Structure of the Course Analyzing Graphs Relational Data Analyzing Text Data Mining Analyzing “Core” framework features and algorithm design 2

Descriptive vs. Predictive Analytics 3

external APIs users users Frontend Frontend Frontend Backend Backend Backend OLTP OLTP OLTP database database database ETL (Extract, Transform, and Load) “Data Lake” Data Warehouse Other SQL on “Traditional” tools Hadoop BI tools data scientists 4

Supervised Machine Learning The generic problem of function induction given sample instances of input and output Classification: output draws from finite discrete labels Regression: output is a continuous value This is not meant to be an exhaustive treatment of machine learning! 5

Classification 6 Source: Wikipedia (Sorting)

Applications Spam detection Sentiment analysis Content (e.g., topic) classification Link prediction Document ranking Object recognition Fraud detection And much much more! 7

Supervised Machine Learning training testing/deployment Model ? Machine Learning Algorithm 8

Feature Representations Objects are represented in terms of features: “Dense” features: sender IP, timestamp, # of recipients, length of message, etc. “Sparse” features: contains the term “Viagra” in message, contains “URGENT” in subject, etc. 9

Applications Spam detection Sentiment analysis Content (e.g., genre) classification Link prediction Document ranking Object recognition Fraud detection And much much more! 10

Components of a ML Solution Data Features Model Optimization 11

No data like more data! 12 (Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007)

Limits of Supervised Classification? Why is this a big data problem? Isn’t gathering labels a serious bottleneck? Solutions Crowdsourcing Bootstrapping, semi-supervised techniques Exploiting user behavior logs The virtuous cycle of data-driven products 13

Virtuous Product Cycle a useful service $ (hopefully) transform insights analyze user behavior into action to extract insights Google. Facebook. Twitter. Amazon. Uber. data products data science 14

What’s the deal with neural networks? Data Features Model Optimization 15

Supervised Binary Classification Restrict output label to be binary Yes / No 1/0 Binary classifiers form primitive building blocks for multi-class problems … 16

Binary Classifiers as Building Blocks Example: four-way classification One vs. rest classifiers Classifier cascades A or not? A or not? B or not? B or not? C or not? C or not? D or not? D or not? 17

The Task label Given: (sparse) feature vector Induce: Such that loss is minimized loss function Typically, we consider functions of a parametric form: model parameters 18

Key insight: machine learning as an optimization problem! (closed form solutions generally not possible) 19

Gradient Descent: Preliminaries Rewrite: Compute gradient: “Points” to fastest increasing “direction” So, at any point: 20

Gradient Descent: Iterative Update Start at an arbitrary point, iteratively update: We have: 21

Intuition behind the math… New weights Old weights Update based on gradient 22

Gradient Descent: Iterative Update Start at an arbitrary point, iteratively update: We have: Lots of details: Figuring out the step size Getting stuck in local minima Convergence rate … 23

Gradient Descent Repeat until convergence: Note, sometimes formulated as ascent but entirely equivalent 24

Gradient Descent 25 Source: Wikipedia (Hills)

Even More Details… Gradient descent is a “first order” optimization technique Often, slow convergence Newton and quasi-Newton methods: Intuition: Taylor expansion Requires the Hessian (square matrix of second order partial derivatives): impractical to fully compute 26

Logistic Regression 27 Source: Wikipedia (Hammer)

Logistic Regression: Preliminaries Given: Define: Interpretation: 28

Relation to the Logistic Function After some algebra: 1 The logistic function: 0.9 0.8 0.7 0.6 logistic(z) 0.5 0.4 0.3 0.2 0.1 0 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 29 z

Training an LR Classifier Maximize the conditional likelihood: Define the objective in terms of conditional log likelihood: We know: So: Substituting: 30

LR Classifier Update Rule Take the derivative: General form of update rule: Final update rule: 31

Lots more details… Regularization Different loss functions … Want more details? Take a real machine-learning course! 32

MapReduce Implementation mappers single reducer compute partial gradient mapper mapper mapper mapper reducer iterate until convergence update model 33

Shortcomings Hadoop is bad at iterative algorithms High job startup costs Awkward to retain state across iterations High sensitivity to skew Iteration speed bounded by slowest task Potentially poor cluster utilization Must shuffle all data to a single reducer Some possible tradeoffs Number of iterations vs. complexity of computation per iteration E.g., L-BFGS: faster convergence, but more to compute 34

Spark Implementation val points = spark.textFile(...).map(parsePoint).persist() var w = // random initial vector for (i <- 1 to ITERATIONS) { val gradient = points.map{ p => p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y }.reduce((a,b) => a+b) w -= gradient } compute partial gradient mapper mapper mapper mapper reducer update model 35

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter - PowerPoint PPT Presentation

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (1/4) October 29, 2019 Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 This work is licensed under a Creative Commons

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

Thank You! Reviewers Moss Drake Roslyn Lindquist

Water and sanitation economics: reflections on application to developing economies Dale

Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks Martha Kim and Stephen

CONSORTIUM (CCC) Mark D. Hill, U . Wisconsin CS & ECE CCC Vice Chair 2016-18 & Chair

rusty@fundthepeople.org @rustystahl Agenda 1. About the Project 2. Whats Wrong 3. Proposed

ADVANCED DATABASE SYSTEMS Multi-Version Concurrency Control (Garbage Collection) @ Andy_Pavlo

Alice Learning to program: Part Three By Ruthie Tucker and Jenna Hayes, Under the direction of

New particle mass spectrometry at the LHC : Resolving combinatoric endpoints Won-Sang Cho (IPMU)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter - PowerPoint PPT Presentation

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (1/4) October 29, 2019 Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 This work is licensed under a Creative Commons

MapReduce Data Intensive Computing Data-intensive computing is a class of parallel

Data-Intensive Workfmows A journey to a Holistjc Framework for Data-Intensive Workfmows Ian

Data Intensive Computing Frameworks Amir H. Payberah amir@sics.se Amirkabir University of

for Data Intensive Scalable Computing CAP3 Gene Assembly Program Compute intensive

Intensive Family Support Project Katherine Manchester Paula Hill What is the Intensive Family

Data-Intensive Distributed Computing 431/631 (Fall 2020) Part 1: Introduction to Big Data Ali

Data-Intensive Distributed Computing 451/651 (Fall 2020) Part 1: Introduction to Big Data Ali

Enabling Enabling Data- -Intensive Science Intensive Science Data with Tactical Storage

On safety in distributed computing Srivatsan Ravi On safety in distributed computing Safety in

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

Data-Intensive Distributed Computing 431/451/631/651 (Fall 2020) Part 1: MapReduce Algorithm

OCIO UFOs Template 4 April 26, 2011 4 April 26, 2011 Objectives 1. Provide an interoperable

Data-Intensive Distributed Computing CS 431/631 451/651 (Fall 2019) Part 6: Data Mining (3/4)

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 9: Real-Time Data

Data-Intensive Distributed Computing CS 431/631 451/651 (Winter 2019) Part 6: Data Mining (4/4)

Thank You! Reviewers Moss Drake Roslyn Lindquist

Water and sanitation economics: reflections on application to developing economies Dale

Computation vs. Memory Systems: Pinning Down Accelerator Bottlenecks Martha Kim and Stephen

CONSORTIUM (CCC) Mark D. Hill, U . Wisconsin CS &amp; ECE CCC Vice Chair 2016-18 &amp; Chair

rusty@fundthepeople.org @rustystahl Agenda 1. About the Project 2. Whats Wrong 3. Proposed

ADVANCED DATABASE SYSTEMS Multi-Version Concurrency Control (Garbage Collection) @ Andy_Pavlo

Alice Learning to program: Part Three By Ruthie Tucker and Jenna Hayes, Under the direction of

New particle mass spectrometry at the LHC : Resolving combinatoric endpoints Won-Sang Cho (IPMU)

CONSORTIUM (CCC) Mark D. Hill, U . Wisconsin CS & ECE CCC Vice Chair 2016-18 & Chair