Lecture 1: Introduction Felix Held, Mathematical Sciences - PowerPoint PPT Presentation

Lecture 1: Introduction Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 25th March 2019

What is Big Data?

Google Scholar. What is Big Data? [1] https://www.businessinsider.com/big-data-and-cancer-2015-9?r=US&IR=T&IR=T doi 10.1126/science.1248506 [3] https://www.nytimes.com/2018/03/22/opinion/democracy-survive-data.html [4] https://www.ft.com/content/21a6e7d8-b479-11e3-a09a-00144feabdc0#axzz2yQ2QQfQX 1/29 ▶ Is it just a buzz word? ▶ Is it a cure to everything? See e.g. [1] ▶ Big Data - Big Problems? ▶ Big Data does not mean correct answers, see e.g. [2] ▶ Privacy concerns, see e.g. [3] ▶ Big Data is often not collected systematically, see e.g. [4] ▶ It’s a huge topic in science! Almost 5 million hits on [2] Lazer et al. (2014) The Parable of Google Flu: Traps in Big Data Analysis. Science 343 (6176):1203–1205.

So Big Data is about size? Yes and no. Note that size is a flexible term. Here mostly: Big- 𝑜 setting Big- 𝑞 setting Big- 𝑜 / Big- 𝑞 setting Is this all? 2/29 ▶ Size as in: Number of observations ▶ Size as in: Number of variables ▶ Size as in: Number of observations and variables

The Four Vs of Big Data Four attributes commonly assigned to Big Data. Volume Large scale of the data. Challenges are storage, computation, finding the interesting parts, … Variety Different data types, data sources, many variables, … Veracity Uncertainty of data due to convenience sampling, missing values, varying data quality, insufficient data cleaning/preparation, … Velocity Data arriving at high speeds and need to be dealt with immediately (e.g. production plant, self-driving cars) See also https://www.ibmbigdatahub.com/infographic/four-vs-big-data 3/29

How does statistics come into play? Statistics as a science has always been concerned with… parameters/predictions Focus is on the last three in this course. 4/29 ▶ sampling designs ▶ modelling of data and underlying assumptions ▶ inference of parameters ▶ uncertainty quantification in estimated

Statistical challenges in Big Data complexity and variety of data ( 𝑞 grows with 𝑜 ) require statistics much more likely with an increase in 𝑜 points in high-dimensional space 5/29 ▶ Increase in sample size often leads to increase in ▶ More data ≠ less uncertainty ▶ A lot of classical theory is for fixed 𝑞 and growing 𝑜 ▶ Exploration and visualisation of Big Data can already ▶ Probability of extreme values: Unlikely results become ▶ Curse of dimensionality: Lot’s of space between data

Course Overview & Expectations

A clarification upfront This course focusses on statistics, not on the logistics of data processing. and reasonable interpretations are our main goals. theory and their modifications for big data sets specialised courses for this (e.g. FFR135/FIM720 or TDA231/DIT380). 6/29 ▶ Understanding of algorithms , modelling assumptions ▶ We will focus on well-understood methods supported by ▶ No neural networks or deep learning. There are

Themes classification 7/29 ▶ Statistical learning/prediction: Regression and ▶ Unsupervised classification, i.e. clustering ▶ Variable selection, both explicit and implicit ▶ Data representations/Dimension reduction ▶ Large sample methods

Who’s involved Felix Held , felix.held@chalmers.se Rebecka Jörnsten jornsten@chalmers.se Juan Inda Diaz inda@chalmers.se 8/29

A course in three parts 1. Lectures 2. Projects 3. Take-home exam 9/29

Course literature Hastie, T, Tibshirani, R, and Friedman, J (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer Science+Business Media, LLC application suggestions on course website. 10/29 ▶ Covers a lot of statistical methods ▶ Freely available online ▶ Balanced presentation of theory and ▶ Not always very detailed. Other

Projects project mandatory to be allowed to take the exam 11/29 ▶ Five (small) projects throughout the course ▶ Purpose: ▶ Hands-on experience in data-analysis ▶ Further exploration of course topics ▶ Practice how to present statistical results ▶ You will work in groups and have at least one week per ▶ Projects will be presented in class ▶ Attendance (and presenting) of project presentations is ▶ More information next week

Exam analysis/statistical tasks 12/29 ▶ Take-home exam ▶ Structure: ▶ 50% of the exam/grade: Revise your projects individually ▶ 50% of the exam/grade: Additional data ▶ Exam will be handed out on 24 th May ▶ Hard deadline on 14 th June

Statistical Learning

Basics about random variables quantities variable variables 13/29 ▶ We will consider discrete and continuous random ▶ Probability mass function (pmf) 𝑞(𝑙) for a discrete ▶ Probability density function (pdf) 𝑞(𝐲) for a continuous

Two important rules (and a consequence) Marginalisation For a joint density 𝑞(𝑦, 𝑧) it holds that 𝑧 𝑞(𝑦, 𝑧) or Conditioning For a joint density 𝑞(𝑦, 𝑧) it holds that 𝑞(𝑦, 𝑧) = 𝑞(𝑦|𝑧)𝑞(𝑧) = 𝑞(𝑧|𝑦)𝑞(𝑦) 𝑞(𝑧) 14/29 𝑞(𝑦) = ∑ 𝑞(𝑦) = ∫ 𝑞(𝑦, 𝑧) d 𝑧 Both rules together imply Bayes’ law 𝑞(𝑦|𝑧) = 𝑞(𝑧|𝑦)𝑞(𝑦)

Expectation and variance Expectations and variance depend on an underlying pdf/pmf. Notation: 2 ] 15/29 ▶ 𝔽 𝑞(𝑦) [𝑔(𝑦)] = ∫ 𝑔(𝑦)𝑞(𝑦) d 𝑦 ▶ Var 𝑞(𝑦) [𝑔(𝑦)] = 𝔽 𝑞(𝑦) [(𝑔(𝑦) − 𝔽 𝑞(𝑦) [𝑔(𝑦)])

What is Statistical Learning? Learn a model from data by minimizing expected prediction (predictive modelling) observed data and predictions 16/29 error determined by a loss function. ▶ Model: Find a model that is suitable for the data ▶ Data: Data with known outcomes is needed ▶ Expected prediction error: Focus on quality of prediction ▶ Loss function: Quantifies the discrepancy between

Linear regression - An old friend 17/29 2.5 ● ● ● ● ● ● 2.0 ● ● ● ● ● y ● ● ● ● ● ● ● 1.5 ● ● 0.0 0.5 1.0 1.5 2.0 2.5 x

Statistical Learning and Linear Regression regression problems, i.e. squared error loss 2 ) 𝜸 ⏟⎵⎵⎵⏟⎵⎵⎵⏟ 18/29 design matrix 𝐘 has rank 𝑞 + 1 𝑗 = 1, … , 𝑜 (𝑧 𝑗 , 𝐲 𝑗 ), ▶ Data: Training data consists of independent pairs Observed response 𝑧 𝑗 ∈ ℝ for predictors 𝐲 𝑗 ∈ ℝ 𝑞 and ▶ Model: 𝑧 𝑗 = 𝐲 𝑈 𝑗 𝜸 + 𝜁 𝑗 where 𝜁 𝑗 ∼ 𝑂(0, 𝜏 2 ) independent ▶ Loss function: Least squares solves standard linear 𝑧) 2 = (𝑧 − 𝐲 𝑈 ((𝐘 𝑈 𝐘) −1 𝐘 𝑈 𝐳) 𝑀(𝑧, ̂ 𝑧) = (𝑧 − ̂ = ̂

Statistical decision theory for regression (I) 𝑔(𝐲) dependent on the variable(s) 𝑦 𝑀(𝑧, 𝑔(𝐲)) = (𝑧 − 𝑔(𝐲)) 2 from training data (population) as the training data arrives, expected ˆ 𝑔 = arg min 𝑔 𝐾(𝑔) 19/29 ▶ Squared error loss between outcome 𝑧 and a prediction ▶ Assume we want to find the “best” 𝑔 that can be learned ▶ When a new pair of data (𝑧, 𝐲) from the same distribution prediction loss for a given 𝑔 is 𝐾(𝑔) = 𝔽 𝑞(𝐲,𝑧) [𝑀(𝑧, 𝑔(𝐲))] = 𝔽 𝑞(𝐲) [𝔽 𝑞(𝑧|𝐲) [𝑀(𝑧, 𝑔(𝐲))]] ▶ Define “best” by:

Statistical decision theory for regression (II) 𝑧 𝑗 𝑧 𝑗 𝑚 𝐲 𝑗𝑚 ∈𝑂 𝑙 (𝐲) ∑ 𝑙 Then neighbours of 𝐲 in the training data 𝑂 𝑙 (𝐲) = {𝐲 𝑗 1 , … , 𝐲 𝑗 𝑙 } . 𝐲 𝑗 =𝐲 1 𝔽 𝑞(𝑧|𝐲) [𝑧] ≈ mean) the expectation of 𝑧 given that 𝐲 is fixed (conditional 𝑔(𝐲) = 𝔽 𝑞(𝑧|𝐲) [𝑧] ˆ 20/29 ▶ It can be derived (see blackboard) that ▶ Regression methods approximate the conditional mean ▶ For many observations 𝑧 with identical 𝐲 we could use |{𝑧 𝑗 ∶ 𝐲 𝑗 = 𝐲}| ∑ ▶ Probably more realistic to look for the 𝑙 closest 𝔽 𝑞(𝑧|𝐲) [𝑧] ≈ 1

Average of 𝑙 neighbours 21/29 2.5 ● ● ● ● ● ● 2.0 ● ● ● ● ● y ● ● ● ● ● ● ● 1.5 ● ● 0.0 0.5 1.0 1.5 2.0 2.5 x k 2 5

Back to linear regression Linear regression is a model-based approach and assumes that the dependence of 𝑧 on 𝐲 can be written as a weighted sum 𝔽 𝑞(𝑧|𝑦) [𝑧] ≈ 𝐲 𝑈 𝜸 22/29

A simple example of classification How do we classify a pair of new coordinates 𝐲 = (𝑦 1 , 𝑦 2 ) ? 23/29 5.0 ● ● ● ● 2.5 ● ● ● ● ● ● ● ● ● ● ● ● Class ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● x 2 ● ● ● ● ● ● ● 1 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2.5 ● ● ● ● ● ● ● ● ● −5.0 −4 −2 0 2 4 x 1

𝑙 -nearest neighbour classifier (kNN) 𝑂 𝑙 (𝐲) = {𝐲 𝑗 1 , … , 𝐲 𝑗 𝑙 } in the training sample, that are closest to 𝐲 in the Euclidean norm. in 𝑂 𝑙 (𝐲) belong to (highest frequency) 24/29 ▶ Find the 𝑙 predictors ▶ Majority vote: Assign 𝐲 to the class that most predictors

Lecture 1: Introduction Felix Held, Mathematical Sciences - PowerPoint PPT Presentation

Lecture 1: Introduction Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 25th March 2019 What is Big Data? Google Scholar. What is Big Data? [1]

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Basics on generative and discriminative classification Machine Learning and Object Recognition

Printability and Complexity Co-optimization Bentian Jiang 1 , Lixin Liu 1 , Yuzhe Ma 1 , Hang Zhang

Probabilistic Graphical Models David Sontag New York University Lecture 10, April 3, 2012 David

Part 2: Introduction to Graphical Models Sebastian Nowozin and Christoph H. Lampert Colorado

Data Sciences CentraleSupelec Advance Machine Learning Course VI - Nonnegative matrix

(Statistical Machine-Learning) General framework + Supervised Learning Pr. Fabien MOUTARDE

Neural Networks Janos Borst July 23, 2019 University of Leipzig - NLP Group Machine Learning

Electricity Demand Forecasting by Multi-Task Learning Jean-Baptiste Fiot Francesco Dinuzzo IBM