Introduction to Data Science Winter Semester 2019/20 Oliver Ernst - PowerPoint PPT Presentation

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultät für Mathematik, Professur Numerische Mathematik Lecture Slides

Contents I 1 What is Data Science? 2 Learning Theory 2.1 What is Statistical Learning? 2.2 Assessing Model Accuracy 3 Linear Regression 3.1 Simple Linear Regression 3.2 Multiple Linear Regression 3.3 Other Considerations in the Regression Model 3.4 Revisiting the Marketing Data Questions 3.5 Linear Regression vs. K -Nearest Neighbors 4 Classification 4.1 Overview of Classification 4.2 Why Not Linear Regression? 4.3 Logistic Regression 4.4 Linear Discriminant Analysis 4.5 A Comparison of Classification Methods 5 Resampling Methods Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 3 / 462

Contents II 5.1 Cross Validation 5.2 The Bootstrap 6 Linear Model Selection and Regularization 6.1 Subset Selection 6.2 Shrinkage Methods 6.3 Dimension Reduction Methods 6.4 Considerations in High Dimensions 6.5 Miscellanea 7 Nonlinear Regression Models 7.1 Polynomial Regression 7.2 Step Functions 7.3 Regression Splines 7.4 Smoothing Splines 7.5 Generalized Additive Models 8 Tree-Based Methods 8.1 Decision Tree Fundamentals 8.2 Bagging, Random Forests and Boosting Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 4 / 462

Contents III 9 Unsupervised Learning 9.1 Principal Components Analysis 9.2 Clustering Methods Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 5 / 462

Contents 2 Learning Theory 2.1 What is Statistical Learning? 2.2 Assessing Model Accuracy Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 26 / 462

Contents 2 Learning Theory 2.1 What is Statistical Learning? 2.2 Assessing Model Accuracy Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 27 / 462

Learning Theory Example: Advertising channels • Given a data set containing the sales numbers for a given product in 200 markets, allocate an advertising budget across the three media channels TV , radio and newspaper . • The sales numbers for each medium are available for different advertising budget values. • We will try to model the dependence of sales on advertising budgets. • Terminology: independent variables,  X 1 : TV budget input variables,  X 2 : radio budget predictors,  X 3 : newpaper budget variables, features dependent variable, Y : sales target variable, response Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 28 / 462

Learning Theory Example: Advertising channels 25 25 25 20 20 20 Sales 15 Sales 15 Sales 15 10 10 10 5 5 5 0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100 TV Radio Newspaper Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 29 / 462

Learning Theory Example: Advertising channels 25 25 25 20 20 20 Sales 15 Sales 15 Sales 15 10 10 10 5 5 5 0 50 100 200 300 0 10 20 30 40 50 0 20 40 60 80 100 TV Radio Newspaper X = ( X 1 , . . . , X p ) , p = # predictors , Y = f ( X ) + ε ε : random error term , E [ ε ] = 0 , (2.1) f : systematic information X provides about Y . Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 29 / 462

Learning Theory Example: Income • Data set shows income against years of education for 30 people. • Objective: determine function f relating income as response to years of education as predictor. • f generally unknown, must be estimated from the data. • Here: data simulated, so f available. • In another data set, income is given with respect to two input variables: years of education and seniority . • Statistical learning is concerned with techniques for estimating f from a data set. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 30 / 462

Learning Theory Example: Income 80 80 70 70 60 60 Income Income 50 50 40 40 30 30 20 20 10 12 14 16 18 20 22 10 12 14 16 18 20 22 Years of Education Years of Education Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 31 / 462

Learning Theory Example: Income Income Seniority Y e a r s o f E d u c a t i o n Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 32 / 462

Learning Theory Example: Income Income Seniority Y e a r s Two main reasons for o f E d u estimating f : c a t i o n prediction and inference . Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 32 / 462

Learning Theory Prediction • Suppose inputs X readily available, but outputs Y difficult to obtain. • Since errors average out, predict Y using ˆ f : estimate for f , Y = ˆ ˆ f ( X ) , ˆ Y : prediction for Y = f ( X ) . • Often ˆ f only available as a black box , i.e., a procedure for generating ˆ Y given X . Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 33 / 462

Learning Theory Prediction • Suppose inputs X readily available, but outputs Y difficult to obtain. • Since errors average out, predict Y using ˆ f : estimate for f , Y = ˆ ˆ f ( X ) , ˆ Y : prediction for Y = f ( X ) . • Often ˆ f only available as a black box , i.e., a procedure for generating ˆ Y given X . Example: X 1 , . . . , X p : characteristics of a patient’s blood samples, measured in lab. : patient’s risk for severe adverse reaction to particular drug. Y For obvious reasons, having an accurate estimate ˆ Y = ˆ f ( X ) is preferable to evaluating Y = f ( X ) . Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 33 / 462

Learning Theory Prediction Accuracy of ˆ Y ≈ Y depends on reducible error and irreducible error . Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 34 / 462

Learning Theory Prediction Accuracy of ˆ Y ≈ Y depends on reducible error and irreducible error . • reducible error : f − ˆ f . Can be made smaller and smaller by employing increasingly sophisticated statistical learning techniques. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 34 / 462

Learning Theory Prediction Accuracy of ˆ Y ≈ Y depends on reducible error and irreducible error . • reducible error : f − ˆ f . Can be made smaller and smaller by employing increasingly sophisticated statistical learning techniques. • irreducible error : ε . Present even for f = ˆ f , cannot be predicted from X . Possible sources: • Additional variables Y may depend on but which are not observed/measured. • Unmeasurable variation. (E.g.: Adverse reaction may depend on manufacturing variations in drug or variations in patient’s sensitivity over time.) Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 34 / 462

Learning Theory Prediction Accuracy of ˆ Y ≈ Y depends on reducible error and irreducible error . • reducible error : f − ˆ f . Can be made smaller and smaller by employing increasingly sophisticated statistical learning techniques. • irreducible error : ε . Present even for f = ˆ f , cannot be predicted from X . Possible sources: • Additional variables Y may depend on but which are not observed/measured. • Unmeasurable variation. (E.g.: Adverse reaction may depend on manufacturing variations in drug or variations in patient’s sensitivity over time.) • Quantitative measure: mean squared error (MSE) � Y ) 2 � ( Y − ˆ = [ f ( X ) − ˆ f ( X )] 2 E + Var ε � �� irreducible reducible Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 34 / 462

Learning Theory Prediction Accuracy of ˆ Y ≈ Y depends on reducible error and irreducible error . • reducible error : f − ˆ f . Can be made smaller and smaller by employing increasingly sophisticated statistical learning techniques. • irreducible error : ε . Present even for f = ˆ f , cannot be predicted from X . Possible sources: • Additional variables Y may depend on but which are not observed/measured. • Unmeasurable variation. (E.g.: Adverse reaction may depend on manufacturing variations in drug or variations in patient’s sensitivity over time.) • Quantitative measure: mean squared error (MSE) � Y ) 2 � ( Y − ˆ = [ f ( X ) − ˆ f ( X )] 2 E + Var ε � �� irreducible reducible • Note: irreducible error always a lower bound on prediction accuracy. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 34 / 462

Learning Theory Inference Inference seeks to determine how the individual predictors X 1 , . . . , X p affect the response Y . In particular, this involves more detailed knowledge about ˆ f than simply considering it a black box. Things to investigate: Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 35 / 462

Learning Theory Inference Inference seeks to determine how the individual predictors X 1 , . . . , X p affect the response Y . In particular, this involves more detailed knowledge about ˆ f than simply considering it a black box. Things to investigate: • Identify those predictors with the strongest effect on Y . Can be a small subset of X 1 , . . . , X p . Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 35 / 462

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst - PowerPoint PPT Presentation

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultt fr Mathematik, Professur Numerische Mathematik Lecture Slides Contents I 1 What is Data Science? 2 Learning Theory 2.1 What is Statistical Learning?

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data Set Overview

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

ww www. w.big bigbang bang-datasc atascience.com ience.com Agenda BBDS Team Data Science

Introduction to Data Science January 11, 2016 About this course DATA 5000: Introduction to Data

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Creating and looping

Using dictionaries Jason Myers Instructor DataCamp Data Types for Data Science Creating and

GPUs in Finance ~ An Overview (with some Monte-Carlo thrown in!) ~ Andrew Sheppard & Enzo

Geographic Data Science - Lecture III (Geo-)Visualization Dani Arribas-Bel Today Visualization

Geographic Data Science - Lecture III (Geo-)Visualization Dani Arribas-Bel Today

Therapy-Related Leukaemia What is it? Who has it? Are you sure? Robert Peter Gale MD, PhD, DSc

Coincidences are more likely than you think: The birthday paradox Carla Santos 1 and Cristina

The researchers blind spot: 6 cognitive biases we shouldnt ignore in research UX New Zealand

Inpatient Endocrinology Pearls Neil Gesundheit, MD, MPH Professor of Medicine Division of

You are invited to an Ancestor Fair Saturday, August 22, 2015 Public viewing of exhibits 11 AM-3

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst - PowerPoint PPT Presentation

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultt fr Mathematik, Professur Numerische Mathematik Lecture Slides Contents I 1 What is Data Science? 2 Learning Theory 2.1 What is Statistical Learning?

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data Set Overview

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

ww www. w.big bigbang bang-datasc atascience.com ience.com Agenda BBDS Team Data Science

Introduction to Data Science January 11, 2016 About this course DATA 5000: Introduction to Data

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Creating and looping

Using dictionaries Jason Myers Instructor DataCamp Data Types for Data Science Creating and

GPUs in Finance ~ An Overview (with some Monte-Carlo thrown in!) ~ Andrew Sheppard &amp; Enzo

Geographic Data Science - Lecture III (Geo-)Visualization Dani Arribas-Bel Today Visualization

Geographic Data Science - Lecture III (Geo-)Visualization Dani Arribas-Bel Today

Therapy-Related Leukaemia What is it? Who has it? Are you sure? Robert Peter Gale MD, PhD, DSc

Coincidences are more likely than you think: The birthday paradox Carla Santos 1 and Cristina

The researchers blind spot: 6 cognitive biases we shouldnt ignore in research UX New Zealand

Inpatient Endocrinology Pearls Neil Gesundheit, MD, MPH Professor of Medicine Division of

You are invited to an Ancestor Fair Saturday, August 22, 2015 Public viewing of exhibits 11 AM-3

GPUs in Finance ~ An Overview (with some Monte-Carlo thrown in!) ~ Andrew Sheppard & Enzo