Introduction to Data Science Winter Semester 2019/20 Oliver Ernst - PowerPoint PPT Presentation

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultät für Mathematik, Professur Numerische Mathematik Lecture Slides

Contents I 1 What is Data Science? 2 Learning Theory 2.1 What is Statistical Learning? 2.2 Assessing Model Accuracy 3 Linear Regression 3.1 Simple Linear Regression 3.2 Multiple Linear Regression 3.3 Other Considerations in the Regression Model 3.4 Revisiting the Marketing Data Questions 3.5 Linear Regression vs. K -Nearest Neighbors 4 Classification 4.1 Overview of Classification 4.2 Why Not Linear Regression? 4.3 Logistic Regression 4.4 Linear Discriminant Analysis 4.5 A Comparison of Classification Methods 5 Resampling Methods Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 3 / 463

Contents II 5.1 Cross Validation 5.2 The Bootstrap 6 Linear Model Selection and Regularization 6.1 Subset Selection 6.2 Shrinkage Methods 6.3 Dimension Reduction Methods 6.4 Considerations in High Dimensions 6.5 Miscellanea 7 Nonlinear Regression Models 7.1 Polynomial Regression 7.2 Step Functions 7.3 Regression Splines 7.4 Smoothing Splines 7.5 Generalized Additive Models 8 Tree-Based Methods 8.1 Decision Tree Fundamentals 8.2 Bagging, Random Forests and Boosting Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 4 / 463

Contents III 9 Unsupervised Learning 9.1 Principal Components Analysis 9.2 Clustering Methods Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 5 / 463

Contents 9 Unsupervised Learning 9.1 Principal Components Analysis 9.2 Clustering Methods Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 414 / 463

Unsupervised Learning Introduction • Supervised learning : n observations { ( x i , y i ) n i = 1 } , each consisting of feature vector x i ∈ R p and a response observation y i . • Construct prediction model ˆ f such that y i ≈ ˆ y = ˆ f ( x i ) in order to predict f ( x ) for values x not among data set. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 415 / 463

Unsupervised Learning Introduction • Supervised learning : n observations { ( x i , y i ) n i = 1 } , each consisting of feature vector x i ∈ R p and a response observation y i . • Construct prediction model ˆ f such that y i ≈ ˆ y = ˆ f ( x i ) in order to predict f ( x ) for values x not among data set. • Unsupervised learning: only feature observations available, no response data. • Prediction not possible. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 415 / 463

Unsupervised Learning Introduction • Supervised learning : n observations { ( x i , y i ) n i = 1 } , each consisting of feature vector x i ∈ R p and a response observation y i . • Construct prediction model ˆ f such that y i ≈ ˆ y = ˆ f ( x i ) in order to predict f ( x ) for values x not among data set. • Unsupervised learning: only feature observations available, no response data. • Prediction not possible. • Instead: statistical techniques for “discovering interesting things” about observations { x i } n i = 1 . • Informative visualization of the data. • Indentification of subgroups in the data/variables. • Here: principal components analysis (PCA) and clustering . Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 415 / 463

Unsupervised Learning Challenges • For supervised learning tasks, e.g., binary classification, large selection of well developed algorithms (logistic regression, LDA, classification trees, SVMs) as well as assessment techniques (CV, validation set, . . . ). Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 416 / 463

Unsupervised Learning Challenges • For supervised learning tasks, e.g., binary classification, large selection of well developed algorithms (logistic regression, LDA, classification trees, SVMs) as well as assessment techniques (CV, validation set, . . . ). • Unsupervised learning more subjective. • No clear goal of analysis (such as response prediction). • Often performed as part of exploratory data analysis . • Results harder to assess (by very nature). • Examples: • finding patterns in gene expression data for cancer patients; • identifying subgroups of customers of online shopping platform which display similar behavior/interest; • determining which content a search engine should display to which individu- als. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 416 / 463

Contents 9 Unsupervised Learning 9.1 Principal Components Analysis 9.2 Clustering Methods Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 417 / 463

Unsupervised Learning Principal components analysis • Many correlated feature/predictor variables X 1 , . . . , X p . • Form new predictor variables Z m (components) as linear combinations of original variables. • Construct Z m to be uncorrelated, ordered by decreasing variance. • Ideal situation: first few M < p components (principal components) ex- plain large part of total variance of original variables. In this case data set well explained by restriction to principal components. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 418 / 463

Unsupervised Learning Principal components analysis • Many correlated feature/predictor variables X 1 , . . . , X p . • Form new predictor variables Z m (components) as linear combinations of original variables. • Construct Z m to be uncorrelated, ordered by decreasing variance. • Ideal situation: first few M < p components (principal components) ex- plain large part of total variance of original variables. In this case data set well explained by restriction to principal components. • Have used this idea for principal components regression (Chapter 6). There, used principal components as new (fewer) predictor variables. • PCA: process by which principal components derived; also a technique for data visualization. • Unsupervised, since applies only to feature/predictor variables. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 418 / 463

Unsupervised Learning Principal components � p • To visualize p -variate data using bivariate scatterplots, � = p ( p − 1 ) / 2 2 pairs to examine. • Besides effort involved, individual scatterplots not necessarily that informative, containing only small fraction of information carried by complete data. • Ideal: find low (1, 2 or 3)-dimensional representation of data containing all (most) relevant information. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 419 / 463

Unsupervised Learning Principal components � p • To visualize p -variate data using bivariate scatterplots, � = p ( p − 1 ) / 2 2 pairs to examine. • Besides effort involved, individual scatterplots not necessarily that informative, containing only small fraction of information carried by complete data. • Ideal: find low (1, 2 or 3)-dimensional representation of data containing all (most) relevant information. • First principal component : linear combination p � φ 2 Z 1 = φ 1 , 1 X 1 + · · · + φ p , 1 X p , j , 1 = 1 , (9.1) j = 1 of original feature variables X j with normalized coefficients (“ loadings ”) with maximal variance. Loading vector φ 1 := ( φ 1 , 1 , . . . , φ p , 1 ) ⊤ . Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 419 / 463

Unsupervised Learning Computing the first principal component • Given data set X ∈ R n × p , i.e., n samples of p features X 1 , . . . , X p , • Each column x j = ( x 1 , j , . . . , x n , j ) ⊤ ∈ R n , j = 1 , . . . , p , contains n samples (observations) of j -th feature. • Each row ˜ x ⊤ = ( x i , 1 , . . . , x i , p ) ∈ R p , i = 1 , . . . , n , contains one sample of p i features. • Here information synonymous with variance, hence assume centered co- lumns, i.e.,  1  . e ⊤ x j = 0 ,  ∈ R n , . j = 1 , . . . , p , e =   .  1 hence sample mean of each column is zero. Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 420 / 463

Unsupervised Learning Computing the first principal component • Loadings { φ j , 1 } p j = 1 for first principal component determined as (normalized) coefficients in linear combination z 1 = φ 1 , 1 x 1 + · · · + φ p , 1 x p = Xφ 1 such that z 1 has largest sample variance (mean remains zero). Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 421 / 463

Unsupervised Learning Computing the first principal component • Loadings { φ j , 1 } p j = 1 for first principal component determined as (normalized) coefficients in linear combination z 1 = φ 1 , 1 x 1 + · · · + φ p , 1 x p = Xφ 1 such that z 1 has largest sample variance (mean remains zero). • In other words, loadings { φ j , 1 } p j = 1 solve optimization problem 2     p p n 1   � � � φ 2 max φ j , 1 x i , j : j , 1 = 1 (9.2)   n   i = 1 j = 1 j = 1 Oliver Ernst (NM) Introduction to Data Science Winter Semester 2018/19 421 / 463

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst - PowerPoint PPT Presentation

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultt fr Mathematik, Professur Numerische Mathematik Lecture Slides Contents I 1 What is Data Science? 2 Learning Theory 2.1 What is Statistical Learning?

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data Set Overview

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

ww www. w.big bigbang bang-datasc atascience.com ience.com Agenda BBDS Team Data Science

Introduction to Data Science January 11, 2016 About this course DATA 5000: Introduction to Data

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Creating and looping

Using dictionaries Jason Myers Instructor DataCamp Data Types for Data Science Creating and

Maties Machine Learning https://mml-stellenbosch.github.io/ Meet the MML Research Groups

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

COVID-19: Community Response and Recommendations Alisa R. Haushalter, DNP, RN, PHNA-BC Director,

Introduction to CellMiner Augustin Luna 21 January, 2016 Research Fellow Department of

Data Visualisation with R Caroline Sporleder & Ines Rehbein WS 09/10 Sporleder & Rehbein

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Workshop 14: R-mode MVA Murray Logan 06 Aug 2016 > # might want to put this in to make the

Machine Learning for Computational Linguistics May 24, 2016 . ltekin, label of an unknown

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst - PowerPoint PPT Presentation

Introduction to Data Science Winter Semester 2019/20 Oliver Ernst TU Chemnitz, Fakultt fr Mathematik, Professur Numerische Mathematik Lecture Slides Contents I 1 What is Data Science? 2 Learning Theory 2.1 What is Statistical Learning?

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data Set Overview

CS378 Introduction to Data Mining Data Exploration and Data Preprocessing Li Xiong Data

ww www. w.big bigbang bang-datasc atascience.com ience.com Agenda BBDS Team Data Science

Introduction to Data Science January 11, 2016 About this course DATA 5000: Introduction to Data

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Creating and looping

Using dictionaries Jason Myers Instructor DataCamp Data Types for Data Science Creating and

Maties Machine Learning https://mml-stellenbosch.github.io/ Meet the MML Research Groups

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

COVID-19: Community Response and Recommendations Alisa R. Haushalter, DNP, RN, PHNA-BC Director,

Introduction to CellMiner Augustin Luna 21 January, 2016 Research Fellow Department of

Data Visualisation with R Caroline Sporleder &amp; Ines Rehbein WS 09/10 Sporleder &amp; Rehbein

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Workshop 14: R-mode MVA Murray Logan 06 Aug 2016 &gt; # might want to put this in to make the

Machine Learning for Computational Linguistics May 24, 2016 . ltekin, label of an unknown

Data Visualisation with R Caroline Sporleder & Ines Rehbein WS 09/10 Sporleder & Rehbein

Workshop 14: R-mode MVA Murray Logan 06 Aug 2016 > # might want to put this in to make the