Introduction to Data Science: Data K x x x i u x i v a i - PowerPoint PPT Presentation

Data Analysis with Geometry Geometry and Distances From data to feature vectors From data to feature vectors The curse of dimensionality From data to feature vectors Technical notation Technical notation Data Analysis with Geometry Geometry and Distances Quick vector algebra review Summary Quick vector algebra review Geometry and Distances Geometry and Distances The importance of transformations Quick vector algebra review Quick vector algebra review Quick vector algebra review Quick vector algebra review Data Analysis with Geometry Geometry and Distances Data Analysis with Geometry Motivating Example: Credit Analysis Quiz K-nearest neighbor classi�cation Inductive bias Scalar Multiplication : vectors can be scaled by real values; Feature scaling is an important issue in distance-based methods. Parameter Here is the same credit data represented as a matrix of feature vectors A common situation: Distance-based methods like KNN can be problematic in high- The vast majority of ML algorithms we see in class treat instances as The norm of a vector , written The vast majority of ML algorithms we see in class treat instances as Now that we think of instances as vectors we can do some interesting Task predict account default The dot product , or inner product of two vectors and is defined as Vector addition can be viewed geometrically as taking a vector , then One may be interested in various things: Observed values will be denoted in lower case. So A (real-valued) vector is just an array of real values, for instance We will represent many ML algorithms geometrically as vectors Vectors will not be bold, for example is a hyper-parameter , it's value may affect prediction is its length. may mean all predictors for means the th Introduction to Data Science: Data K x ∥ x ∥ x i u x i v a i tacking on to the end of it; the new end point is exactly . "feature vectors". dimensional problems "feature vectors". operations. accuracy significantly. observation of the random variable subject , unless it is the vector of a particular predictor Vector math review is a three-dimensional vector. . . x = ⟨ 1, 2.5, −6 ⟩ i b X c x j Unless otherwise specified, this is its Euclidean length, namely: What is the outcome What effects do the covariates an outcome attribute (variable) ? , and have on the outcome ? The assumptions we make about our data that allow us to make Now that we have a distance between instances we can create a Write Euclidean distance of vectors and as a vector norm Analysis with Geometry default student balance income p All vectors are assumed to be column vectors, so the -th row of K-nearest neighbors default student balance income 2 ⟨ 1, 2.5, −6 ⟩ = ⟨ 2, 5, −12 ⟩ Y X i Y Y Which of these two features u v u ′ v = i X We can represent each instance as a vector in Euclidean space Let's try a first one: define a distance between two instances using Question: which situation may lead to overfitting , high or low values of What are the predictors We can represent each instance as a vector in Euclidean space Consider the case where we have many covariates. We want to use - Vector sums are computed pointwise, and are only defined when one or more independent covariate or predictor attributes Matrices are represented with bold face upper case. For example How well can we quantify these effects? ? classifier. Suppose we want to predict the class for an instance . predictions. ∑ u i v i will be The curse of dimensionality , i.e., the transpose of . 1 0 1717.0716 38408.89 In general, No X j No 729.5265 44361.625 k X K will affect distance the most? x ′ x x i Euclidean distance ? Why? nearest neighbor methods. dimensions match, so Can we predict outcome will represent all observed predictors. . . . using covariates ?, etc...  j =1 ax = ⟨ ax 1 , ax 2 , … , ax p ⟩ p i Héctor Corrada Bravo  ⟨ x 1 , … , x p , y ⟩ ⟨ x 1 , … , x p , y ⟩ X 1 , … , X p Y X i 1 1 1983.2345 25687.93 In KNN, our inductive bias is that points that are nearby will be of the K-nearest neighbors uses the closest points in predictor space predict . x 2 No Yes 817.1804 12106.135  A useful geometric interpretation of the inner product is that it gives ∥ x ∥ = ∑ j Y v ′ u One usually observes these variables for multiple "instances" (or Basically, we need to define distance and look for small multi- every measurement is represented as a continuous value (or ) will usually mean the number of observations, or length of ⎷ same class. -1 1 883.1573 18213.08 j =1  the projection of onto (when ). p ⟨ 1, 2.5, −6 ⟩ + ⟨ 2, −2.5, 3 ⟩ = ⟨ 3, 0, −3 ⟩ No No 1073.5492 31767.139 University of Maryland, College Park, USA N n Y  entities). dimensional "balls" around the target points. . will be used to denote which observation and to denote which in particular, categorical variables become numeric (e.g., one-hot v u ∥ u ∥ = 1 ( x 1 j − x 2 j ) 2 1  d ( x 1 , x 2 ) = 2020-04-05 ∑ . ^ 1 0 1975.6530 38221.84 i j Y = ∑ y k . No No 529.2506 35704.494 covariate or predictor. encoding) ⎷ k j =1 With many covariates this becomes difficult. x k ∈ N k ( x ) -1 0 0.0000 32809.33 No No 785.6559 38463.496 In general, if then for all vectors . represents the -nearest points to . How would you use to c = a + b cd = ad + bd d -1 0 528.0893 46389.34 ^ No Yes 919.5885 7491.559 N k ( x ) k x Y make a prediction? 21 / 22 22 / 22 15 / 22 10 / 22 12 / 22 20 / 22 16 / 22 17 / 22 19 / 22 14 / 22 18 / 22 13 / 22 11 / 22 1 / 22 2 / 22 5 / 22 5 / 22 9 / 22 7 / 22 6 / 22 3 / 22 4 / 22 8 / 22

Data Analysis with Geometry A common situation: an outcome attribute (variable) , and Y one or more independent covariate or predictor attributes . X 1 , … , X p One usually observes these variables for multiple "instances" (or entities). 1 / 22

Data Analysis with Geometry One may be interested in various things: What effects do the covariates have on the outcome ? X i Y How well can we quantify these effects? Can we predict outcome using covariates ?, etc... Y X i 2 / 22

Data Analysis with Geometry Motivating Example: Credit Analysis default student balance income No No 729.5265 44361.625 No Yes 817.1804 12106.135 No No 1073.5492 31767.139 No No 529.2506 35704.494 No No 785.6559 38463.496 No Yes 919.5885 7491.559 3 / 22

Data Analysis with Geometry Task predict account default What is the outcome ? Y What are the predictors ? X j 4 / 22

From data to feature vectors The vast majority of ML algorithms we see in class treat instances as "feature vectors". We can represent each instance as a vector in Euclidean space . ⟨ x 1 , … , x p , y ⟩ 5 / 22

From data to feature vectors The vast majority of ML algorithms we see in class treat instances as "feature vectors". We can represent each instance as a vector in Euclidean space . ⟨ x 1 , … , x p , y ⟩ every measurement is represented as a continuous value in particular, categorical variables become numeric (e.g., one-hot encoding) 5 / 22

From data to feature vectors Here is the same credit data represented as a matrix of feature vectors default student balance income 1 0 1717.0716 38408.89 1 1 1983.2345 25687.93 -1 1 883.1573 18213.08 1 0 1975.6530 38221.84 -1 0 0.0000 32809.33 -1 0 528.0893 46389.34 6 / 22

Technical notation Observed values will be denoted in lower case. So means the th x i i observation of the random variable . X Matrices are represented with bold face upper case. For example X will represent all observed predictors. (or ) will usually mean the number of observations, or length of N n Y . will be used to denote which observation and to denote which i j covariate or predictor. 7 / 22

Technical notation Vectors will not be bold, for example may mean all predictors for x i subject , unless it is the vector of a particular predictor . i x j All vectors are assumed to be column vectors, so the -th row of i X will be , i.e., the transpose of . x ′ x i i 8 / 22

Geometry and Distances Now that we think of instances as vectors we can do some interesting operations. Let's try a first one: define a distance between two instances using Euclidean distance  p  ( x 1 j − x 2 j ) 2  d ( x 1 , x 2 ) = ∑ ⎷ j =1 9 / 22

Geometry and Distances K-nearest neighbor classi�cation Now that we have a distance between instances we can create a classifier. Suppose we want to predict the class for an instance . x K-nearest neighbors uses the closest points in predictor space predict . Y 1 ^ Y = ∑ y k . k x k ∈ N k ( x ) represents the -nearest points to . How would you use to ^ N k ( x ) k x Y make a prediction? 10 / 22

Geometry and Distances 11 / 22

Geometry and Distances Inductive bias The assumptions we make about our data that allow us to make predictions. In KNN, our inductive bias is that points that are nearby will be of the same class. 12 / 22

Geometry and Distances Parameter is a hyper-parameter , it's value may affect prediction K accuracy significantly. Question: which situation may lead to overfitting , high or low values of K ? Why? 13 / 22

The importance of transformations Feature scaling is an important issue in distance-based methods. Which of these two features will affect distance the most? 14 / 22

Introduction to Data Science: Data K x x x i u x i v a i - PowerPoint PPT Presentation

Data Analysis with Geometry Geometry and Distances From data to feature vectors From data to feature vectors The curse of dimensionality From data to feature vectors Technical notation Technical notation Data Analysis with Geometry Geometry

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Hollywood Science Hollywood Science Science and Transgression: Crossing Forbidden

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Interesting Design Science with Old Science Wrappers Richard Baskerville Jan Pries-Heje Georgia

ww www. w.big bigbang bang-datasc atascience.com ience.com Agenda BBDS Team Data Science

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

Introduction to Data Science January 11, 2016 About this course DATA 5000: Introduction to Data

Science Ms. Karen Hughes Science Chairperson Science Program 8 th Grade 9th Grade 10th Grade

From Science to Praxis opportunities and challenges Dr. Peter Moll Senior Science & Science

Today O ( . ) notation: official definition For the record, the official definition of

HyperG Solution Introduction Agenda HyperG yo your best Securit ity & Smart Solutio

DESIGNING ROBUST SYSTEMS DESIGNING ROBUST SYSTEMS with with UNCERTAIN INFORMATION UNCERTAIN

performances Valerio Bertacchi Universit di Pisa & INFN Pisa Face To Face Tracking Meeting

MAGIX Detectors Overview Pepe Glker MAGIX Collaboration Meeting 2017 Topics From Physics to

More Recursion! Recursion - examples Problem: given a string as input, return the string

Statistical Methods for Plant Biology PBIO 3150/5150 Anirudh V. S. Ruhil January 28, 2016 The

Language Processing with Perl and Prolog A Short Introduction to Prolog Pierre Nugues Lund

Introduction to Data Science: Data K x x x i u x i v a i - PowerPoint PPT Presentation

Data Analysis with Geometry Geometry and Distances From data to feature vectors From data to feature vectors The curse of dimensionality From data to feature vectors Technical notation Technical notation Data Analysis with Geometry Geometry

EMIS/DS 1300: A Practical Introduction to Data Science Slides by Michael Hahsler Data + Science

DataCamp Data Types for Data Science DataCamp Data Types for Data Science Data types Data type

Data Science: Statistics or Computer Science? 9/15/2015 DATA SCIENCE: STATISTICS OR COMPUTER

Hollywood Science Hollywood Science Science and Transgression: Crossing Forbidden

Data Science in the Wild Lecture 1: Introduction Eran Toch Data Science in the Wild, Spring 2019

Introduction and lists Jason Myers Instructor DataCamp Data Types for Data Science Data types

Kotlin for Data Science Thomas Nield @thomasnield9727 Agenda Kotlin for Data Science

CSCI 3022 Intro to Data Science with Probability and Statistics What is Data Science? What is

DATA SCIENCE DAN S REZNIK, DIRECTOR DATA SCIENCE CONSULTING LTD (c) 2019 Data Science Consutling

INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION INTRODUCTION

Interesting Design Science with Old Science Wrappers Richard Baskerville Jan Pries-Heje Georgia

ww www. w.big bigbang bang-datasc atascience.com ience.com Agenda BBDS Team Data Science

Data Science in the Wild Lecture 12: Memory-Based Data Warehouses Eran Toch Data Science in the

Introduction to Data Science January 11, 2016 About this course DATA 5000: Introduction to Data

Science Ms. Karen Hughes Science Chairperson Science Program 8 th Grade 9th Grade 10th Grade

From Science to Praxis opportunities and challenges Dr. Peter Moll Senior Science &amp; Science

Today O ( . ) notation: official definition For the record, the official definition of

HyperG Solution Introduction Agenda HyperG yo your best Securit ity &amp; Smart Solutio

DESIGNING ROBUST SYSTEMS DESIGNING ROBUST SYSTEMS with with UNCERTAIN INFORMATION UNCERTAIN

performances Valerio Bertacchi Universit di Pisa &amp; INFN Pisa Face To Face Tracking Meeting

MAGIX Detectors Overview Pepe Glker MAGIX Collaboration Meeting 2017 Topics From Physics to

More Recursion! Recursion - examples Problem: given a string as input, return the string

Statistical Methods for Plant Biology PBIO 3150/5150 Anirudh V. S. Ruhil January 28, 2016 The

Language Processing with Perl and Prolog A Short Introduction to Prolog Pierre Nugues Lund

From Science to Praxis opportunities and challenges Dr. Peter Moll Senior Science & Science

HyperG Solution Introduction Agenda HyperG yo your best Securit ity & Smart Solutio

performances Valerio Bertacchi Universit di Pisa & INFN Pisa Face To Face Tracking Meeting