One-Hot Encoding MACH IN E LEARN IN G W ITH P YS PARK Andrew - PowerPoint PPT Presentation

One-Hot Encoding MACH IN E LEARN IN G W ITH P YS PARK Andrew Collier Data Scientist, Exegetic Analytics

The problem with indexed values # Counts for 'type' category # Numerical indices for 'type' category +-------+-----+ +-------+--------+ | type|count| | type|type_idx| +-------+-----+ +-------+--------+ |Midsize| 22| |Midsize| 0.0| | Small| 21| | Small| 1.0| |Compact| 16| |Compact| 2.0| | Sporty| 14| | Sporty| 3.0| | Large| 11| | Large| 4.0| | Van| 9| | Van| 5.0| +-------+-----+ +-------+--------+ MACHINE LEARNING WITH PYSPARK

Dummy variables +-------+ +-------+-------+-------+-------+-------+-------+ | type| |Midsize| Small|Compact| Sporty| Large| Van| +-------+ +-------+-------+-------+-------+-------+-------+ |Midsize| | X | | | | | | | Small| | | X | | | | | |Compact| ===> | | | X | | | | | Sporty| | | | | X | | | | Large| | | | | | X | | | Van| | | | | | | X | +-------+ +-------+-------+-------+-------+-------+-------+ Each categorical level becomes a column. MACHINE LEARNING WITH PYSPARK

Dummy variables: binary encoding +-------+ +-------+-------+-------+-------+-------+-------+ | type| |Midsize| Small|Compact| Sporty| Large| Van| +-------+ +-------+-------+-------+-------+-------+-------+ |Midsize| | 1 | 0 | 0 | 0 | 0 | 0 | | Small| | 0 | 1 | 0 | 0 | 0 | 0 | |Compact| ===> | 0 | 0 | 1 | 0 | 0 | 0 | | Sporty| | 0 | 0 | 0 | 1 | 0 | 0 | | Large| | 0 | 0 | 0 | 0 | 1 | 0 | | Van| | 0 | 0 | 0 | 0 | 0 | 1 | +-------+ +-------+-------+-------+-------+-------+-------+ Binary values indicate the presence ( 1 ) or absence ( 0 ) of the corresponding level. MACHINE LEARNING WITH PYSPARK

Dummy variables: sparse representation +-------+ +-------+-------+-------+-------+-------+-------+ +------+-----+ | type| |Midsize| Small|Compact| Sporty| Large| Van| |Column|Value| +-------+ +-------+-------+-------+-------+-------+-------+ +------+-----+ |Midsize| | 1 | 0 | 0 | 0 | 0 | 0 | | 0| 1| | Small| | 0 | 1 | 0 | 0 | 0 | 0 | | 1| 1| |Compact| ===> | 0 | 0 | 1 | 0 | 0 | 0 | ===> | 2| 1| | Sporty| | 0 | 0 | 0 | 1 | 0 | 0 | | 3| 1| | Large| | 0 | 0 | 0 | 0 | 1 | 0 | | 4| 1| | Van| | 0 | 0 | 0 | 0 | 0 | 1 | | 5| 1| +-------+ +-------+-------+-------+-------+-------+-------+ +------+-----+ Sparse representation: store column index and value. MACHINE LEARNING WITH PYSPARK

Dummy variables: redundant column +-------+ +-------+-------+-------+-------+-------+ +------+-----+ | type| |Midsize| Small|Compact| Sporty| Large| |Column|Value| +-------+ +-------+-------+-------+-------+-------+ +------+-----+ |Midsize| | 1 | 0 | 0 | 0 | 0 | | 0| 1| | Small| | 0 | 1 | 0 | 0 | 0 | | 1| 1| |Compact| ===> | 0 | 0 | 1 | 0 | 0 | ===> | 2| 1| | Sporty| | 0 | 0 | 0 | 1 | 0 | | 3| 1| | Large| | 0 | 0 | 0 | 0 | 1 | | 4| 1| | Van| | 0 | 0 | 0 | 0 | 0 | | | | +-------+ +-------+-------+-------+-------+-------+ +------+-----+ Levels are mutually exclusive, so drop one. MACHINE LEARNING WITH PYSPARK

One-hot encoding from pyspark.ml.feature import OneHotEncoderEstimator onehot = OneHotEncoderEstimator(inputCols=['type_idx'], outputCols=['type_dummy']) Fit the encoder to the data. onehot = onehot.fit(cars) # How many category levels? onehot.categorySizes [6] MACHINE LEARNING WITH PYSPARK

One-hot encoding cars = onehot.transform(cars) cars.select('type', 'type_idx', 'type_dummy').distinct().sort('type_idx').show() +-------+--------+-------------+ | type|type_idx| type_dummy| +-------+--------+-------------+ |Midsize| 0.0|(5,[0],[1.0])| | Small| 1.0|(5,[1],[1.0])| |Compact| 2.0|(5,[2],[1.0])| | Sporty| 3.0|(5,[3],[1.0])| | Large| 4.0|(5,[4],[1.0])| | Van| 5.0| (5,[],[])| +-------+--------+-------------+ MACHINE LEARNING WITH PYSPARK

Dense versus sparse from pyspark.mllib.linalg import DenseVector, SparseVector Store this vector: [1, 0, 0, 0, 0, 7, 0, 0]. DenseVector([1, 0, 0, 0, 0, 7, 0, 0]) DenseVector([1.0, 0.0, 0.0, 0.0, 0.0, 7.0, 0.0, 0.0]) SparseVector(8, [0, 5], [1, 7]) SparseVector(8, {0: 1.0, 5: 7.0}) MACHINE LEARNING WITH PYSPARK

One-Hot Encode categoricals MACH IN E LEARN IN G W ITH P YS PARK

Regression MACH IN E LEARN IN G W ITH P YS PARK Andrew Collier Data Scientist, Exegetic Analytics

Consumption versus mass: scatter MACHINE LEARNING WITH PYSPARK

Consumption versus mass: �t MACHINE LEARNING WITH PYSPARK

Consumption versus mass: alternative �ts MACHINE LEARNING WITH PYSPARK

Consumption versus mass: residuals MACHINE LEARNING WITH PYSPARK

Loss function MSE = "Mean Squared Error" MACHINE LEARNING WITH PYSPARK

Loss function: Observed values y — observed values i MACHINE LEARNING WITH PYSPARK

Loss function: Model values y — observed values i ^ — model values y i MACHINE LEARNING WITH PYSPARK

Loss function: Mean y — observed values i ^ — model values y i MACHINE LEARNING WITH PYSPARK

Assemble predictors Predict consumption using mass , cyl and type_dummy . Consolidate predictors into a single column. +------+---+-------------+----------------------------+-----------+ |mass |cyl|type_dummy |features |consumption| +------+---+-------------+----------------------------+-----------+ |1451.0|6 |(5,[0],[1.0])|(7,[0,1,2],[1451.0,6.0,1.0])|9.05 | |1129.0|4 |(5,[2],[1.0])|(7,[0,1,4],[1129.0,4.0,1.0])|6.53 | |1399.0|4 |(5,[2],[1.0])|(7,[0,1,4],[1399.0,4.0,1.0])|7.84 | |1147.0|4 |(5,[1],[1.0])|(7,[0,1,3],[1147.0,4.0,1.0])|7.84 | |1111.0|4 |(5,[3],[1.0])|(7,[0,1,5],[1111.0,4.0,1.0])|9.05 | +------+---+-------------+----------------------------+-----------+ MACHINE LEARNING WITH PYSPARK

Build regression model from pyspark.ml.regression import LinearRegression regression = LinearRegression(labelCol='consumption') Fit to cars_train (training data). regression = regression.fit(cars_train) Predict on cars_test (testing data). predictions = regression.transform(cars_test) MACHINE LEARNING WITH PYSPARK

Examine predictions +-----------+------------------+ |consumption|prediction | +-----------+------------------+ |7.84 |8.92699470743403 | |9.41 |9.379295891451353 | |8.11 |7.23487264538364 | |9.05 |9.409860194333735 | |7.84 |7.059190923328711 | |7.84 |7.785909738591766 | |7.59 |8.129959405168547 | |5.11 |6.836843743852942 | |8.11 |7.17173702652015 | +-----------+------------------+ MACHINE LEARNING WITH PYSPARK

Calculate RMSE from pyspark.ml.evaluation import RegressionEvaluator # Find RMSE (Root Mean Squared Error) RegressionEvaluator(labelCol='consumption').evaluate(predictions) 0.708699086182001 A RegressionEvaluator can also calculate the following metrics: mae (Mean Absolute Error) 2 r2 ( R ) mse (Mean Squared Error). MACHINE LEARNING WITH PYSPARK

Consumption versus mass: intercept MACHINE LEARNING WITH PYSPARK

Examine intercept regression.intercept 4.9450616833727095 This is the fuel consumption in the (hypothetical) case that: mass = 0 cyl = 0 and vehicle type is 'Van'. MACHINE LEARNING WITH PYSPARK

Consumption versus mass: slope MACHINE LEARNING WITH PYSPARK

Examine Coef�cients regression.coefficients DenseVector([0.0027, 0.1897, -1.309, -1.7933, -1.3594, -1.2917, -1.9693]) mass 0.0027 cyl 0.1897 Midsize -1.3090 Small -1.7933 Compact -1.3594 Sporty -1.2917 Large -1.9693 MACHINE LEARNING WITH PYSPARK

Regression for numeric predictions MACH IN E LEARN IN G W ITH P YS PARK

Bucketing & Engineering MACH IN E LEARN IN G W ITH P YS PARK Andrew Collier Data Scientist, Exegetic Analytics

Bucketing MACHINE LEARNING WITH PYSPARK

Bucketing heights +------+ |height| +------+ | 1.42| | 1.45| | 1.47| | 1.50| | 1.52| | 1.57| | 1.60| | 1.75| | 1.85| | 1.88| +------+ MACHINE LEARNING WITH PYSPARK

One-Hot Encoding MACH IN E LEARN IN G W ITH P YS PARK Andrew - PowerPoint PPT Presentation

One-Hot Encoding MACH IN E LEARN IN G W ITH P YS PARK Andrew Collier Data Scientist, Exegetic Analytics The problem with indexed values # Counts for 'type' category # Numerical indices for 'type' category +-------+-----+ +-------+--------+

61A Extra Lecture 4 Announcements Encoding Strings Representing Strings: UTF-8 Encoding 4

HOT CEREALS March, 2016 THE BIG NEWS ABOUT BREAKFAST Hot Cereal Has Never Been Hotter Hot

Deep Encode: Machine Learning for Per-Title Encoding Daniel Silhavy| IBC20| Per-Title Encoding

Language and Computers Relation to language Encoding written language Prologue: Encoding

Language and Computers Relation to language Encoding written Prologue: Encoding Language

INVESTOR PRESENTATION December 4, 2019 TSX: HOT.UN (CAD$) | TSX: HOT.U (US$) | TSX: HOT.DB.U

Investor Presentation TSX: HOT.UN (CAD$) TSX: HOT.U (US$) TSX: HOT.DB.U (Debentures)

Investor Presentation TSX: HOT.UN (CAD$) TSX: HOT.U (US$) TSX: HOT.DB.U (Debentures) May

Hot Topics in Visualization 12-1 Ronald Peikert SciVis 2007 - Hot Topics Hot Topic 1:

District 211 One-to-One Program One-to-One: Program Background 2012-2013 2016-2017 One-to-One

Thermometer Encoding: One Hot Way to Resist Adversarial Examples Stanford, 2017-11-16 Aurko Roy*

Lecture 23 Logistics HW8 due today, HW9 is due Friday All lab must be done by 6/5 Thu

Hot or Not? A Nonparametric Formulation of the Hot Hand in Baseball Amanda Glazer

Hot Dog Stand USA, 1871 President Khrushchev trying a Hot Dog USA, 1959 First model of Vitrum

ExpressLanes/HOT Lanes (I-110 ExpressLanes/HOT Lanes (I-110) DEIR/EA Project Overview March 9

Annual Meeting of Unitholders May 8, 2019 TSX: HOT.UN (CAD$) TSX: HOT.U (US$) TSX:

GJGNY Advisory Council Meeting April 15, 2016 2 Agenda Program Status Future Funding

MODULE 9: PROGRAM INCOME IDIS Online for CDBG Entitlement Communities 1 Module Overview

2016 Annual General Meeting of Shareholders Wednesday, January 11, 2017 | 2 p.m. MT Safe

Decoupled smoothing on graphs Alex Chin, Yatong Chen , Kristen M. Altenburger, Johan Ugander

POIR 613: Computational Social Science Pablo Barber a School of International Relations

COMS 4721: Machine Learning for Data Science Lecture 4, 1/26/2017 Prof. John Paisley Department

Why LASSO, Ridge Need for Strictly . . . Regression, and EN: General Analysis of the . . . Why

Survey of Machine Learning Methods Pedro Rodriguez CU Boulder PhD Student in Large-Scale Machine