Data distrib u tions FE ATU R E E N G IN E E R IN G FOR MAC H IN - - PowerPoint PPT Presentation

data distrib u tions
SMART_READER_LITE
LIVE PREVIEW

Data distrib u tions FE ATU R E E N G IN E E R IN G FOR MAC H IN - - PowerPoint PPT Presentation

Data distrib u tions FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON Robert O ' Callaghan Director of Data Science , Ordergroo v e Distrib u tion ass u mptions FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON Obser


slide-1
SLIDE 1

Data distributions

FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

Robert O'Callaghan

Director of Data Science, Ordergroove

slide-2
SLIDE 2

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Distribution assumptions

slide-3
SLIDE 3

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Observing your data

import matplotlib as plt df.hist() plt.show()

slide-4
SLIDE 4

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Delving deeper with box plots

slide-5
SLIDE 5

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Box plots in pandas

df[['column_1']].boxplot() plt.show()

slide-6
SLIDE 6

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Paring distributions

import seaborn as sns sns.pairplot(df)

slide-7
SLIDE 7

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Further details on your distributions

df.describe()

slide-8
SLIDE 8

Let's practice!

FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

slide-9
SLIDE 9

Scaling and transformations

FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

Robert O'Callaghan

Data Scientist

slide-10
SLIDE 10

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Scaling data

slide-11
SLIDE 11

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Min-Max scaling

slide-12
SLIDE 12

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Min-Max scaling

slide-13
SLIDE 13

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Min-Max scaling in Python

from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() scaler.fit(df[['Age']]) df['normalized_age'] = scaler.transform(df[['Age']])

slide-14
SLIDE 14

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Standardization

slide-15
SLIDE 15

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Standardization in Python

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(df[['Age']]) df['standardized_col'] = scaler\ .transform(df[['Age']])

slide-16
SLIDE 16

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Log Transformation

slide-17
SLIDE 17

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Log transformation in Python

from sklearn.preprocessing import PowerTransformer log = PowerTransformer() log.fit(df[['ConvertedSalary']]) df['log_ConvertedSalary'] = log.transform(df[['ConvertedSalary']])

slide-18
SLIDE 18

Final Slide

FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

slide-19
SLIDE 19

Removing outliers

FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

Robert O'Callaghan

Director of Data Science, Ordergroove

slide-20
SLIDE 20

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

What are outliers?

slide-21
SLIDE 21

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Quantile based detection

slide-22
SLIDE 22

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Quantiles in Python

q_cutoff = df['col_name'].quantile(0.95) mask = df['col_name'] < q_cutoff trimmed_df = df[mask]

slide-23
SLIDE 23

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Standard deviation based detection

slide-24
SLIDE 24

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Standard deviation detection in Python

mean = df['col_name'].mean() std = df['col_name'].std() cut_off = std * 3 lower, upper = mean - cut_off, mean + cut_off new_df = df[(df['col_name'] < upper) & (df['col_name'] > lower)]

slide-25
SLIDE 25

Let's practice!

FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

slide-26
SLIDE 26

Scaling and transforming new data

FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON

Robet O'Callaghan

Director of Data Science, Ordergroove

slide-27
SLIDE 27

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Reuse training scalers

scaler = StandardScaler() scaler.fit(train[['col']]) train['scaled_col'] = scaler.transform(train[['col']]) # FIT SOME MODEL # .... test = pd.read_csv('test_csv') test['scaled_col'] = scaler.transform(test[['col']])

slide-28
SLIDE 28

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Training transformations for reuse

train_mean = train[['col']].mean() train_std = train[['col']].std() cut_off = train_std * 3 train_lower = train_mean - cut_off train_upper = train_mean + cut_off # Subset train data test = pd.read_csv('test_csv') # Subset test data test = test[(test[['col']] < train_upper) & (test[['col']] > train_lower)]

slide-29
SLIDE 29

FEATURE ENGINEERING FOR MACHINE LEARNING IN PYTHON

Why only use training data?

Data leakage: Using data that you won't have access to when assessing the performance of your model

slide-30
SLIDE 30

Avoid data leakage!

FE ATU R E E N G IN E E R IN G FOR MAC H IN E L E AR N IN G IN P YTH ON