CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING - - PowerPoint PPT Presentation

▶

Sep 12, 2022 575 likes •777 views

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING Spring 2019 Marion Neumann RECAP: FEATURE ENGINEERING 5 Good Reasons for Feature Engineering: 1) get better represented features ( scaling , standardization ) improve model

SLIDE 1

CSE217 INTRODUCTION TO DATA SCIENCE

Spring 2019 Marion Neumann

LECTURE 10: DATA ENGINEERING

SLIDE 2

RECAP: FEATURE ENGINEERING

5 Good Reasons for Feature Engineering: 1) get better represented features (scaling, standardization)

à improve model training and prediction quality

2) get more expressive features

à improve model training and prediction quality à ACTIVITY 1

3) get more representative features

à remove noise

4) get less features

à more efficient computation

5) represent features in 2d or 3d

à visualization

6) bring data in to vector representation

à not all data necessarily comes in vector form ⃑

" = "$ ⋮ "& What data does not come in vector form?

SLIDE 3

data that does not come as numerical vectors

à requires feature extraction

Example: images

COMPLEX DATA

SLIDE 4

data that does not come as numerical vectors

à requires feature extraction

Example: text

COMPLEX DATA

4 great small location friends …

Same great flavor and friendly service as in the S 18th street

location. This location is not as small but it's hard to talk to friends.

Thankfully there is great outdoor seating to escape the noise.

SLIDE 5

data that does not come as numerical vectors

à requires feature extraction

Example: street networks

COMPLEX DATA

SLIDE 6

data that does not come as numerical vectors

à requires feature extraction

Example: chemical compounds

COMPLEX DATA

6 mutagenic non-mutagenic ???

SLIDE 7

data that does not come as numerical vectors

à requires feature extraction

Example: point clouds

COMPLEX DATA

SLIDE 8

data that does not come as numerical vectors

à requires feature extraction

Example: social networks

COMPLEX DATA

SLIDE 9

FEATURE EXTRACTION

feature extraction is challenging
which features to compute and use?
too many choices and combinations
NNs can be used to learn features
sometimes those features are even interpretable
works well for images and text

SLIDE 10

FEATURE EXTRACTION VS LEARNING

SLIDE 11

FEATURE EXTRACTION VS LEARNING

Same great flavor and friendly service as in the S 18th street location. This location is not as small but it's hard to talk to friends. Thankfully there is great

utdoor seating to escape the noise.

Same great flavor and friendly service as in the S 18th street location. This location is not as small but it's hard to talk to friends. Thankfully there is great

utdoor seating to escape the noise.

politics sports culture politics sports culture

SLIDE 12

FEATURE EXTRACTION VS LEARNING

politics sports culture Train NN on very large corpus simple ML model

kNN
random forest
…

positive negative Train separate model

n new (small) data

!" = 0.23 !( = 3.42 !*= 0.89

Use pre-trained NN

SLIDE 13

DATA ENGINEERING

instead of removing or creating features
create or remove data points!
Why could that be useful?
remove outliers
create more training examples à data augmentation

SLIDE 14

CAUSES OF OUTLIERS

data entry errors (human errors)
measurement errors (instrument errors)
experimental errors
data extraction error
experiment planning/executing errors
intentional
dummy outliers made to test detection methods
data processing errors
data manipulation
data set unintended mutations
sampling errors
extracting or mixing data from wrong or various sources
natural à not an error, novelties in data

SLIDE 15

OUTLIER DETECTION

standardize your data
empirical rule:

utliers

utliers
31

! = #$%#&%%#'

(

= )

( * +,) (

Z-Score Analysis (parametric approach à makes the parametric assumption that the features are normally distributed)

samples

7 8 48 55 68

. = ./ =

) (0) ∑23$

#204 &

sample standard deviation sample mean

SLIDE 16

DATA AUGMENTATION

easy for images

à preform image transformations

SLIDE 17

DATA AUGMENTATION

Why à NNs need a ton of training data

NN performance is very sensitive to adversarially created noise
collecting (and labelling data) is very time consuming/expensive

+ .007 ⇥ = x sign(rxJ(θ, x, y)) x + ✏ sign(rxJ(θ, x, y)) y =“panda” “nematode” “gibbon” w/ 57.7% confidence w/ 8.2% confidence w/ 99.3 % confidence

SLIDE 18

[DSFS]
Ch18: Neural Networks (p213-218)
http://playground.tensorflow.org

SUMMARY & READING

More expressive features (even simple transformations) can

greatly improve the training time and model quality.

Feature learning is useful to deal with non-vectorial input

data.

Data engineering can improve supervised models by removing
utliers or data augmentation.
Neural Networks are tricky to train and very sensitive to noise.