CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING - - PowerPoint PPT Presentation

cse217 introduction to data science lecture 10 data
SMART_READER_LITE
LIVE PREVIEW

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING - - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING Spring 2019 Marion Neumann RECAP: FEATURE ENGINEERING 5 Good Reasons for Feature Engineering: 1) get better represented features ( scaling , standardization ) improve model


slide-1
SLIDE 1

CSE217 INTRODUCTION TO DATA SCIENCE

Spring 2019 Marion Neumann

LECTURE 10: DATA ENGINEERING

slide-2
SLIDE 2

RECAP: FEATURE ENGINEERING

2

5 Good Reasons for Feature Engineering: 1) get better represented features (scaling, standardization)

à improve model training and prediction quality

2) get more expressive features

à improve model training and prediction quality à ACTIVITY 1

3) get more representative features

à remove noise

4) get less features

à more efficient computation

5) represent features in 2d or 3d

à visualization

6) bring data in to vector representation

à not all data necessarily comes in vector form ⃑

" = "$ ⋮ "& What data does not come in vector form?

slide-3
SLIDE 3
  • data that does not come as numerical vectors

à requires feature extraction

Example: images

COMPLEX DATA

3

slide-4
SLIDE 4
  • data that does not come as numerical vectors

à requires feature extraction

Example: text

COMPLEX DATA

4 great small location friends …

Same great flavor and friendly service as in the S 18th street

  • location. This location is not as small but it's hard to talk to friends.

Thankfully there is great outdoor seating to escape the noise.

slide-5
SLIDE 5
  • data that does not come as numerical vectors

à requires feature extraction

Example: street networks

COMPLEX DATA

5

slide-6
SLIDE 6
  • data that does not come as numerical vectors

à requires feature extraction

Example: chemical compounds

COMPLEX DATA

6 mutagenic non-mutagenic ???

slide-7
SLIDE 7
  • data that does not come as numerical vectors

à requires feature extraction

Example: point clouds

COMPLEX DATA

7

slide-8
SLIDE 8
  • data that does not come as numerical vectors

à requires feature extraction

Example: social networks

COMPLEX DATA

8

slide-9
SLIDE 9

FEATURE EXTRACTION

  • feature extraction is challenging
  • which features to compute and use?
  • too many choices and combinations
  • NNs can be used to learn features
  • sometimes those features are even interpretable
  • works well for images and text

9

slide-10
SLIDE 10

FEATURE EXTRACTION VS LEARNING

10

slide-11
SLIDE 11

FEATURE EXTRACTION VS LEARNING

11

Same great flavor and friendly service as in the S 18th street location. This location is not as small but it's hard to talk to friends. Thankfully there is great

  • utdoor seating to escape the noise.

Same great flavor and friendly service as in the S 18th street location. This location is not as small but it's hard to talk to friends. Thankfully there is great

  • utdoor seating to escape the noise.

politics sports culture politics sports culture

slide-12
SLIDE 12

FEATURE EXTRACTION VS LEARNING

12

politics sports culture Train NN on very large corpus simple ML model

  • kNN
  • random forest

positive negative Train separate model

  • n new (small) data

!" = 0.23 !( = 3.42 !*= 0.89

Use pre-trained NN

slide-13
SLIDE 13

DATA ENGINEERING

  • instead of removing or creating features
  • create or remove data points!
  • Why could that be useful?
  • remove outliers
  • create more training examples à data augmentation

13

slide-14
SLIDE 14

CAUSES OF OUTLIERS

  • data entry errors (human errors)
  • measurement errors (instrument errors)
  • experimental errors
  • data extraction error
  • experiment planning/executing errors
  • intentional
  • dummy outliers made to test detection methods
  • data processing errors
  • data manipulation
  • data set unintended mutations
  • sampling errors
  • extracting or mixing data from wrong or various sources
  • natural à not an error, novelties in data

14

slide-15
SLIDE 15

OUTLIER DETECTION

  • standardize your data
  • empirical rule:

15

  • utliers

91

  • utliers
  • 31

! = #$%#&%%#'

(

= )

( * +,) (

  • +

Z-Score Analysis (parametric approach à makes the parametric assumption that the features are normally distributed)

samples

7 8 48 55 68

. = ./ =

) (0) ∑23$

'

#204 &

sample standard deviation sample mean

slide-16
SLIDE 16

DATA AUGMENTATION

  • easy for images

à preform image transformations

16

slide-17
SLIDE 17

DATA AUGMENTATION

Why à NNs need a ton of training data

  • NN performance is very sensitive to adversarially created noise
  • collecting (and labelling data) is very time consuming/expensive

17

+ .007 ⇥ = x sign(rxJ(θ, x, y)) x + ✏ sign(rxJ(θ, x, y)) y =“panda” “nematode” “gibbon” w/ 57.7% confidence w/ 8.2% confidence w/ 99.3 % confidence

slide-18
SLIDE 18

18

  • [DSFS]
  • Ch18: Neural Networks (p213-218)
  • http://playground.tensorflow.org

SUMMARY & READING

  • More expressive features (even simple transformations) can

greatly improve the training time and model quality.

  • Feature learning is useful to deal with non-vectorial input

data.

  • Data engineering can improve supervised models by removing
  • utliers or data augmentation.
  • Neural Networks are tricky to train and very sensitive to noise.