CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING - - PowerPoint PPT Presentation
CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING - - PowerPoint PPT Presentation
CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 10: DATA ENGINEERING Spring 2019 Marion Neumann RECAP: FEATURE ENGINEERING 5 Good Reasons for Feature Engineering: 1) get better represented features ( scaling , standardization ) improve model
RECAP: FEATURE ENGINEERING
2
5 Good Reasons for Feature Engineering: 1) get better represented features (scaling, standardization)
à improve model training and prediction quality
2) get more expressive features
à improve model training and prediction quality à ACTIVITY 1
3) get more representative features
à remove noise
4) get less features
à more efficient computation
5) represent features in 2d or 3d
à visualization
6) bring data in to vector representation
à not all data necessarily comes in vector form ⃑
" = "$ ⋮ "& What data does not come in vector form?
- data that does not come as numerical vectors
à requires feature extraction
Example: images
COMPLEX DATA
3
- data that does not come as numerical vectors
à requires feature extraction
Example: text
COMPLEX DATA
4 great small location friends …
Same great flavor and friendly service as in the S 18th street
- location. This location is not as small but it's hard to talk to friends.
Thankfully there is great outdoor seating to escape the noise.
- data that does not come as numerical vectors
à requires feature extraction
Example: street networks
COMPLEX DATA
5
- data that does not come as numerical vectors
à requires feature extraction
Example: chemical compounds
COMPLEX DATA
6 mutagenic non-mutagenic ???
- data that does not come as numerical vectors
à requires feature extraction
Example: point clouds
COMPLEX DATA
7
- data that does not come as numerical vectors
à requires feature extraction
Example: social networks
COMPLEX DATA
8
FEATURE EXTRACTION
- feature extraction is challenging
- which features to compute and use?
- too many choices and combinations
- NNs can be used to learn features
- sometimes those features are even interpretable
- works well for images and text
9
FEATURE EXTRACTION VS LEARNING
10
FEATURE EXTRACTION VS LEARNING
11
Same great flavor and friendly service as in the S 18th street location. This location is not as small but it's hard to talk to friends. Thankfully there is great
- utdoor seating to escape the noise.
Same great flavor and friendly service as in the S 18th street location. This location is not as small but it's hard to talk to friends. Thankfully there is great
- utdoor seating to escape the noise.
politics sports culture politics sports culture
FEATURE EXTRACTION VS LEARNING
12
politics sports culture Train NN on very large corpus simple ML model
- kNN
- random forest
- …
positive negative Train separate model
- n new (small) data
!" = 0.23 !( = 3.42 !*= 0.89
Use pre-trained NN
DATA ENGINEERING
- instead of removing or creating features
- create or remove data points!
- Why could that be useful?
- remove outliers
- create more training examples à data augmentation
13
CAUSES OF OUTLIERS
- data entry errors (human errors)
- measurement errors (instrument errors)
- experimental errors
- data extraction error
- experiment planning/executing errors
- intentional
- dummy outliers made to test detection methods
- data processing errors
- data manipulation
- data set unintended mutations
- sampling errors
- extracting or mixing data from wrong or various sources
- natural à not an error, novelties in data
14
OUTLIER DETECTION
- standardize your data
- empirical rule:
15
- utliers
91
- utliers
- 31
! = #$%#&%%#'
(
= )
( * +,) (
- +
Z-Score Analysis (parametric approach à makes the parametric assumption that the features are normally distributed)
samples
7 8 48 55 68
. = ./ =
) (0) ∑23$
'
#204 &
sample standard deviation sample mean
DATA AUGMENTATION
- easy for images
à preform image transformations
16
DATA AUGMENTATION
Why à NNs need a ton of training data
- NN performance is very sensitive to adversarially created noise
- collecting (and labelling data) is very time consuming/expensive
17
+ .007 ⇥ = x sign(rxJ(θ, x, y)) x + ✏ sign(rxJ(θ, x, y)) y =“panda” “nematode” “gibbon” w/ 57.7% confidence w/ 8.2% confidence w/ 99.3 % confidence
18
- [DSFS]
- Ch18: Neural Networks (p213-218)
- http://playground.tensorflow.org
SUMMARY & READING
- More expressive features (even simple transformations) can
greatly improve the training time and model quality.
- Feature learning is useful to deal with non-vectorial input
data.
- Data engineering can improve supervised models by removing
- utliers or data augmentation.
- Neural Networks are tricky to train and very sensitive to noise.