CSE217 INTRODUCTION TO DATA SCIENCE
Spring 2019 Marion Neumann
CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 6: LEARNING PRINCIPLES - - PowerPoint PPT Presentation
CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 6: LEARNING PRINCIPLES Spring 2019 Marion Neumann RECAP: MACHINE LEARNING Workflow 2 NOISE noisy samples from true function 3 WHY IS NOISE A PROBLEM? small random sample from the noisy
Spring 2019 Marion Neumann
2
3
4
5
à fitting the noise instead of the true function
6
7
PDSH p393 Linear Regression
Error on training set: linear model >> quadraEc >> 6-order polynomial
ß error is zero! Is the model with zero (training) error the best?
8
RMSE % &, &() =
+ , - .
(%
MAE % &, &() = +
, - .
|%
% & = 6(7()) predictions for test data
9
t
µ
Sgp fnderf.im
f
a
pH
linear
s
I
high
poly
# #$%%&'(%%$)$*+ ,*%, -.$/,% # ,*%, -.$/,%
from? à noisy labels
10
we have again training and test error (accuracy)
11
+1
+1
prediction true label
✓ ✓
✘
false nega2ve predic2on
✘
false positive prediction true positive prediction true negative prediction Can you define accuracy using these measures?
TPR
TP N
NR FF
P
FPR ftp.u TNI
ETNR
TIN
12
to
13
µy
eiE'oat
compare training
test errors
for all three models
14
Draw this yourself
I
l
I
I
Several Strategies:
1) prefer simpler models over more complicated ones 2) use validation set for model selection 3) add a regularization term to your optimization problem during training
15
lDra
A
ground
Validation
truth
msn.ee EsmYegaePEt
B
prediction
Validation
4
Performance
C
Evaluation
Validation
vs penalize large weights in
16
17
A population is the entire set
hypothetical “all students” or all students in this class. A sample is a (representative) subset of the objects or events under study. à needed because it’s impossible or intractable to
What are problems with sample data?
this might have a (nega:ve) impact!
18
19
influences model selection
looking at the error.