1
Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ - - PowerPoint PPT Presentation
Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ - - PowerPoint PPT Presentation
Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$ Machine(Learner( 2 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$ Machine(Learner( Data$with$a5ributes$ ID( A1( Reflex(
Machine(Learning(in(a(Nutshell (
Machine(Learner( Data$ Model$ Performance$ Measure$
2
Data$with$a5ributes$
Machine(Learning(in(a(Nutshell (
ID( A1( Reflex( RefLow(RefHigh(Label( 1$ 5.6$ Normal$ 3.4$ 7$ No$ 2$ 5.5$ Normal$ 2.4$ 5.7$ No$ 3$ 5.3$ Normal$ 2.4$ 5.7$ Yes$ 4$ 5.3$ Elevated$ 2.4$ 5.7$ No$ 5$ 6.3$ Normal$ 3.4$ 7$ No$ 6$ 3.3$ Normal$ 2.4$ 5.7$ Yes$ 7$ 5.1$ Decreased$ 2.4$ 5.7$ Yes$ 8$ 4.2$ Normal$ 2.4$ 5.7$ Yes$
…$…$ …$ …$ …$ …$
Machine(Learner( Data$ Model$ Performance$ Measure$
Instance$$$ with$label$ xi ∈ X yi ∈ Y
3
Machine(Learning(in(a(Nutshell (
Machine(Learner( Data$ Model$ Performance$ Measure$ Model$
LogisHc$regression$ Support$vector$$ machines$
x2 x3 x1 x4 x5
Hierarchical$ Bayesian$ Networks$ Mixture$ Models$
f : X 7! Y Data$with$a5ributes$
ID( A1( Reflex( RefLow(RefHigh(Label( 1$ 5.6$ Normal$ 3.4$ 7$ No$ 2$ 5.5$ Normal$ 2.4$ 5.7$ No$ 3$ 5.3$ Normal$ 2.4$ 5.7$ Yes$ 4$ 5.3$ Elevated$ 2.4$ 5.7$ No$ 5$ 6.3$ Normal$ 3.4$ 7$ No$ 6$ 3.3$ Normal$ 2.4$ 5.7$ Yes$ 7$ 5.1$ Decreased$ 2.4$ 5.7$ Yes$ 8$ 4.2$ Normal$ 2.4$ 5.7$ Yes$
…$…$ …$ …$ …$ …$ Instance$$$ with$label$ xi ∈ X yi ∈ Y
4
Machine(Learning(in(a(Nutshell (
Machine(Learner( Data$ Model$ Performance$ Measure$ EvaluaHon$
Measure$predicted$labels$vs$ actual$labels$on$test$data$ #$Training$Examples$ Performance$
Learning$Curve$
5
Model$
LogisHc$regression$ Support$vector$$ machines$
x2 x3 x1 x4 x5
Hierarchical$ Bayesian$ Networks$ Mixture$ Models$
f : X 7! Y Data$with$a5ributes$
ID( A1( Reflex( RefLow(RefHigh(Label( 1$ 5.6$ Normal$ 3.4$ 7$ No$ 2$ 5.5$ Normal$ 2.4$ 5.7$ No$ 3$ 5.3$ Normal$ 2.4$ 5.7$ Yes$ 4$ 5.3$ Elevated$ 2.4$ 5.7$ No$ 5$ 6.3$ Normal$ 3.4$ 7$ No$ 6$ 3.3$ Normal$ 2.4$ 5.7$ Yes$ 7$ 5.1$ Decreased$ 2.4$ 5.7$ Yes$ 8$ 4.2$ Normal$ 2.4$ 5.7$ Yes$
…$…$ …$ …$ …$ …$ Instance$$$ with$label$ xi ∈ X yi ∈ Y
6
A training set
7
ID3-induced decision tree
8
Model spaces
I + +
- I
+ +
- I
+ +
- Nearest
neighbor Version space Decision tree
9
Decision tree-induced partition – example
Color Shape Size + +
- Size
+
- +
big big small small round square red green blue
I
19
The Naïve Bayes Classifier
Some material adapted from slides by Tom Mitchell, CMU.
20
The Naïve Bayes Classifier
) ( ) | ( ) ( ) | (
j i j i j i
X P Y X P Y P X Y P =
! Recall Bayes rule: ! Which is short for: ! We can re-write this as:
) ( ) | ( ) ( ) | (
j i j i j i
x X P y Y x X P y Y P x X y Y P = = = = = = =
∑
= = = = = = = = =
k k k j i j i j i
y Y P y Y x X P y Y x X P y Y P x X y Y P ) ( ) | ( ) | ( ) ( ) | (
21
Deriving Naïve Bayes
! Idea: use the training data to directly estimate: ! Then, we can use these values to estimate
using Bayes rule.
! Recall that representing the full joint probability
is not practical.
) (Y P
) | ( Y X P
and
) | (
new
X Y P ) | , , , (
2 1
Y X X X P
n
…
22
Deriving Naïve Bayes
! However, if we make the assumption that the
attributes are independent, estimation is easy!
! In other words, we assume all attributes are
conditionally independent given Y.
! Often this assumption is violated in practice, but
more on that later…
∏
=
i i n
Y X P Y X X P ) | ( ) | , , (
1 …
23
Deriving Naïve Bayes
! Let and label Y be discrete. ! Then, we can estimate
and directly from the training data by counting!
n
X X X , ,
1 …
=
) ( i Y P
) | (
i i Y
X P
Sky Temp Humid Wind Water Forecast Play? sunny warm normal strong warm same yes sunny warm high strong warm same yes rainy cold high strong warm change no sunny warm high strong cool change yes P(Sky = sunny | Play = yes) = ? P(Humid = high | Play = yes) = ?
24
The Naïve Bayes Classifier
! Now we have:
which is just a one-level Bayesian Network
! To classify a new point Xnew:
) (
i
H P
… …
Attributes (evidence) Labels (hypotheses)
1 n i j
X X X Y
) (
j
Y P
) | (
j i Y
X P
∑ ∏ ∏
= = = = = =
k i k i k j i i j n j
y Y X P y Y P y Y X P y Y P X X y Y P ) | ( ) ( ) | ( ) ( ) , , | (
1 …
∏
= = "" ←
i k i k y new
y Y X P y Y P Y
k
) | ( ) ( max arg
25
The Naïve Bayes Algorithm
! For each value yk
! Estimate P(Y = yk) from the data. ! For each value xij of each attribute Xi
! Estimate P(Xi=xij | Y = yk)
! Classify a new point via: ! In practice, the independence assumption
doesnt often hold true, but Naïve Bayes performs very well despite it.
∏
= = "" ←
i k i k y new
y Y X P y Y P Y
k
) | ( ) ( max arg
26
Naïve Bayes Applications
! Text classification
! Which e-mails are spam? ! Which e-mails are meeting notices? ! Which author wrote a document?
! Classifying mental states
People Words Animal Words
Learning P(BrainActivity | WordCategory) Pairwise Classification Accuracy: 85%