Machine Learning: Algorithms and Applications Floriano Zini Free - - PDF document

machine learning algorithms and applications
SMART_READER_LITE
LIVE PREVIEW

Machine Learning: Algorithms and Applications Floriano Zini Free - - PDF document

12/03/12 Machine Learning: Algorithms and Applications Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 3: 12 th March 2012 Nave Bayes classifier (1) Problem definition A


slide-1
SLIDE 1

12/03/12 1

Machine Learning: Algorithms and Applications

Floriano Zini Free University of Bozen-Bolzano Faculty of Computer Science Academic Year 2011-2012 Lecture 3: 12th March 2012

Naïve Bayes classifier (1)

› Problem definition

  • A training set X, where each training instance x is represented

as an n-dimensional attribute vector: (x1, x2, ..., xn)

  • A pre-defined set of classes: C={c1, c2, ..., cm}
  • Given a new instance z, which class should z be classified to?

› We want to find the most probable class for instance z

) | ( max arg z c P c

i C c MAP

i∈

=

) ,..., , | ( max arg

2 1 n i C c MAP

z z z c P c

i∈

=

cMAP = argmax

ci!C

P(z1, z2,..., zn | ci)*P(ci) P(z1, z2,..., zn)

(by Bayes theorem)

cMAP = argmax

ci!C

P(z1, z2,..., zn | ci)"P(ci)

(P(z1,z2,...,zn) is the same for all classes)

slide-2
SLIDE 2

12/03/12 2

Naïve Bayes classifier (2)

Assumption in Naïve Bayes classifier. The attributes are conditionally independent given the classification

=

=

n j i j i n

c z P c z z z P

1 2 1

) | ( ) | ,..., , ( Naïve Bayes classifier finds the most probable class for z

cNB = argmax

ci!C

P(ci)* P(zj | ci)

j=1 n

"

Naïve Bayes classifier - Algorithm

› The learning (training) phase (given a training set)

For each class ci∈C

  • Estimate the prior probability: P(ci)
  • For each attribute value zj, estimate the probability of that

attribute value given class ci: P(zj|ci)

› The classification phase

  • For each class ci∈C, compute the formula

P(ci)! P(zj | ci)

j=1 n

"

  • Select the most probable class c*

c* = argmax

ci!C

P(ci)" P(zj | ci)

j=1 n

#

slide-3
SLIDE 3

12/03/12 3

Naïve Bayes classifier – Example (1)

Will a young student with medium income and fair credit rating buy a computer?

  • Rec. ID

Age Income Student Credit_Rating Buy_Computer 1 Young High No Fair No 2 Young High No Excellent No 3 Medium High No Fair Yes 4 Old Medium No Fair Yes 5 Old Low Yes Fair Yes 6 Old Low Yes Excellent No 7 Medium Low Yes Excellent Yes 8 Young Medium No Fair No 9 Young Low Yes Fair Yes 10 Old Medium Yes Fair Yes 11 Young Medium Yes Excellent Yes 12 Medium Medium No Excellent Yes 13 Medium High Yes Fair Yes 14 Old Medium No Excellent No

http://www.cs.sunysb.edu/~cse634/lecture_notes/07classification.pdf

Naïve Bayes classifier – Example (2)

› Representation of the problem

  • z = (Age=Young,Income=Medium,Student=Yes,Credit_Rating=Fair)
  • Two classes: c1 (buy a computer) and c2 (not buy a computer)

› Compute the prior probability for each class

  • P(c1) = 9/14
  • P(c2) = 5/14

› Compute the probability of each attribute value given each

class

  • P(Age=Young|c1) = 2/9;

P(Age=Young|c2) = 3/5

  • P(Income=Medium|c1) = 4/9;

P(Income=Medium|c2) = 2/5

  • P(Student=Yes|c1) = 6/9;

P(Student=Yes|c2) = 1/5

  • P(Credit_Rating=Fair|c1) = 6/9;

P(Credit_Rating=Fair|c2) = 2/5

slide-4
SLIDE 4

12/03/12 4

Naïve Bayes classifier – Example (3)

› Compute the likelihood of instance z given each class

  • For class c1

P(z|c1)= P(Age=Young|c1)*P(Income=Medium|c1)*P(Student=Yes|c1)* P(Credit_Rating=Fair|c1) = (2/9)*(4/9)*(6/9)*(6/9) = 0.044

  • For class c2

P(z|c2)= P(Age=Young|c2)*P(Income=Medium|c2)*P(Student=Yes|c2)* P(Credit_Rating=Fair|c2) = (3/5)*(2/5)*(1/5)*(2/5) = 0.019 › Find the most probable class

  • For class c1

P(c1)*P(z|c1) = (9/14)*(0.044) = 0.028

  • For class c2

P(c2)*P(z|c2) = (5/14)*(0.019) = 0.007

→ Conclusion: The person z (a young student with medium income and fair credit rating) will buy a computer!

Naïve Bayes classifier – Issues (1)

› What happens if no training instances associated with class ci have

attribute value xj?

› E.g., in the “buy computer” example, no young students bought computers

P(xj|ci)= n(cj,xj)/n(cj)=0 , and hence:

› Solution: use a Bayesian approach to estimate P(xj|ci)

  • n(ci): number of training instances associated with class ci
  • n(ci,xj): number of training instances associated with class ci that have attribute

value xj

  • p: a prior estimate for P(xj|ci)

→ Assume uniform priors: p=1/k, if attribute fj has k possible values

  • m: a weight given to prior

→ To augment the n(ci) actual observations by an additional m virtual samples distributed according to p

P(ci)! P(x j | ci)

j=1 n

"

= 0 m c n mp x c n c x P

i j i i j

+ + = ) ( ) , ( ) | (

slide-5
SLIDE 5

12/03/12 5

Naïve Bayes classifier – Issues (2)

  • P(xj|ci)<1, for every attribute value xj and class ci
  • So, when the number of attribute values is very

large

) | ( lim

1

= ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛∏

= ∞ → n j i j n

c x P

n Solution: use a logarithmic function of probability

cNB = argmax

ci!C

log P(ci)" P(x j | ci)

j=1 n

#

$ % & & ' ( ) ) * + , ,

  • .

/ / ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ + =

= ∈ n j i j i C c NB

c x P c P c

i

1

) | ( log ) ( log max arg

Naïve Bayes classifier – Summary

› One of the most practical learning methods › Based on the Bayes theorem › Parameter estimation for Naïve Bayes models uses the maximum likelihood

estimation

› Computationally very fast

  • Training: only one pass over the training set
  • Classification: linear in the number of attributes

› Despite its conditional independence assumption, Naïve Bayes classifier

shows a good performance in several application domains

› When to use?

  • A moderate or large training set available
  • Instances are represented by a large number of attributes
  • Attributes that describe instances are conditionally independent given

classification

slide-6
SLIDE 6

12/03/12 6

Linear regression Linear regression – Introduction

› Goal: to predict a real-valued output given an input instance › A simple-but-effective learning technique when the target

function is a linear function

› The learning problem is to learn (i.e., approximate) a real-valued

function f f: X → Y

  • X: The input domain (i.e., an n-dimensional vector space – Rn)
  • Y: The output domain (i.e., the real values domain – R)
  • f: The target function to be learned (i.e., a linear mapping function)

› Essentially, to learn the weights vector

w = (w0, w1, w2, …, wn)

=

+ = + + + + =

n i i i n n

x w w x w x w x w w x f

1 2 2 1 1

... ) (

(wi,xi ∈R)

slide-7
SLIDE 7

12/03/12 7

Linear regression – Example

What is the linear function f(x)?

f(x) x x f(x)

0.13

  • 0.91

1.02

  • 0.17

3.17 1.61

  • 2.76
  • 3.31

1.44 0.18 5.28 3.36

  • 1.74
  • 2.46

7.93 5.56 ... ...

E.g., f(x) = -1.02 + 0.83x

Linear regression – Training / test instances

› For each training instance x=(x1,x2,...,xn) ∈ X, where xi∈R

  • The desired (target) output value cx (∈R)
  • The actual output value

→ Here, wi are the system’s current estimates of the weights → The actual output value yx is desired to (approximately) be cx

› For a test instance z=(z1,z2,...,zn)

  • To predict the output value
  • By applying the learned target function f

=

+ =

n i i i x

x w w y

1

slide-8
SLIDE 8

12/03/12 8

Linear regression – Error function

› The learning algorithm requires to define an error

function

→ To measure the error made by the system in the training phase

› Definition of the training square error E

  • Error computed on each training example x:
  • Error computed on the entire training set X:

2 1 2

2 1 ) ( 2 1 ) ( ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − − = − =

= n i i i x x x

x w w c y c x E

E = E(x)

x!X

"

= 1 2 (cx # yx)2 = 1 2

x!X

"

cx # w0 # wixi

i=1 n

"

$ % & ' ( )

x!X

"

2

Least-square linear regression

› Learning the target function f is equivalent to learning the

weights vector w that minimizes the training square error E

→ Why the name of the approach is “Least-Square Linear Regression”

› Training phase

  • Initialize the weights vector w (small random values)
  • Compute the training error E
  • Update the weights vector w according to the delta rule
  • Repeat until converging to a (locally) minimum error E

› Prediction phase

For a new instance z, the (predicted) output value is:

=

+ =

n i i i z

w w z f

1

* * ) (

where w*=(w*0,w*1,..., w*n) is the learned weights vector

slide-9
SLIDE 9

12/03/12 9

The delta rule

› To update the weights vector w in the direction that

decreases the training error E

  • η is the learning rate (i.e., a small positive constant)

→ To decide the degree to which the weights are changed at each training step

  • Instance-to-instance update: wi ← wi + η(cx-yx)xi
  • Batch update:

› Other names of the delta rule

  • LMS (least mean square) rule
  • Adaline rule
  • Widrow-Hoff rule

wi ! wi +! cx " yx

( )

x#X

$

xi

LSLR_batch(X, η)

for each attribute i wi ← an initial (small) random value while not CONVERGENCE for each attribute i delta_wi ← 0 for each training example x∈X compute the actual output value yx for each attribute i delta_wi ← delta_wi + η(cx-yx)xi for each attribute i wi ← wi + delta_wi end while return w

slide-10
SLIDE 10

12/03/12 10

Batch vs. incremental update

› The previous algorithm follows a batch update approach › Batch update

  • At each training step (cycle), the weights are updated after all

the training instances are inputted to the system

  • First, the error is computed cumulatively on all the training

instances

  • Then, the weights are updated according to the overall

(cumulated) error › Incremental update

  • At each training step, the weights are updated immediately

after each training instance is inputted to the system

  • The individual error is computed for the training instance
  • The weights are updated immediately according to the individual

error

LSLR_incremental(X, η) for each attribute i wi ← an initial (small) random value while not CONVERGENCE for each training example x∈X compute the actual output value yx for each attribute i wi ← wi + η(cx-yx)xi end while return w

slide-11
SLIDE 11

12/03/12 11

Training termination conditions

› In the LSLR_batch and LSLR_incremental learning

algorithms, the training process terminates when the conditions indicated by CONVERGENCE are met

› The (training) termination conditions are typically defined

based on some kind of system performance measure

  • Stop, if the error is less than a threshold value
  • Stop, if the error at a learning step is greater than that at the

previous step

  • Stop, if the difference between the errors at two consecutive steps is

less than a threshold value

  • Stop, if ...

Nearest neighbor learner

slide-12
SLIDE 12

12/03/12 12

Nearest neighbor learner – Introduction (1)

› Some alternative names

  • Instance-based learning
  • Lazy learning
  • Memory-based learning

› Nearest neighbor learner

  • Given a set of training instances

─ Just store the training instances ─ Not construct a general, explicit description (model) of the

target function based on the training instances

  • Given a test instance (to be classified/predicted)

─ Examine the relationship between the test instance and the

training instances to assign a target function value

Nearest neighbor learner – Introduction (2)

› The input representation

  • Each instance x is represented as a vector in an n-

dimensional vector space X∈Rn

  • x = (x1,x2,…,xn), where xi (∈R) is a real number

› We consider two learning tasks

  • Nearest neighbor learner for classification

─ To learn a discrete-valued target function ─ The output is one of pre-defined nominal values (i.e., class

labels)

  • Nearest neighbor learner for prediction

─ To learn a continuous-valued target function ─ The output is a real number

slide-13
SLIDE 13

12/03/12 13

Nearest neighbor learner – Example

› 1 nearest neighbor

→ Assign z to c2

› 3 nearest neighbors

→ Assign z to c1

› 5 nearest neighbors

→ Assign z to c1

test instance z class c1 class c2

k-Nearest neighbor classifier – Algorithm

› For the classification task › Each training instance x is represented by

  • The description: x=(x1,x2,…,xn), where xi∈R
  • The class label: c (∈C, where C is a pre-defined set of class labels)

› Training phase

  • Just store the training instances set X = {x}

› Test phase. To classify a new instance z

  • For each training instance x∈X, compute distance between x and z
  • Compute the set NB(z) – the neighbourhood of z

→ The k instances in X nearest to z according to a distance function d

  • Classify z to the majority class of the instances in NB(z)
slide-14
SLIDE 14

12/03/12 14

k-Nearest neighbor predictor – Algorithm

› For the regression task (i.e., to predict a real output value) › Each training instance x is represented by

  • The description: x=(x1,x2,…,xn), where xi∈R
  • The output value: yx∈R (i.e., a real number)

› Training phase

  • Just store the training examples set X

› Test phase. To predict the output value for new instance z

  • For each training instance x∈X, compute distance between x and z
  • Compute the set NB(z) – the neighbourhood of z

→ The k instances in X nearest to z according to a distance function d

  • Predict the output value of z:

∑ ∈

=

) (

1

z NB x x z

y k y

One vs. More than one neighbor

› Using only a single neighbor (i.e., the training

instance closest to the test instance) to determine the classification is subject to errors

  • E.g., noise (i.e. error) in the class label of a single training

instance

› Consider the k (>1) nearest training instances, and

return the majority class label of these k instances

› The value of k is typically odd to avoid ties

  • For example, k=3 or k=5
slide-15
SLIDE 15

12/03/12 15

Distance function (1)

› The distance function d

  • Play a very important role in the nearest neighbor

learning approach

  • Typically defined before, and fixed through, the training

and test phases – i.e., not adjusted based on data

› Choice of the distance function d

  • Geometry distance functions, for continuous-valued input

space (xi∈R)

  • Hamming distance function, for binary-valued input space

(xi∈{0,1})

Distance function (2)

› Geometry distance functions

  • Manhattan distance
  • Euclidean distance
  • Minkowski (p-norm) distance
  • Chebyshev distance

=

− =

n i i i

z x z x d

1

) , (

( )

=

− =

n i i i

z x z x d

1 2

) , (

p n i p i i

z x z x d

/ 1 1

) , ( ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − = ∑

= i i i

z x − = max

p n i p i i p

z x z x d

/ 1 1

lim ) , ( ⎟ ⎠ ⎞ ⎜ ⎝ ⎛ − =

= ∞ →

slide-16
SLIDE 16

12/03/12 16

Distance function (3)

› Hamming distance

function

  • For binary-valued input

space

  • E.g., x=(0,1,0,1,1)

=

=

n i i i z

x Difference z x d

1

) , ( ) , ( ⎩ ⎨ ⎧ = ≠ = ) ( , ) ( , 1 ) , ( b a if b a if b a Difference

Attribute value normalization

› The Euclidean distance function › Assume that an instance is represented by 3 attributes: Age,

Income (per month), and Height (in meters)

  • x = (Age=20, Income=12000, Height=1.68)
  • z = (Age=40, Income=1300, Height=1.75)

› The distance between x and z

  • d(x,z) = [(20-40)2 + (12000-1300)2 + (1.68-1.75)2]1/2
  • The distance is dominated by the local distance (difference) on the

Income attribute → Because the Income attribute has a large range of values

› To normalize the values of all the attributes to the same range

  • Usually the value range [0,1] is used
  • E.g., for every attribute i: xi = xi/max_value_of_attribute_i

( )

=

− =

n i i i

z x z x d

1 2

) , (

slide-17
SLIDE 17

12/03/12 17

Attribute importance weight

› The Euclidean distance function

  • All the attributes are considered equally important in the distance

computation

› Different attributes may have different degrees of influence on

the distance metric

› To incorporate attribute importance weights in the distance

function

  • wi is the importance weight of attribute i:

› How to achieve the attribute importance weights?

  • By the domain-specific knowledge (e.g., indicated by experts in

the problem domain)

  • By an optimization process (e.g., using a separate validation set

to learn an optimal set of attribute weights)

( )

=

− =

n i i i

z x z x d

1 2

) , (

( )

=

− =

n i i i i

z x w z x d

1 2

) , (

Distance-weighted NN learner (1)

› Consider NB(z) – the set of the k

training instances nearest to the test instance z

  • Each (nearest) instance has a

different distance to z

  • Should these (nearest) instances

influence equally to the classification/prediction of z? → No!

› To weight the contribution of each

  • f the k neighbors according to

their distance to z

  • Larger weight for nearer neighbor!

test instance z

slide-18
SLIDE 18

12/03/12 18

Distance-weighted NN learner (2)

› Let’s denote by v a distance-based weighting

function

  • Given a distance d(x,z) – the distance of x to z
  • v(x,z) is inversely proportional to d(x,z)

› For the classification task: › For the prediction task: › Select a distance-based weighting function

c(z) = argmax

cj !C

v(x, z)"!(cj,c(x))

x!NB(z)

#

!(a,b) = 1,if (a = b) 0,if (a ! b) " # $ % $ f (z) = v(x, z)! f (x)

x"NB(z)

#

v(x, z)

x"NB(z)

#

) , ( 1 ) , ( z x d z x v + = α

2

)] , ( [ 1 ) , ( z x d z x v + = α

2 2

) , (

) , (

σ z x d

e z x v

=

Lazy learning vs. Eager learning

› Lazy learning. The target function estimation (i.e., generalization)

is postponed until the test instance is introduced

  • E.g., Nearest neighbor learner, Locally weighted regression
  • Estimate (i.e., approximate) the target function locally and differently for each test

instance – i.e., performed at the classification/prediction time

  • Compute many local approximations of the target function
  • Typically take longer time to answer queries, and require more memory space

› Eager learning. The target function estimation is completed before

any test instance is introduced

  • E.g., Linear regression, Support vector machines, Neural networks, etc.
  • Estimate (i.e., approximate) the target function globally for the entire instance

space – i.e., performed at the training time

  • Compute a single (global) approximation of the target function
slide-19
SLIDE 19

12/03/12 19

Nearest neighbor learner – When?

› Instances are represented as vectors in Rn › The dimensionality of the input space is not large › A large set of training instances is available

› Advantages

  • No training is needed (i.e., just store the training instances)
  • Scale well with a large number of classes

→ Not need to learn n classifiers for n classes

  • k-NN (k >>1) learner is robust to noisy data

→ Classification/prediction is performed considering k nearest neighbors › Disadvantages

  • Distance function must be carefully chosen
  • Computational cost (in time and memory) at the classification/prediction time
  • May be misled by irrelevant attributes