The bits the whirlwind tour left out ... BMVA Summer School 2016 - - PowerPoint PPT Presentation

the bits the whirlwind tour left out
SMART_READER_LITE
LIVE PREVIEW

The bits the whirlwind tour left out ... BMVA Summer School 2016 - - PowerPoint PPT Presentation

The bits the whirlwind tour left out ... BMVA Summer School 2016 extra background slides (from teaching material at Durham University) BMVA Summer School 2016 Machine Learning Extra : 1 Machine Learning Definition: A computer


slide-1
SLIDE 1

Machine Learning Extra : 1 BMVA Summer School 2016

The bits the whirlwind tour left

  • ut ...

BMVA Summer School 2016 – extra background slides (from teaching material at Durham University)

slide-2
SLIDE 2

Machine Learning Extra : 2 BMVA Summer School 2016

Machine Learning

Definition:

– “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, improves with experience E.”

[Mitchell, 1997]

slide-3
SLIDE 3

Machine Learning Extra : 3 BMVA Summer School 2016

Algorithm to construct decision trees ….

slide-4
SLIDE 4

Machine Learning Extra : 4 BMVA Summer School 2016

Building Decision Trees – ID3

 node = root of tree

 Main loop:

A = “best” decision attribute for next node .....

But which attribute is best to split on ?

slide-5
SLIDE 5

Machine Learning Extra : 5 BMVA Summer School 2016

Entropy in machine learning

Entropy : a measure of impurity

– S is a sample of training examples – P is the proportion of positive examples in S – P⊖ is the proportion of negative examples in S

Entropy measures the impurity of S:

slide-6
SLIDE 6

Machine Learning Extra : 6 BMVA Summer School 2016

Information Gain – reduction in Entropy

Gain(S,A) = expected reduction in entropy due to splitting

  • n attribute A

– i.e. expected reduction in impurity in the data – (improvement in consistent data sorting)

slide-7
SLIDE 7

Machine Learning Extra : 7 BMVA Summer School 2016

Information Gain – reduction in Entropy

– reduction in entropy in set of examples S if split on attribute A – Sv = subset of S for which attribute A has value v

– Gain(S,A) = original entropy – SUM(entropy of sub-nodes if split on A)

slide-8
SLIDE 8

Machine Learning Extra : 8 BMVA Summer School 2016

Information Gain – reduction in Entropy

Information Gain :

– “information provided about the target function given the value of some attribute A” – How well does A sort the data into the required classes?

Generalise to c classes :

– (not just  or ⊖)

EntropyS=−∑

i=1 c

pi log pi

slide-9
SLIDE 9

Machine Learning Extra : 9 BMVA Summer School 2016

Building Decision Trees

 Selecting the Next Attribute – which attribute should we split on next?

slide-10
SLIDE 10

Machine Learning Extra : 10 BMVA Summer School 2016

Building Decision Trees

 Selecting the Next Attribute – which attribute should we split on next?

slide-11
SLIDE 11

Machine Learning Extra : 11 BMVA Summer School 2016

Boosting and Bagging …. + Forests

slide-12
SLIDE 12

Lecture 4 : 12 Toby Breckon

Learning using Boosting

Assign equal weight to each training instance For t iterations: Apply learning algorithm to weighted training set, store resulting (weak) classifier Compute classifier’s error e on weighted training set If e = 0 or e > 0.5: Terminate classifier generation For each instance in training set: If classified correctly by classifier: Multiply instance’s weight by e/(1-e) Normalize weight of all instances

Learning Boosted Classifier (Adaboost Algorithm) Classification using Boosted Classifier

Assign weight = 0 to all classes For each of the t (or less) classifiers: For the class this classifier predicts add –log e/(1-e) to this class’s weight Return class with highest weight

e = error of classifier on the training set

slide-13
SLIDE 13

Lecture 4 : 13 Toby Breckon

Learning using Boosting

 Some things to note:

– Weight adjustment means t+1th classifier concentrates on the examples tth classifier got wrong – Each classifier must be able to achieve greater than 50% success

  • (i.e. 0.5 in normalised error range {0..1})

– Results in an ensemble of t classifiers

  • i.e. a boosted classifier made up of t weak classifiers
  • boosting/bagging classifiers often called ensemble classifiers

– Training error decreases exponentially (theoretically)

  • prone to over-fitting (need diversity in test set)

– several additions/modifications to handle this

– Works best with weak classifiers

Boosted Trees

– set of t decision trees of limited complexity (e.g. depth)

.....

slide-14
SLIDE 14

Lecture 4 : 14 Toby Breckon

Decision Forests (a.k.a. Random Forests/Trees)

Bagging using multiple decision trees where each tree in the ensemble classifier ... – is trained on a random subsets of the training data – computes a node split on a random subset of the available attributes Each tree is grown as follows:

– Select a training set T' (size N) by randomly selecting (with replacement) N instances from training set T – Select a number m < M where a subset of m attributes

  • ut of the available M attributes are used to compute

the best split at a given node (m is constant across all trees in the forest) – Grow each tree using T' to the largest extent possible without any pruning.

[Breiman 2001]

slide-15
SLIDE 15

Machine Learning Extra : 15 BMVA Summer School 2016

Backpropogation Algorithm ….

slide-16
SLIDE 16

Machine Learning Extra : 16 BMVA Summer School 2016

Backpropagation Algorithm

Assume we have:

– input examples d={1...D}

  • each is pair {xd,td} = {input

vector, target vector}

– node index n={1 … N} – weight wji connects node j → i – input xji is the input on the connection node j → i

  • corresponding weight = wji

– output error for node n is δn

  • similar to (o – t)

Output Layer

Input layer Input, x Output vector, Ok

Hidden Layer node index {1 … N}

slide-17
SLIDE 17

Machine Learning Extra : 17 BMVA Summer School 2016

Backpropagation Algorithm

(1) Input Example

example d

(2) output layer error

based on :

difference between

  • utput and target

(t - o) derivative of sigmoid function

(3) Hidden layer error

proportional to node contribution to output error

(4) Update weights wij

slide-18
SLIDE 18

Machine Learning Extra : 18 BMVA Summer School 2016

Backpropagation

Termination criteria

– number of iterations reached – Or error below suitable bound

Output layer error Hidden layer error Add weights updated using relevant error

slide-19
SLIDE 19

Machine Learning Extra : 19 BMVA Summer School 2016

Backpropagation

Output Layer, unit k

Input layer Input, x Output vector, Ok

Hidden Layer, unit h

slide-20
SLIDE 20

Machine Learning Extra : 20 BMVA Summer School 2016

Backpropagation

Output vector, Ok

δh is expressed as a weighted sum of the output layer errors δk to which it contributes (i.e. whk > 0)

Output Layer, unit k

Input layer Input, x

Hidden Layer, unit h

slide-21
SLIDE 21

Machine Learning Extra : 21 BMVA Summer School 2016

Backpropagation

Error is propogated backwards from network

  • utput ....

to weights of output layer .... to weights of the hidden layer … Hence the name: backpropagation

Output Layer, unit k

Input layer Input, x

Hidden Layer, unit h

Output vector, Ok

slide-22
SLIDE 22

Machine Learning Extra : 22 BMVA Summer School 2016

Backpropagation

Repeat these stages for every hidden layer in a multi-layer network: (using error δi where xji>0) .......

Output Layer, unit k

Input layer Input, x

Hidden Layer(s), unit h

Output vector, Ok

slide-23
SLIDE 23

Machine Learning Extra : 23 BMVA Summer School 2016

Backpropagation

Error is propogated backwards from network

  • utput ....

to weights of output layer ....

  • ver weights of all N

hidden layers … Hence the name: backpropagation .......

Output Layer, unit k

Input layer Input, x

Hidden Layer(s), unit h

Output vector, Ok

slide-24
SLIDE 24

Machine Learning Extra : 24 BMVA Summer School 2016

Backpropagation

Will perform gradient descent

  • ver the weight

space of {wji} for all connections i → j in the network Stochastic gradient descent

– as updates based on training one sample at a time

slide-25
SLIDE 25

Machine Learning Extra : 25 BMVA Summer School 2016

Future and current concepts

This is beyond the scope this introductory tutorial but the following are recommended as good places to start: Convolutional Neural Networks

– http://deeplearning.net/tutorial/lenet.html

Deep Learning

– http://www.deeplearning.net/tutorial/

slide-26
SLIDE 26

Machine Learning Extra : 26 BMVA Summer School 2016

Understanding (and believing) the SVM stuff ….

slide-27
SLIDE 27

Machine Learning Extra : 27 BMVA Summer School 2016

Remedial Note: equations of 2D lines

Line: where: are 2D vectors.

Offset from origin Normal to line 2D LINES REMINDER

slide-28
SLIDE 28

Machine Learning Extra : 28 BMVA Summer School 2016

Remedial Note: equations of 2D lines

http://www.mathopenref.com/coordpointdisttrig.html

2D LINES REMINDER

slide-29
SLIDE 29

Machine Learning Extra : 29 BMVA Summer School 2016

Remedial Note: equations of 2D lines

For a defined line equation: Fixed Insert point into equation …...

Normal to line

Result is +ve if point on this side

  • f line (i.e.> 0).

Result is -ve if point on this side

  • f line. ( < 0 )

Result is the distance (+ve or

  • ve) of point from line given by:

for:

2D LINES REMINDER

slide-30
SLIDE 30

Machine Learning Extra : 30 BMVA Summer School 2016

Linear Separator

 Instances (i.e, examples) {xi , yi } – xi = point in instance space (Rn) made up of n attributes – yi =class value for classification of xi

Want a linear separator. Can view this as constraint satisfaction problem: Equivalently,

y = +1 y = -1 Classification of example function f(x) = y = {+1, -1} i.e. 2 classes N.B. we have a vector of weights coefficients ⃗ w

slide-31
SLIDE 31

Machine Learning Extra : 31 BMVA Summer School 2016

Linear Separator

If we define the distance of the nearest point to the margin as 1 → width of margin is (i.e. equal width each side) We thus want to maximize:

finding the parameters: y = +1 y = -1 Classification of example function f(x) = y = {+1, -1} i.e. 2 classes

slide-32
SLIDE 32

Machine Learning Extra : 32 BMVA Summer School 2016

which is equivalent to minimizing:

slide-33
SLIDE 33

Machine Learning Extra : 33 BMVA Summer School 2016

…............. back to main slides

slide-34
SLIDE 34

Machine Learning Extra : 34 BMVA Summer School 2016

So …. Find the “hyperplane” (i.e. boundary) with: a) maximum margin b) minimum number of (training) examples on the wrong side of the chosen boundary (i.e. minimal penalties due to C) Solve via optimization (in polynomial time/complexity)

slide-35
SLIDE 35

Machine Learning Extra : 35 BMVA Summer School 2016

Find hyperplane separator (plane in 3D) via optimization Non-linear Separation (red / blue data items

  • n 2D plane).

Kernel projection to higher dimensional space Non-linear boundary in original dimension (e.g. circle n 2D) defined by planar boundary (cut) in 3D.

Example:

slide-36
SLIDE 36

Machine Learning Extra : 36 BMVA Summer School 2016

Non Linear SVMs

Suppose we have instance space X = Rn, need non-linear separator.

– project X into some higher dimensional space X' = Rm where data will be linearly separable – let  : X → X' be this projection.

Interestingly,

– Training depends only on dot products of form (xi)  (xj)

  • i.e. dot product of instances in Rn gives divisor in Rn

– So we can train in Rm with same computational complexity as in Rn, provided we can find a kernel basis function K such that:

  • K(xi, xj) = (xi)  (xj)

(kernel trick)

– Classifying new instance x now requires calculating sign of:

slide-37
SLIDE 37

Machine Learning Extra : 37 BMVA Summer School 2016

.... but it is all about the data!

slide-38
SLIDE 38

Machine Learning Extra : 38 BMVA Summer School 2016

Desirable Data Properties

Machine learning is a Data Driven Approach The Data is important! Ideally training/testing data used for learning must be:

– Unbiased

  • towards any given subset of the space of examples ...

– Representative

  • of the “real-world” data to be encountered in use/deployment

– Accurate

  • inaccuracies in training/testing produce inaccuracies results

– Available

  • the more training/testing data available the better the results
  • greater confidence in the results can be achieved
slide-39
SLIDE 39

Machine Learning Extra : 39 BMVA Summer School 2016

Data Training Methodologies

Simple approach : Data Splits

– split overall data set into separate training and test sets

  • No established rule but 80%:20%, 70%:30% or ⅓:⅔ training to testing

splits common

– Training on one, test on the other – Test error = error on the test set – Training error = error on training set – Weakness: susceptible to bias in data sets or “over-fitting”

  • Also less data available for training
slide-40
SLIDE 40

Machine Learning Extra : 40 BMVA Summer School 2016

Data Training Methodologies

More advanced (and robust): K Cross Validation

– Randomly split (all) the data into k-subsets – For 1 to k

  • train using all the data not in kth subset
  • test resulting learned [classifier|function …]

using kth subset – report mean error over all k tests

slide-41
SLIDE 41

Machine Learning Extra : 41 BMVA Summer School 2016

Key Summary Statistics #1

tp = true positive / tn = true negative fp = false positive / fn = false negative Often quoted or plotted when comparing ML techniques

slide-42
SLIDE 42

Machine Learning Extra : 42 BMVA Summer School 2016

Kappa Statistic

Measure of classification of “N items into C mutually exclusive categories” Pr(a) = probability of success of classification ( = accuracy) Pr(e) = probability of success due to chance

– e.g. 2 categories = 50% (0.5), 3 categories = 33% (0.33) ….. etc. – Pr(e) can be replaced with Pr(b) to measure agreement between classifiers/techniques a and b

[Cohen, 1960]