Machine Learning sanparith.marukatat@nectec.or.th Today Example - - PowerPoint PPT Presentation

machine learning
SMART_READER_LITE
LIVE PREVIEW

Machine Learning sanparith.marukatat@nectec.or.th Today Example - - PowerPoint PPT Presentation

Machine Learning sanparith.marukatat@nectec.or.th Today Example of intelligent system: OCR k-Nearest Neighbor Classifier Generative model Maximum likelihood Nave Bayes model Gaussian model What is OCR? Optical


slide-1
SLIDE 1

Machine Learning

sanparith.marukatat@nectec.or.th

slide-2
SLIDE 2

Today

  • Example of intelligent system: OCR
  • k-Nearest Neighbor Classifier
  • Generative model
  • Maximum likelihood
  • Naïve Bayes model
  • Gaussian model
slide-3
SLIDE 3

What is OCR?

  • Optical Character Recognition

– Input: scanned images, photos, vdo images – Output: text file

  • Alternative input

– Electronic stylus e.g. PDA – Online handwritten recognition

slide-4
SLIDE 4

OCR process (1)

  • Preprocessing

– Image enhancement, denoise, deskew, ... – Binarization

  • Layout analysis and character segmentation
  • Character recognition
  • Spell correction
slide-5
SLIDE 5

Should be removed

slide-6
SLIDE 6

picture Two columns

slide-7
SLIDE 7

1 line

slide-8
SLIDE 8

1 character

slide-9
SLIDE 9

OCR process (2)

  • Preprocessing uses image processing tech.
  • Layout analysis uses rule-based + some stats.
  • Character recognition

– Classifier (trained from training corpus) – Look-up table: no class -> ascii or UTF code (given by system designer)

  • Spell correction uses dictionary + some stats

+ some NLP tech.

slide-10
SLIDE 10

How to build the recognition module? (1)

  • Technical choice

– All separated character: Neural Network, SVM – Few touched characters: Class of touched char – Some broken characters (อำำ): Class of sub-char – Rule-based segmentation – Several touched chars (e.g. arab, urdu): 2D- HMM

slide-11
SLIDE 11

How to build the recognition module? (2)

  • Normalize character image (reduce variation, get

fixed size)

  • Select features

– High-level features: e.g. head, tail

  • Contain more information
  • Cannot reliably detected

– Low-level features: pixels color, edge

  • Single feature is not meaningful
  • Can be easily detected
  • Can be improved: PCA, LDA, NLDA, ...
slide-12
SLIDE 12

How to build the recognition module? (2)

  • Design class
  • Build a feature extractor, ex: vector of pixels color
  • Construct a training corpus

– 1 example = 1 vector and 1 class – Very large number of examples – Cover all conditions: dpi, fonts, font sizes, style (e.g. slant, bold), writing styles, pen styles, with and without noises

slide-13
SLIDE 13

How to build the recognition module? (3)

  • Building corpus

– Handwritten

  • Collect sample
  • Segment from forms or manual segmentation

– Printed

  • Print different fonts, font sizes, ...
  • Scan, scan of copy, ...
  • Time consuming
slide-14
SLIDE 14

How to build the recognition module? (4)

  • Select a classifier

– Select tools

  • SNNS or fann for Neural Network
  • libsvm or svmlight for SVM
  • weka

– Format of training corpus – Parameters and their values – How to use it in your code

slide-15
SLIDE 15

What is Neural Network?

  • Biological inspired multi-class classifier
  • Set of nodes with oriented connections

Multi-Layer Perceptron Diabolo network Recurrent network

slide-16
SLIDE 16

Using neural network

  • Try MLP with 1 hidden layer first
  • 1 parameter = number of hidden nodes
  • Training with Gradient descent
  • 1 training parameter = learning rates
slide-17
SLIDE 17

What is SVM?

  • Linear classifier using kernel trick trained to tradeoff

between error and generalization – Linear classifier

  • output is linear combination of input features
  • y = sign(wTx)
  • Use multiple linear classifier for multi-class

problem

– Kernel trick

  • Replace all dot product with a kernel function
  • K(x1,x2) = <g(x1); g(x2)> with some unknown

function g

slide-18
SLIDE 18

Using SVM

  • Kernel function

– Radial Basis Function (RBF) K(x,y) = exp(-γ||x-y||2) – Polynomial K(x,y) = (<x;y>+1)d

  • Tradeoff parameter C

– Small C = generalization is more important than error – Large C = error is more important

slide-19
SLIDE 19

Exercise

  • MNIST dataset with libsvm using RBF

kernel

  • SVM's parameters

– gamma = inverse of area of influence around each example, try 0.005 – C = trade off parameter between error on training set and generalization, try 1000

slide-20
SLIDE 20
slide-21
SLIDE 21

How to build the recognition module? (5)

  • Multi-pass classifier

– Rough classification: upper vowel, mid-level char, lower vowel – Fine classification

slide-22
SLIDE 22

How to build the recognition module? (5)

  • Multi-pass classifier

– Rough classification: upper vowel, mid-level char, lower vowel – Fine classification: กถภ, ปฝฟ, ... – Finer classification

slide-23
SLIDE 23

1-Nearest Neighbor classifier

  • Prototype-based classifier, template-based classifier
  • Distance function
  • Useful when

– We have very limited number of training examples, cannot train other classifier – We have large number of training examples, just to test as a baseline

  • What is performance of this model?

– When n→ ∞, 1NN error < 2 bayes error

slide-24
SLIDE 24

Bayes error

  • If we do know P(Class1|x),...,P(ClassM|x), then the Bayes

classification rule f(x) = arg maxi=1,...,M P(Classi|x) produces the minimum possible expected error = Bayes error

P(Y=1|x) P(Y=2|x)

Expected error

X

slide-25
SLIDE 25

k-NN and Bayes classification rule

  • P(x) ≈ k / (NV)
  • N number of examples in training set
  • V volume around x
  • k number of points in V
  • P(x|y) ≈ ky / (NyV)
  • ky number of examples of class y in V
  • Ny number of examples of class y in training set
  • P(y) ≈ Ny/N
  • P(y|x) ≈ ky / k (why??)
slide-26
SLIDE 26

k-NN algorithm

  • Put the input object into the same class as most of

its k nearest neighbors

  • Implementation

– Compute distance between input and each training examples – Sort in increasing order – Count number of instances from each class amongst k nearest neighbors

  • There is no k which is always optimal
slide-27
SLIDE 27

Some useful distances

  • Norm-p distance
  • Mahalanobis distance
  • Kernel-based distance

– WHY?

∣ ∣x−y∣ ∣

p=∑ i=1 n

xi−yi

p 1/p

distx , y=x−y

T  −1x−y

distx , y=K x ,xK y , y−2Kx , y

slide-28
SLIDE 28

Generative approach (1)

  • Solving classification problem = build P(class|input)
  • P(classi|input) = P(input|classi) P(classi) / P(input)
  • P(input) = Σi P(input|classi) P(classi)
  • P(classi) = percentage of examples from class i in the

training set

  • Solving classification problem = build P(input|classi)
  • P(input|classi) = likelihood of classi
  • P(classi|input) = posterior probability of classi
slide-29
SLIDE 29

Generative approach (2)

  • To build P(input|classi) we usually made an

assumption about

– How the data from the class i is distributed, e.g. Binomial, Gaussian, Mixture of Gaussian – How the data is created, e.g. HMM

  • Example: Document classification

– How each input i.e. document is represented? – What is the likelihood model for these data?

slide-30
SLIDE 30

Document classification (1)

  • Spam/not-spam
  • Document = set of words
  • Preprocessing

– word segmentation – remove stop-words – stemming – word selection

slide-31
SLIDE 31

Document classification (2)

  • Naïve Bayes assumption: all words are

independent

  • P(w1,...,wn|Spam) = Πi P(wi|Spam)
  • Same hypothesis for all classes
  • How to compute P(wi|Spam), why?
  • What is the process of building Naïve Bayes

model for spam classification?

slide-32
SLIDE 32

Maximum Likelihood (1)

  • x1,...,xN are i.i.d. according to P(x|θ) where θ is the

parameter of this model

  • Q: What is the proper value for θ?
  • A: The value which gives maximum P(x1,...,xN|θ)
  • Q: We know P(x|θ), how to compute P(x1,...,xN|θ)?
  • A: P(x1,...,xN|θ) = Πi P(xi|θ), Why?
  • Q: How to find the maximum value?
  • Q: How to get ride of the product?
slide-33
SLIDE 33

Maximum Likelihood (2)

  • Exercise: Binomial distribution

– Q: What this means? What is P(w|Spam) – word “viagra” – {T, F, F, T, T, T, F, T, F, T } – Find proper parameter for P(w|Spam)

slide-34
SLIDE 34

Maximum Likelihood (3)

  • Exercise: coin-toss

– {H, T, H, H, T, H, H, H, T, H} – Q: What is the parameter of Binomial distribution that fits this data? – Q: What is the conclusion?

slide-35
SLIDE 35

Maximum A Posteriori

  • Sometimes, we have prior knowledge about the model,

i.e. we have P(θ)

  • We search for maximum P(θ|x1,...,xN) instead
  • Q: How to compute P(θ|x1,...,xN) from P(θ) and P(x|θ)?
  • Exercise: coin-toss problem θ is distributed as Gaussian

with mean 5/10 and standard deviation 1/10 – Q: What is Gaussian model? – Q: What is the proper value for θ?

slide-36
SLIDE 36

ML or MAP?

  • ML is good when we have large enough data
  • MAP is prefered when we have small data
  • Prior can be estimated from data too