Overview SVM theoretical framework ORACLE data mining technology - - PowerPoint PPT Presentation

overview
SMART_READER_LITE
LIVE PREVIEW

Overview SVM theoretical framework ORACLE data mining technology - - PowerPoint PPT Presentation

SVM: Algorithms of Choice for Challenging Data Boriana Milenova, Joseph Yarmus, Marcos Campos Data Mining Technologies ORACLE Corp. Overview SVM theoretical framework ORACLE data mining technology SVM parameter estimation SVM


slide-1
SLIDE 1
slide-2
SLIDE 2

SVM: Algorithms of Choice

for Challenging Data

Boriana Milenova, Joseph Yarmus, Marcos Campos Data Mining Technologies ORACLE Corp.

slide-3
SLIDE 3

Overview

SVM theoretical framework  ORACLE data mining technology

– SVM parameter estimation – SVM optimization strategy

SVM on challenging data

slide-4
SLIDE 4

SVM Model Defines a Hyperplane

Linear models in feature space Hyperplane defined by a set of coefficients and a bias term

   b x w

w b

slide-5
SLIDE 5

Maximum Margin Models

)) ( min(

i i

x f y margin Functional 

support vectors

) max( min margin  w

w w 1 ) ) ( min(  

i i

x f y margin Geometric

slide-6
SLIDE 6

SVM Optimization Problem

Minimize ||w|| subject to Lagrangian in primal space:

subject to

1 ) ( 

i i

x f y

 

 

      1 2 1 ) ( b y L

i i i p

x w w w w 

i

   w

p

L    b Lp

i i i y x

w    

i i y

slide-7
SLIDE 7

Duality

Lagrangian in dual space:

subject to

Dot products!

– dimension-insensitive optimization – generalized dot products via non-linear map 

 

  

j i j i j i i D

y y L x x    2 1

i

 

i i y

 ) ( ) ( ) , (

j i j i

K x x x x    

slide-8
SLIDE 8

Towards Higher Dimensionality via Kernels

  • 1. Transform data via non-linear mapping  to an inner

product feature space

  • 2. Train a linear machine in the new feature space

) , ( ,.) ( ,.) (

j i j i

K K K x x x x  

Mercer’s kernels:

– symmetry – positive semi-definite

kernel matrix

– reproducing property

) , ( ) , (

i j j i

K K x x x x 

slide-9
SLIDE 9

Soft Margin: Non-Separable Data

 

  

k p

C L  w w w 2 1 ) (

 

i i i

b y      1 x w

subject to Capacity parameter C trades off complexity and empirical risk

slide-10
SLIDE 10

1-Norm Dual Problem

Lagrangian in dual space:

subject to

Quadratic problem

– linear and inequality constraints

 

  ) , ( 2 1

j i j i j i i D

K y y L x x   

C

i 

 

i i y

slide-11
SLIDE 11

SVM Regression

  ˆ 

) ˆ ( 2 1 ) (

   

k k p

C L   w w w

 

i i i

y b       x w

subject to

 

i i i

b y   ˆ      x w

slide-12
SLIDE 12

SVM Fundamental Properties

Convexity

– single global minimum

Regularization

– trades off structural and empirical risk to

avoid overfitting

Sparse solution

– usually only a fraction of training data

become support vectors

Not probabilistic Solvable in polynomial time…

slide-13
SLIDE 13

SVM in the Database

ORACLE Data Mining (ODM)

– commercial SVM implementation in the

database

– product targets application developers and

data mining practitioners

– focuses on ease of use and efficiency

Challenges:

– effective and inexpensive parameter

tuning

– computationally efficient SVM model

  • ptimization
slide-14
SLIDE 14

SVM Out-Of-The-Box

Inexperienced users can get dramatically poor results LIBSVM examples:

Vehicle Bioinformatics Astroparticle Physics 0.88 0.02 0.79 0.57 0.97 0.67 After tuning correct rate Out-of-the-box correct rate

slide-15
SLIDE 15

SVM Parameter Tuning

Grid search (+ cross-validation or generalization error estimates)

– naive – guided (Keerthi & Lin, 2002)

Parameter optimization

– gradient descent (Chapelle et al., 2000)

Heuristics

slide-16
SLIDE 16

ODM On-the-Fly Estimates

Standard deviation for Gaussian kernel

– single kernel parameter – kernel has good numeric properties

 bounded, no overflow

Capacity

– key to good classification generalization

Epsilon estimate for regression

– key to good regression generalization

slide-17
SLIDE 17

ODM Standard Deviation Estimate

Goal: Estimate distance between classes

  • 3. Pick random pairs from
  • pposite classes
  • 4. Measure distances
  • 5. Order descending
  • 6. Exclude tail (90th percentile)
  • 7. Select minimum distance
slide-18
SLIDE 18

ODM Capacity Estimate

Goal: Allocate sufficient capacity to separate typical examples

  • 2. Pick m random examples per class
  • 3. Compute yi assuming  = C
  • 5. Exclude noise (incorrect sign)
  • 6. Scale C, (non bounded sv)
  • 8. Order descending
  • 9. Exclude tail (90th percentile)

10.Select minimum value

 

m j i j j i

K Cy y

2 1

) , ( x x

 

m j i j j i

K y y C

2 1

) , ( / x x 1  

i

y

slide-19
SLIDE 19

Some Comparison Numbers

LIBSVM examples:

0.71 0.84 0.97 On-the-fly estimates Vehicle Bioinformatics Astroparticle Physics 0.88 0.02 0.85 0.57 0.97 0.67 Grid search + xval Out-of- the-box

slide-20
SLIDE 20

ODM Epsilon Estimate

Goal: estimate target noise by fitting a preliminary model

  • 3. Pick m random examples
  • 4. Train SVM model with
  • 5. Compute residuals on

remaining data

  • 6. Scale
  • 7. Retrain

 

  2

/

1 n t t

    

slide-21
SLIDE 21

Comparison Numbers Regression

0.02 0.35 6.57 On-the-fly estimates RMSE Pumadyn Computer activity Boston housing 0.02 0.33 6.26 Grid search RMSE

slide-22
SLIDE 22

Optimization Approaches

QP solvers

– MINOS, LOQO, quadprog (Matlab)

Gradient descent methods

– Sequentially update one  coefficient at a

time

Chunking and decomposition

– optimize small “working sets” towards global

solution

– analytic solution possible (SMO - Platt, 1998)

slide-23
SLIDE 23

Chunking strategy

/* WS working set */ select initial WS randomly; while (violations) { Solve QP on WS; Select new WS; }

slide-24
SLIDE 24

ODM Working Set Selection

Avoid oscillations

– overlap across chunks – retain non-bounded support vectors

Choose among violators

– add large violators

Computational efficiency

– avoid sorting

slide-25
SLIDE 25

Who to Retain?

/* Examine previous working set */ if (non-bounded sv < 50%) { retain all non-bounded sv; add other randomly selected up to 50%; } else { randomly select non-bounded sv; }

slide-26
SLIDE 26

Who to Add?

create violator list; /* Scan I - pick largest violators */ while (new examples < 50% AND WS Not Full) { if (violation > avg_violation) add to WS; } /* Scan II - pick other violators */ while (new examples < 50% AND WS Not Full) { add randomly selected violators to WS; }

slide-27
SLIDE 27

SVM in Feed-Forward Framework

j i i j j i

K y y ) , ( x x 

j

 ) , (

i i

K x x

slide-28
SLIDE 28

DOF in Neural Nets / RBF

slide-29
SLIDE 29

DOF in SVM

slide-30
SLIDE 30

SVM vs. Neural Net / RBF

Compact model Global minimum Regularization

 – –  – 

NN / RBF SVM

slide-31
SLIDE 31

Text Mining

Domain characteristics:

– thousands of features – hundreds of topics – sparse data Science Sport Art

slide-32
SLIDE 32

SVM in Text Mining

Reuters corpus

~10K documents, ~10K terms, 115 classes Accuracy: recall / precision breakeven point 0.86 0.84 0.82 0.79 0.80 0.72 SVM non-linear SVM linear K-NN C4.5 Rocchio Naive Bayes

Joachims, 1998

slide-33
SLIDE 33

Biomining

Domain characteristics:

– thousands of features – very few data points – dense data

microarray data

slide-34
SLIDE 34

SVM on Microarray Data

Multiple tumor types

144 samples, 16063 genes, 14 classes Accuracy: correct rate 0.43 Naive Bayes 0.78 0.68 0.62 SVM linear K-NN Weighted voting

Ramaswamy et al., 2001

slide-35
SLIDE 35

Other domains

High dimensionality problems:

– image (color and texture histograms) – satellite remote sensing – speech

Linear kernels sufficient in most cases

– data separability – single parameter tuning (capacity) – small model size

slide-36
SLIDE 36

Final Note

SVM classification and regression algorithms available in ORACLE 10G database Two APIs

– JAVA (J2EE) – PL/SQL

slide-37
SLIDE 37

References

Chapelle, O., Vapnik, V., Bousquet, O., & Mukherjee, S. (2001). Choosing Multiple Parameters for Support Vector Machines. Hsu C., Chang C., & Lin, C. (2003). A Practical Guide to Support Vector Classification. Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Keerthi, S. & Lin, C. (2002). Asymptotic Behaviors of Support Vector Machines with Gaussian Kernel. Platt, J. (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J., Poggio, T., Gerald, W., Loda, M., Lander, E., Golub, T. (2001). Multi-Class Cancer Diagnosis Using Tumor Gene Expression Signatures.

slide-38
SLIDE 38