Support Vector Machines for Classification of Flow Data - - PowerPoint PPT Presentation

support vector machines for classification of flow data
SMART_READER_LITE
LIVE PREVIEW

Support Vector Machines for Classification of Flow Data - - PowerPoint PPT Presentation

Support Vector Machines for Classification of Flow Data Classification of Flow Data Funded by SBIR Grant # R43 RR024094-01A1 FlowCap 2010 p John Quinn Ph.D. Treestar john@treestar.com Our Objective Our Objective Demonstrate that


slide-1
SLIDE 1

Support Vector Machines for Classification of Flow Data Classification of Flow Data

Funded by SBIR Grant # R43 RR024094-01A1

FlowCap 2010 p

John Quinn Ph.D. Treestar john@treestar.com

slide-2
SLIDE 2

Our Objective Our Objective

  • Demonstrate that supervised training

algorithms can effectively replicate user created gates

– Very useful for high throughput settings – Can increase robustness

  • We believe this will be the first application in

pp which algorithmic gate placement becomes the norm.

slide-3
SLIDE 3

Selected Algorithm Selected Algorithm

  • Support Vector Machine (SVM)

pp ( )

– Radial kernel

  • Supervised linear classifier that solves an
  • ptimization problem to find the hyperplane(s)

that separate classes with the maximum distance between classes

Wi h li i d h i li l – With non-linear mapping data that is not linearly separable can be classified

slide-4
SLIDE 4

SVM Operation SVM Operation

Optimization: p

  • Determine which

elements of the training data mark training data mark the boundary of maximum distance

D

between two classes

  • r

Support vectors Class 1 Class 2 D Maximum separation

slide-5
SLIDE 5

SVM Operation SVM Operation

  • Optimization problem

Optimization problem

For data: A h l th t t t l b d fi d A hyperplane that separates any two classes can be defined as: For ci=1 For ci=-1 Knowing that the data points should be outside of the margin, we can impose the constraint: p

slide-6
SLIDE 6

SVM Operation SVM Operation

We know that the support vectors will have a perpendicular di t f th h l f distance from the hyperplane of: and The distance between SV’s can then be expressed as: So optimization is the minimization of

D

slide-7
SLIDE 7

SVM Operation SVM Operation

We then use the inequality, q y, as a constraint to fix a critical point and use as a constraint to fix a critical point and use Lagrangian multipliers αi, to express w as a linear combination of the training vectors: The support vectors, NSV, are then the Xi associated with non-negative Lagrange multipliers

slide-8
SLIDE 8

SVM Operation SVM Operation

Once w is known, and the support vectors have been identified, b can be solved as: If there are more than two classes, the

  • peration remains the same but the

hyperplanes are determined either as one hyperplanes are determined either as one versus all or pairwise

  • We chose a one versus all format
slide-9
SLIDE 9

SVM Operation SVM Operation

  • Data not linearly separable? Map it to a

y p p space where it is!

– We assume that flow data will have a Gaussian G distribution and selected a Gaussian mapping

Input Space Mapped Space

slide-10
SLIDE 10

Why use an SVM? Why use an SVM?

  • SVM’s are deterministic
  • Find the global maxima and not local

maxima

– If the training data are representative of the real data, you cannot do better.

  • SVM’s are fast

– They solve a maximization problem, as d d i i i fi i

  • pposed to doing an iterative fitting
slide-11
SLIDE 11

Preprocessing Preprocessing

  • To prepare the training data, we:

N li th d t t f 1 t 1 – Normalize the data to a range of -1 to 1 – Identified the training data set with the largest number

  • f clusters
  • Used this data set as the reference set

– Calculated the centroid of each cluster in the reference set – In all other training data, calculated the Euclidean distance of each cluster to the clusters in the reference set and assigned them cluster ID’s matching reference set and assigned them cluster ID s matching the reference cluster with the smallest distance measure Took a sample of each training data set and combined – Took a sample of each training data set and combined them into one training vector to present to the SVM

slide-12
SLIDE 12

Algorithm choice Algorithm choice

Matlab has a free file share repository

Someone has already put almost any algorithm p y g you can think of into code

I d th SVM d d b I used the SVM coded by By Junshui Ma, and Yi Zhao of Ohio St. University

It received 5 stars

slide-13
SLIDE 13

Training Data Training Data

  • Example training data

p g

– Showing parameters 1 & 2, and 3 & 4 of the stem cell data set

slide-14
SLIDE 14

Results Results

slide-15
SLIDE 15

Results Results

Speed: p

Data set Training time Classification time

  • CFSE

4 sec 2 min 48 sec (13 files)

  • CFSE

4 sec 2 min 48 sec (13 files)

  • DLBCL

5 sec 67 sec (30 files)

  • GvHD

5 sec 38 sec (12 files)

  • NDD

11 sec 27 min 28 sec (30 files)

  • Stem cell

4 sec 19 sec (30 files) Stem cell 4 sec 19 sec (30 files)

slide-16
SLIDE 16

Room for improvement… Room for improvement…

  • The SVM’s are highly dependant on

g y p identifying a transform that maps the data to a linearly separable space.

  • We could experiment with a number of

different transforms

slide-17
SLIDE 17

FlowCap Feedback FlowCap Feedback

  • What went well

What went well

– Data easily available – Submission process easy Submission process easy – Questions answered immediately!

  • What could be improved

Wid bli it ti l l t f – Wider publicity particularly out of our domain

slide-18
SLIDE 18

Questions? Questions?