Support Vector Machines for Classification of Flow Data - - PowerPoint PPT Presentation
Support Vector Machines for Classification of Flow Data - - PowerPoint PPT Presentation
Support Vector Machines for Classification of Flow Data Classification of Flow Data Funded by SBIR Grant # R43 RR024094-01A1 FlowCap 2010 p John Quinn Ph.D. Treestar john@treestar.com Our Objective Our Objective Demonstrate that
Our Objective Our Objective
- Demonstrate that supervised training
algorithms can effectively replicate user created gates
– Very useful for high throughput settings – Can increase robustness
- We believe this will be the first application in
pp which algorithmic gate placement becomes the norm.
Selected Algorithm Selected Algorithm
- Support Vector Machine (SVM)
pp ( )
– Radial kernel
- Supervised linear classifier that solves an
- ptimization problem to find the hyperplane(s)
that separate classes with the maximum distance between classes
Wi h li i d h i li l – With non-linear mapping data that is not linearly separable can be classified
SVM Operation SVM Operation
Optimization: p
- Determine which
elements of the training data mark training data mark the boundary of maximum distance
D
between two classes
- r
Support vectors Class 1 Class 2 D Maximum separation
SVM Operation SVM Operation
- Optimization problem
Optimization problem
For data: A h l th t t t l b d fi d A hyperplane that separates any two classes can be defined as: For ci=1 For ci=-1 Knowing that the data points should be outside of the margin, we can impose the constraint: p
SVM Operation SVM Operation
We know that the support vectors will have a perpendicular di t f th h l f distance from the hyperplane of: and The distance between SV’s can then be expressed as: So optimization is the minimization of
D
SVM Operation SVM Operation
We then use the inequality, q y, as a constraint to fix a critical point and use as a constraint to fix a critical point and use Lagrangian multipliers αi, to express w as a linear combination of the training vectors: The support vectors, NSV, are then the Xi associated with non-negative Lagrange multipliers
SVM Operation SVM Operation
Once w is known, and the support vectors have been identified, b can be solved as: If there are more than two classes, the
- peration remains the same but the
hyperplanes are determined either as one hyperplanes are determined either as one versus all or pairwise
- We chose a one versus all format
SVM Operation SVM Operation
- Data not linearly separable? Map it to a
y p p space where it is!
– We assume that flow data will have a Gaussian G distribution and selected a Gaussian mapping
Input Space Mapped Space
Why use an SVM? Why use an SVM?
- SVM’s are deterministic
- Find the global maxima and not local
maxima
– If the training data are representative of the real data, you cannot do better.
- SVM’s are fast
– They solve a maximization problem, as d d i i i fi i
- pposed to doing an iterative fitting
Preprocessing Preprocessing
- To prepare the training data, we:
N li th d t t f 1 t 1 – Normalize the data to a range of -1 to 1 – Identified the training data set with the largest number
- f clusters
- Used this data set as the reference set
– Calculated the centroid of each cluster in the reference set – In all other training data, calculated the Euclidean distance of each cluster to the clusters in the reference set and assigned them cluster ID’s matching reference set and assigned them cluster ID s matching the reference cluster with the smallest distance measure Took a sample of each training data set and combined – Took a sample of each training data set and combined them into one training vector to present to the SVM
Algorithm choice Algorithm choice
Matlab has a free file share repository
Someone has already put almost any algorithm p y g you can think of into code
I d th SVM d d b I used the SVM coded by By Junshui Ma, and Yi Zhao of Ohio St. University
It received 5 stars
Training Data Training Data
- Example training data
p g
– Showing parameters 1 & 2, and 3 & 4 of the stem cell data set
Results Results
Results Results
Speed: p
Data set Training time Classification time
- CFSE
4 sec 2 min 48 sec (13 files)
- CFSE
4 sec 2 min 48 sec (13 files)
- DLBCL
5 sec 67 sec (30 files)
- GvHD
5 sec 38 sec (12 files)
- NDD
11 sec 27 min 28 sec (30 files)
- Stem cell
4 sec 19 sec (30 files) Stem cell 4 sec 19 sec (30 files)
Room for improvement… Room for improvement…
- The SVM’s are highly dependant on
g y p identifying a transform that maps the data to a linearly separable space.
- We could experiment with a number of
different transforms
FlowCap Feedback FlowCap Feedback
- What went well
What went well
– Data easily available – Submission process easy Submission process easy – Questions answered immediately!
- What could be improved