Alternatives to support vector machines in neuroimaging ensembles - PowerPoint PPT Presentation

Alternatives to support vector machines in neuroimaging ensembles of decision trees for classification and information mapping with predictive models Jonas Richiardi FINDlab / LabNIC http://www.stanford.edu/~richiard Dept. of Neurology & Dept. of Neuroscience Neurological Sciences Dept. of Clinical Neurology PRNI2013 tutorial

Decision trees deserve more attention Scopus june 2013: (mapping OR "brain decoding" OR "brain reading" OR classification OR MVPA OR "multi-voxel pattern analysis") AND (neuroimaging OR "brain imaging" OR fMRI OR MRI or "magnetic resonance imaging") + ("support vector machine" OR "SVM") = 657 docs (1282 if adding EEG OR electroencephalography OR MEG OR magnetoencephalography) + ("random forest" OR "decision tree") = 71 docs (199 if adding EEG OR electroencephalography OR MEG OR magnetoencephalography) Roughly speaking, more used at MICCAI (segmentation, geometry extraction, image reconstruction, skull stripping...) than at HBM 4

Tutorial agenda Lecture Practical Basics Datasets Growing trees Matlab/ PRoNTo Ensembling Python/Scikit The random forest Other forests Tuning your forests Information mapping Correlated features 5

Information mapping relates model input to output In supervised learning for classification, we seek a function mapping voxels to class labels f ( x ) : R D → y y = { 0 , 1 } , x ∈ R D S = { ( x n , y n ) } , n = 1 , . . . , N As neuroimagers, if some voxels in x contain information about y, the function should reflect it, and we are interested in mapping it f ( x ) = sgn( w T x + b ) tells us about relative discriminative importance [Mourao-Miranda et al., NeuroImage, 2005] 6

Mutual information measures uncertainty reduction We can also explicitly measure the amount of information x and y share using mutual information . I ( Y ; X ) ≡ H ( Y ) − H ( Y | X ) 1 1 X X H ( Y ) ≡ P ( y ) log H ( Y | X ) ≡ P ( y, x ) log P ( y ) P ( y | x ) y y,x high if classes are balanced average uncertainty remaining about (high uncertainty about class class label if we know the voxel intensity membership) P ( y, x ) log P ( y, x ) X I ( Y ; X ) ≡ P ( y ) P ( x ) y,x average reduction in uncertainty about class label if we know voxel intensity 7

Growing trees recursively reduces entropy Decision trees seek to partition voxel space by intensity values to decrease uncertainty about class label x2 x2 x2 � 3 A A C f(x1,x2) f(x1,x2) A > � 1 > � 1 f(x1,x2) B B C > � 1 x2 x2 x2 > � 2 > � 2 > � 3 � 2 � 2 B B � 1 � 1 � 1 x1 x1 x1 A A A This leaves a few questions... How to measure goodness of splits? How to choose voxels and where to cut ? When to stop growing? >rpart::rpart >>stats::classregtree 8 >>>sklearn::tree

Split goodness can be measured 1 Entropy impurity: X H ( S ) ≡ P ( y ) log P ( y ) y Gini impurity: X G ( S ) ≡ P ( y i ) P ( y j ) y i 6 = y j [Hastie et al., 2001] Information Gain (decrease in impurity): |S i | X |S| H ( S i ) ∆ I ( S ; x, τ ) ≡ H ( S ) − i = L,R 9

Information gain helps choose the best split Split&1& Before&split& Split&2& [Criminisi & Shotton, 2013] 10

Stopping and pruning matter We can stop growing the tree When the info gain is small When the number of points in a leaf is small (relative or absolute) - dense regions of voxel space will be split more When (nested) CV error does not improve any more ... but stopping criteria are hard to set We can grow fully and then prune Merge leaves where miminal impurity increase ensues We can leave unpruned These choices generally matter more than split goodness criterion (see CART vs C4.5) 11

Trees relate to other models We can view trees as kernels : build a feature space mapping with indicator functions Φ ( x ) = ( 1 1 ( x ) . . . 1 B ( x )) T Then is a positive kernel (only = 1 if x and x’ k b ( x , x 0 ) = Φ ( x ) T Φ ( x 0 ) in same leaf). Can also do ‘soft’ version We can also view trees as encoding conditional probability distributions, e.g. represented by Bayesian Networks : ... x 1 x 2 x D 1 p t (y| x ) 0.5 0 y 300 200 320 150 340 360 100 380 50 400 P ( x 1 , . . . , x D , y ) = P ( x 1 ) · . . . · P ( x D ) P ( y | x 1 , . . . , x D ) Voxel 1 Voxel 2 420 12 Kernel view: [Geurts et al., 2006] BN view: [Richiardi, 2007], others

Trees are rule-based classifiers [Douglas et al. 2011] 13

Many trees are more complex Multivariate trees query >1 voxels per node Splits don’t have to be axis-parallel (can be oblique) Model trees use MV regression in leaves Functional Trees can use several voxels either at nodes or leaves At each node, use Δ I to split on either a voxel x, or a logistic regression estimate of class probability P(y) Multivariate nodes (FT -inner) reduce bias Multivariate leaves (FT -leaves) reduces variance 14 Functional trees: [Gama, 2004]

Single trees vs SVMs SVM Tree Interpretability + + Irrelevant voxels +- + Input scaling - + Speed + (linear) + Generalisation error ++ - Information mapping +- - [Kuncheva&Rodriguez 2010] 15

Single trees tend to have high variance The bias-variance tradeoff applies as usual: We can decrease prediction error arbitrarily on a given dataset, thus yielding low bias . However, this systematically comes at a price in variance : the parameters of f can change a lot if the training set varies. Single trees are not ‘stable’ - they tend to reduce bias by increasing variance 17 [Duda et al, 2001]

Ensembling exploits diversity Train set of classifiers, combine predictions, get reduced ensemble variance and/or bias Tree diversity has multiple sources Training set variability / resampling , random projection , choice of cut point (, pruning strategy...) 18

Bagging classifiers generates diversity Bagging = B oostrap agg regat ing 1. Resample with replacement B times from training dataset , yielding {S b } , b = 1 , . . . , B S 2. Train B base classifiers { f b } 3. Get B predictions 4. Combine by majority vote If the base classifiers have high variance, accuracy tends to improve with bagging since this generates diversity Good news for trees! >::sample >>stats::ClassificationBaggedEnsemble 19 [Breiman, 1996] >>>sklearn::ensemble

The random forest bags trees RF combines several diversity-producing methods 1. Generate B bootstrap replicates log 2 (D) 300 sqrt(D) 250 random projection 2. At each node, randomly select a few 200 voxels. Typically or . 150 100 Since K << D, randomisation is high. 50 2 4 6 8 10 3. No pruning number of voxels 4 x 10 With a ‘large enough’ number of [Kuncheva&Rodriguez 2010] trees, RFs typically performs well with no tuning on many datasets >adabag::bagging >>stats::TreeBagger, PRoNTo::machine_RT_bin 21 Random projection: [Ho, 1998] >>>sklearn::RandomForestClassifier

RFs can be seen as probabilistic models The leaf of each tree b can be seen as a posterior (multinomial distribution) p b (y| x ) Ensemble probability: 1 0.5 0 300 200 150 350 100 Voxel 1 50 400 Voxel 2 More trees = smoother posterior = less over-confidence 22

RF works for fMRI classification (1) Data: event related fMRI, belief vs disbelief in statements, 14 subjects Features: ICA timecourse value at button press [Douglas et al. 2011] 23

RF works for multimodal classification Data: 14 HC young, 13 AD old, 14 HC old. Visual stimulation + keypress. fMRI. Features: fMRI GLM activation-related (n suprathreshold voxels, peak z-score,...), RT, demographics... + feature selection Classifiers: RF + variants of split criterion. Group classification. 97-99% acc [Tripoliti et al. 2011] activation features only + other features 24

RF really works for multimodal classification Data: ADNI, 37 AD, 75 MCI, 35 HC. MRI, FDG-PET, CSF measures, 1 SNP Features: RF as kernel + MDS 25 [Gray et al 2013]

Extremely Randomised Trees increase diversity Tree variance is due in large part to cutpoint choice. We can generate even more diversity with Extra-trees Select both K voxels and cutpoints at random, pick best*. Stop growing when leaves are small When K=1, called totally randomized trees + Accuracy and variance reduction competitive with and sometimes better than RF, faster than RF >extraTrees::extraTrees Extra-trees: [Geurts et al., 2006] >>PRoNTo::machine_RT_bin *Dietterich 1998 - the opposite: select top-K best 27 >>>sklearn::ExtraTreesClassifier splits, then pick at random

Alternatives to support vector machines in neuroimaging ensembles - PowerPoint PPT Presentation

Alternatives to support vector machines in neuroimaging ensembles of decision trees for classification and information mapping with predictive models Jonas Richiardi FINDlab / LabNIC http://www.stanford.edu/~richiard Dept. of Neurology &

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Topological Data Analysis for Brain Networks Relating Functional Brain Network Topology to

Deep Model Generalization for Medical Image Computing at Scale DOU Qi Department of Computer

AI IN HEALTHCARE What is Artificial Intelligence (AI)? How is AI different from Machine Learning

Signal Processing for Functional Brain Imaging: ICA Lab Exercise Dimitri Van De Ville Medical

Lec No: 3 Subject: Brainstem Motor Function Doctor: Faisal Mohammed 00/00/2017 Brainstem Motor

Pre-motor nuclei and the medial longitudinal fasciculus (MLF) I Hierarchy of control Planning-

Lecture 2: Neurobiology 101 Image from http://clasdean.la.asu.edu/news/images/ubep2001/neuron3.jpg

The High Yield Neurologic Examination Sadly, I still have nothing new to disclose since Tuesday

Alternatives to support vector machines in neuroimaging ensembles - PowerPoint PPT Presentation

Alternatives to support vector machines in neuroimaging ensembles of decision trees for classification and information mapping with predictive models Jonas Richiardi FINDlab / LabNIC http://www.stanford.edu/~richiard Dept. of Neurology &

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

? 17.10.2018 3 17.10.2018 4 Support Vector Machines (SVM): Background Support Vector Machines

Support Vector Machines October 16, 2018 Support Vector Machines October 16, 2018 1 / 31

Relevance Vector Machines Jukka Lankinen LUT February 21, 2011 Jukka Lankinen Relevance Vector

Vector addition: The zero vector The D -vector whose entries are all zero is the zero vector ,

Support Vector Machines &amp; Kernelization Barna Saha Most of the slides are made using David

Introduction Kailash Awati Instructor DataCamp Support Vector Machines in R Preliminaries

Support Vector Machines Support Vector Machines Hypothesis Space Hypothesis Space variable

Support Vector Machines (Ch. 18.9) SVM Basics Support Vector Machines (SVMs) try to do our

Support vector machines CS 446 Part 1: linear support vector machines 1.0 1.0 1.0 0.8 0.8

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Matthieu R Bloch Tuesday, February 25, 2020 1

RBF Kernels: Generating a complex dataset DataCamp Support Vector Machines in R A bit about RBF

Machine Learning for NLP Support Vector Machines Aurlie Herbelot 2019 Centre for Mind/Brain

Generating a radially separable dataset DataCamp Support Vector Machines in R Generating a 2d

Support Vector Machines 290N, 2014 Support Vector Machines (SVM) Supervised learning

Topological Data Analysis for Brain Networks Relating Functional Brain Network Topology to

Deep Model Generalization for Medical Image Computing at Scale DOU Qi Department of Computer

AI IN HEALTHCARE What is Artificial Intelligence (AI)? How is AI different from Machine Learning

Signal Processing for Functional Brain Imaging: ICA Lab Exercise Dimitri Van De Ville Medical

Lec No: 3 Subject: Brainstem Motor Function Doctor: Faisal Mohammed 00/00/2017 Brainstem Motor

Pre-motor nuclei and the medial longitudinal fasciculus (MLF) I Hierarchy of control Planning-

Lecture 2: Neurobiology 101 Image from http://clasdean.la.asu.edu/news/images/ubep2001/neuron3.jpg

The High Yield Neurologic Examination Sadly, I still have nothing new to disclose since Tuesday

Support Vector Machines & Kernelization Barna Saha Most of the slides are made using David