Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk
forest classifier for R in SPRINT Lawrence Mitchell EPCC - - PowerPoint PPT Presentation
forest classifier for R in SPRINT Lawrence Mitchell EPCC - - PowerPoint PPT Presentation
A parallel random forest classifier for R in SPRINT Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk Outline SPRINT Random Forests Parallelisation results/lessons Future work SPRINT The Simple Parallel R Interface
Outline
- SPRINT
- Random Forests
- Parallelisation results/lessons
- Future work
SPRINT
- The Simple Parallel R Interface www.r-sprint.org
[Hill et al, BMC Bioinformatics (2008)]
- Enable R users to easily exploit HPC resources
- Motivating use case: analysis of gene expression data
- R library: install.packages(“sprint”)
- Parallel replacements for time-consuming analysis functions
- Written in C with MPI for parallelisation
- Portable
– multicore desktops; clusters; supercomputers; EC2
Example: permutation adjusted p-values
data(golub) library(multtest) ## Can take a long time resT <- mt.maxT(golub, golub.cl, test="t", side="abs") quit(save="no") data(golub) library(sprint) ## Run in parallel resT <- pmaxT(golub, golub.cl, test="t", side="abs") pterminate() quit(save="no")
Supported functionality
- Available now on CRAN
– Pearson Correlation (replacing cor) – Permutation adjusted p-values (replacing mt.maxT, from multtest) – Partioning around medoids (replacing pam)
- Coming soon: implemented, in test phase
– Parallel apply (like apply) – Bootstrapping (replacing boot) – Random Forests (builds on randomForest package) – Rank Products (implements functionality from Bioconductor package RP)
- Under development
– RMA analysis (affy package from Bioconductor)
Classifying data
- Have probe data from 100 patients in two groups
- Want to classify further patient data into correct group
– e.g. Susceptibility to some disease
- Construct classification model using test data
- Predict classification of unseen data
- Random Forests provide such a method
Random Forests
- An ensemble tree classifier
- Bootstrap samples of a dataset
– genetic data typically: O(100) cases; O(10000) probes
- One tree per sample, giving a forest
- Compute classifications over this ensemble
...
Existing implementations
- randomForest [Liaw & Wiener, R News (2002)]
– serial, R version
- parf [Topid et al, Parallel Numerics (2005)]
– task parallel F90, no R version
- randomjungle [Schwarz et al, Bioinformatics (2010)]
– task parallel C++, designed for microarray data, no R version
Parallelisation
- Build on existing serial R package
- Spread the generation of bootstraps across processes
- Combine the results on single (master) process
- 96000 probes; 65 cases
Combining takes time
Combine in parallel
- Do a tree reduction (compare MPI_Reduce)
- 24000 probes; 65 cases
Faster time to solution
- More efficient exploitation of computing resource
- 96000 probes; 65 cases
Lessons
- HPC resources are hard to exploit efficiently
- Profile and benchmark your implementation carefully
Future work in SPRINT
- Better (transparent) support for large datasets: starts October ‘11
- More analysis routines – we need user input here
- More efficient serial random forest (use randomjungle?)
- ...
Thanks
Wellcome Trust grant 086696/Z/08/Z and edikt2 The HECToR distributed CSE service operated by NAG Ltd The Centre for Numerical Algorithms and Intelligent Software DPM Team
- Peter Ghazal
- Thorsten Forster
- Muriel Mewissen
EPCC Team
- Terry Sloan
- Michal Piotrowski
- Lawrence Mitchell
- Savvas Petrou
- Bartek Dobrzelecki
- Jon Hill
- Florian Scharinger
Questions?