forest classifier for R in SPRINT Lawrence Mitchell EPCC - - PowerPoint PPT Presentation

forest classifier for r in
SMART_READER_LITE
LIVE PREVIEW

forest classifier for R in SPRINT Lawrence Mitchell EPCC - - PowerPoint PPT Presentation

A parallel random forest classifier for R in SPRINT Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk Outline SPRINT Random Forests Parallelisation results/lessons Future work SPRINT The Simple Parallel R Interface


slide-1
SLIDE 1

Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk

A parallel random forest classifier for R in SPRINT

slide-2
SLIDE 2

Outline

  • SPRINT
  • Random Forests
  • Parallelisation results/lessons
  • Future work
slide-3
SLIDE 3

SPRINT

  • The Simple Parallel R Interface www.r-sprint.org

[Hill et al, BMC Bioinformatics (2008)]

  • Enable R users to easily exploit HPC resources
  • Motivating use case: analysis of gene expression data
  • R library: install.packages(“sprint”)
  • Parallel replacements for time-consuming analysis functions
  • Written in C with MPI for parallelisation
  • Portable

– multicore desktops; clusters; supercomputers; EC2

slide-4
SLIDE 4

Example: permutation adjusted p-values

data(golub) library(multtest) ## Can take a long time resT <- mt.maxT(golub, golub.cl, test="t", side="abs") quit(save="no") data(golub) library(sprint) ## Run in parallel resT <- pmaxT(golub, golub.cl, test="t", side="abs") pterminate() quit(save="no")

slide-5
SLIDE 5

Supported functionality

  • Available now on CRAN

– Pearson Correlation (replacing cor) – Permutation adjusted p-values (replacing mt.maxT, from multtest) – Partioning around medoids (replacing pam)

  • Coming soon: implemented, in test phase

– Parallel apply (like apply) – Bootstrapping (replacing boot) – Random Forests (builds on randomForest package) – Rank Products (implements functionality from Bioconductor package RP)

  • Under development

– RMA analysis (affy package from Bioconductor)

slide-6
SLIDE 6

Classifying data

  • Have probe data from 100 patients in two groups
  • Want to classify further patient data into correct group

– e.g. Susceptibility to some disease

  • Construct classification model using test data
  • Predict classification of unseen data
  • Random Forests provide such a method
slide-7
SLIDE 7

Random Forests

  • An ensemble tree classifier
  • Bootstrap samples of a dataset

– genetic data typically: O(100) cases; O(10000) probes

  • One tree per sample, giving a forest
  • Compute classifications over this ensemble

...

slide-8
SLIDE 8

Existing implementations

  • randomForest [Liaw & Wiener, R News (2002)]

– serial, R version

  • parf [Topid et al, Parallel Numerics (2005)]

– task parallel F90, no R version

  • randomjungle [Schwarz et al, Bioinformatics (2010)]

– task parallel C++, designed for microarray data, no R version

slide-9
SLIDE 9

Parallelisation

  • Build on existing serial R package
  • Spread the generation of bootstraps across processes
  • Combine the results on single (master) process
  • 96000 probes; 65 cases
slide-10
SLIDE 10

Combining takes time

slide-11
SLIDE 11

Combine in parallel

  • Do a tree reduction (compare MPI_Reduce)
  • 24000 probes; 65 cases
slide-12
SLIDE 12

Faster time to solution

  • More efficient exploitation of computing resource
  • 96000 probes; 65 cases
slide-13
SLIDE 13

Lessons

  • HPC resources are hard to exploit efficiently
  • Profile and benchmark your implementation carefully
slide-14
SLIDE 14

Future work in SPRINT

  • Better (transparent) support for large datasets: starts October ‘11
  • More analysis routines – we need user input here
  • More efficient serial random forest (use randomjungle?)
  • ...
slide-15
SLIDE 15

Thanks

Wellcome Trust grant 086696/Z/08/Z and edikt2 The HECToR distributed CSE service operated by NAG Ltd The Centre for Numerical Algorithms and Intelligent Software DPM Team

  • Peter Ghazal
  • Thorsten Forster
  • Muriel Mewissen

EPCC Team

  • Terry Sloan
  • Michal Piotrowski
  • Lawrence Mitchell
  • Savvas Petrou
  • Bartek Dobrzelecki
  • Jon Hill
  • Florian Scharinger
slide-16
SLIDE 16

Questions?