forest classifier for r in
play

forest classifier for R in SPRINT Lawrence Mitchell EPCC - PowerPoint PPT Presentation

A parallel random forest classifier for R in SPRINT Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk Outline SPRINT Random Forests Parallelisation results/lessons Future work SPRINT The Simple Parallel R Interface


  1. A parallel random forest classifier for R in SPRINT Lawrence Mitchell EPCC lawrence.mitchell@ed.ac.uk

  2. Outline • SPRINT • Random Forests • Parallelisation results/lessons • Future work

  3. SPRINT • The Simple Parallel R Interface www.r-sprint.org [Hill et al, BMC Bioinformatics (2008)] • Enable R users to easily exploit HPC resources • Motivating use case: analysis of gene expression data • R library: install.packages (“sprint”) • Parallel replacements for time-consuming analysis functions • Written in C with MPI for parallelisation • Portable – multicore desktops; clusters; supercomputers; EC2

  4. Example: permutation adjusted p-values data(golub) library(multtest) ## Can take a long time resT <- mt.maxT(golub, golub.cl, test="t", side="abs") quit(save="no") data(golub) library(sprint) ## Run in parallel resT <- pmaxT(golub, golub.cl, test="t", side="abs") pterminate() quit(save="no")

  5. Supported functionality • Available now on CRAN – Pearson Correlation (replacing cor ) – Permutation adjusted p-values (replacing mt.maxT , from multtest) – Partioning around medoids (replacing pam ) • Coming soon: implemented, in test phase – Parallel apply (like apply ) – Bootstrapping (replacing boot ) – Random Forests (builds on randomForest package) – Rank Products (implements functionality from Bioconductor package RP ) • Under development – RMA analysis ( affy package from Bioconductor)

  6. Classifying data • Have probe data from 100 patients in two groups • Want to classify further patient data into correct group – e.g. Susceptibility to some disease • Construct classification model using test data • Predict classification of unseen data • Random Forests provide such a method

  7. Random Forests • An ensemble tree classifier • Bootstrap samples of a dataset – genetic data typically: O(100) cases; O(10000) probes • One tree per sample, giving a forest ... • Compute classifications over this ensemble

  8. Existing implementations • randomForest [Liaw & Wiener, R News (2002)] – serial, R version • parf [ Topid et al, Parallel Numerics (2005)] – task parallel F90, no R version • randomjungle [Schwarz et al, Bioinformatics (2010)] – task parallel C++, designed for microarray data, no R version

  9. Parallelisation • Build on existing serial R package • Spread the generation of bootstraps across processes • Combine the results on single (master) process • 96000 probes; 65 cases

  10. Combining takes time

  11. Combine in parallel • Do a tree reduction (compare MPI_Reduce ) • 24000 probes; 65 cases

  12. Faster time to solution • More efficient exploitation of computing resource • 96000 probes; 65 cases

  13. Lessons • HPC resources are hard to exploit efficiently • Profile and benchmark your implementation carefully

  14. Future work in SPRINT • Better (transparent) support for large datasets: starts October ‘11 • More analysis routines – we need user input here • More efficient serial random forest (use randomjungle ?) • ...

  15. Thanks DPM Team EPCC Team • • Terry Sloan Peter Ghazal • • Michal Piotrowski Thorsten Forster • • Lawrence Mitchell Muriel Mewissen • Savvas Petrou • Bartek Dobrzelecki • Jon Hill • Florian Scharinger Wellcome Trust grant 086696/Z/08/Z and edikt2 The HECToR distributed CSE service operated by NAG Ltd The Centre for Numerical Algorithms and Intelligent Software

  16. Questions?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend