GPU Acceleration for Machine Learning
John Canny*^
* Computer Science Division University of California, Berkeley ^ Google Research, 2016
GPU Acceleration for Machine Learning John Canny*^ * Computer - - PowerPoint PPT Presentation
GPU Acceleration for Machine Learning John Canny*^ * Computer Science Division University of California, Berkeley ^ Google Research, 2016 Outline BIDMach on single machines BIDMach on clusters DNNs for Power-Law data MCMC for massive
* Computer Science Division University of California, Berkeley ^ Google Research, 2016
Yahoo [Chen, Pavlov, Canny, KDD 2009]* Ebay [Chen, Canny, SIGIR 2011]** Quantcast 2011-2013 Microsoft 2014 Yahoo 2015 [KDD 2015] Google 2016-
* Best application paper prize ** Best paper honorable mention
Performance Interactivity Scripting Existing Codebase Object-Oriented Productivity Concurrency Natural Syntax Functional Platform Mobility
Dual Haswell (20-core) 2.4 GHz Titan-X GPU Sandy Bridge, 2.2 GHz CPU, 1 thread
C 1600 Mflops Julia 800 Mflops Scala 500 Mflops Lua 5 Mflops Python 1 Mflops
400 Gflops 5 Tflops
DataSource (local disk) Learner Model Optimizer Mixins Model Optimizer Mixins
GPU 1 thread 1 : :
data blocks
DataSource (Memory) DataSource HDFS over network Zhao+Canny SIAM DM 13, KDD 13, KDD 2015, IEEE BigData 2015
DataSink (local disk) DataSink (Memory) DataSink HDFS over network
GPU 2 thread 2
1.
Regression (logistic, linear)
2.
Support Vector Machines
3.
k-Means Clustering
4.
Topic Modeling - Latent Dirichlet Allocation
5.
Collaborative Filtering
6.
NMF – Non-Negative Matrix Factorization
7.
Factorization Machines
8.
Random Forests
9.
IPTW (Causal Estimation)
= Likely the fastest implementation available
1000 100 10 1 Time in seconds (log scale) Vowpal Wabbit LibLinear Scikit-Learn BIDMach
1000 100 10 1 Time in seconds (log scale) Spark - 72 core cluster BIDMach (1 Titan-X GPU) Spark - 136 core cluster
1000 100 10 1 Time in seconds (log scale) Spark - 384 core cluster BIDMach (1 680 GPU) Petuum – 32 node cluster Graphlab – 576 core cluster BIDMach (4 Titan-X)
asynchronous, fault tolerant.
Reduce along first (longest) dimension
Feature frequency Feature rank Feature sorted by frequency descending Freq α 1/rankp
Minibatch number
Features reduced on each round Tail features Head features
90-node Yahoo M45 64-node EC2 64-node EC2
System Algorithm Passes AUC Time(s) Spark 17x m3.2xlarge LR-LBFGS 3 0.68 3300 BIDMach 1x g2.xlarge LR-SGD 3 0.74 3000 BIDMach 17x g2.xlarge LR-SGD 3 0.74 220
System (node type) Nodes Inertia Time(s) Spark (m3.2xlarge) 97 1.5E6 1130 Petuum (m3.2xlarge) 17 ?? 2500 BIDMach (g2.xlarge) 1 1.5E6 460 BIDMach (g2.xlarge) 17 1.5E6 60
Feature frequency Feature rank Feature sorted by frequency descending Freq α 1/rankp
Input feature frequency
Output Features
Input feature frequency
Output Features
1st 15th
θ P(θ)
θ P(θ)
θ P(θ)
ΔU = log(l2/l1) Pr(accept) exp(ΔU) ΔU = log(l2/l1) Pr(accept) 1/(1+exp(-ΔU)) 1
ΔU = log(l2/l1) Pr(accept) exp(ΔU) X density(X) Accept if ΔU + X > 0 NoiseU (normal) X (logistic’) Xcorrection
ΔU = log(l2/l1) Pr(accept) exp(ΔU) X density(X) Accept if ΔU + X > 0 NoiseU (normal) X (logistic’) Xcorrection