slid ide in in defen ense e of smart algorithms over er
play

SLID IDE : In In Defen ense e of Smart Algorithms over er Ha - PowerPoint PPT Presentation

SLID IDE : In In Defen ense e of Smart Algorithms over er Ha Hardware e Accel eler eration for Large-Scale e Deep eep Lea earning System ems Beidi Chen Collaborators: Tharun Medini * , James Farwell , Sameh Gobriel , Charlie


  1. SLID IDE : In In Defen ense e of Smart Algorithms over er Ha Hardware e Accel eler eration for Large-Scale e Deep eep Lea earning System ems Beidi Chen Collaborators: Tharun Medini * , James Farwell † , Sameh Gobriel † , Charlie Tai † , Anshumali Shrivastava * * Rice University, † Intel MLSys 2020

  2. Ou Our SL SLIDE Sy System ( C++ from scratch) on a 44 44 co core CPU be beats TF on n V100 (1 (1 ho hour urs vs 3. 3.5 ho hour urs). ). 100+ million parameter networks. TF on same CPU is 16 hours with all HPC optimization (Intel MKL-DNN). 3.5x faster on CPU than TF on V100 (Log Scale in Time) 2

  3. Th The Age ge of Large ge Networks • More Data • Large Models • Tons of Engineering • Backpropagation (Aka Simple Gradient Descent) 3

  4. Fully Fully Connec nnected ed NN Giant Matrix Multiplication for every data point in each epoch (Forward + Backward) 𝑔(𝑋 $ 𝑦) 1 1 … 1 1 2 5 2 2 3 3 3 … 4 … … 9 … Hidden 1 Hidden 2 … Input Output 4

  5. Ch Challenges Do we really need all the computations? No!! Good News: Only high activations are important • Sampling few neurons in proportion of activations is enough ( Adaptive Dropouts ) (Ba et al. Neurips 13 , Makhzani et al. Neurips 15) • Relu filtered negative activations (50% sparsity by design) • Softmax Bad News: We need to compute all to identify (or sample) the high activation neurons. NO SAVINGS 5

  6. Th The Fundamental Sampling Puzzle Given N fixed sampling weights, 𝑥 ( , 𝑥 * , … , 𝑥 , . • Task: Sample 𝑦 - with probability 𝑥 - • Cost of 1 sample 𝑃(𝑂) . • Cost of K samples 𝑃(𝑂) . 0 . 0 , 𝑥 * 0 , … , 𝑥 , Given N time-varying sampling weights (activations) 𝑥 ( 0 • Task: At time t, sample 𝑦 - with probability 𝑥 - • Cost of sampling O(N), at every time t. 0 = 𝑔(𝑡𝑗𝑛(𝜄 0 , 𝑦 - )) , for a • Last Few years of work in Locality Sensitive Hashing: If 𝑥 - specific set of f and sim, then 𝑃(1) every time after and initial preprocessing cost of 𝑃(𝑂) . 6

  7. Te Textbook Hashing (Dictionary) Hashing: Function h that maps a given data point ( 𝑦 ∈ 𝑆 9 ) to an integer key ℎ ∶ 𝑆 9 ↦ 0, 1, 2, … , 𝑂 . ℎ(𝑦) serves as a discrete fingerprint. Property (Ideal Hash Functions): • If x = 𝑧 , then ℎ 𝑦 = ℎ(𝑧) • If x ≠ 𝑧 , then ℎ 𝑦 ≠ ℎ(𝑧) 7

  8. Pr Probabilistic Fingerprinting (Hashing) (late ate 90s) Hashing: Function (Randomized) h that maps a given data point ( 𝑦 ∈ 𝑆 9 ) to an integer key ℎ ∶ 𝑆 9 ↦ 0, 1, 2, … , 𝑂 . ℎ(𝑦) serves as a discrete fingerprint. Locality Sensitive Property: • If x = 𝑧 S𝑗𝑛(𝑦, 𝑧) is high, then ℎ 𝑦 = ℎ 𝑧 Pr(ℎ 𝑦 = ℎ 𝑧 ) is high • If x ≠ 𝑧 S𝑗𝑛(𝑦, 𝑧) is low, then ℎ 𝑦 ≠ ℎ(𝑧) Pr(ℎ 𝑦 = ℎ 𝑧 ) is low Unlikely Likely 8

  9. Ex Exampl ple 1: Si : Signed R Random P m Project ction ( (SR SRP) X X H 2 + H 1 - + Y Y Pr ( h ( x ) = h ( y )) = 1 − 1 monotonic in 𝜄 π cos − 1 ( θ ) A classical result from Goemans-Williamson (95) 9

  10. Ex Exampl ple 2: ( : (De Densif ifie ied) ) Wi Winne nner Ta Take Al All Original Vectors: K=3 WTA hash codes: (ICCV 2011) DWTA hash codes: (UAI 2018) Yagnik (ICCV11), Chen and Shrivastava (UAI 18) 10

  11. <latexit sha1_base64="T1X8MzxWNtEUHQEsV/R6e5uIKQ=">ACEnicbZDLSsNAFIYnXmu9V26GSxCAqUkVdCFQsGNywr2AmkIk+mkGTq5MDORhtBncOruHGhiFtX7nwbp20W2vrDwDf/OYeZ83sJo0Ka5re2srq2vrFZ2ipv7+zu7VcODjsiTjkmbRyzmPc8JAijEWlLKhnpJZyg0GOk641upvXuA+GCxtG9zBLihGgYUZ9iJXlVowWd4O+R4d2oI8NeA0DPTOmd0exrwsa6uNaZhg16FaqZt2cCS6DVUAVFGq5la/+IMZpSCKJGRLCtsxEOjnikmJGJuV+KkiC8AgNia0wQiERTj5baQJPlTOAfszViScub8nchQKkYWe6gyRDMRibWr+V7NT6V86OY2SVJIzx/yUwZlDKf5wAHlBEuWKUCYU/VXiAPEZYqxbIKwVpceRk6jbp1Vm/cnVebV0UcJXAMToAOLHABmuAWtEAbYPAInsEreNOetBftXfuYt65oxcwR+CPt8wdf5q3</latexit> Pr Probabilisti tic Has Hash h Tables ables Given: 𝑔 is monotonic. ⇥ ⇤ h ( x ) = h ( y ) = f ( sim ( x, y )) , Pr h • Given query, if ℎ ( 𝑟 = 11 and ℎ * 𝑟 = 01 , then probe bucket with index 1101 . It is a good bucket !! • (Locality Sensitive) ℎ - 𝑟 = ℎ - (𝑦) noisy indicator of high similarity. • Doing better than random !! 11

  12. LSH LSH f for Se or Search ch ( (Known) Theory • Super-linear 𝑃(𝑂 (FG ) memory • Sub-linear query time, O( 𝑂 G ) • 𝜍 < 1 but generally large (close to 1) and often hard to determine Practical Issues • Needs lot of hash tables and distance computations for good accuracy on near-neighbors • Buckets can be quite heavy. Poor randomness, or unfavorable data distributions 12

  13. New Vi View: Data Structures for Efficient Sampling! Is LSH really a search algorithm? • Given the query 𝜄 0 , LSH samples 𝑦 - from the dataset, with probability 0 = 1 − 1 − p x L , 𝜄 0 M N 𝑥 - 0 is proportional to p x L , 𝜄 0 M and the some similarity of x L , 𝜄 0 • 𝑥 - • LSH is considered a black box for nearest-neighbor search. It is not!! 13

  14. LSH LSH a as Sa Samp mplers We can pre-process the dataset D, such that • Given any query q, we can sample 𝑦 ∈ 𝐸 with probability 𝐷𝑝𝑜𝑡𝑢× 1 − 1 − 𝑞 𝑟, 𝑦 V W in KL hash computation and L bucket probes. • Even K = 1, L =1 is adaptive. So O(1) time adaptive. • Adaptive : x is sampled with higher probability than y • if and only if sim(q,x) > sim(q,y) We can exactly compute the sampling probability. • Const = No of elements sampled/ No of elements in Buckets (Chen et al. NeurIPS 2019) Sufficient for Importance Sampling Estimations. Sampling cost O(1). 14

  15. SLI SLIDE : S ub- LI near D eep learning E ngine 1 1 Step 1 – Build the hash tables by 1 |3 1 |1 processing the weights of the H1 2 | 1,4 2 | 2,4 H2 hidden layers (initialization). 1 3 | 2 3 | 3 1 … 1 1 2 5 2 2 3 Subtlety : Neurons (vectors) in 3 3 … 4 hash tables are not the data 4 4 9 5 vectors. Reorganizing neurons. Hidden 1 Hidden 2 … Input Output 15

  16. SLI SLIDE : S ub- LI near D eep learning E ngine Step 2 – Hash the input to any 1 |3 1 |1 given layer using its randomized H1 2 | 1,4 2 | 2,4 H2 hash function. 1 2 3 | 2 3 | 3 1 … 1 1 2 5 2 2 3 3 3 … 4 4 4 9 5 Hidden 1 Hidden 2 … Input Output 16

  17. SLI SLIDE : S ub- LI near D eep learning E ngine Step 3 – Query the hidden layer's 3 1 |3 1 |1 hash table(s) for the active set H1 2 | 1,4 2 | 2,4 H2 using integer fingerprint. 1 3 | 2 3 | 3 Sample neurons in proportion to 1 … 1 1 their activations. 2 5 2 2 3 3 3 … 4 4 4 9 5 Hidden 1 Hidden 2 … Input Output 17

  18. SLI SLIDE : S ub- LI near D eep learning E ngine Step 4 – Perform forward and 1 |3 1 |1 back propagation only on the H1 2 | 1,4 2 | 2,4 H2 nodes in the active set. 1 3 | 2 3 | 3 Computation is in the same order 4 1 … 1 1 of active neurons. 2 5 2 2 3 3 3 … 4 4 4 9 5 Hidden 1 Hidden 2 … Input Output 18

  19. SLI SLIDE : S ub- LI near D eep learning E ngine 5 5 Step 5 – Update hash tables by 1 |3 1 |1 rehashing the updated node H1 2 | 1,4 2 | 2,4 H2 weights. 1 3 | 2 3 | 3 Computation is in the same order 1 … 1 1 of active neurons. 2 5 2 2 3 3 3 … 4 4 4 9 5 Hidden 1 Hidden 2 … Input Output 19

  20. We We can go very sparse if Adaptive • Reduce both training and inference cost by 95%! • Significantly more for larger networks. (The wider the better) • 2 Hidden Layers • 1000 Nodes Per Layer 20

  21. Sp Sparsity + + R Randomn omness à As Asynchronous Updates • 3 Hidden Layers • 1000 Nodes Per Layer 21

  22. Les ess Computations + Asynchronous Parallel elism • Each update is computationally very small (100x+ reduction in computation and energy) • Updates are near-independent, very low chance of conflict. Hence, parallel SGD! 22

  23. SLI SLIDE : S ub- LI near D eep learning E ngine Neuron 1 Network Layer BatchSize 1 … 1 1 Hash Table L Hash Table 1 Active Inputs 2 1 5 1 0 1 …… 2 2 " … " Buckets $ … $ Buckets ℎ " ℎ # ℎ " ℎ # 3 … 1 … Activation for each Inputs … 1 9 … 3 3 00 00 00 00 … … … … 2 …… … 2 … 9 0.1 0.2 0.5 4 00 01 00 01 4 4 9 … … Empty Empty 10 10 00 00 Accumulated Gradients 5 Hidden 1 Hidden 2 … … … … Active Inputs … … … … 5 …… … 0.3 0.8 0.7 … Input … … … … 11 11 11 11 Weights … -0.3 0.8 -0.5 Output Previous Layer Size 23

  24. Pa Parallelism with OpenMP Node Batchsize Parallel across training samples in a batch Active Inputs (Extreme sparsity and randomness in gradient updates) 1 0 1 …… Activation for each Inputs …… 0.1 0.2 0.5 Accumulated Gradients Thanks to the theory of HOGWILD ! …… 0.3 0.8 0.7 (Recht et al. Neurips 11) Weights …… -0.3 0.8 -0.5 24

  25. Fle Flexible ible cho hoic ices es of Has Hash h Func Functio tions ns SLIDE supports four different LSH hash functions • Simhash (cosine similarity) • Winner-take-all Hashing (order) • Densified Winner-take-all Hashing ( for sparse data ) ∗ • Minhash (jaccard similarity) Easily add more! 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend