yong cao debprakash patnaik sean ponce jeremy archuleta
play

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick - PowerPoint PPT Presentation

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan Virginia Polytechnic Institute and State University Reverse-engineer the brain National Academy of Engineering Top 5 Grand


  1. Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan Virginia Polytechnic Institute and State University

  2.  Reverse-engineer the brain National Academy of Engineering Top 5 Grand Challenges Axon Terminal (transmitter) Neuron Cited from Sciseek.com Axon (wires) Dendrites (receiver) Question: How are the neurons connected? Action Potentials (Spikes) 2

  3.  Reverse-engineer the brain National Academy of Engineering Top 5 Grand Challenges Multi-Electrode Array (MEA) Neurons grown on MEA Chip A B C A B C tim time Spike Train Stream 3

  4.  Reverse-engineer the brain National Academy of Engineering Top 5 Grand Challenges Find Repeating Patterns Infer Network Connectivity 4

  5.  Fast data mining of spike train stream on Graphics Processing Units (GPUs) MEA MEA Chip hip GPU GPU Chip hip Multi-Electrode Array NVIDIA GTX280 (MEA) Graphics Card 5

  6.  Fast data mining of spike train stream on Graphics Processing Units (GPUs)  Two key algorithmic strategies to address scalability problem on GPU  A hybrid mining approach  A two-pass elimination approach 6

  7.  Event stream data: sequence of neurons firing ( ) , E 2 , t 2 ( ) ,..., E n , t n ( ) E 1 , t 1 Event of Type A occurred at t = 6 t = 6 A 1 1 1 Neuron B 1 1 C 1 1 1 D 1 1 1 1 Time Event of Type D occurred at t = 5 t = 5 7

  8.  Pattern or Episode Inter-event constraint  Occurrences (Non-overlapped) 1 1 A 1 1 1 1 Neurons 1 B 1 1 1 1 1 C 1 1 1 1 D 1 1 1 1 1 Time Episode appears twice in the event stream. 8

  9.  Data mining problem: Find all possible episodes / patterns which occur more than X-times in the event sequence.  Challenge:  Combinatorial Explosion: large number of episodes to count Episode …… 1 2 3 4 Size/Length: A A → B A → B → C A → B → C → D B B → A A → C → B A → C → B → D …… A → C B → A → C A → C → D → B B → C → A A → D → B → C …… …… A → D → C → B …… 9

  10.  Mining Algorithm (A level wise procedure to control combinatorial explosion) Generate an initial list of candidate size-1 episodes  Repeat until - no more candidate episodes  Count ount : Occurrences of size-M candidate episodes  Prune: Retain only frequent episodes  Candidate Generation: size-(M+1) candidate episodes  from N-size frequent episodes Output all the frequent episodes  Computational bottleneck 10

  11.  Counting Algorithm (for one episode) Episode : Episode Accept_A pt_A() () Accept_B pt_B() () Accept_C pt_C() () Accept_D pt_D() () A 1 B 4 C 10 D 17 A 2 B 12 C 13 A 5 5 10 A 1 B 4 A 5 C 10 B 12 C 13 D 17 A 2 Event Str Ev nt Stream 11

  12.  Find an efficient counting algorithm on GPU to count the occurrences of N size-M episodes in an event stream.  Address scalability problem on GPU’s massive parallel execution architecture. 12

  13.  One episode per GPU thread (PTPE)  Each thread counts one episode  Simple extension of serial counting GPU MP MP MP N Episodes SP SP SP … N GPU Threads SM SM SM Event Stream Global Memory  Efficient when the number of episode is larger than the number of GPU cores. 13

  14.  Not enough episodes/thread, some GPU cores will be idle.  Solution: Increase the level of parallelism. Multiple Thread per Episode (MTPE) N Episodes N Episodes NM NM GPU N GPU Threads Threads Event Stream M Event Segments 14

  15.  Problem with simple count merge. 15

  16.  Choose the right algorithm with respect to the number of episodes N .  Define a switching threshold - Crossover point (CP) No Yes If N < CP Use Use PTPE MTPE GPU Performance CP = MP × B MP × T B × f ( size ) computing Penalty Factor capacity MP : Number of multi- processors B MP : Block per multi- processor T B : Thread per block 16

  17.  Problem: Original counting algorithm is too complex for a GPU kernel function. Episode : Episode Accept_A pt_A() () Accept_B pt_B() () Accept_C pt_C() () Accept_D pt_D() () A 1 B 4 C 10 D 17 A 2 B 12 C 13 A 5 5 10 A 1 B 4 A 5 C 10 B 12 C 13 D 17 A 2 Event Str Ev nt Stream 17

  18.  Problem: Original counting algorithm is too complex for a GPU kernel function. MP MP MP Accept_A pt_A() () Accept__B pt__B() () Accept_C pt_C() () Accept_D pt_D() () SP SP SP A 1 B 4 C 10 D 17 … A 2 B 12 C 13 SM SM SM A 5 Global Memory  Large shared memory usage  Large register file usage  Large number of branching instructions 18

  19.  Solution: PreElim algorithm  Less constrained counting  Simple kernel function  Upper bound only ( − ,5] ( − ,10] ( − ,5] Episode : A Episode ⎯ B ⎯ C ⎯ D ⎯ ⎯ → ⎯ ⎯ → ⎯ ⎯ → Accept_A pt_A() () Accept_B pt_B() () Accept_C pt_C() () Accept_D pt_D() () A 5 A 2 A 1 B 4 C 10 D 17 C 13 B 12 5 10 A 1 B 4 A 5 C 10 B 12 C 13 A 2 D 17 Ev Event Str nt Stream 19

  20.  A simpler kernel function Shared Memory Register Local Memory PreElim 4 x Episode Size 13 0 Normal Counting 44 x Episode Size 17 80 20

  21.  Solution:  Two-pass elimination approach PASS 1: Less Constrained Counting PASS 2: Normal Counting Fewer Episodes Threads Episodes Threads Event Stream Event Stream 21

  22.  A simpler kernel function Compile Time Difference Shared Memory Register Local Memory PreElim 4 x Episode Size 13 0 Normal Counting 44 x Episode Size 17 80 Run Time Difference Local Memory Load Divergent Branching and Store Two Pass 24,770,310 12,258,590 Hybrid 210,773,785 14,161,399 22

  23.  Hardware  Computer (custom-built)  Intel Core2 Quad @ 2.33GHz  4GB memory  Graphics Card (Nvidia GTX 280 GPU)  240 cores (30 MPs * 8 cores) @ 1.3GHz  1GB global memory  16K shared memory for each MP 23

  24.  Datasets  Synthetic ( Sym26 )  60 seconds with 50,000 events  Real (Culture growing for 5 weeks)  Day 33: 2-1-33 (333478 events)  Day 34: 2-1-34 (406795 events)  Day 35: 2-1-35 (526380 events) 24

  25.  PTPE vs MTPE Crossover points 25

  26.  Performance of the Hybrid Approach PTPE PTPE MTPE MTPE Hybrid 1200 1200 1000 1000 Time (ms) ime (ms) Time (ms) ime (ms) 800 800 Crossover points 600 600 400 400 200 200 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 Episode Size Episode Siz Episode Size Episode Siz e e Episode Number: Sym26 dataset, Support = 100 26

  27.  Crossover Point Estimation a  is a better fit. f ( size ) = size + b  A least square fit is performed. 27

  28.  Two-pass approach vs Hybrid approach 99.9% fewer episodes 28

  29.  Performance of the Two-pass approach One Pass Two Pass Total # First Pass Cull 200K 160K 160K 120K ime (ms) Time (ms) Episode # Episode # 120K 80K 80K 40K 40K 0K 0K 1 2 3 4 5 1 2 3 4 5 One Pass 93.2 1839.8 16139.7 132752.6 7036.6 Total # 64 6210 33623 173408 6288 Two Pass 160.4 1716.6 12602.6 41581.7 1844.6 First Pass Cull 18 2677 21442 169360 6288 Episode Size Episode Siz e Episode Siz Episode Size e 2-1-35 dataset, Support = 3150 29

  30.  Percentage of episodes eliminated by each pass First Pass Second Pass 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 Suppor Support t 2-1-35 dataset, episode size = 4 30

  31.  GPU vs CPU • GPU is always faster than CPU – 5x - 15x speedup – Fair comparison • Two-pass algorithm used • Maximum threading for both 31

  32.  Massive parallelism is required for conquering near exponential search space  GPU’s far more accessible than high performance clusters  Frequent episode mining – Not data parallel  Redesigned algorithm  Framework for real-time and interactive analysis of spike train experimental data 32

  33.  A fast temporal data mining framework on GPUs  Commoditized system  Massive parallel execution architecture  Two programming strategies  A hybrid approach  Increase level of parallelism (data segmentation + map-reduce)  Two-pass elimination approach  Decrease algorithm complexity (Task decomposition) 33

  34. Questions. 34

  35. ACE A .  Parallel Execution via B . pthreads . C  Optimized for CPU ACDE execution D A  Minimize disk access B E  Cache performance D AEF F E  Implements Two-Pass G F Approach H Z  PreElim – Simpler/ EFG G Quicker state machine … .  Full State Machine – … . Slower but is required to . eliminate all unsupported episodes

  36.  Level-wise  N-size frequent episodes => (N+1)-size candidates 1 A A 1 1 B B + 1 1 C C 1 D D 1 A 1 B 1 C 1 D

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend