Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick - - PowerPoint PPT Presentation

yong cao debprakash patnaik sean ponce jeremy archuleta
SMART_READER_LITE
LIVE PREVIEW

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick - - PowerPoint PPT Presentation

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan Virginia Polytechnic Institute and State University Reverse-engineer the brain National Academy of Engineering Top 5 Grand


slide-1
SLIDE 1

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan Virginia Polytechnic Institute and State University

slide-2
SLIDE 2

 Reverse-engineer the brain

National Academy of Engineering Top 5 Grand Challenges

Cited from Sciseek.com

Action Potentials (Spikes)

Axon Terminal (transmitter) Dendrites (receiver) Axon (wires)

Neuron

Question:

How are the neurons connected?

2

slide-3
SLIDE 3

 Reverse-engineer the brain

National Academy of Engineering Top 5 Grand Challenges

Neurons grown on MEA Chip Multi-Electrode Array (MEA) A B C

Spike Train Stream

tim time

A B C

3

slide-4
SLIDE 4

 Reverse-engineer the brain

National Academy of Engineering Top 5 Grand Challenges

Find Repeating Patterns Infer Network Connectivity

4

slide-5
SLIDE 5

 Fast data mining of spike train stream on Graphics Processing Units (GPUs)

Multi-Electrode Array (MEA)

MEA MEA Chip hip

NVIDIA GTX280 Graphics Card

GPU GPU Chip hip

5

slide-6
SLIDE 6

 Fast data mining of spike train stream on Graphics Processing Units (GPUs)  Two key algorithmic strategies to address scalability problem on GPU  A hybrid mining approach  A two-pass elimination approach

6

slide-7
SLIDE 7

 Event stream data: sequence of neurons firing

7

E1,t1

( ), E2,t2 ( ),..., En,tn ( )

A 1 1 1 B 1 1 C 1 1 1 D 1 1 1 1 Neuron Time

Event of Type A occurred at t = 6 t = 6 Event of Type D occurred at t = 5 t = 5

slide-8
SLIDE 8

8

 Pattern or Episode  Occurrences (Non-overlapped)

Inter-event constraint

A 1 1 1 1 B 1 1 1 C 1 1 1 D 1 1 1

Neurons Time

1 1 1 1 1 1 1 1

Episode appears twice in the event stream.

slide-9
SLIDE 9

 Data mining problem: Find all possible episodes / patterns which

  • ccur more than X-times in the event sequence.

 Challenge:  Combinatorial Explosion: large number

  • f episodes to count

9

……

A → B B → A A → C

……

Episode Size/Length:

A → B → C A → C → B B → A → C B → C → A

…… 2 3 4

A → B → C → D A → C → B → D A → C → D → B A → D → B → C A → D → C → B

……

1

A B

……

slide-10
SLIDE 10

 Mining Algorithm

(A level wise procedure to control combinatorial explosion)

10

Generate an initial list of candidate size-1 episodes

Repeat until - no more candidate episodes

Count

  • unt: Occurrences of size-M candidate episodes

Prune: Retain only frequent episodes

Candidate Generation: size-(M+1) candidate episodes from N-size frequent episodes

Output all the frequent episodes

Computational bottleneck

slide-11
SLIDE 11

 Counting Algorithm (for one episode)

11

5 10

A1 A2 B4 A5 C10 B12 C13 D17

Ev Event Str nt Stream Accept_A pt_A() () Accept_B pt_B() () Accept_C pt_C() () Accept_D pt_D() ()

Episode Episode:

A1 A2 B4 A5 C10 B12 C13 D17

slide-12
SLIDE 12

 Find an efficient counting algorithm on GPU to count the occurrences of N size-M episodes in an event stream.  Address scalability problem on GPU’s massive parallel execution architecture.

12

slide-13
SLIDE 13

 One episode per GPU thread (PTPE)

 Each thread counts one episode  Simple extension of serial counting

13

Event Stream N Episodes N GPU Threads GPU

SP SM MP

SP SM MP SP SM MP Global Memory

 Efficient when the number of episode is larger than the number of GPU cores.

slide-14
SLIDE 14

 Not enough episodes/thread, some GPU cores will be idle.  Solution: Increase the level of parallelism. Multiple Thread per Episode (MTPE)

14

Event Stream N Episodes N GPU Threads M Event Segments N Episodes NM NM GPU Threads

slide-15
SLIDE 15

 Problem with simple count merge.

15

slide-16
SLIDE 16

 Choose the right algorithm with respect to the number of episodes N.  Define a switching threshold - Crossover point (CP)

16

If N < CP Use MTPE Use PTPE Yes No

CP = MP × BMP × TB × f (size) MP : Number of multi- processors BMP : Block per multi- processor TB : Thread per block

Performance Penalty Factor GPU computing capacity

slide-17
SLIDE 17

17

5 10

A1 A2 B4 A5 C10 B12 C13 D17

Ev Event Str nt Stream Accept_A pt_A() () Accept_B pt_B() () Accept_C pt_C() () Accept_D pt_D() ()

Episode Episode:

A1 A2 B4 A5 C10 B12 C13 D17

 Problem: Original counting algorithm is too complex for a GPU kernel function.

slide-18
SLIDE 18

 Problem: Original counting algorithm is too complex for a GPU kernel function.

18

Accept_A pt_A() () Accept__B pt__B() () Accept_C pt_C() () Accept_D pt_D() ()

A1 A2 B4 A5 C10 B12 C13 D17

 Large shared memory usage  Large register file usage  Large number of branching instructions

SP SM MP

SP SM MP SP SM MP Global Memory

slide-19
SLIDE 19

19

5 10

A1 A2 B4 A5 C10 B12 C13 D17

Ev Event Str nt Stream Accept_A pt_A() () Accept_B pt_B() () Accept_C pt_C() () Accept_D pt_D() ()

Episode Episode: A

(−,5]

⎯ → ⎯ ⎯ B

(−,10]

⎯ → ⎯ ⎯ C

(−,5]

⎯ → ⎯ ⎯ D

A1 A2 B4 A5 C10 B12 C13 D17

 Solution: PreElim algorithm

 Less constrained counting Simple kernel function  Upper bound only

slide-20
SLIDE 20

 A simpler kernel function

20

Shared Memory Register Local Memory PreElim 4 x Episode Size 13 Normal Counting 44 x Episode Size 17 80

slide-21
SLIDE 21

 Solution:

 Two-pass elimination approach

21

Event Stream

Episodes Threads

Event Stream

Fewer Episodes Threads

PASS 1: Less Constrained Counting PASS 2: Normal Counting

slide-22
SLIDE 22

 A simpler kernel function

22

Shared Memory Register Local Memory PreElim 4 x Episode Size 13 Normal Counting 44 x Episode Size 17 80 Local Memory Load and Store Divergent Branching Two Pass 24,770,310 12,258,590 Hybrid 210,773,785 14,161,399 Compile Time Difference Run Time Difference

slide-23
SLIDE 23

 Hardware  Computer (custom-built)

 Intel Core2 Quad @ 2.33GHz  4GB memory

 Graphics Card (Nvidia GTX 280 GPU)

 240 cores (30 MPs * 8 cores) @ 1.3GHz  1GB global memory  16K shared memory for each MP

23

slide-24
SLIDE 24

 Datasets  Synthetic (Sym26)

 60 seconds with 50,000 events

 Real (Culture growing for 5 weeks)

 Day 33: 2-1-33 (333478 events)  Day 34: 2-1-34 (406795 events)  Day 35: 2-1-35 (526380 events)

24

slide-25
SLIDE 25

25

 PTPE vs MTPE

Crossover points

slide-26
SLIDE 26

200 400 600 800 1000 1200 1 2 3 4 5 6 7 Time (ms) ime (ms) Episode Siz Episode Size e PTPE MTPE

26

 Performance of the Hybrid Approach

200 400 600 800 1000 1200 1 2 3 4 5 6 7 Time (ms) ime (ms) Episode Siz Episode Size e PTPE MTPE Hybrid

Sym26 dataset, Support = 100

Episode Number:

Crossover points

slide-27
SLIDE 27

27

 Crossover Point Estimation

 is a better fit.  A least square fit is performed.

f (size) = a size + b

slide-28
SLIDE 28

 Two-pass approach vs Hybrid approach

28

99.9% fewer episodes

slide-29
SLIDE 29

 Performance of the Two-pass approach

29

0K 40K 80K 120K 160K

1 2 3 4 5 One Pass 93.2 1839.8 16139.7 132752.6 7036.6 Two Pass 160.4 1716.6 12602.6 41581.7 1844.6

Time (ms) ime (ms) Episode Siz Episode Size e One Pass Two Pass

1 2 3 4 5 Total # 64 6210 33623 173408 6288 First Pass Cull 18 2677 21442 169360 6288 0K 40K 80K 120K 160K 200K Episode # Episode # Episode Siz Episode Size e Total # First Pass Cull

2-1-35 dataset, Support = 3150

slide-30
SLIDE 30

 Percentage of episodes eliminated by each pass

30

2-1-35 dataset, episode size = 4

91% 92% 93% 94% 95% 96% 97% 98% 99% 100% 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 Suppor Support t First Pass Second Pass

slide-31
SLIDE 31

 GPU vs CPU

  • GPU is always faster than CPU

– 5x - 15x speedup – Fair comparison

  • Two-pass algorithm used
  • Maximum threading for both

31

slide-32
SLIDE 32

 Massive parallelism is required for conquering near exponential search space  GPU’s far more accessible than high performance clusters  Frequent episode mining – Not data parallel  Redesigned algorithm  Framework for real-time and interactive analysis

  • f spike train experimental data

32

slide-33
SLIDE 33

 A fast temporal data mining framework on GPUs  Commoditized system  Massive parallel execution architecture  Two programming strategies  A hybrid approach

 Increase level of parallelism (data segmentation + map-reduce)

 Two-pass elimination approach

 Decrease algorithm complexity (Task decomposition)

33

slide-34
SLIDE 34

Questions.

34

slide-35
SLIDE 35

 Parallel Execution via pthreads  Optimized for CPU execution  Minimize disk access  Cache performance  Implements Two-Pass Approach  PreElim – Simpler/ Quicker state machine  Full State Machine – Slower but is required to eliminate all unsupported episodes

. . . A B D E F Z G . . . A B C D E F G H … … AEF EFG ACE ACDE

slide-36
SLIDE 36

 Level-wise  N-size frequent episodes => (N+1)-size candidates

1 1 1 1 1 1 1 1 1 1

+

A B C D A B C D A B C D