in place associative computing
play

In-Place Associative Computing Avidan Akerib Ph.D. Vice President - PowerPoint PPT Presentation

In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com All Images are Public in the Web 1 Agenda Introduction to associative computing Use case examples Similarity


  1. In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com All Images are Public in the Web 1

  2. Agenda • Introduction to associative computing • Use case examples • Similarity search • Large Scale Attention Computing • Few-shot learning • Software model • Future Approaches 2

  3. The Challenge In AI Computing (Matrix Multiplication is not enough!!) AI Requirement Use Case Example • High Precision Floating Point Neural network learning • Multi precision Real time inference, saving memory • Linearly Scalable Big Data • Sort-search Top-K, recommendation, speech, classify image/video • Heavy computation Non linearity, Softmax, exponent , normalization • Bandwidth/power tradeoff High speed at low power 3

  4. 4 Von Neumann Architecture Memory CPU Lower Density High Density (Lots of Logic ) (Repeated cells) Faster Slower Leveraging Moore ’ s Law 4

  5. 5 Von Neumann Architecture Memory CPU High Density Lower Density (Reputed cells) (Lots of Logic) Slower Faster CPU frequency outpacing memory - need to add cache … . Continue to leverage Moore ’ s Law 5

  6. 6 Since 2006 Clock speed start flattening Sharply … Source: Intel 6

  7. 7 Thinking Parallel : 2 cores and more Memory CPU However, memory utilization becomes an issue … 7

  8. More and more memory to solve utilization problem Memory CPU Local and Global Memory 8

  9. 9 Memory still growing rapidly Memory CPU Memory becomes a larger part of each chip 9

  10. 10 Same Concept even with GPGPUs GPGPU Memories Very High Power , large die, Expensive … What ’ s Next ?? 10

  11. Most of Power goes to BW Source: Song Han Stanford University 11

  12. 12 Changing the Rules of the Game!!! Standard Memory cells are smarter than we thought !! 12

  13. APU — Associative Processing Unit Millions Processors Question APU Simple Simple CPU Associative & Narrow Bus Processing Answer • Computes in-place directly in the memory array — removes the I/O bottleneck • Significantly increases performance • Reduces power 13

  14. 14 How Computers Work Today 0101 Address Decoder RE/WE 1000 1100 1001 ALU Sense Amp /IO Drivers 14

  15. 15 Accessing Multiple Rows Simultaneously RE 0101 RE 1000 RE 1100 1001 WE 0001 WE RE ? 0010 NOR Bus Contention is not an error !!! It ’ s a simple NOR/NAND satisfying De-Morgan ’ s law 15

  16. Truth Table Example A B C D AB 00 01 11 10 0 0 0 1 • Every Minterm C 0 0 1 0 1 1 0 0 0 takes one Clock 0 1 0 1 • All bit lines 0 1 1 1 1 0 1 1 0 executes Karnaugh 1 0 0 0 tables in parallel !A!C + BC = 1 0 1 0 !!( !A!C + BC ) = ! (!(!A!C)!(BC)) 1 1 0 0 1 1 1 1 = NAND( NAND(!A,!C),NAND(B,C)) 1 CLOCK Read (!A,!C) ; WRITE T1 Read (B,C) ; WRITE T2 1 CLOCK Read (T1,T2) ; WRITE D 16

  17. 17 Vector Add Example A[] + B[] = C[] vector A(8,32M) vector B(8,32M) Vector C(9,32M) C = A + B No. Of Clocks = 4 * 8 = 32 Clocks/byte= 32/32M=1/1M OPS = 1Ghz X 1M = 1 PetaOPS 17

  18. CAM/ Associative Search 1 in the combines key goes to the Records read enable RE 0 0 0 1 KEY: 1 0 1 Values 0 Search 1 1 1 0 0110 1 0 0 RE 1 1 1 1 0 RE 0 1 0 Duplicate Vales with 1 inverse data 0 0 RE 0 1 0 1 1 0 Duplicate the Key with Inverse. Move The original Key next to the inverse 0 0 1 = match data 18

  19. TCAM Search By Standard Memory Cells 0 Don ’ t Care Don ’ t ’ Care 0 1 0 1 1 1 1 1 1 0 0 Don ’ t Care 0 19

  20. TCAM Search By Standard Memory Cells 1 in the combines key goes to the read enable 0 0 RE 0 1 1 0 Insert Zero instead of don ’ t-care 1 0 1 1 1 0 KEY: 0 0 0 RE 1 Search 0 0 1 0 0110 RE 0 1 0 1 0 0 RE 0 Duplicate data. Inverse only to 1 those which are not don ’ t care 0 1 1 Duplicate the Key with 0 Inverse. Move The original Key next to the 1 = match 0 1 = match inverse data 20

  21. Computing in the Bit Lines a0 a1 a2 a3 a4 a5 a6 a7 Vector A Vector B b0 b1 b2 b3 b4 b5 b6 b7 C=f(A,B) Each bit line becomes a processor and storage Millions of bit lines = millions of processors 21

  22. Neighborhood Computing Shift vector C=f(A,SL(B,1)) Parallel shift of bit lines @ 1 cycle sections Enables neighborhood operations such as convolutions 22

  23. Search & Count 5 -17 20 3 20 54 8 20 Count = 3 Search 20 1 1 1 1 1 1 • Search (binary or ternary) all bit lines in 1 cycle • 128 M bit lines => 128 Peta search/sec • Key Applications for search and count for predictive analytics: • Recommender systems • K-Nearest Neighbors (using cosine similarity search) • Random forest • Image histogram • Regular expression 23

  24. CPU vs GPU vs FPGA vs APU 24

  25. CPU/GPGPU vs APU CPU/GPGPU In-Place Computing (APU) (Current Solution) Search by content Send an address to memory Fetch the data from memory and Mark in place send it to the processor Compute in place on millions of Compute serially per core processors (the memory itself (thousands of cores at most) becomes millions of processors No need to write data back — the Write the data back to memory, further wasting IO resources result is already in the memory Send data to each location that If needed, distribute or broadcast needs it at once 25

  26. ARCHITECTURE 26

  27. Communication between Sections … Shift between sections enable neighborhood operations (filters , CNN etc.) … Store, Compute, Search and Transport data anywhere. 27

  28. Section Computing to Improve Performance MLB section 0 24 rows Memory Connecting Mux control MLB section 1 Connecting mux . . . Instr. Buffer 28

  29. APU Chip Layout 2M bit processors or 128K vector processors runs at 1G Hz with up to 2 Peta OPS peak performance 29

  30. APU Layout vs GPU Layout Multi-Functional, Acceleration of FP Programmable Blocks operation Blocks 30

  31. EXAMPLE APPLICATIONS 31

  32. K-Nearest Neighbors (k-NN) Simple example: N = 36, 3 Groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands 32

  33. k-NN Use Case in an APU Item 1 Item N Features of item 1 Features of item 2 Features of item N Item features and label Item 2 storage Item 3 𝑞 𝑅 𝑗 Compute cosine distances 𝑜 𝐸 𝑞 ∙𝑅 σ 𝑗=0 𝐸 𝑗 𝐷 𝑞 = 𝐸 𝑞 𝑅 = for all N in parallel 𝑞 2 σ 𝑗=0 𝑜 𝑜 𝑅 𝑗 2 (  10  s, assuming D=50 features) σ 𝑗=0 𝐸 𝑗 Q Distribute data – 2 ns (to all) Computing Area K Mins at O(1) complexity (  3  s) In-Place ranking 4 1 3 2 Majority Calculation With the data base in an APU, computation for all N items done in  0.05 ms, independent of K (1000X Improvement over current solutions) 33

  34. K - MINS: O(1) Algorithm KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: MSB LSB D := not(C[b]); C 0 N := M & D; C 1 cnt = COUNT(N|V) IF cnt > K: C 2 M := N; ELIF cnt < K: V := N|V; N ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR } 34

  35. K - MINS: The Algorithm C[0] V N|V N M D 0 0 0 1 0 110101 110101 0 KMINS(int K, vector C){ 1 1 1 1 1 010100 0 010100 M := 1, V := 0; 000101 0 1 1 1 1 1 000101 FOR b = msb to b = lsb: 0 1 1 1 1 1 000111 000111 D := not(C[b]); 1 1 1 1 1 000001 000001 0 N := M & D; 0 0 0 1 0 111011 111011 0 cnt = COUNT(N|V) 1 1 1 1 1 000101 0 000101 IF cnt > K: 010110 M := N; 0 1 1 1 1 1 010110 ELIF cnt < K: 0 1 1 1 1 1 001100 001100 V := N|V; 1 1 1 1 1 011100 011100 0 ELSE: // cnt == K 0 0 0 1 0 111100 111100 0 V := N|V; 0 0 1 0 0 101101 0 101101 EXIT; 000101 0 1 1 1 1 1 000101 ENDIF 0 0 0 0 1 0 111101 111101 ENDFOR } 1 1 1 1 1 000101 000101 0 1 1 1 1 1 000101 000101 0 cnt=11 35

  36. K - MINS: The Algorithm C[1] V N|V N M D 0 0 0 0 0 0 110101 110101 KMINS(int K, vector C){ 0 0 0 0 1 0 010100 010100 M := 1, V := 0; 0 1 1 1 1 1 000101 000101 FOR b = msb to b = lsb: 0 1 1 1 1 1 000111 000111 D := not(C[b]); 0 1 1 1 1 1 000001 000001 N := M & D; 0 0 0 0 0 0 111011 111011 cnt = COUNT(N|V) 0 1 1 1 1 1 000101 000101 IF cnt > K: M := N; 0 0 0 0 1 0 010110 010110 ELIF cnt < K: 0 1 1 1 1 1 001100 001100 V := N|V; 0 0 0 1 0 0 011100 011100 ELSE: // cnt == K 0 0 0 0 0 0 111100 111100 V := N|V; 0 0 0 0 0 1 101101 101101 EXIT; 0 1 1 1 1 1 000101 000101 ENDIF 0 0 0 0 0 0 111101 111101 ENDFOR 0 1 1 1 1 1 000101 000101 } 0 1 1 1 1 1 000101 000101 cnt=8 36

  37. K - MINS: The Algorithm V C 0 110101 110101 KMINS(int K, vector C){ 0 010100 010100 M := 1, V := 0; 1 000101 000101 FOR b = msb to b = lsb: 0 000111 000111 D := not(C[b]); 1 000001 000001 N := M & D; 0 111011 111011 cnt = COUNT(N|V) 1 000101 000101 IF cnt > K: M := N; 0 010110 010110 ELIF cnt < K: 0 001100 001100 V := N|V; 0 011100 011100 ELSE: // cnt == K 0 111100 111100 V := N|V; 0 101101 101101 EXIT; 1 000101 000101 ENDIF 0 111101 111101 ENDFOR 0 000101 000101 } final output 0 000101 000101 O(1) Complexity 37

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend