In-Place Associative Computing Avidan Akerib Ph.D. Vice President - PowerPoint PPT Presentation

In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com All Images are Public in the Web 1

Agenda • Introduction to associative computing • Use case examples • Similarity search • Large Scale Attention Computing • Few-shot learning • Software model • Future Approaches 2

The Challenge In AI Computing (Matrix Multiplication is not enough!!) AI Requirement Use Case Example • High Precision Floating Point Neural network learning • Multi precision Real time inference, saving memory • Linearly Scalable Big Data • Sort-search Top-K, recommendation, speech, classify image/video • Heavy computation Non linearity, Softmax, exponent , normalization • Bandwidth/power tradeoff High speed at low power 3

4 Von Neumann Architecture Memory CPU Lower Density High Density (Lots of Logic ) (Repeated cells) Faster Slower Leveraging Moore ’ s Law 4

5 Von Neumann Architecture Memory CPU High Density Lower Density (Reputed cells) (Lots of Logic) Slower Faster CPU frequency outpacing memory - need to add cache … . Continue to leverage Moore ’ s Law 5

6 Since 2006 Clock speed start flattening Sharply … Source: Intel 6

7 Thinking Parallel : 2 cores and more Memory CPU However, memory utilization becomes an issue … 7

More and more memory to solve utilization problem Memory CPU Local and Global Memory 8

9 Memory still growing rapidly Memory CPU Memory becomes a larger part of each chip 9

10 Same Concept even with GPGPUs GPGPU Memories Very High Power , large die, Expensive … What ’ s Next ?? 10

Most of Power goes to BW Source: Song Han Stanford University 11

12 Changing the Rules of the Game!!! Standard Memory cells are smarter than we thought !! 12

APU — Associative Processing Unit Millions Processors Question APU Simple Simple CPU Associative & Narrow Bus Processing Answer • Computes in-place directly in the memory array — removes the I/O bottleneck • Significantly increases performance • Reduces power 13

14 How Computers Work Today 0101 Address Decoder RE/WE 1000 1100 1001 ALU Sense Amp /IO Drivers 14

15 Accessing Multiple Rows Simultaneously RE 0101 RE 1000 RE 1100 1001 WE 0001 WE RE ? 0010 NOR Bus Contention is not an error !!! It ’ s a simple NOR/NAND satisfying De-Morgan ’ s law 15

Truth Table Example A B C D AB 00 01 11 10 0 0 0 1 • Every Minterm C 0 0 1 0 1 1 0 0 0 takes one Clock 0 1 0 1 • All bit lines 0 1 1 1 1 0 1 1 0 executes Karnaugh 1 0 0 0 tables in parallel !A!C + BC = 1 0 1 0 !!( !A!C + BC ) = ! (!(!A!C)!(BC)) 1 1 0 0 1 1 1 1 = NAND( NAND(!A,!C),NAND(B,C)) 1 CLOCK Read (!A,!C) ; WRITE T1 Read (B,C) ; WRITE T2 1 CLOCK Read (T1,T2) ; WRITE D 16

17 Vector Add Example A[] + B[] = C[] vector A(8,32M) vector B(8,32M) Vector C(9,32M) C = A + B No. Of Clocks = 4 * 8 = 32 Clocks/byte= 32/32M=1/1M OPS = 1Ghz X 1M = 1 PetaOPS 17

CAM/ Associative Search 1 in the combines key goes to the Records read enable RE 0 0 0 1 KEY: 1 0 1 Values 0 Search 1 1 1 0 0110 1 0 0 RE 1 1 1 1 0 RE 0 1 0 Duplicate Vales with 1 inverse data 0 0 RE 0 1 0 1 1 0 Duplicate the Key with Inverse. Move The original Key next to the inverse 0 0 1 = match data 18

TCAM Search By Standard Memory Cells 0 Don ’ t Care Don ’ t ’ Care 0 1 0 1 1 1 1 1 1 0 0 Don ’ t Care 0 19

TCAM Search By Standard Memory Cells 1 in the combines key goes to the read enable 0 0 RE 0 1 1 0 Insert Zero instead of don ’ t-care 1 0 1 1 1 0 KEY: 0 0 0 RE 1 Search 0 0 1 0 0110 RE 0 1 0 1 0 0 RE 0 Duplicate data. Inverse only to 1 those which are not don ’ t care 0 1 1 Duplicate the Key with 0 Inverse. Move The original Key next to the 1 = match 0 1 = match inverse data 20

Computing in the Bit Lines a0 a1 a2 a3 a4 a5 a6 a7 Vector A Vector B b0 b1 b2 b3 b4 b5 b6 b7 C=f(A,B) Each bit line becomes a processor and storage Millions of bit lines = millions of processors 21

Neighborhood Computing Shift vector C=f(A,SL(B,1)) Parallel shift of bit lines @ 1 cycle sections Enables neighborhood operations such as convolutions 22

Search & Count 5 -17 20 3 20 54 8 20 Count = 3 Search 20 1 1 1 1 1 1 • Search (binary or ternary) all bit lines in 1 cycle • 128 M bit lines => 128 Peta search/sec • Key Applications for search and count for predictive analytics: • Recommender systems • K-Nearest Neighbors (using cosine similarity search) • Random forest • Image histogram • Regular expression 23

CPU vs GPU vs FPGA vs APU 24

CPU/GPGPU vs APU CPU/GPGPU In-Place Computing (APU) (Current Solution) Search by content Send an address to memory Fetch the data from memory and Mark in place send it to the processor Compute in place on millions of Compute serially per core processors (the memory itself (thousands of cores at most) becomes millions of processors No need to write data back — the Write the data back to memory, further wasting IO resources result is already in the memory Send data to each location that If needed, distribute or broadcast needs it at once 25

ARCHITECTURE 26

Communication between Sections … Shift between sections enable neighborhood operations (filters , CNN etc.) … Store, Compute, Search and Transport data anywhere. 27

Section Computing to Improve Performance MLB section 0 24 rows Memory Connecting Mux control MLB section 1 Connecting mux . . . Instr. Buffer 28

APU Chip Layout 2M bit processors or 128K vector processors runs at 1G Hz with up to 2 Peta OPS peak performance 29

APU Layout vs GPU Layout Multi-Functional, Acceleration of FP Programmable Blocks operation Blocks 30

EXAMPLE APPLICATIONS 31

K-Nearest Neighbors (k-NN) Simple example: N = 36, 3 Groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands 32

k-NN Use Case in an APU Item 1 Item N Features of item 1 Features of item 2 Features of item N Item features and label Item 2 storage Item 3 𝑞 𝑅 𝑗 Compute cosine distances 𝑜 𝐸 𝑞 ∙𝑅 σ 𝑗=0 𝐸 𝑗 𝐷 𝑞 = 𝐸 𝑞 𝑅 = for all N in parallel 𝑞 2 σ 𝑗=0 𝑜 𝑜 𝑅 𝑗 2 (  10  s, assuming D=50 features) σ 𝑗=0 𝐸 𝑗 Q Distribute data – 2 ns (to all) Computing Area K Mins at O(1) complexity (  3  s) In-Place ranking 4 1 3 2 Majority Calculation With the data base in an APU, computation for all N items done in  0.05 ms, independent of K (1000X Improvement over current solutions) 33

K - MINS: O(1) Algorithm KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: MSB LSB D := not(C[b]); C 0 N := M & D; C 1 cnt = COUNT(N|V) IF cnt > K: C 2 M := N; ELIF cnt < K: V := N|V; N ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR } 34

K - MINS: The Algorithm C[0] V N|V N M D 0 0 0 1 0 110101 110101 0 KMINS(int K, vector C){ 1 1 1 1 1 010100 0 010100 M := 1, V := 0; 000101 0 1 1 1 1 1 000101 FOR b = msb to b = lsb: 0 1 1 1 1 1 000111 000111 D := not(C[b]); 1 1 1 1 1 000001 000001 0 N := M & D; 0 0 0 1 0 111011 111011 0 cnt = COUNT(N|V) 1 1 1 1 1 000101 0 000101 IF cnt > K: 010110 M := N; 0 1 1 1 1 1 010110 ELIF cnt < K: 0 1 1 1 1 1 001100 001100 V := N|V; 1 1 1 1 1 011100 011100 0 ELSE: // cnt == K 0 0 0 1 0 111100 111100 0 V := N|V; 0 0 1 0 0 101101 0 101101 EXIT; 000101 0 1 1 1 1 1 000101 ENDIF 0 0 0 0 1 0 111101 111101 ENDFOR } 1 1 1 1 1 000101 000101 0 1 1 1 1 1 000101 000101 0 cnt=11 35

K - MINS: The Algorithm C[1] V N|V N M D 0 0 0 0 0 0 110101 110101 KMINS(int K, vector C){ 0 0 0 0 1 0 010100 010100 M := 1, V := 0; 0 1 1 1 1 1 000101 000101 FOR b = msb to b = lsb: 0 1 1 1 1 1 000111 000111 D := not(C[b]); 0 1 1 1 1 1 000001 000001 N := M & D; 0 0 0 0 0 0 111011 111011 cnt = COUNT(N|V) 0 1 1 1 1 1 000101 000101 IF cnt > K: M := N; 0 0 0 0 1 0 010110 010110 ELIF cnt < K: 0 1 1 1 1 1 001100 001100 V := N|V; 0 0 0 1 0 0 011100 011100 ELSE: // cnt == K 0 0 0 0 0 0 111100 111100 V := N|V; 0 0 0 0 0 1 101101 101101 EXIT; 0 1 1 1 1 1 000101 000101 ENDIF 0 0 0 0 0 0 111101 111101 ENDFOR 0 1 1 1 1 1 000101 000101 } 0 1 1 1 1 1 000101 000101 cnt=8 36

K - MINS: The Algorithm V C 0 110101 110101 KMINS(int K, vector C){ 0 010100 010100 M := 1, V := 0; 1 000101 000101 FOR b = msb to b = lsb: 0 000111 000111 D := not(C[b]); 1 000001 000001 N := M & D; 0 111011 111011 cnt = COUNT(N|V) 1 000101 000101 IF cnt > K: M := N; 0 010110 010110 ELIF cnt < K: 0 001100 001100 V := N|V; 0 011100 011100 ELSE: // cnt == K 0 111100 111100 V := N|V; 0 101101 101101 EXIT; 1 000101 000101 ENDIF 0 111101 111101 ENDFOR 0 000101 000101 } final output 0 000101 000101 O(1) Complexity 37

In-Place Associative Computing Avidan Akerib Ph.D. Vice President - PowerPoint PPT Presentation

In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com All Images are Public in the Web 1 Agenda Introduction to associative computing Use case examples Similarity

A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE TO CALL HOME A PLACE

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Associative arrays Associative arrays map a key to a value Keys and values can be different

Associative Fine-Tuning of Biologically Inspired Active Neuro-Associative Knowledge Graphs Adrian

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Associative dyadic boolean functions Goals Def: A Boolean function f : { 0 , 1 } 2 { 0 , 1 }

Example: Associative Arrays An environment can be expressed as an associative array, e.g.:

10. Left-associative grammar (LAG) 10.1 Rule types and derivation order 10.1.1 The notion

Associative caches (3 rd Ed: p.496-504, 4 th Ed: 479-487) flexible block placement schemes

Cache design overview ANY cache can be viewed as k-way associative. What are the pros and cons of

The Place Approach What is the Place Approach? What makes a Great Place The Benefits of a Great

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA

Leading Causes of Death Where do you think heart disease falls? 1st place 2nd place

No place like No place like HOME No place like No place like HOME HOME HOME (Harmonising

2 nd place 3 rd place 5 th place 17 th place Ledning och styrning Vision, ml

A place where spiritual people go A place where spiritual people go A place to

WESTERN NC CONVERSION HEALTH FOUNDATION FORUM Kim Stravolo kstravolo@maryblackfoundation.org Vice

Credit Suisse Transportation Conference September 9, 2010 James A. Squires Executive Vice

Flaws in Applying Proof Methodologies to Signature Schemes Jacques Stern - David Pointcheval

Recap: Part 6 Public key cryptosystem: a pair of keys Public-key: meant for public, should

Different Spirals of Sameness: A Study of Content Sharing in Mainstream and Alternative Media

Cryptanalysis of SFLASH with Slightly Modified Parameters Vivien Dubois, Pierre-Alain Fouque and

Creating Purchase Requisitions and Receiving As of May 20, 2019 Objectives Identify what a

Agenda What is EA One? How will Online Recruitment impact schools? Tea Break - Interactive