[PPT] - IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY PowerPoint Presentation

SLIDE 1

IN-MEMORY ASSOCIATIVE COMPUTING

AVIDAN AKERIB, GSI TECHNOLOGY

AAKERIB@GSITECHNOLOGY.COM

SLIDE 2

AGENDA

The AI computational challenge Introduction to associative computing Examples An NLP use case What’s next?

SLIDE 3

THE CHALLENGE IN AI COMPUTING

AI Requirement

32 bit FP

Neural network learning

Multi precision

Neural network inference, data mining, etc.

Scaling

Data center

Sort-search

Top-K, recommendation, speech, classify image/video

Heavy computation

Non linearity, Softmax, exponent , normalize

Bandwidth

Required for speed and power

Use Case Example

SLIDE 4

CURRENT SOLUTION

CPU

General Purpose GPU

Question Answer

DRAM

Bottleneck when register file data needs to be replaced on a regular basis

➢ Limits performance ➢ Increases power consumption

Does not scale with the search, sort, and rank requirements of applications like

recommender systems, NLP, speech recognition, and data mining that requires functions like Top-K and Softmax.

Very Wide Bus

Tens of Cores Thousands of Cores

SLIDE 5

GPU VS CPU VS FPGA

SLIDE 6

GPU VS CPU VS FPGA VS APU

SLIDE 7

GSI’S SOLUTION  APU—ASSOCIATIVE PROCESSING UNIT

Computes in-place directly in the memory array—removes the I/O

bottleneck

Significantly increases performances
Reduces power

Simple CPU

APU Associative Memory

Question Answer

Simple & Narrow Bus

Millions Processors

SLIDE 8

IN-MEMORY COMPUTING CONCEPT

SLIDE 9

THE COMPUTING MODEL FOR THE   PAST 80 YEARS

1000 1100 1001

Address Decoder

Sense Amp /IO Drivers

ALU

0101 1000

Read Read

0010

Write Write

SLIDE 10

THE CHANGE—IN-MEMORY COMPUTING

Patented in-memory logic using only Read/Write operations
Any logic/arithmetic function can be generated internally

0101 1000 1100 1001

Read Read Write

NOR

0101

Read Read Write

0010

NOR

Simple Controller

SLIDE 11

1 1 1 1 1 1

1 1 1 1

1 1 1 1 1 1

Records

1=match

Duplicate vales with inverse data

Duplicate the key with

inverse. Move the original

key next to the inverse data RE RE RE RE 1 in the combines key goes to the read enable

11

CAM/ ASSOCIATIVE SEARCH

Values KEY: Search 0110

SLIDE 12

1 1

Don’t care

1

Don’t care

1 1

Don’t care

1 1

12

TCAM SEARCH WITH STANDARD MEMORY CELLS

SLIDE 13

1 1 1 1

1 1 1 1 1 KEY: Search 0110

1=match 1= match

Duplicate data. Inverse only to those which are not don’t care

Duplicate the key with

inverse. Move The
riginal Key next to the

inverse data RE RE RE RE 1 in the combines key goes to the read enable

Insert zero instead of don’t-care

13

TCAM SEARCH WITH STANDARD MEMORY CELLS

SLIDE 14

COMPUTING IN THE BIT LINES

Vector A Vector B

Each bit line becomes a processor and storage millions of bit lines = millions of processors

C=f(A,B)

SLIDE 15

NEIGHBORHOOD COMPUTING

Parallel shift of bit lines @ 1 cycle sections Enables neighborhood operations such as convolutions C=f(A,SL(B,1))

Shift vector

SLIDE 16

SEARCH & COUNT

5

Count = 3 Search 20

20

17

3 20 54 20

1 1 1

Key applications for search and count for predictive analytics:
Recommender systems
K-nearest neighbors (using cosine similarity search)
Random forest
Image histogram
Regular expression

8

Search (binary or ternary) all bit lines in 1 cycle
128 M bit lines => 128 Peta search/sec

1 1 1

SLIDE 17

DATABASE SEARCH AND UPDATE

Content-based search, record can be placed anywhere

Update, modify, insert, delete is immediate

Exact Match

CAM/TCAM

Similarity Match In-Place Aggregate

SLIDE 18

TRADITIONAL STORAGE CAN DO MUCH MORE

Standard memory cell 1 bit Standard memory cell 2 bits 2 input NOR, 1 TCAM cell Standard memory cell 3 Bits 3 input NOR, 2 Input NOR + 1 Output 4 State CAM Standard memory cell

… …

SLIDE 19

CPU/GPGPU VS APU

SLIDE 20

ARCHITECTURE

SLIDE 21

SECTION COMPUTING TO IMPROVE PERFORMANCE

21

Memory MLB section 0

Connecting Mux

. . .

control

Instr. Buffer

MLB section 1

Connecting mux 24 rows

SLIDE 22

Store, compute, search and move data anywhere … …

Shift between sections enable neighborhood

perations (filters , CNN etc.)

COMMUNICATION BETWEEN SECTIONS

22

SLIDE 23

APU CHIP LAYOUT

2M bit processors or 128K vector processors runs at 1G Hz with up to 2 Peta OPS peak performance

SLIDE 24

EVALUATION BOARD PERFORMANCE

Precision :

Unlimited : from 1 bit to 160 bits or more. 6.4 TOPS (FP) – 8 Peta OPS for one bit computing or 16 bit

exact search

Similarity Search, Top-k , min, max , Softmax,

O(1) complexity in μs, any size of K compared to ms with

current solutions

In-memory IO

2 Petabit/sec > 100X GPGPU/CPU/FPGA

Sparse matrix multiplication

> 100X GPGPU/CPU/FPGA

SLIDE 25

APU SERVER

64 APU chips , 256-512GByte DDR, From 100TFLOPS Up to 128 Peta OPS with peak performance

128TOPS/W

O(1) Top-K, min, max, 32 Peta bits/sec internal IO < 1K Watts > 1000X GPGPs on average Linearly scalable Currently 28nm process and scalable to 7nm or less

Well suited to advanced memory technology such as non volatile ReRAM

and more

SLIDE 26

EXAMPLE APPLICATIONS

SLIDE 27

K-NEAREST NEIGHBORS (K-NN)

Simple example: N = 36, 3 groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands

SLIDE 28

K-NN USE CASE IN AN APU

Item1 Item2 Item N

Q

4 1 3 2

Majority Calculation Item features and label storage Computing Area Compute cosine distances for all N in parallel

(≤ 10μs, assuming D=50 features)

K Mins at O(1) complexity (≤ 3μs) Distribute data – 2 ns (to all) In-Place ranking

Features of item 1

Item 1 Item 2 Item 3 Item N

Features of item 2 Features of item N

With the data base in an APU, computation for all N items done in ≤ 0.05 ms, independent of K

SLIDE 29

LARGE DATABASE EXAMPLE USING APU SERVERS

Number of items: billions
Features per item: tens to hundreds
Latency: ≤ 1 msec
Throughput: Scales to 1M similarity searches/sec
k-NN: Top 100,000 nearest neighbors

SLIDE 30

EXAMPLE K-NN FOR RECOGNITION Text BOW, Word Embedding

Feature Extractor

(Neural Network)

K-NN Classifier

(Associative Memory)

Image Convolution Layer

SLIDE 31

K-MINS: O(1) ALGORITHM

KMINS(int K, vector C){  M := 1, V := 0;  FOR b = msb to b = lsb:  D := not(C[b]);  N := M & D;  cnt = COUNT(N|V)  IF cnt > K:  M := N;  ELIF cnt < K:  V := N|V;  ELSE: // cnt == K  V := N|V;  EXIT;  ENDIF  ENDFOR  }

C0 C1 C2

N

MSB LSB

SLIDE 32

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

K-MINS: THE ALGORITHM

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

V M

KMINS(int K, vector C){  M := 1, V := 0;  FOR b = msb to b = lsb:  D := not(C[b]);  N := M & D;  cnt = COUNT(N|V)  IF cnt > K:  M := N;  ELIF cnt < K:  V := N|V;  ELSE: // cnt == K  V := N|V;  EXIT;  ENDIF  ENDFOR  }

1 1 1 1 1 1 1 1 1 1 1

D

1 1 1 1 1 1 1 1 1 1 1

N

1 1 1 1 1 1 1 1 1 1 1

N|V cnt=11

1 1 1 1 1 1 1 1 1 1 1 110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

C[0]

SLIDE 33

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

K-MINS: THE ALGORITHM

1 1 1 1 1 1 1 1 1 1 1

V M

KMINS(int K, vector C){  M := 1, V := 0;  FOR b = msb to b = lsb:  D := not(C[b]);  N := M & D;  cnt = COUNT(N|V)  IF cnt > K:  M := N;  ELIF cnt < K:  V := N|V;  ELSE: // cnt == K  V := N|V;  EXIT;  ENDIF  ENDFOR  }

1 1 1 1 1 1 1 1 1

D

1 1 1 1 1 1 1 1

N

1 1 1 1 1 1 1 1

N|V cnt=8

1 1 1 1 1 1 1 1 110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

C[1]

SLIDE 34

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

K-MINS: THE ALGORITHM

V

KMINS(int K, vector C){  M := 1, V := 0;  FOR b = msb to b = lsb:  D := not(C[b]);  N := M & D;  cnt = COUNT(N|V)  IF cnt > K:  M := N;  ELIF cnt < K:  V := N|V;  ELSE: // cnt == K  V := N|V;  EXIT;  ENDIF  ENDFOR  }

1 1 1 1 110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

C final output

O(1) Complexity

SLIDE 35

DENSE (1XN) VECTOR BY SPARSE NXM MATRIX

4

2

3

1

Input Vector Sparse Matrix

Output Vector

3 5 9 17

1 2 2 4 2 3 4 4

Shift and Add all belonging to same column

3 5 9 17

6

6

17

Row Column

6

15

9
17
6

6

17

Search all columns for row = 2 :Distribute -2 : 2 Cy Search all columns for row = 3 :Distribute 3 : 2 Cy Search all columns for row = 4 :Distribute -1 : 2 Cy Multiply in Parallel : 100 Cy

APU Representations and Computing

2

3

1
1

Complexity including IO : O (N +logβ) where β is the number of nonzero elements in the sparse matrix N << M in general for recommender systems

SLIDE 36

SPARSE MATRIX MULTIPLICATION   PERFORMANCE ANALYSIS

G3 circuit matrix
1.5M X 1.5M sparse matrix
Roughly 8M nonzero elements
20–100 GFLOPS with GPGPU solution
APU solution provides 64 TFLOPS using the same

amount of power as the GPGPU solution above

> 500x improvement with APU solution

SLIDE 37

ASSOCIATIVE MEMORY FOR NATURAL LANGUAGE PROCESSING (NLP)

Q&A, dialog, language translation, speech recognition’ etc. Requires learning things from the past – needs memory More memory, more accuracy i.e. “Dan put the book in his car, ….. Long story here…. Mike took Dan’s car … Long story

here …. He drove to SF”

Q : Where is the book now? A: Car, SF

SLIDE 38

END-TO-END MEMORY NETWORKS

End-To-End Memory Networks, (Weston et. al., NIPS 2015). (a): Single hope, (b) 3 hops

SLIDE 39

Q&A : END TO END NETWORK

SLIDE 40

REQUIREMENTS FOR AUGMENTED MEMORY

Cosine Similarity Search + Top-K Compute softmax to selected Vertical Multiplication selected Columns and horizontal sum

Output

Embedding Input features to next coloumn , to any other location or location based on content

SLIDE 41

Control

I O

mi

Ti

M

0 1 2 N-1

i mi

i mi Broadcast input I to selected columns Compute any function at selected columns Generate output Generate tags for selection

APU MEMORY FOR NLP

SLIDE 42

GSI SOLUTION FOR END TO END

Constant time of 3 µsec per iteration, any memory size.

SLIDE 43

PROGRAMING MODEL

SLIDE 44

PROGRAMMING MODEL

APU (Associative Processing Unit Hardware) Graph Execution and Tasks Scheduling Framework (TensorFlow ) Application (C++, Python)

HOST Device

SLIDE 45

c

A TF EXAMPLE: MATMUL

a = tf.placeholder(tf.int32, shape=[3,4]) b = tf.placeholder(tf.int32, shape=[4,6]) c = tf.matmul(a, b) # shape = [3,6]

b a

matmul

SLIDE 46

A TF EXAMPLE: MATMUL  GRAPH PREPARATION

apu device (tf+eigen)

a = tf.placeholder(tf.int32, shape=[3, 4]) b = tf.placeholder(tf.int32, shape=[4, 6]) c = tf.matmul(a, b) gnl_create_array(a)

APU DEVICE SPACE

gnl_create_array(b) gnl_create_array(c)

a b c apuc’s L4 host

L1 MMB

SLIDE 47

MMB L1

A TF EXAMPLE: MATMUL  GVML_SET, GVML_MUL

with tf.Session() as sess: result = sess.run(c, feed_dict= {a: [[2 -4 15 3] b: [[27 -8 -14 2 9 -32] [11 1 -30 4] [-7 52 -6 21 0 4] [-8 23 -9 7]], [-81 1 20 6 19 -3] [-38 90 5 2 13 77]]})

2 -4 15 3  11 1 -30 4 

8 23 -9 7

APU DEVICE SPACE a apuc L4

27 -8 -14 2 9 -32 

7 52 -6 21 0 4 
81 1 20 6 19 -3 
38 90 5 2 13 77 b

c

gnlpd_mat_mul(c,a,b)

controller

d m a c

p

y L 4 t

L

1

7 52 -6 21 0 4

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

c

gnlpd_dma_16b_start(GNLPD_SYS_2_VMR,…) gvml_mul_s16(……)

dma keeps  loading   data to apuc while matmul  is being  computed  in apuc

gvml_set_16(……) 27 -8 -14 2 9 -32 2 2 2 2 2 2 54 -16 -28 4 18 -64

X gvml_mul_s16(……)

g v m l _ s e t _ 1 6 ( … … )

SLIDE 48

TENSORFLOW ENHANCEMENT: FUSED OPERATIONS

a = tf.placeholder(tf.int32, shape=[3,4]) b = tf.placeholder(tf.int32, shape=[4,6]) c = tf.matmul(a, b) # shape = [3,6]

b a c

matmul

d = tf.nn.top_k(c, k=2) # shape = [3,2]

d

top_k fused(matmul,top_k)

The two operations are computed inside the apuc
Data stays in L1
No IO operations between them
Saves valuable data transfer time and power

fused

SLIDE 49

MMB L1

A TF EXAMPLE: MATMUL  CODE EXAMPLE

APU DEVICE

27 -8 -14 2 9 -32 54 -16 -28 4 18 -64 0 0 0 0 0 0 0 0 0 0 0 0 c 27 -8 -14 2 9 -32 2 2 2 2 2 2 54 -16 -28 4 18 -64

APL_FRAG add_u16(RN_REG x, RN_REG y, RN_REG t_xory, RN_REG t_ci0) { SM_0XFFFF: RL = SB[x]; // RL[0-15] = x SM_0XFFFF: RL |= SB[y]; // RL[0-15] = x|y { SM_0XFFFF: SB[t_xory] = RL; // t_xory[1-15] = x|y SM_0XFFFF: RL = SB[x , y]; // RL[0-15] = x&y } // Add init state: // 0: RL = co[0] // 1..15: RL = x&y { (SM_0X1111 << 1): SB[t_ci0] = NRL; // t_ci0[5,9,13] = x&y (SM_0X1111 << 1): RL |= SB[t_xory] & NRL; // RL[1] = Cout[1] = x&y | ci(x|y) // 5,9,13: RL = Cout0[5,9,13] = x&y | ci(x|y) (SM_0X1111 << 4): RL |= SB[t_xory]; // RL[4,8,12] = Cout1[4,8,12] = x&y | 1&(x|y) } { (SM_0X1111 << 2): SB[t_ci0] = NRL; // Propagate Cin0 (SM_0X1111 << 2): RL |= SB[t_xory] & NRL; // Propagate Cout0 (SM_0X1111 << 5): RL |= SB[t_xory] & NRL; // Propagate Cout1 } { (SM_0X1111 << 3): SB[t_ci0] = NRL; // Propagate Cin0 (SM_0X1111 << 3): RL |= SB[t_xory] & NRL; // Propagate Cout0 (SM_0X1111 << 6): RL |= SB[t_xory] & NRL; // Propagate Cout1 (SM_0X0001 << 15): GL = RL; } { (SM_0X1111 << 4): SB[t_ci0] = NRL; // t_ci0[8,12,16] = Cout0[7, 11, 15] SM_0X0001: SB[t_ci0] = GL; // t_ci0[8,12,16] = Cout0[7, 11, 15] (SM_0X1111 << 7): RL |= SB[t_xory] & NRL; // Propagate Cout1 (SM_0X0001 << 15): GL = RL; } . . .

SLIDE 50

FUTURE APPROACH – NON VOLATILE CONCEPT

SLIDE 51

CPU Register File

L1/L2/L3

Flash HDD

SOLUTIONS FOR FUTURE DATA CENTERS

DRAM ASSOCIATIVE

STT-RAM RAM Based PC-RAM Based ReRam Based

Full computing (floating points etc.) requires read & write : Machine learning, malware detection detection etc., : Much more read and much less write Data search engines (read most of the time)

High endurance Mid endurance Low endurance

Standard SRAM Based

Volatile Non Volatile

SLIDE 52

IN-MEMORY ASSOCIATIVE COMPUTING

AVIDAN AKERIB, GSI TECHNOLOGY

AGENDA

The AI computational challenge Introduction to associative computing Examples An NLP use case What’s next?

THE CHALLENGE IN AI COMPUTING

AI Requirement

Neural network learning

Neural network inference, data mining, etc.

Data center

Top-K, recommendation, speech, classify image/video

Non linearity, Softmax, exponent , normalize

Required for speed and power

Use Case Example

CURRENT SOLUTION

DRAM

➢ Limits performance ➢ Increases power consumption

GPU VS CPU VS FPGA

GPU VS CPU VS FPGA VS APU

GSI’S SOLUTION APU—ASSOCIATIVE PROCESSING UNIT

bottleneck

IN-MEMORY COMPUTING CONCEPT

THE COMPUTING MODEL FOR THE PAST 80 YEARS

THE CHANGE—IN-MEMORY COMPUTING

1 1 1 1 1 1

1 1 1 1 1 1

1=match

CAM/ ASSOCIATIVE SEARCH

Values KEY: Search 0110

1

1 1

1 1

TCAM SEARCH WITH STANDARD MEMORY CELLS

1 1 1 1

1 1 1 1 1 KEY: Search 0110

1=match 1= match

TCAM SEARCH WITH STANDARD MEMORY CELLS

COMPUTING IN THE BIT LINES

Each bit line becomes a processor and storage millions of bit lines = millions of processors

C=f(A,B)

NEIGHBORHOOD COMPUTING

Parallel shift of bit lines @ 1 cycle sections Enables neighborhood operations such as convolutions C=f(A,SL(B,1))

SEARCH & COUNT

DATABASE SEARCH AND UPDATE

Content-based search, record can be placed anywhere

Update, modify, insert, delete is immediate

Exact Match

CAM/TCAM

Similarity Match In-Place Aggregate

TRADITIONAL STORAGE CAN DO MUCH MORE

… …

CPU/GPGPU VS APU

ARCHITECTURE

SECTION COMPUTING TO IMPROVE PERFORMANCE

Store, compute, search and move data anywhere … …

COMMUNICATION BETWEEN SECTIONS

APU CHIP LAYOUT

EVALUATION BOARD PERFORMANCE

APU SERVER

128TOPS/W

EXAMPLE APPLICATIONS

K-NEAREST NEIGHBORS (K-NN)

Simple example: N = 36, 3 groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands

K-NN USE CASE IN AN APU

With the data base in an APU, computation for all N items done in ≤ 0.05 ms, independent of K

LARGE DATABASE EXAMPLE USING APU SERVERS

EXAMPLE K-NN FOR RECOGNITION Text BOW, Word Embedding

Feature Extractor

K-NN Classifier

Image Convolution Layer

K-MINS: O(1) ALGORITHM

K-MINS: THE ALGORITHM

K-MINS: THE ALGORITHM

K-MINS: THE ALGORITHM

O(1) Complexity

DENSE (1XN) VECTOR BY SPARSE NXM MATRIX

Complexity including IO : O (N +logβ) where β is the number of nonzero elements in the sparse matrix N << M in general for recommender systems

SPARSE MATRIX MULTIPLICATION PERFORMANCE ANALYSIS

amount of power as the GPGPU solution above

ASSOCIATIVE MEMORY FOR NATURAL LANGUAGE PROCESSING (NLP)

END-TO-END MEMORY NETWORKS

GSI’S SOLUTION  APU—ASSOCIATIVE PROCESSING UNIT

THE COMPUTING MODEL FOR THE   PAST 80 YEARS

SPARSE MATRIX MULTIPLICATION   PERFORMANCE ANALYSIS

A TF EXAMPLE: MATMUL  GRAPH PREPARATION

A TF EXAMPLE: MATMUL  GVML_SET, GVML_MUL

A TF EXAMPLE: MATMUL  CODE EXAMPLE