IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY - - PowerPoint PPT Presentation

in memory associative computing
SMART_READER_LITE
LIVE PREVIEW

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY - - PowerPoint PPT Presentation

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA The AI computational challenge Introduction to associative computing Examples An NLP use case Whats next? THE CHALLENGE IN AI


slide-1
SLIDE 1

IN-MEMORY ASSOCIATIVE COMPUTING

AVIDAN AKERIB, GSI TECHNOLOGY

AAKERIB@GSITECHNOLOGY.COM

slide-2
SLIDE 2

AGENDA

The AI computational challenge Introduction to associative computing Examples An NLP use case What’s next?

slide-3
SLIDE 3

THE CHALLENGE IN AI COMPUTING

AI Requirement

32 bit FP

Neural network learning

Multi precision

Neural network inference, data mining, etc.

Scaling

Data center

Sort-search

Top-K, recommendation, speech, classify image/video

Heavy computation

Non linearity, Softmax, exponent , normalize

Bandwidth

Required for speed and power

Use Case Example

slide-4
SLIDE 4

CURRENT SOLUTION

CPU

General Purpose GPU

Question Answer

DRAM

  • Bottleneck when register file data needs to be replaced on a regular basis

➢ Limits performance ➢ Increases power consumption

  • Does not scale with the search, sort, and rank requirements of applications like

recommender systems, NLP, speech recognition, and data mining that requires functions like Top-K and Softmax.

Very Wide Bus

Tens of Cores Thousands of Cores

slide-5
SLIDE 5

GPU VS CPU VS FPGA

slide-6
SLIDE 6

GPU VS CPU VS FPGA VS APU

slide-7
SLIDE 7

GSI’S SOLUTION
 APU—ASSOCIATIVE PROCESSING UNIT

  • Computes in-place directly in the memory array—removes the I/O

bottleneck

  • Significantly increases performances
  • Reduces power

Simple CPU

APU Associative Memory

Question Answer

Simple & Narrow Bus

Millions Processors

slide-8
SLIDE 8

IN-MEMORY COMPUTING CONCEPT

slide-9
SLIDE 9

THE COMPUTING MODEL FOR THE 
 PAST 80 YEARS

1000 1100 1001

Address Decoder

Sense Amp /IO Drivers

ALU

0101 1000

Read Read

0010

Write Write

slide-10
SLIDE 10

THE CHANGE—IN-MEMORY COMPUTING

  • Patented in-memory logic using only Read/Write operations
  • Any logic/arithmetic function can be generated internally

0101 1000 1100 1001

Read Read Write

NOR

0101

Read Read Write

0010

NOR

Simple Controller

slide-11
SLIDE 11

1 1 1 1 1 1

1 1 1 1

1 1 1 1 1 1

Records

1=match

Duplicate vales with inverse data

Duplicate the key with

  • inverse. Move the original

key next to the inverse data RE RE RE RE 1 in the combines key goes to the read enable

11

CAM/ ASSOCIATIVE SEARCH

Values KEY: Search 0110

slide-12
SLIDE 12

1 1

Don’t care

1

Don’t care

1 1

Don’t care

1 1

12

TCAM SEARCH WITH STANDARD MEMORY CELLS

slide-13
SLIDE 13

1 1 1 1

1 1 1 1

1 1 1 1 1 KEY: Search 0110

1=match 1= match

Duplicate data. Inverse only to those which are not don’t care

Duplicate the key with

  • inverse. Move The
  • riginal Key next to the

inverse data RE RE RE RE 1 in the combines key goes to the read enable

Insert zero instead of don’t-care

13

TCAM SEARCH WITH STANDARD MEMORY CELLS

slide-14
SLIDE 14

COMPUTING IN THE BIT LINES

Vector A Vector B

Each bit line becomes a processor and storage millions of bit lines = millions of processors

C=f(A,B)

slide-15
SLIDE 15

NEIGHBORHOOD COMPUTING

Parallel shift of bit lines @ 1 cycle sections Enables neighborhood operations such as convolutions C=f(A,SL(B,1))

Shift vector

slide-16
SLIDE 16

SEARCH & COUNT

5

Count = 3 Search 20

20

  • 17

3 20 54 20

1 1 1

  • Key applications for search and count for predictive analytics:
  • Recommender systems
  • K-nearest neighbors (using cosine similarity search)
  • Random forest
  • Image histogram
  • Regular expression

8

  • Search (binary or ternary) all bit lines in 1 cycle
  • 128 M bit lines => 128 Peta search/sec

1 1 1

slide-17
SLIDE 17

DATABASE SEARCH AND UPDATE

Content-based search, record can be placed anywhere

Update, modify, insert, delete is immediate

Exact Match

CAM/TCAM

Similarity Match In-Place Aggregate

slide-18
SLIDE 18

TRADITIONAL STORAGE CAN DO MUCH MORE

Standard memory cell 1 bit Standard memory cell 2 bits 2 input NOR, 1 TCAM cell Standard memory cell 3 Bits 3 input NOR, 2 Input NOR + 1 Output 4 State CAM Standard memory cell

… …

slide-19
SLIDE 19

CPU/GPGPU VS APU

slide-20
SLIDE 20

ARCHITECTURE

slide-21
SLIDE 21

SECTION COMPUTING TO IMPROVE PERFORMANCE

21

Memory MLB section 0

Connecting Mux

. . .

control

Instr. Buffer

MLB section 1

Connecting mux 24 rows

slide-22
SLIDE 22

Store, compute, search and move data anywhere … …

Shift between sections enable neighborhood

  • perations (filters , CNN etc.)

COMMUNICATION BETWEEN SECTIONS

22

slide-23
SLIDE 23

APU CHIP LAYOUT

2M bit processors or 128K vector processors runs at 1G Hz with up to 2 Peta OPS peak performance

slide-24
SLIDE 24

EVALUATION BOARD PERFORMANCE

Precision :

Unlimited : from 1 bit to 160 bits or more. 6.4 TOPS (FP) – 8 Peta OPS for one bit computing or 16 bit

exact search

Similarity Search, Top-k , min, max , Softmax,

O(1) complexity in μs, any size of K compared to ms with

current solutions

In-memory IO

2 Petabit/sec > 100X GPGPU/CPU/FPGA

Sparse matrix multiplication

> 100X GPGPU/CPU/FPGA

slide-25
SLIDE 25

APU SERVER

64 APU chips , 256-512GByte DDR, From 100TFLOPS Up to 128 Peta OPS with peak performance

128TOPS/W

O(1) Top-K, min, max, 32 Peta bits/sec internal IO < 1K Watts > 1000X GPGPs on average Linearly scalable Currently 28nm process and scalable to 7nm or less

Well suited to advanced memory technology such as non volatile ReRAM

and more

slide-26
SLIDE 26

EXAMPLE APPLICATIONS

slide-27
SLIDE 27

K-NEAREST NEIGHBORS (K-NN)

Simple example: N = 36, 3 groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands

slide-28
SLIDE 28

K-NN USE CASE IN AN APU

Item1 Item2 Item N

Q

4 1 3 2

Majority Calculation Item features and label storage Computing Area Compute cosine distances for all N in parallel

(≤ 10μs, assuming D=50 features)

K Mins at O(1) complexity (≤ 3μs) Distribute data – 2 ns (to all) In-Place ranking

Features of item 1

Item 1 Item 2 Item 3 Item N

Features of item 2 Features of item N

With the data base in an APU, computation for all N items done in ≤ 0.05 ms, independent of K

slide-29
SLIDE 29

LARGE DATABASE EXAMPLE USING APU SERVERS

  • Number of items: billions
  • Features per item: tens to hundreds
  • Latency: ≤ 1 msec
  • Throughput: Scales to 1M similarity searches/sec
  • k-NN: Top 100,000 nearest neighbors
slide-30
SLIDE 30

EXAMPLE K-NN FOR RECOGNITION Text BOW, Word Embedding

Feature Extractor

(Neural Network)

K-NN Classifier

(Associative Memory)

Image Convolution Layer

slide-31
SLIDE 31

K-MINS: O(1) ALGORITHM

KMINS(int K, vector C){
 M := 1, V := 0;
 FOR b = msb to b = lsb:
 D := not(C[b]);
 N := M & D;
 cnt = COUNT(N|V)
 IF cnt > K:
 M := N;
 ELIF cnt < K:
 V := N|V;
 ELSE: // cnt == K
 V := N|V;
 EXIT;
 ENDIF
 ENDFOR
 }

C0 C1 C2

N

MSB LSB

slide-32
SLIDE 32

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

K-MINS: THE ALGORITHM

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

V M

KMINS(int K, vector C){
 M := 1, V := 0;
 FOR b = msb to b = lsb:
 D := not(C[b]);
 N := M & D;
 cnt = COUNT(N|V)
 IF cnt > K:
 M := N;
 ELIF cnt < K:
 V := N|V;
 ELSE: // cnt == K
 V := N|V;
 EXIT;
 ENDIF
 ENDFOR
 }

1 1 1 1 1 1 1 1 1 1 1

D

1 1 1 1 1 1 1 1 1 1 1

N

1 1 1 1 1 1 1 1 1 1 1

N|V cnt=11

1 1 1 1 1 1 1 1 1 1 1 110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

C[0]

slide-33
SLIDE 33

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

K-MINS: THE ALGORITHM

1 1 1 1 1 1 1 1 1 1 1

V M

KMINS(int K, vector C){
 M := 1, V := 0;
 FOR b = msb to b = lsb:
 D := not(C[b]);
 N := M & D;
 cnt = COUNT(N|V)
 IF cnt > K:
 M := N;
 ELIF cnt < K:
 V := N|V;
 ELSE: // cnt == K
 V := N|V;
 EXIT;
 ENDIF
 ENDFOR
 }

1 1 1 1 1 1 1 1 1

D

1 1 1 1 1 1 1 1

N

1 1 1 1 1 1 1 1

N|V cnt=8

1 1 1 1 1 1 1 1 110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

C[1]

slide-34
SLIDE 34

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

K-MINS: THE ALGORITHM

V

KMINS(int K, vector C){
 M := 1, V := 0;
 FOR b = msb to b = lsb:
 D := not(C[b]);
 N := M & D;
 cnt = COUNT(N|V)
 IF cnt > K:
 M := N;
 ELIF cnt < K:
 V := N|V;
 ELSE: // cnt == K
 V := N|V;
 EXIT;
 ENDIF
 ENDFOR
 }

1 1 1 1 110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

C final output

O(1) Complexity

slide-35
SLIDE 35

DENSE (1XN) VECTOR BY SPARSE NXM MATRIX

4

  • 2

3

  • 1

Input Vector Sparse Matrix

Output Vector

3 5 9 17

1 2 2 4 2 3 4 4

Shift and Add all belonging to same column

3 5 9 17

  • 6

6

  • 17

Row Column

  • 6

15

  • 9
  • 17
  • 6

6

  • 17

Search all columns for row = 2 :Distribute -2 : 2 Cy Search all columns for row = 3 :Distribute 3 : 2 Cy Search all columns for row = 4 :Distribute -1 : 2 Cy Multiply in Parallel : 100 Cy

APU Representations and Computing

  • 2

3

  • 1
  • 1

Complexity including IO : O (N +logβ) where β is the number of nonzero elements in the sparse matrix N << M in general for recommender systems

slide-36
SLIDE 36

SPARSE MATRIX MULTIPLICATION 
 PERFORMANCE ANALYSIS

  • G3 circuit matrix
  • 1.5M X 1.5M sparse matrix
  • Roughly 8M nonzero elements
  • 20–100 GFLOPS with GPGPU solution
  • APU solution provides 64 TFLOPS using the same

amount of power as the GPGPU solution above

  • > 500x improvement with APU solution
slide-37
SLIDE 37

ASSOCIATIVE MEMORY FOR NATURAL LANGUAGE PROCESSING (NLP)

Q&A, dialog, language translation, speech recognition’ etc. Requires learning things from the past – needs memory More memory, more accuracy i.e. “Dan put the book in his car, ….. Long story here…. Mike took Dan’s car … Long story

here …. He drove to SF”

Q : Where is the book now? A: Car, SF

slide-38
SLIDE 38

END-TO-END MEMORY NETWORKS

End-To-End Memory Networks, (Weston et. al., NIPS 2015). (a): Single hope, (b) 3 hops

slide-39
SLIDE 39

Q&A : END TO END NETWORK

slide-40
SLIDE 40

REQUIREMENTS FOR AUGMENTED MEMORY

Cosine Similarity Search + Top-K Compute softmax to selected Vertical Multiplication selected Columns and horizontal sum

Output

Embedding Input features to next coloumn , to any other location or location based on content

slide-41
SLIDE 41

Control

I O

mi

Ti

M

0 1 2 N-1

i mi

i mi Broadcast input I to selected columns Compute any function at selected columns Generate output Generate tags for selection

APU MEMORY FOR NLP

slide-42
SLIDE 42

GSI SOLUTION FOR END TO END

Constant time of 3 µsec per iteration, any memory size.

slide-43
SLIDE 43

PROGRAMING MODEL

slide-44
SLIDE 44

PROGRAMMING MODEL

APU (Associative Processing Unit Hardware) Graph Execution and Tasks Scheduling Framework (TensorFlow ) Application (C++, Python)

HOST Device

slide-45
SLIDE 45

c

A TF EXAMPLE: MATMUL

a = tf.placeholder(tf.int32, shape=[3,4]) b = tf.placeholder(tf.int32, shape=[4,6]) c = tf.matmul(a, b) # shape = [3,6]

b a

matmul

slide-46
SLIDE 46

A TF EXAMPLE: MATMUL
 GRAPH PREPARATION

apu device (tf+eigen)

a = tf.placeholder(tf.int32, shape=[3, 4]) b = tf.placeholder(tf.int32, shape=[4, 6]) c = tf.matmul(a, b) gnl_create_array(a)

APU DEVICE SPACE

gnl_create_array(b) gnl_create_array(c)

a b c apuc’s L4 host

L1 MMB

slide-47
SLIDE 47

MMB L1

A TF EXAMPLE: MATMUL
 GVML_SET, GVML_MUL

with tf.Session() as sess: result = sess.run(c, feed_dict= {a: [[2 -4 15 3] b: [[27 -8 -14 2 9 -32] [11 1 -30 4] [-7 52 -6 21 0 4] [-8 23 -9 7]], [-81 1 20 6 19 -3] [-38 90 5 2 13 77]]})

2 -4 15 3
 11 1 -30 4


  • 8 23 -9 7

APU DEVICE SPACE a apuc L4

27 -8 -14 2 9 -32


  • 7 52 -6 21 0 4

  • 81 1 20 6 19 -3

  • 38 90 5 2 13 77 b

c

gnlpd_mat_mul(c,a,b)

controller

d m a c

  • p

y L 4 t

  • L

1

  • 7 52 -6 21 0 4

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

c

gnlpd_dma_16b_start(GNLPD_SYS_2_VMR,…) gvml_mul_s16(……)

dma keeps
 loading 
 data to apuc while matmul
 is being
 computed
 in apuc

gvml_set_16(……) 27 -8 -14 2 9 -32 2 2 2 2 2 2 54 -16 -28 4 18 -64

X gvml_mul_s16(……)

g v m l _ s e t _ 1 6 ( … … )

slide-48
SLIDE 48

TENSORFLOW ENHANCEMENT: FUSED OPERATIONS

a = tf.placeholder(tf.int32, shape=[3,4]) b = tf.placeholder(tf.int32, shape=[4,6]) c = tf.matmul(a, b) # shape = [3,6]

b a c

matmul

d = tf.nn.top_k(c, k=2) # shape = [3,2]

d

top_k fused(matmul,top_k)

  • The two operations are computed inside the apuc
  • Data stays in L1
  • No IO operations between them
  • Saves valuable data transfer time and power

fused

slide-49
SLIDE 49

MMB L1

A TF EXAMPLE: MATMUL
 CODE EXAMPLE

APU DEVICE

27 -8 -14 2 9 -32 54 -16 -28 4 18 -64 0 0 0 0 0 0 0 0 0 0 0 0 c 27 -8 -14 2 9 -32 2 2 2 2 2 2 54 -16 -28 4 18 -64

APL_FRAG add_u16(RN_REG x, RN_REG y, RN_REG t_xory, RN_REG t_ci0) { SM_0XFFFF: RL = SB[x]; // RL[0-15] = x SM_0XFFFF: RL |= SB[y]; // RL[0-15] = x|y { SM_0XFFFF: SB[t_xory] = RL; // t_xory[1-15] = x|y SM_0XFFFF: RL = SB[x , y]; // RL[0-15] = x&y } // Add init state: // 0: RL = co[0] // 1..15: RL = x&y { (SM_0X1111 << 1): SB[t_ci0] = NRL; // t_ci0[5,9,13] = x&y (SM_0X1111 << 1): RL |= SB[t_xory] & NRL; // RL[1] = Cout[1] = x&y | ci(x|y) // 5,9,13: RL = Cout0[5,9,13] = x&y | ci(x|y) (SM_0X1111 << 4): RL |= SB[t_xory]; // RL[4,8,12] = Cout1[4,8,12] = x&y | 1&(x|y) } { (SM_0X1111 << 2): SB[t_ci0] = NRL; // Propagate Cin0 (SM_0X1111 << 2): RL |= SB[t_xory] & NRL; // Propagate Cout0 (SM_0X1111 << 5): RL |= SB[t_xory] & NRL; // Propagate Cout1 } { (SM_0X1111 << 3): SB[t_ci0] = NRL; // Propagate Cin0 (SM_0X1111 << 3): RL |= SB[t_xory] & NRL; // Propagate Cout0 (SM_0X1111 << 6): RL |= SB[t_xory] & NRL; // Propagate Cout1 (SM_0X0001 << 15): GL = RL; } { (SM_0X1111 << 4): SB[t_ci0] = NRL; // t_ci0[8,12,16] = Cout0[7, 11, 15] SM_0X0001: SB[t_ci0] = GL; // t_ci0[8,12,16] = Cout0[7, 11, 15] (SM_0X1111 << 7): RL |= SB[t_xory] & NRL; // Propagate Cout1 (SM_0X0001 << 15): GL = RL; } . . .

slide-50
SLIDE 50

FUTURE APPROACH – NON VOLATILE CONCEPT

slide-51
SLIDE 51

CPU Register File

L1/L2/L3

Flash HDD

SOLUTIONS FOR FUTURE DATA CENTERS

DRAM ASSOCIATIVE

STT-RAM RAM Based PC-RAM Based ReRam Based

Full computing (floating points etc.) requires read & write : Machine learning, malware detection detection etc., : Much more read and much less write Data search engines (read most of the time)

High endurance Mid endurance Low endurance

Standard SRAM Based

Volatile Non Volatile

slide-52
SLIDE 52

THANK YOU