1
In-Place Associative Computing
Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com
All Images are Public in the Web
In-Place Associative Computing Avidan Akerib Ph.D. Vice President - - PowerPoint PPT Presentation
In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com All Images are Public in the Web 1 Agenda Introduction to associative computing Use case examples Similarity
1
All Images are Public in the Web
2
3
Neural network learning
Real time inference, saving memory
Big Data
Top-K, recommendation, speech, classify image/video
Non linearity, Softmax, exponent , normalization
High speed at low power
4
4
High Density (Repeated cells) Slower Lower Density (Lots of Logic ) Faster
5
5
High Density (Reputed cells) Slower Lower Density (Lots of Logic) Faster
6
6
Source: Intel
7
7
CPU Memory
8
CPU Memory
9
9
CPU Memory
10
10
GPGPU Memories
11
Source: Song Han Stanford University
12
12
13
Simple CPU
APU Associative Processing
Question Answer
Simple & Narrow Bus
Millions Processors
14
14
1000 1100 1001
Sense Amp /IO Drivers
0101 RE/WE
15
15
0101 1000 1100 1001 RE RE WE ? 0010
NOR
0001 RE RE WE
16
A B C 1 1 1 1 1 1 1 1 1 1 1 1 D 1 1 1 1
AB C
00 01 11 10
1 !A!C + BC = !!( !A!C + BC ) = ! (!(!A!C)!(BC))
= NAND( NAND(!A,!C),NAND(B,C))
Read (B,C) ; WRITE T2 Read (!A,!C) ; WRITE T1 Read (T1,T2) ; WRITE D 1 CLOCK 1 CLOCK
17
17
A[] + B[] = C[]
Clocks/byte= 32/32M=1/1M OPS = 1Ghz X 1M
vector A(8,32M) vector B(8,32M) Vector C(9,32M) C = A + B
18
1 1 1 1
Records
Duplicate Vales with inverse data
Duplicate the Key with
Key next to the inverse data RE RE RE RE 1 in the combines key goes to the read enable
19
1 1
Don’t’ Care
Don’t Care
Don’t Care
20
1 1 1 1
Duplicate data. Inverse only to those which are not don’t care
Duplicate the Key with
inverse data RE RE RE RE 1 in the combines key goes to the read enable
Insert Zero instead of don’t-care
21
Vector A Vector B
a0 a1 a2 a3 a4 a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7
22
Shift vector
23
5
Count = 3 Search 20
20
3 20 54 20
1 1 1
8
1 1 1
24
25
(Current Solution)
Send an address to memory Search by content Fetch the data from memory and send it to the processor Mark in place Compute serially per core (thousands of cores at most) Compute in place on millions of processors (the memory itself becomes millions of processors Write the data back to memory, further wasting IO resources No need to write data back—the result is already in the memory Send data to each location that needs it If needed, distribute or broadcast at once
26
27
Shift between sections enable neighborhood
etc.)
28
MLB section 0
Connecting Mux
. . .
control
Instr. Buffer
MLB section 1
Connecting mux
24 rows
29
2M bit processors or 128K vector processors runs at 1G Hz with up to 2 Peta OPS peak performance
30
Multi-Functional, Programmable Blocks Acceleration of FP
31
32
33
𝐷𝑞 =
𝐸𝑞∙𝑅 𝐸𝑞 𝑅 = σ𝑗=0
𝑜
𝐸𝑗
𝑞𝑅𝑗
σ𝑗=0
𝑜
𝐸𝑗
𝑞 2 σ𝑗=0 𝑜
𝑅𝑗 2
Q
4 1 3 2
Majority Calculation Item features and label storage Computing Area Compute cosine distances for all N in parallel
( 10s, assuming D=50 features)
K Mins at O(1) complexity ( 3s) Distribute data – 2 ns (to all) In-Place ranking
Features of item 1
Item 1 Item 2 Item 3 Item N
Features of item 2 Features of item N
34
KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }
C0 C1 C2
N
MSB LSB
35
110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
V M
KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }
1 1 1 1 1 1 1 1 1 1 1
D
1 1 1 1 1 1 1 1 1 1 1
N
1 1 1 1 1 1 1 1 1 1 1
N|V cnt=11
1 1 1 1 1 1 1 1 1 1 1 110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101
C[0]
36
110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101
1 1 1 1 1 1 1 1 1 1 1
V M
KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }
1 1 1 1 1 1 1 1 1
D
1 1 1 1 1 1 1 1
N
1 1 1 1 1 1 1 1
N|V cnt=8
1 1 1 1 1 1 1 1 110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101
C[1]
37
110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101
V
KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }
1 1 1 1 110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101
C final output
38
Data Base
Every image/sentence/doc has a label Feature Extractor (Neural network) Convolution Layer Word/Sentence/doc Embedding
39
4
3
Input Vector Sparse Matrix
Output Vector
3 5 9 17
1 2 2 4 2 3 4 4
Shift and Add all belonging to same column
3 5 9 17
6
Row Column
15
6
Search all columns for row = 2 :Distribute -2 : 2 Cy Search all columns for row = 3 :Distribute 3 : 2 Cy Search all columns for row = 4 :Distribute -1 : 2 Cy Multiply in Parallel : 100 Cy
APU Representations and Computing
3
40
Choose Next free Entry from In-DB1 Read its Row value Search and Mark Similar Rows For all Marked Row Search where Col(In-DB1) = Row(In-DB2) Broadcast selected value to Output Table bit lines Multiply in Parallel Shift and Add all belonging to same Column Update Out-DB Go Back to Step 1 if there are more free entries Exit Sparse In1 Matrix Sparse Inb-2 Matrix 1 2 3 18 34 3 6 5 30 3 9 17 9 1 6 3 2 3 5 9 17 2 3 2 4 1 2 2 4 1 3 4 1 2 3 4 4
1 2 2
1
Output Matrix
X =
In-DB1 In-DB2 Out-DB 3 18 34 3 18 34 1 2 4 1 1 1
6 30 30 2 3
9 18 34 3 9 1 4
Complexity Including IO : O(β+logβ) Compared to O(β0.7N1.2 +N2 ) in CPU ( > 1000X Improvement)
COL ROW
41
42
43
44
45
46
“Dan put the book in his car, ….. Long story here…. Mike took Dan’s car … Long story here …. He drove to SF”
Source: Łukasz Kaiser
47
Input Data
(i.e. Sentence in English for translation or for Q&A) Feature Vector Embedding
Sentences Features Representation (Key)
48
Query
Feature Vector Dot Product
Key
Dot Product :O(1)
0.1 0.01 0.03 0.07 0.2 0.1 0.09 0.4 0.02 0.08 0.05 0.04 0.17 0.22
. . . V6 V1 V2 V3 V4 V5
X
Attention Compute TOP K
SoftMax O(1) Top-K O(1)
Next Stage
(Encoder or Decoder)
Dot Product Result
Value)
SoftMax Result
49
Source: Weston et al
50
Source: Weston et al
51
52
label
Input Images with labels Embedding Input features
Pixels Features
Feature Extractor by Convolution Layer
Cosine Similarity Search + Top-K Similar Image without label
Similar Image Label
53
features in VGG, very sparse)
transformation
200 Associative
54
k-NN Data Base
55
k-NN Data Base (Associative)
56
57
Write application In Standard Host Using TensorFlow /Tesor2Tensor Frame Work
Generates TensorFlow Graph for Execution in Device Memory APU Chip/Card Execute the Graph using fused Capabilities
58
59
60
Non Volatile bit cell
Sense Unit & Write Control
Select REF
NOR/NAND input)
Logic “1” or “0”
(NOR/NAND results)
for bit line
61
CPU Register File
L1/L2/L3
Flash HDD
DRAM ASSOCIATIVE
STT-RAM RAM Based PC-RAM Based ReRam Based
Full computing (floating points etc.) requires read & write : Machine learning, malware detection detection etc., : Much more read and much less write Data search engines (read most of the time)
High endurance Mid endurance Low endurance
Standard SRAM Based
Volatile Non Volatile
62
APU enables state of the art, next-generation machine learning :
63
64