In-Place Associative Computing Avidan Akerib Ph.D. Vice President - - PowerPoint PPT Presentation

in place associative computing
SMART_READER_LITE
LIVE PREVIEW

In-Place Associative Computing Avidan Akerib Ph.D. Vice President - - PowerPoint PPT Presentation

In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com All Images are Public in the Web 1 Agenda Introduction to associative computing Use case examples Similarity


slide-1
SLIDE 1

1

In-Place Associative Computing

Avidan Akerib Ph.D. Vice President Associative Computing BU aakerib@gsitechnology.com

All Images are Public in the Web

slide-2
SLIDE 2

2

Agenda

  • Introduction to associative computing
  • Use case examples
  • Similarity search
  • Large Scale Attention Computing
  • Few-shot learning
  • Software model
  • Future Approaches
slide-3
SLIDE 3

3

The Challenge In AI Computing (Matrix Multiplication is not enough!!) AI Requirement

  • High Precision Floating Point

Neural network learning

  • Multi precision

Real time inference, saving memory

  • Linearly Scalable

Big Data

  • Sort-search

Top-K, recommendation, speech, classify image/video

  • Heavy computation

Non linearity, Softmax, exponent , normalization

  • Bandwidth/power tradeoff

High speed at low power

Use Case Example

slide-4
SLIDE 4

4

Von Neumann Architecture

4

CPU Memory

High Density (Repeated cells) Slower Lower Density (Lots of Logic ) Faster

Leveraging Moore’s Law

slide-5
SLIDE 5

5

Von Neumann Architecture

5

CPU Memory

High Density (Reputed cells) Slower Lower Density (Lots of Logic) Faster

CPU frequency outpacing memory - need to add cache …. Continue to leverage Moore’s Law

slide-6
SLIDE 6

6

Since 2006 Clock speed start flattening Sharply …

6

Source: Intel

slide-7
SLIDE 7

7

Thinking Parallel : 2 cores and more

7

CPU Memory

However, memory utilization becomes an issue …

slide-8
SLIDE 8

8

More and more memory to solve utilization problem

CPU Memory

Local and Global Memory

slide-9
SLIDE 9

9

Memory still growing rapidly

9

CPU Memory

Memory becomes a larger part of each chip

slide-10
SLIDE 10

10

Same Concept even with GPGPUs

10

GPGPU Memories

Very High Power , large die, Expensive … What’s Next ??

slide-11
SLIDE 11

11

Most of Power goes to BW

Source: Song Han Stanford University

slide-12
SLIDE 12

12

Changing the Rules of the Game!!!

12

Standard Memory cells are smarter than we thought !!

slide-13
SLIDE 13

13

APU—Associative Processing Unit

  • Computes in-place directly in the memory array—removes the I/O

bottleneck

  • Significantly increases performance
  • Reduces power

Simple CPU

APU Associative Processing

Question Answer

Simple & Narrow Bus

Millions Processors

slide-14
SLIDE 14

14

How Computers Work Today

14

1000 1100 1001

Address Decoder

Sense Amp /IO Drivers

ALU

0101 RE/WE

slide-15
SLIDE 15

15

Accessing Multiple Rows Simultaneously

15

0101 1000 1100 1001 RE RE WE ? 0010

NOR

0001 RE RE WE

Bus Contention is not an error !!! It’s a simple NOR/NAND satisfying De-Morgan’s law

slide-16
SLIDE 16

16

Truth Table Example

A B C 1 1 1 1 1 1 1 1 1 1 1 1 D 1 1 1 1

AB C

00 01 11 10

1 1 1 1

1 !A!C + BC = !!( !A!C + BC ) = ! (!(!A!C)!(BC))

= NAND( NAND(!A,!C),NAND(B,C))

Read (B,C) ; WRITE T2 Read (!A,!C) ; WRITE T1 Read (T1,T2) ; WRITE D 1 CLOCK 1 CLOCK

  • Every Minterm

takes one Clock

  • All bit lines

executes Karnaugh tables in parallel

slide-17
SLIDE 17

17

Vector Add Example

17

A[] + B[] = C[]

  • No. Of Clocks = 4 * 8 = 32

Clocks/byte= 32/32M=1/1M OPS = 1Ghz X 1M

= 1 PetaOPS

vector A(8,32M) vector B(8,32M) Vector C(9,32M) C = A + B

slide-18
SLIDE 18

18

1 1 1 1 1 1

1 1 1 1

1 1 1 1 1 1

Records

1=match

Duplicate Vales with inverse data

Duplicate the Key with

  • Inverse. Move The original

Key next to the inverse data RE RE RE RE 1 in the combines key goes to the read enable

CAM/ Associative Search

Values KEY: Search 0110

slide-19
SLIDE 19

19

1 1

Don’t’ Care

1

Don’t Care

1 1

Don’t Care

1 1

TCAM Search By Standard Memory Cells

slide-20
SLIDE 20

20

1 1 1 1

1 1 1 1

1 1 1 1 1 KEY: Search 0110

1=match 1= match

Duplicate data. Inverse only to those which are not don’t care

Duplicate the Key with

  • Inverse. Move The
  • riginal Key next to the

inverse data RE RE RE RE 1 in the combines key goes to the read enable

Insert Zero instead of don’t-care

TCAM Search By Standard Memory Cells

slide-21
SLIDE 21

21

Computing in the Bit Lines

Vector A Vector B

Each bit line becomes a processor and storage Millions of bit lines = millions of processors

a0 a1 a2 a3 a4 a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7

C=f(A,B)

slide-22
SLIDE 22

22

Neighborhood Computing

Parallel shift of bit lines @ 1 cycle sections Enables neighborhood operations such as convolutions C=f(A,SL(B,1))

Shift vector

slide-23
SLIDE 23

23

Search & Count

5

Count = 3 Search 20

20

  • 17

3 20 54 20

1 1 1

  • Key Applications for search and count for predictive analytics:
  • Recommender systems
  • K-Nearest Neighbors (using cosine similarity search)
  • Random forest
  • Image histogram
  • Regular expression

8

  • Search (binary or ternary) all bit lines in 1 cycle
  • 128 M bit lines => 128 Peta search/sec

1 1 1

slide-24
SLIDE 24

24

CPU vs GPU vs FPGA vs APU

slide-25
SLIDE 25

25

CPU/GPGPU vs APU

CPU/GPGPU

(Current Solution)

In-Place Computing (APU)

Send an address to memory Search by content Fetch the data from memory and send it to the processor Mark in place Compute serially per core (thousands of cores at most) Compute in place on millions of processors (the memory itself becomes millions of processors Write the data back to memory, further wasting IO resources No need to write data back—the result is already in the memory Send data to each location that needs it If needed, distribute or broadcast at once

slide-26
SLIDE 26

26

ARCHITECTURE

slide-27
SLIDE 27

27

Store, Compute, Search and Transport data anywhere. … …

Shift between sections enable neighborhood

  • perations (filters , CNN

etc.)

Communication between Sections

slide-28
SLIDE 28

28

Section Computing to Improve Performance

Memory

MLB section 0

Connecting Mux

. . .

control

Instr. Buffer

MLB section 1

Connecting mux

24 rows

slide-29
SLIDE 29

29

APU Chip Layout

2M bit processors or 128K vector processors runs at 1G Hz with up to 2 Peta OPS peak performance

slide-30
SLIDE 30

30

APU Layout vs GPU Layout

Multi-Functional, Programmable Blocks Acceleration of FP

  • peration Blocks
slide-31
SLIDE 31

31

EXAMPLE APPLICATIONS

slide-32
SLIDE 32

32

K-Nearest Neighbors (k-NN)

Simple example: N = 36, 3 Groups, 2 dimensions (D = 2 ) for X and Y K = 4 Group Green selected as the majority For actual applications: N = Billions, D = Tens, K = Tens of thousands

slide-33
SLIDE 33

33

k-NN Use Case in an APU

𝐷𝑞 =

𝐸𝑞∙𝑅 𝐸𝑞 𝑅 = σ𝑗=0

𝑜

𝐸𝑗

𝑞𝑅𝑗

σ𝑗=0

𝑜

𝐸𝑗

𝑞 2 σ𝑗=0 𝑜

𝑅𝑗 2

Q

4 1 3 2

Majority Calculation Item features and label storage Computing Area Compute cosine distances for all N in parallel

( 10s, assuming D=50 features)

K Mins at O(1) complexity ( 3s) Distribute data – 2 ns (to all) In-Place ranking

Features of item 1

Item 1 Item 2 Item 3 Item N

Features of item 2 Features of item N

With the data base in an APU, computation for all N items done in  0.05 ms, independent of K (1000X Improvement over current solutions)

slide-34
SLIDE 34

34

K-MINS: O(1) Algorithm

KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }

C0 C1 C2

N

MSB LSB

slide-35
SLIDE 35

35

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

K-MINS: The Algorithm

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

V M

KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }

1 1 1 1 1 1 1 1 1 1 1

D

1 1 1 1 1 1 1 1 1 1 1

N

1 1 1 1 1 1 1 1 1 1 1

N|V cnt=11

1 1 1 1 1 1 1 1 1 1 1 110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

C[0]

slide-36
SLIDE 36

36

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

K-MINS: The Algorithm

1 1 1 1 1 1 1 1 1 1 1

V M

KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }

1 1 1 1 1 1 1 1 1

D

1 1 1 1 1 1 1 1

N

1 1 1 1 1 1 1 1

N|V cnt=8

1 1 1 1 1 1 1 1 110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

C[1]

slide-37
SLIDE 37

37

110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

K-MINS: The Algorithm

V

KMINS(int K, vector C){ M := 1, V := 0; FOR b = msb to b = lsb: D := not(C[b]); N := M & D; cnt = COUNT(N|V) IF cnt > K: M := N; ELIF cnt < K: V := N|V; ELSE: // cnt == K V := N|V; EXIT; ENDIF ENDFOR }

1 1 1 1 110101 010100 000101 000111 000001 111011 000101 010110 001100 011100 111100 101101 000101 111101 000101 000101

C final output

O(1) Complexity

slide-38
SLIDE 38

38

Similarity Search and Top-K for Recognition

Text Image

Data Base

Every image/sentence/doc has a label Feature Extractor (Neural network) Convolution Layer Word/Sentence/doc Embedding

slide-39
SLIDE 39

39

Dense (1XN) Vector by Sparse NxM Matrix

4

  • 2

3

  • 1

Input Vector Sparse Matrix

Output Vector

3 5 9 17

1 2 2 4 2 3 4 4

Shift and Add all belonging to same column

3 5 9 17

  • 6

6

  • 17

Row Column

  • 6

15

  • 9
  • 17
  • 6

6

  • 17

Search all columns for row = 2 :Distribute -2 : 2 Cy Search all columns for row = 3 :Distribute 3 : 2 Cy Search all columns for row = 4 :Distribute -1 : 2 Cy Multiply in Parallel : 100 Cy

APU Representations and Computing

  • 2

3

  • 1
  • 1

Complexity including IO : O (N +logβ) where β is the number of nonzero elements in the sparse matrix N << M in general for recommender systems

slide-40
SLIDE 40

40

Two NxN Sparse Matrix Multiplication

Choose Next free Entry from In-DB1 Read its Row value Search and Mark Similar Rows For all Marked Row Search where Col(In-DB1) = Row(In-DB2) Broadcast selected value to Output Table bit lines Multiply in Parallel Shift and Add all belonging to same Column Update Out-DB Go Back to Step 1 if there are more free entries Exit Sparse In1 Matrix Sparse Inb-2 Matrix 1 2 3 18 34 3 6 5 30 3 9 17 9 1 6 3 2 3 5 9 17 2 3 2 4 1 2 2 4 1 3 4 1 2 3 4 4

. .

1 2 2

1

Output Matrix

X =

In-DB1 In-DB2 Out-DB 3 18 34 3 18 34 1 2 4 1 1 1

.

6 30 30 2 3

.

9 18 34 3 9 1 4

Complexity Including IO : O(β+logβ) Compared to O(β0.7N1.2 +N2 ) in CPU ( > 1000X Improvement)

COL ROW

slide-41
SLIDE 41

41

Softmax

  • Used in many neural networks applications, especially for attention networks
  • The Softmax function takes an N dimensional vector of scores and

generates probabilities between 0 to 1, as defined by the function

  • Where Z is the dot product between a query vector and feature vector ( for

example, word emending of English vocabulary )

𝑇𝑗 = 𝑓𝑎𝑗 σ𝑘=1

𝑂

𝑓𝑎𝑘

slide-42
SLIDE 42

42

The Difficulties in Softmax Computing

  • 1. Dot Product for millions vectors
  • 2. Non Linearity function (Exp)
  • 3. Dependency : every score depends on all others in data base
  • 4. Dynamic range: fast overflow, requires high precision calculations
  • 5. Speed and Latency
slide-43
SLIDE 43

43

Taylor Series 𝑓𝑦 = 1 + 𝑦 + 𝑦2 2 + 𝑦3 3! + ⋯

Very Expensive, Requires more than 20 coefficients and double precision for good accuracy.

slide-44
SLIDE 44

44

1M SoftMax Performance

  • Proprietary algorithm leverages APU’s

lookup capability

  • Provide 1M High accuracy exact Softmax values

= < 5 µsec vs 10-100 msec in GPU

  • > 3 orders of magnitude improvement
slide-45
SLIDE 45

45

Associative Memory for Natural Language Processing (NLP)

  • Q&A, dialog, language translation, speech recognition etc.
  • Requires learning past events
  • Needs large array with attention capabilities
slide-46
SLIDE 46

46

Examples

  • Q&A:

“Dan put the book in his car, ….. Long story here…. Mike took Dan’s car … Long story here …. He drove to SF”

  • Q : Where is the book now?
  • A: Car, SF
  • Language Translation:
  • “The cow ate the hay because it was delicious”.
  • “The cow ate the hay because it was hungry”.

Attention Computing

Source: Łukasz Kaiser

slide-47
SLIDE 47

47

Example of Associative Attention Computing

Encoder (NN)

Input Data

(i.e. Sentence in English for translation or for Q&A) Feature Vector Embedding

Sentences Features Representation (Key)

slide-48
SLIDE 48

48

Example of Associative Attention Computing

Encoder (NN)

Query

Feature Vector Dot Product

Key

Dot Product :O(1)

0.1 0.01 0.03 0.07 0.2 0.1 0.09 0.4 0.02 0.08 0.05 0.04 0.17 0.22

. . . V6 V1 V2 V3 V4 V5

X

Attention Compute TOP K

SoftMax O(1) Top-K O(1)

Next Stage

(Encoder or Decoder)

Dot Product Result

Value)

SoftMax Result

slide-49
SLIDE 49

49

Q&A : End to End Network (Weston)

Source: Weston et al

slide-50
SLIDE 50

50

GSI Associative Solution for End to End

Constant time of 3 µSec per one iteration, any memory size > few orders of magnitude improvement

Source: Weston et al

slide-51
SLIDE 51

51

Associative Computing for Low-Shot Learning

  • Gradient-Based Optimization has achieved impressive results
  • n supervised tasks such as image classification
  • These models need a lot of data
  • People can learn efficiently from few examples
  • Associative Computing
  • Like people, can measure similarity to features stored in memory
  • Can also create a new label for similar features in the future
slide-52
SLIDE 52

52

Zero-Shot Learning with k-NN

  • Extract features using any pre- trained CNN, for example VGG/Inception on ImageNet
  • New data set is embedded using a pre-trained model and stored in memory with its label
  • Query (test images) are input without label and their features are cosine similarity searched to predict the

label

Input Images with labels Embedding Input features

Pixels Features

Feature Extractor by Convolution Layer

Cosine Similarity Search + Top-K Similar Image without label

Similar Image Label

slide-53
SLIDE 53

53

Dimension Reduction

  • Output of convolution layer is large ( 20,000

features in VGG, very sparse)

  • Simple matrix or multi-layer non-linear

transformation

  • Learned simply
  • Loss function:
  • Cosine distance found between any two records
  • Difference between the distance of input and output
  • Learns to preserve the cosine distance through transform

20,000

200 Associative

slide-54
SLIDE 54

54

Low-Shot: Train the network on distance

  • Start with untrained network
  • Output of network is already reduced-dimension keys for k-NN DB
  • Train the network only to keep similar-valued keys close

k-NN Data Base

slide-55
SLIDE 55

55

Cut Short

  • Stop training when system starts to converge (Cut Short)
  • Use similarity search instead of Fully Connected
  • Requires less complete training

k-NN Data Base (Associative)

slide-56
SLIDE 56

56

PROGRAMING MODEL

slide-57
SLIDE 57

57

Programming Model

Write application In Standard Host Using TensorFlow /Tesor2Tensor Frame Work

Generates TensorFlow Graph for Execution in Device Memory APU Chip/Card Execute the Graph using fused Capabilities

slide-58
SLIDE 58

58

PCIe Development Boards

  • 4 APU Chips
  • 8 Millions bit lines rows (processors)
  • 8 Peta Boolean OPS
  • 6.4 TFLOPS
  • 2 Petabit/sec Internal IO
  • 16-64 GB(Device memory)
  • TensorFlow Frame-work (basic functions)
  • GNL (GSI Numeric Lib)
slide-59
SLIDE 59

59

FUTURE APPROACH – NON VOLATILE CONCEPT

slide-60
SLIDE 60

60

Computing in Non-Volatile Cells

Non Volatile bit cell

Sense Unit & Write Control

Select REF

  • Select Multiple Lines for read (as

NOR/NAND input)

  • Ref = V-read
  • The Sense Unit is Sensing Bit Line for

Logic “1” or “0”

  • Select 1 or Multiple Lines for write

(NOR/NAND results)

  • Ref = V-write
  • Write Control Generates logic “1” or “0”

for bit line

slide-61
SLIDE 61

61

CPU Register File

L1/L2/L3

Flash HDD

Solutions for Future Data Centers

DRAM ASSOCIATIVE

STT-RAM RAM Based PC-RAM Based ReRam Based

Full computing (floating points etc.) requires read & write : Machine learning, malware detection detection etc., : Much more read and much less write Data search engines (read most of the time)

High endurance Mid endurance Low endurance

Standard SRAM Based

Volatile Non Volatile

slide-62
SLIDE 62

62

Summary

APU enables state of the art, next-generation machine learning :

  • In-Place from basic Boolean Algebra to complex algorithms
  • O(1) Dot Produces computation
  • O(1) Min/Max
  • O(1) Top K
  • O(1) Softmax
  • Ultra high Internal BW – 0.5 Peta bit/sec
  • Up to 2 PetaOps of Boolean Algebra in a single chip
  • Fully Scalable
  • Fully Programmable
  • Efficient TensorFlow based capabilities
slide-63
SLIDE 63

63

Summary

Extending Moore’s Law and Leveraging Advanced Memory Technology Growth

For M.Sc./Ph.D. students that would like to collaborate on research please contact me: aakerib@gsitechnology.com

slide-64
SLIDE 64

64

Thank You! Any Questions?

APU Page – http://www.gsitechnology.com/node/123377