Dr. Avidan Akerib, VP Associative Computing BU June 2nd 2019 2 - PowerPoint PPT Presentation

Dr. Avidan Akerib, VP Associative Computing BU June 2nd 2019

CORPORATE SUMMARY FOUNDED IN 1995 1 PUBLIC COMPANY 2 Consistent profitability & zero debt ~150 EMPLOYEES WORLDWIDE. 3 Design / R&D in Sunnyvale, CA & Israel; Operations in Taiwan APU HIGH PERFORMANCE Developed the APU, Massively Parallel Processor Leader in supplying high performance memories to demanding industries such for big data similarity search, based on as aerospace, defense and high performance datacenters. Computational Memory technology. Acqu MikaMonu and its Associative Computing IP in 2015. 3

Can someone recommend a … I recommend this or that or….maybe nothing… 5

For doing that, Machine Learning is not enough. LETS UNDERSTAND THE CONCEPT FIRST 6

Cloud Speech Speech Text Client Text Server Photo Photo Sketch Sketch Video Video 0 1 1 0 0 Fingerprint Ask Fingerprint 1 0 1 1 1 Similarity 1 0 1 0 0 0 1 0 1 1 Search Engine 0 1 0 1 1 0 1 0 0 1 covert to feature vector Storage convert to feature vector (DRAM) (AI translates Question to (AI translates DB to meaningful fingerprint meaningful fingerprints Answer 7

Cloud Client Server ON LINE COMPUTING OFF LINE COMPUTING 0 1 1 0 0 Ask 1 0 1 1 1 Similarity APU 1 0 1 0 0 DRAM CPU 0 1 0 1 Associative Computing Unit 1 Search Engine Embedding 0 1 0 1 1 0 1 0 0 1 convert to In-Memory feature vector Storage < 1 TByte Answer

Computes in-place, directly in the memory Question array, removing the I/O bottleneck APU Simple CPU & Narrow Bus Associative Processing  Significantly increases performance  Reduces power consumption Answer  Data compression (Binary Reduction)  Query parallelism for production system 9

The current state is that storage simply holds the data. The need for intelligent cache that preprocesses for the main processor (CPU or GPGPU) tedious tasks and replace the main processor with an associative processor Calculations within the memory unit Associative Computing fundamentals with lower latency and lower voltage is making it an essential part of any architecture of any datacenter 10

CRITICAL COMPONENT ACROSS APPS “ The future of online product research: visuals and voice. The rise of voice searches fueled by technology like Google Home and Amazon ’ s Alexa has been  As it becomes common large scale well-documented. similarity search  Similarity is in Visual Search, Voice, But visual searches are also on Text apps the rise. Products like Pinterest  Across applications in all industries – Lens use machine learning to consumer, bioinformatics, cyber, automotive aid in brand and product discovery ” 11

WERE EXPERIENCING SIMILARITY AND VISUAL SEARCH Netflix Uses similarity search to figure out our taste in TV to retain us by offering personal content Facebook Tries to tailor our newsfeed to our interests Spotify Builds our playlists according to what we listen to Pinterest Lets us upload a picture and offer us similar products Google Tries to constantly improve its visual search to be more relevant in search results 12

0101 RE/WE Address Decoder 1000 1100 1001 ALU Sense Amp /IO Drivers 14

RE 1101 RE 1001 RE 1100 1001 WE 1110 WE RE ? 0011 NAND Bus Contention is not an error !!! It’s a simple NOR/NAND satisfying De- Morgan’s law 15

A B C D AB 00 01 11 10 0 0 0 1 C • Every minterm takes 0 0 1 0 0 1 1 0 0 one clock 0 1 0 1 1 0 1 1 1 • All bit lines execute 0 1 1 0 1 0 0 0 Karnaugh tables in- !A!C + BC = 1 0 1 0 parallel !!( !A!C + BC ) = ! (!(!A!C)!(BC)) 1 1 0 0 = NAND( NAND(!A,!C),NAND(B,C)) 1 1 1 1 1 CLOCK Read (!A,!C) ; WRITE T1 Read (B,C) ; WRITE T2 1 CLOCK Read (T1,T2) ; WRITE D

A[ ] + B[ ] = C[ ] vector A(8,32M) vector B(8,32M) Vector C(9,32M) C = A + B No. Of Clocks = 4 * 8 = 32 Clocks/byte= 32/32M=1/1M OPS = 1Ghz X 1M Single APU chip has 2M Bit Line = 1 PetaOPS Processors – 64 TOPS or >> 2 TOPS/Watt 17

Vector A Vector B C=f(A,B) Each bit line becomes a processor and storage Millions of bit lines = millions of processors

Shift vector C=f((A,shift(B,1)) Parallel shift of bit lines @ 1 cycle sections Enables neighborhood operations such as convolutions

Query Associative Memory DB 4 6 7 -3 -1 4 8 2 8 2 -1 3 -3 7 4 -2 -1 -1 -2 6 6 -2 -1 0 0 5 -2 2 -3 3 -1 7 4 4 2 1 0 2 -2 3 -3 2 4 -2 0 1 1 8 6 4 5 5 1 1 -3 -1 1 6 0 5 1 -1 2 0 -1 0 In Memory Compute 22 47 -5 -5 2 45 -15 59 36 -22 Cosine Distance TOP K=3 > 100,000 Quires/sec , any K size, 128K Records, Sigle chip@10Watts

In-Memory BW 1000X 1000X 100X SoftMax, Non Linear SP Floating Point 10X 1 1 10X Top-K, Search Low precision/Binary OPS 100X Scalability

CPU/GPGPU vs APU (In-Place Computing (APU CPU/GPGPU (Current Solution) Send an address to memory Search by content Fetch the data from memory and send it to Mark in place the processor Compute in place on millions of Compute serially per core processors (the memory itself becomes (thousands of cores at most) millions of processors Write the data back to memory, further No need to write data back — the result is wasting IO resources already in the memory Send data to each location that needs it If needed, distribute or broadcast at once

DRAM 2M bit 256Mbit line 256Mbit line processors processors processors or ARC Serial ARC Serial CORE 1 CORE 0 processor processor 128K vector 256Mbit line 256Mbit line processors processors processors FPGA runs at 1G Hz including Peripherals From 2 Tera 256Mbit line 256Mbit line ARM processors processors Flops to 2 Peta PCIe ARC Serial ARC Serial CORE 3 CORE 2 processor processor Ops PCIe 256Mbit line 256Mbit line processors processors 24

Multi-Functional, Acceleration of FP Programmable Blocks operation Blocks 25

16GB DDR4 16GB DDR4 16GB DDR4 16GB DDR4 FPGA/ Peripheral APU FPGA/ Peripheral s ARM APU FPGA/ Peripheral s ARM APU FPGA/ s ARM APU Peripherals ARM PCI e PCI e PCI e PCIe

Simple example: N = 36, 3 Groups 2 dimensions (D = 2 ) for X and Y Group Green selected as the majority. K = 4 For actual applications: N = Billions D = Tens K = Tens of thousands

Item 1 Item N 1 Features of item 2 Features of item N Features of item Item 2 Item features and label storage Item 3 Compute cosine distances for all N in parallel (≤ 10μs, assuming D=50 features) Q Distribute data – 2 ns (to all) Computing Area K Mins at O(1) complexity (≤ 3μs) In-Place ranking 4 1 3 2 Majority Calculation With the data base in an APU, computation for all N items done in ≤ 0.05 ms, independent of K (1000X Improvement over current solutions)

. Deep Learning Can Pattern of Local classify up to Contrast / Edges thousands clusters but Combination of Edges what if we have millions or Billions? Features Combination of Features 1% Cat 95% Human Face 1% Dog 3% Mask Pixel Values

horses ? ? ? dogs cars Updates unlabeled images requires new training – that consume latency, power, performance DEEP LEARNING IN NOT ENOUGH

Gradient-Based Optimization has achieved impressive results on supervised tasks such as image classification These models need a lot of data ASSOCIATIVE COMPUTING Like people, can measure similarity to features stored in memory Can also create a new label for similar features in the future Visual search, Face recognition and NLP are some of used cases showing on next slides People can learn efficiently from few examples

⮚ Need to Identify people rather than Pattern of Local object categories Contrast / Edges Combination Ross of Edges Dianna Hannah Bob Thousands to millions of different identities ! Features Kelly Classes may frequently change (avoid ! Brittney retraining for every added identity) Jenifer Mary 1% Cat Identification should occur from as much as ! features features Tom 95% Human Face one previously seen image – One/Low Shot Min(dist) 1% Dog Sharon 3% Mask Learning Problem John Michael Based on Pre-Trained Network ! Jerry Rachel Ideal for Similarity Search Guy Laura Pixel Nathan Values Chris

Identification/Retrieval Face Deep Similarity Search Face Hashing Detection Feature Alignment (LSH) (Hamming Distance + TopK) (MTCNN) Extraction Query Image 128 float32 n-bit binary vector APU Database faces pass the same procedure (offline) and stored in the APU

Identification/Retrieval Similarity Features Extractor Hashing Search Pre-Trained Network (LSH) ( Distance Measure + TopK) n-bit binary 128 float32 vector APU Database pass the same procedure (offline) and stored in the APU

Database: • 13247 images of 5755 identities Query Images • Between 1 to 500 images per person Similarity Search Face Feature Extraction

Dr. Avidan Akerib, VP Associative Computing BU June 2nd 2019 2 - PowerPoint PPT Presentation

Dr. Avidan Akerib, VP Associative Computing BU June 2nd 2019 2 CORPORATE SUMMARY FOUNDED IN 1995 1 PUBLIC COMPANY 2 Consistent profitability & zero debt ~150 EMPLOYEES WORLDWIDE. 3 Design / R&D in Sunnyvale, CA & Israel;

In-Place Associative Computing Avidan Akerib Ph.D. Vice President Associative Computing BU

IN-MEMORY ASSOCIATIVE COMPUTING AVIDAN AKERIB, GSI TECHNOLOGY AAKERIB@GSITECHNOLOGY.COM AGENDA

Lazy Associative Classification Decision Tree Classifier (Eager) Associative Classifier By

Associative arrays Associative arrays map a key to a value Keys and values can be different

Associative Fine-Tuning of Biologically Inspired Active Neuro-Associative Knowledge Graphs Adrian

CHAPTER II III I CHAPTER Neural Networks as Neural Networks as Associative Memory

Associative dyadic boolean functions Goals Def: A Boolean function f : { 0 , 1 } 2 { 0 , 1 }

Example: Associative Arrays An environment can be expressed as an associative array, e.g.:

10. Left-associative grammar (LAG) 10.1 Rule types and derivation order 10.1.1 The notion

Associative caches (3 rd Ed: p.496-504, 4 th Ed: 479-487) flexible block placement schemes

Cache design overview ANY cache can be viewed as k-way associative. What are the pros and cons of

Edge Preserving Filtering Median Filter Bilateral Filter Shai Avidan Tel-Aviv University Slide

Super-Resolution Shai Avidan Tel-Aviv University Slide Credits (partial list) Rick

Wavelets Shai Avidan Tel Aviv University Slide Credits (partial list) Rick Szeliski

Sampling Resampling Warping Morphing Dr. Shai Avidan Faculty of Engineering Tel-Aviv

Image Retargeting Shai Avidan Tel Aviv University Bidirectional Similarity (Simakov et al.

In-memory processing of big data via succinct data structures Rajeev Raman University of

Improving Video Streaming and File Compression Efficiency without Affecting Quality By Yves

Hero Timeseries Historian We are living in the world of Data 2 Without a systematic way to

Small Island Innovations and New Technologies for Addressing Threats from Coastal Erosion and Sea

Considerations for GPU SEE Testing Edward J. Wyrwas edward.j.wyrwas@nasa.gov 301-286-5213

Multiresolution Algorithms for Sparse Matrix Representation By: Mario Barela Mentor: Professor

GPU activities at FI MUNI and their results Ji Matela, Ji Filipovi

Technology Update and Finances January 10, 2017 1 Presentation Outline Technology