dr avidan akerib vp associative computing bu
play

Dr. Avidan Akerib, VP Associative Computing BU June 2nd 2019 2 - PowerPoint PPT Presentation

Dr. Avidan Akerib, VP Associative Computing BU June 2nd 2019 2 CORPORATE SUMMARY FOUNDED IN 1995 1 PUBLIC COMPANY 2 Consistent profitability & zero debt ~150 EMPLOYEES WORLDWIDE. 3 Design / R&D in Sunnyvale, CA & Israel;


  1. Dr. Avidan Akerib, VP Associative Computing BU June 2nd 2019

  2. 2

  3. CORPORATE SUMMARY FOUNDED IN 1995 1 PUBLIC COMPANY 2 Consistent profitability & zero debt ~150 EMPLOYEES WORLDWIDE. 3 Design / R&D in Sunnyvale, CA & Israel; Operations in Taiwan APU HIGH PERFORMANCE Developed the APU, Massively Parallel Processor Leader in supplying high performance memories to demanding industries such for big data similarity search, based on as aerospace, defense and high performance datacenters. Computational Memory technology. Acqu MikaMonu and its Associative Computing IP in 2015. 3

  4. 4

  5. Can someone recommend a … I recommend this or that or….maybe nothing… 5

  6. For doing that, Machine Learning is not enough. LETS UNDERSTAND THE CONCEPT FIRST 6

  7. Cloud Speech Speech Text Client Text Server Photo Photo Sketch Sketch Video Video 0 1 1 0 0 Fingerprint Ask Fingerprint 1 0 1 1 1 Similarity 1 0 1 0 0 0 1 0 1 1 Search Engine 0 1 0 1 1 0 1 0 0 1 covert to feature vector Storage convert to feature vector (DRAM) (AI translates Question to (AI translates DB to meaningful fingerprint meaningful fingerprints Answer 7

  8. Cloud Client Server ON LINE COMPUTING OFF LINE COMPUTING 0 1 1 0 0 Ask 1 0 1 1 1 Similarity APU 1 0 1 0 0 DRAM CPU 0 1 0 1 Associative Computing Unit 1 Search Engine Embedding 0 1 0 1 1 0 1 0 0 1 convert to In-Memory feature vector Storage < 1 TByte Answer

  9. Computes in-place, directly in the memory Question array, removing the I/O bottleneck APU Simple CPU & Narrow Bus Associative Processing  Significantly increases performance  Reduces power consumption Answer  Data compression (Binary Reduction)  Query parallelism for production system 9

  10. The current state is that storage simply holds the data. The need for intelligent cache that preprocesses for the main processor (CPU or GPGPU) tedious tasks and replace the main processor with an associative processor Calculations within the memory unit Associative Computing fundamentals with lower latency and lower voltage is making it an essential part of any architecture of any datacenter 10

  11. CRITICAL COMPONENT ACROSS APPS “ The future of online product research: visuals and voice. The rise of voice searches fueled by technology like Google Home and Amazon ’ s Alexa has been  As it becomes common large scale well-documented. similarity search  Similarity is in Visual Search, Voice, But visual searches are also on Text apps the rise. Products like Pinterest  Across applications in all industries – Lens use machine learning to consumer, bioinformatics, cyber, automotive aid in brand and product discovery ” 11

  12. WERE EXPERIENCING SIMILARITY AND VISUAL SEARCH Netflix Uses similarity search to figure out our taste in TV to retain us by offering personal content Facebook Tries to tailor our newsfeed to our interests Spotify Builds our playlists according to what we listen to Pinterest Lets us upload a picture and offer us similar products Google Tries to constantly improve its visual search to be more relevant in search results 12

  13. 13

  14. 0101 RE/WE Address Decoder 1000 1100 1001 ALU Sense Amp /IO Drivers 14

  15. RE 1101 RE 1001 RE 1100 1001 WE 1110 WE RE ? 0011 NAND Bus Contention is not an error !!! It’s a simple NOR/NAND satisfying De- Morgan’s law 15

  16. A B C D AB 00 01 11 10 0 0 0 1 C • Every minterm takes 0 0 1 0 0 1 1 0 0 one clock 0 1 0 1 1 0 1 1 1 • All bit lines execute 0 1 1 0 1 0 0 0 Karnaugh tables in- !A!C + BC = 1 0 1 0 parallel !!( !A!C + BC ) = ! (!(!A!C)!(BC)) 1 1 0 0 = NAND( NAND(!A,!C),NAND(B,C)) 1 1 1 1 1 CLOCK Read (!A,!C) ; WRITE T1 Read (B,C) ; WRITE T2 1 CLOCK Read (T1,T2) ; WRITE D

  17. A[ ] + B[ ] = C[ ] vector A(8,32M) vector B(8,32M) Vector C(9,32M) C = A + B No. Of Clocks = 4 * 8 = 32 Clocks/byte= 32/32M=1/1M OPS = 1Ghz X 1M Single APU chip has 2M Bit Line = 1 PetaOPS Processors – 64 TOPS or >> 2 TOPS/Watt 17

  18. Vector A Vector B C=f(A,B) Each bit line becomes a processor and storage Millions of bit lines = millions of processors

  19. Shift vector C=f((A,shift(B,1)) Parallel shift of bit lines @ 1 cycle sections Enables neighborhood operations such as convolutions

  20. Query Associative Memory DB 4 6 7 -3 -1 4 8 2 8 2 -1 3 -3 7 4 -2 -1 -1 -2 6 6 -2 -1 0 0 5 -2 2 -3 3 -1 7 4 4 2 1 0 2 -2 3 -3 2 4 -2 0 1 1 8 6 4 5 5 1 1 -3 -1 1 6 0 5 1 -1 2 0 -1 0 In Memory Compute 22 47 -5 -5 2 45 -15 59 36 -22 Cosine Distance TOP K=3 > 100,000 Quires/sec , any K size, 128K Records, Sigle chip@10Watts

  21. In-Memory BW 1000X 1000X 100X SoftMax, Non Linear SP Floating Point 10X 1 1 10X Top-K, Search Low precision/Binary OPS 100X Scalability

  22. CPU/GPGPU vs APU (In-Place Computing (APU CPU/GPGPU (Current Solution) Send an address to memory Search by content Fetch the data from memory and send it to Mark in place the processor Compute in place on millions of Compute serially per core processors (the memory itself becomes (thousands of cores at most) millions of processors Write the data back to memory, further No need to write data back — the result is wasting IO resources already in the memory Send data to each location that needs it If needed, distribute or broadcast at once

  23. . 23

  24. DRAM 2M bit 256Mbit line 256Mbit line processors processors processors or ARC Serial ARC Serial CORE 1 CORE 0 processor processor 128K vector 256Mbit line 256Mbit line processors processors processors FPGA runs at 1G Hz including Peripherals From 2 Tera 256Mbit line 256Mbit line ARM processors processors Flops to 2 Peta PCIe ARC Serial ARC Serial CORE 3 CORE 2 processor processor Ops PCIe 256Mbit line 256Mbit line processors processors 24

  25. Multi-Functional, Acceleration of FP Programmable Blocks operation Blocks 25

  26. 16GB DDR4 16GB DDR4 16GB DDR4 16GB DDR4 FPGA/ Peripheral APU FPGA/ Peripheral s ARM APU FPGA/ Peripheral s ARM APU FPGA/ s ARM APU Peripherals ARM PCI e PCI e PCI e PCIe

  27. 28

  28. Simple example: N = 36, 3 Groups 2 dimensions (D = 2 ) for X and Y Group Green selected as the majority. K = 4 For actual applications: N = Billions D = Tens K = Tens of thousands

  29. Item 1 Item N 1 Features of item 2 Features of item N Features of item Item 2 Item features and label storage Item 3 Compute cosine distances for all N in parallel (≤ 10μs, assuming D=50 features) Q Distribute data – 2 ns (to all) Computing Area K Mins at O(1) complexity (≤ 3μs) In-Place ranking 4 1 3 2 Majority Calculation With the data base in an APU, computation for all N items done in ≤ 0.05 ms, independent of K (1000X Improvement over current solutions)

  30. 31

  31. . Deep Learning Can Pattern of Local classify up to Contrast / Edges thousands clusters but Combination of Edges what if we have millions or Billions? Features Combination of Features 1% Cat 95% Human Face 1% Dog 3% Mask Pixel Values

  32. horses ? ? ? dogs cars Updates unlabeled images requires new training – that consume latency, power, performance DEEP LEARNING IN NOT ENOUGH

  33. Gradient-Based Optimization has achieved impressive results on supervised tasks such as image classification These models need a lot of data ASSOCIATIVE COMPUTING Like people, can measure similarity to features stored in memory Can also create a new label for similar features in the future Visual search, Face recognition and NLP are some of used cases showing on next slides People can learn efficiently from few examples

  34. ⮚ Need to Identify people rather than Pattern of Local object categories Contrast / Edges Combination Ross of Edges Dianna Hannah Bob Thousands to millions of different identities ! Features Kelly Classes may frequently change (avoid ! Brittney retraining for every added identity) Jenifer Mary 1% Cat Identification should occur from as much as ! features features Tom 95% Human Face one previously seen image – One/Low Shot Min(dist) 1% Dog Sharon 3% Mask Learning Problem John Michael Based on Pre-Trained Network ! Jerry Rachel Ideal for Similarity Search Guy Laura Pixel Nathan Values Chris

  35. Identification/Retrieval Face Deep Similarity Search Face Hashing Detection Feature Alignment (LSH) (Hamming Distance + TopK) (MTCNN) Extraction Query Image 128 float32 n-bit binary vector APU Database faces pass the same procedure (offline) and stored in the APU

  36. Identification/Retrieval Similarity Features Extractor Hashing Search Pre-Trained Network (LSH) ( Distance Measure + TopK) n-bit binary 128 float32 vector APU Database pass the same procedure (offline) and stored in the APU

  37. Database: • 13247 images of 5755 identities Query Images • Between 1 to 500 images per person Similarity Search Face Feature Extraction

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend