SLIDE 1 The Architecture of the DIVA Processing-In-Memory Chip
Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, Gokhan Daglikoca USC Information Sciences Institute ICS’02 June 24, 2002
SCIENCES SCIENCES
USC USC
INFORMATION INFORMATION INSTITUTE INSTITUTE
SLIDE 2
Outline
Overview of Project Goals and System
Architecture
PIM Chip Architecture Applications and Simulation Results Prototype Chip Implementation Conclusion
SLIDE 3
Increasing Bandwidth
Host Host Processor Processor Processor Processor Memory Memory
The Problem
Solutions
M P M P M P M P M P M P M P M P
Processor-memory pairs with wide datapaths Multiple nodes per chip Memory-to-memory interconnect
SLIDE 4 System Architecture
Host Host Processor Processor (PowerPC) (PowerPC)
PIM PIM-
to-
PIM Interconnect Interconnect PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM
Processor Memory Bus Processor Memory Bus
PIMs as “smart-memory” co-processors
SLIDE 5
DIVA Key Ideas
First smart-memory PIM device that is
– Capable of executing independent threads of control – Designed to support in-memory virtual addressing
Target Applications
– Image processing and multimedia (streaming) – Irregular memory accesses (sparse-matrix and pointer- based)
Evolutionary application development
– PIMs also support standard memory accesses – System supports familiar parallel programming paradigms
SLIDE 6 Node Node
PIM Chip Organization
Host Interface Host Interface
Memory Memory
Processing Processing Logic Logic PIM PIM Routing Routing Component Component To Neighboring PIM To Neighboring PIM PBUF PBUF Memory Port Memory Port PBUF PBUF Memory Port Memory Port PIM Memory Bus Parcel Interconnect To Host System Memory Bus
SLIDE 7
Host Memory Interface
Even with an extra arbitration cycle, DIVA PIMs satisfy
SDRAM timing by:
– Using high-bandwidth embedded memory macros – Running arbitration logic at 4X clock speed – Exploiting long latency allowed by SDRAM standard
ROW CLK RAS CAS ADDR DATA COL D0 D1 D2 D3
SLIDE 8 Node Architecture
WideWord WideWord Datapath Datapath
MEMORY MEMORY CONTROL CONTROL
& & ARBITER ARBITER
DATA DATA
ICache ICache
Node Data Bus
CTL
Node Memory Requests Host Memory Requests ICache Mem Requests
HEADER
PARCEL BUFFER (“PBUF”) HOST MEMORY PORT
MEMORY MEMORY
32b
5 5-
Stage DLX-
like Instruction Instruction Pipeline Pipeline Instructions WideWord Registers
256b
Scalar Scalar Datapath Datapath Scalar Registers
256b
SLIDE 9
WideWord Unit
256-bit datapath using a 32x256 register file WideWord operand treated as a packed
array of 8-, 16-, or 32-bit objects
– Object size specified by instruction
Features
– Transfers to/from scalar register file – Data rearrangement instructions – Selective execution
SLIDE 10
WideWord Permutation Capability
Permutation instructions rearrange subfields
– Rearrangement pattern specified by a permutation vector
Two “flavors” of permutation instructions:
– General-purpose
– Construction of permutation vector in a WideWord register
– Hardwired
– Permutation vector found in a lookup table of common patterns, e.g., shifts, rotates, shuffles, reductions, etc. – Scalar register value serves as index into the lookup table
SLIDE 11
WideWord Selective Execution
Only certain subfields of a result are
committed during writeback
Subfields participating are determined by a
combination of:
– Condition codes
– EQ, GT, LT, OV
– User-settable mask register
– Useful for a priori subfield specification
– Bits in instruction
– Specify whether and what type of selective execution is to be used for that instruction
SLIDE 12
Example Using Permutations and Selective Execution
a0 a1 a2 a3 a4 a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7 c0 c1 c2 c3 c4 c5 c6 c7
wr1 wr2 wr3
a0 a1 a2 a3 a4 a5 a6 a7 c0 c1 c2 c3 a0 a1 a2 a3
wr3 wr1
b0 b1 b2 b3 b4 b5 b6 b7
wr2 wr3
a0 a1 a2 a3 b4 b5 b6 b7
r1 = upper_enable r2 = lower_enable r3 = swap_upper_lower mtspr mask,r2 wprmi_se wr3,wr1,r3 // permute using // selective execution mtspr mask,r1 wprmi_se wr3,wr2,r3
SLIDE 13 Applications
Program Description Source Data Set Size WideWord Usage Template Matching (TM) image correlation Sandia 4-Kbyte image, 32 1-Kbyte templates parallelism, selective, reuse in registers, page mode Cornerturn (CT) matrix transpose Atlantic Aerospace 32-Mbyte matrix parallelism, permutation CG sparse conjugate gradient NAS 2M double- precision elements parallelism, floating-point, page mode Transitive Closure (TC) Floyd’s all-paths shortest paths Atlantic Aerospace 256 Kbytes parallelism, selective, reuse in registers Neighborhood (NH) relational database join Atlantic Aerospace 500,000 bytes Natural Join (NJ) image processing stencil Alphatech 72 Kbytes Pointer (P) random walk Atlantic Aerospace 4 Mbytes OO7
database query University of Wisconsin 888 Kbytes
SLIDE 14 Simulation Environment
- System simulator based on RSIM
System simulator based on RSIM
– Detailed memory subsystem simulation – PIM processor with WideWord, full ISA – Communication: 1-D ring based on PiRCs
Conservative assumption:
– PIM speed 1/2 of host speed
SLIDE 15
1-PIM Speedups
Speedups over host-only execution
5 10 15 TM CT CG TC NJ NH P OO7
SLIDE 16 Host Execution Time
Busy and memory stall times for host-
20 40 60 80 100 120 TM CT CG TC NJ NH P OO7 busy memory
SLIDE 17 Memory Hierarchy Times
Host-only and 1-PIM memory stall times
20 40 60 80 100 120
TM-H TM-P CT-H CT-P CG-H CG-P TC-H TC-P NJ-H NJ-P NH-H NH-P P-H P-P 007-H 007-P
L1 L2 memory
SLIDE 18 WideWord Performance Gains
Speedup of 1 PIM with WideWord over 1 PIM Scalar
5 10 15 20 TM CT CG TC
SLIDE 19
1st DIVA Prototype PIM Chip
SRAM SRAM
Node Processing Logic, Pbuf SDRAM Interface, PiRC
Fabrication technology
– TSMC 0.18µm
Size
– 9.8mm x 9.8mm, 55 million transistors (2 million logic)
Package
– 35mm, 352 BGA – 241 signal I/O, 111 Vdd or Gnd
Lab results
– Running Cornerturn application at 160MHz while dissipating 0.8W
– Using WideWord permutations and selective execution
SLIDE 20
Conclusions and Future Work
DIVA accelerates multimedia (streaming) and
irregular computations (sparse, pointer-based)
– Average speedup of 3.3 using just 1 PIM
First prototype PIM chip is demonstrating
encouraging results
Ongoing Work
– Demonstration system incorporating PIMs – Future PIMs with WideWord floating-point capability and address translation