USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE - - PowerPoint PPT Presentation

usc usc
SMART_READER_LITE
LIVE PREVIEW

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE - - PowerPoint PPT Presentation

USC USC INFORMATION INFORMATION SCIENCES SCIENCES INSTITUTE INSTITUTE The Architecture of the DIVA Processing-In-Memory Chip Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett Jeff LaCoss, John Granacki, Jaewook Shin,


slide-1
SLIDE 1

The Architecture of the DIVA Processing-In-Memory Chip

Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, Gokhan Daglikoca USC Information Sciences Institute ICS’02 June 24, 2002

SCIENCES SCIENCES

USC USC

INFORMATION INFORMATION INSTITUTE INSTITUTE

slide-2
SLIDE 2

Outline

Overview of Project Goals and System

Architecture

PIM Chip Architecture Applications and Simulation Results Prototype Chip Implementation Conclusion

slide-3
SLIDE 3

Increasing Bandwidth

Host Host Processor Processor Processor Processor Memory Memory

The Problem

Solutions

M P M P M P M P M P M P M P M P

Processor-memory pairs with wide datapaths Multiple nodes per chip Memory-to-memory interconnect

slide-4
SLIDE 4

System Architecture

Host Host Processor Processor (PowerPC) (PowerPC)

PIM PIM-

  • to

to-

  • PIM

PIM Interconnect Interconnect PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM PIM

Processor Memory Bus Processor Memory Bus

PIMs as “smart-memory” co-processors

slide-5
SLIDE 5

DIVA Key Ideas

First smart-memory PIM device that is

– Capable of executing independent threads of control – Designed to support in-memory virtual addressing

Target Applications

– Image processing and multimedia (streaming) – Irregular memory accesses (sparse-matrix and pointer- based)

Evolutionary application development

– PIMs also support standard memory accesses – System supports familiar parallel programming paradigms

slide-6
SLIDE 6

Node Node

PIM Chip Organization

Host Interface Host Interface

Memory Memory

Processing Processing Logic Logic PIM PIM Routing Routing Component Component To Neighboring PIM To Neighboring PIM PBUF PBUF Memory Port Memory Port PBUF PBUF Memory Port Memory Port PIM Memory Bus Parcel Interconnect To Host System Memory Bus

slide-7
SLIDE 7

Host Memory Interface

Even with an extra arbitration cycle, DIVA PIMs satisfy

SDRAM timing by:

– Using high-bandwidth embedded memory macros – Running arbitration logic at 4X clock speed – Exploiting long latency allowed by SDRAM standard

ROW CLK RAS CAS ADDR DATA COL D0 D1 D2 D3

slide-8
SLIDE 8

Node Architecture

WideWord WideWord Datapath Datapath

MEMORY MEMORY CONTROL CONTROL

& & ARBITER ARBITER

DATA DATA

ICache ICache

Node Data Bus

CTL

Node Memory Requests Host Memory Requests ICache Mem Requests

HEADER

PARCEL BUFFER (“PBUF”) HOST MEMORY PORT

MEMORY MEMORY

32b

5 5-

  • Stage DLX

Stage DLX-

  • like

like Instruction Instruction Pipeline Pipeline Instructions WideWord Registers

256b

Scalar Scalar Datapath Datapath Scalar Registers

256b

slide-9
SLIDE 9

WideWord Unit

256-bit datapath using a 32x256 register file WideWord operand treated as a packed

array of 8-, 16-, or 32-bit objects

– Object size specified by instruction

Features

– Transfers to/from scalar register file – Data rearrangement instructions – Selective execution

slide-10
SLIDE 10

WideWord Permutation Capability

Permutation instructions rearrange subfields

– Rearrangement pattern specified by a permutation vector

Two “flavors” of permutation instructions:

– General-purpose

– Construction of permutation vector in a WideWord register

– Hardwired

– Permutation vector found in a lookup table of common patterns, e.g., shifts, rotates, shuffles, reductions, etc. – Scalar register value serves as index into the lookup table

slide-11
SLIDE 11

WideWord Selective Execution

Only certain subfields of a result are

committed during writeback

Subfields participating are determined by a

combination of:

– Condition codes

– EQ, GT, LT, OV

– User-settable mask register

– Useful for a priori subfield specification

– Bits in instruction

– Specify whether and what type of selective execution is to be used for that instruction

slide-12
SLIDE 12

Example Using Permutations and Selective Execution

a0 a1 a2 a3 a4 a5 a6 a7 b0 b1 b2 b3 b4 b5 b6 b7 c0 c1 c2 c3 c4 c5 c6 c7

wr1 wr2 wr3

a0 a1 a2 a3 a4 a5 a6 a7 c0 c1 c2 c3 a0 a1 a2 a3

wr3 wr1

b0 b1 b2 b3 b4 b5 b6 b7

wr2 wr3

a0 a1 a2 a3 b4 b5 b6 b7

r1 = upper_enable r2 = lower_enable r3 = swap_upper_lower mtspr mask,r2 wprmi_se wr3,wr1,r3 // permute using // selective execution mtspr mask,r1 wprmi_se wr3,wr2,r3

slide-13
SLIDE 13

Applications

Program Description Source Data Set Size WideWord Usage Template Matching (TM) image correlation Sandia 4-Kbyte image, 32 1-Kbyte templates parallelism, selective, reuse in registers, page mode Cornerturn (CT) matrix transpose Atlantic Aerospace 32-Mbyte matrix parallelism, permutation CG sparse conjugate gradient NAS 2M double- precision elements parallelism, floating-point, page mode Transitive Closure (TC) Floyd’s all-paths shortest paths Atlantic Aerospace 256 Kbytes parallelism, selective, reuse in registers Neighborhood (NH) relational database join Atlantic Aerospace 500,000 bytes Natural Join (NJ) image processing stencil Alphatech 72 Kbytes Pointer (P) random walk Atlantic Aerospace 4 Mbytes OO7

  • bject-oriented

database query University of Wisconsin 888 Kbytes

slide-14
SLIDE 14

Simulation Environment

  • System simulator based on RSIM

System simulator based on RSIM

– Detailed memory subsystem simulation – PIM processor with WideWord, full ISA – Communication: 1-D ring based on PiRCs

  • Conservative assumption

Conservative assumption:

– PIM speed 1/2 of host speed

slide-15
SLIDE 15

1-PIM Speedups

Speedups over host-only execution

5 10 15 TM CT CG TC NJ NH P OO7

slide-16
SLIDE 16

Host Execution Time

Busy and memory stall times for host-

  • nly execution

20 40 60 80 100 120 TM CT CG TC NJ NH P OO7 busy memory

slide-17
SLIDE 17

Memory Hierarchy Times

Host-only and 1-PIM memory stall times

20 40 60 80 100 120

TM-H TM-P CT-H CT-P CG-H CG-P TC-H TC-P NJ-H NJ-P NH-H NH-P P-H P-P 007-H 007-P

L1 L2 memory

slide-18
SLIDE 18

WideWord Performance Gains

Speedup of 1 PIM with WideWord over 1 PIM Scalar

5 10 15 20 TM CT CG TC

slide-19
SLIDE 19

1st DIVA Prototype PIM Chip

SRAM SRAM

Node Processing Logic, Pbuf SDRAM Interface, PiRC

Fabrication technology

– TSMC 0.18µm

Size

– 9.8mm x 9.8mm, 55 million transistors (2 million logic)

Package

– 35mm, 352 BGA – 241 signal I/O, 111 Vdd or Gnd

Lab results

– Running Cornerturn application at 160MHz while dissipating 0.8W

– Using WideWord permutations and selective execution

slide-20
SLIDE 20

Conclusions and Future Work

DIVA accelerates multimedia (streaming) and

irregular computations (sparse, pointer-based)

– Average speedup of 3.3 using just 1 PIM

First prototype PIM chip is demonstrating

encouraging results

Ongoing Work

– Demonstration system incorporating PIMs – Future PIMs with WideWord floating-point capability and address translation