CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong - - PowerPoint PPT Presentation

cs 5234 spring 2013 advanced parallel computing
SMART_READER_LITE
LIVE PREVIEW

CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong - - PowerPoint PPT Presentation

Architecture CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong Cao Architecture Goals Seq equen ential Machine e and Von on-N -Neu eumann Mod odel el Pa Parallel el Ha Hardware e Distributed vs


slide-1
SLIDE 1

Architecture

CS 5234 –Spring 2013 Advanced Parallel Computing Architecture

Yong Cao

slide-2
SLIDE 2

Architecture

Goals

Ø Seq equen ential Machine e and Von

  • n-N
  • Neu

eumann Mod

  • del

el Ø Pa Parallel el Ha Hardware e

Ø Distributed vs Shared Memory

Ø Architec ecture e Classes es

Ø Multiple-core Ø Many-core (massive parallel)

Ø NVIDIA GPU PU Architec ecture e

slide-3
SLIDE 3

Architecture

Von-Neumann Machine (VN)

Ø PC PC: Pr Prog

  • gram cou
  • unter

er Ø MAR: Mem emor

  • ry addres

ess reg egister er Ø MDR: Mem emor

  • ry data

reg egister er Ø IR: Instruction

  • n reg

egister er Ø ALU LU: Arithmet etic Log Logic Unit Unit Ø Acc Acc: Accumulator

  • r

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

slide-4
SLIDE 4

Architecture

Sequential Execution and Instruction Cycle

Ø Th The e six phases es of

  • f the

e instruction

  • n cycle:

e:

Ø Fetch Ø Decode Ø Evaluate Address Ø Fetch Operands Ø Execute Ø Store Result

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

slide-5
SLIDE 5

Architecture

Sequential Execution and Instruction Cycle

Ø Fet Fetch

Ø MARçPC Ø MDRçMEM[MAR] Ø IRçMDR

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

slide-6
SLIDE 6

Architecture

Sequential Execution and Instruction Cycle

Ø Dec ecod

  • de

e

Ø DECODERçIR.OP

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

slide-7
SLIDE 7

Architecture

Sequential Execution and Instruction Cycle

Ø Evaluate e Addres ess

Ø MARçIR.ADDR Ø MDRçMEM[MAR]

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

slide-8
SLIDE 8

Architecture

Sequential Execution and Instruction Cycle

Ø Exec ecute e

Ø Acc çAcc + MDR

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

slide-9
SLIDE 9

Architecture

Sequential Execution and Instruction Cycle

Ø Stor

  • re

e Res esult

Ø MDRçAcc

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

slide-10
SLIDE 10

Architecture

Sequential Execution and Instruction Cycle

Ø Reg egister er Fi File e

PC MAR

Register File

OP ADDRESS MEMORY A L U

Decoder

slide-11
SLIDE 11

Architecture

Sequential Execution and Instruction Cycle

PC MEMORY A L U IR Register File

slide-12
SLIDE 12

Architecture

Parallel Hardware

Ø Shared ed vs vs Distributed ed Mem emor

  • ry

MEMORY

PC A L U IR Register File PC A L U IR Register File PC A L U IR Register File

……

Ø Multi-C

  • Cor
  • re

e and Many-C

  • Cor
  • re

e Architec ecture e

slide-13
SLIDE 13

Architecture

Parallel Hardware

Ø Shared ed vs vs Distributed ed Mem emor

  • ry

MEMORY

PC A L U IR Register File PC A L U IR Register File PC A L U IR Register File

……

MEMORY MEMORY Ø Cluster er Com

  • mputing, Grid Com
  • mputing, Clou
  • ud Com
  • mputing
slide-14
SLIDE 14

Architecture

Multi-Core vs Many-Core

Ø Defi efinition

  • n of
  • f Cor
  • re

e – Indep epen enden ent ALU LU Ø How How abou

  • ut a vec

ector

  • r proc
  • ces

essor

  • r?

Ø SIMD: E.g. Intel’s SSE.

Ø How How many is “many”?

Ø What if there are too “many” cores in the Multi-core design?

Shared ed con

  • ntrol
  • l log
  • gic (PC

(PC, IR, Sched edule) e)

slide-15
SLIDE 15

Architecture

Multi-Core

Ø Each core has its own control (PC and IR)

slide-16
SLIDE 16

Architecture

Many-Core

Ø A grou

  • up of
  • f cor
  • res

es shares es the e con

  • ntrol
  • l (PC

(PC, IR and Th Threa ead Sched eduling) )

slide-17
SLIDE 17

Architecture

NVIDIA Fermi Architecture

16 Stream Multiprocessor (SM) 32 Core for Each SM

slide-18
SLIDE 18

Architecture

Fermi SM

slide-19
SLIDE 19

Architecture

Execution in a SM

A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of 16 load/store units. This figure shows how instructions are issued to the execution blocks.

slide-20
SLIDE 20

Architecture

Data Parallel

Ø Data Pa Parallel el vs vs Ta Task Pa Parallel el

Ø What to partition? Data or Task?

Ø Massive e Data Pa Parallel el

Ø Millions (or more) of threads Ø Same instruction, different data elements

slide-21
SLIDE 21

Architecture

Computing on GPUs

Ø Stea eam proc

  • ces

essing and Vec ector

  • riza

zation

  • n (S

(SIMD) )

Input Stream Output Stream Instructions SIMD

slide-22
SLIDE 22

Architecture

GPU Programming Model: Stream

Ø Strea eam Pr Prog

  • gramming Mod
  • del

el Ø Strea eams:

Ø An array of data units

Ø Ker ernel els:

Ø Take streams as input, produce streams at output Ø Perform computation on streams Ø Kernels can be linked together

Stream Stream Kernel

slide-23
SLIDE 23

Architecture

Why Streams?

Ø Ample e com

  • mputation
  • n by ex

expos

  • sing parallel

elism

Ø Stream expose data parallelism

Ø Multiple stream elements can be processed in parallel

Ø Pipeline (task) parallelism

Ø Multiple tasks can be processed in parallel

Ø Effi fficien ent com

  • mmunication
  • n

Ø Producer-consumer locality Ø Predictable memory access pattern

Ø Optimize for throughput of all elements, not latency of

  • ne

Ø Processing many elements at once allows latency hiding