CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong - - PowerPoint PPT Presentation

▶

Oct 21, 2023 369 likes •610 views

Architecture CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong Cao Architecture Goals Seq equen ential Machine e and Von on-N -Neu eumann Mod odel el Pa Parallel el Ha Hardware e Distributed vs

SLIDE 1

Architecture

CS 5234 –Spring 2013 Advanced Parallel Computing Architecture

Yong Cao

SLIDE 2

Architecture

Goals

Ø Seq equen ential Machine e and Von

eumann Mod

el Ø Pa Parallel el Ha Hardware e

Ø Distributed vs Shared Memory

Ø Architec ecture e Classes es

Ø Multiple-core Ø Many-core (massive parallel)

Ø NVIDIA GPU PU Architec ecture e

SLIDE 3

Architecture

Von-Neumann Machine (VN)

Ø PC PC: Pr Prog

gram cou
unter

er Ø MAR: Mem emor

ry addres

ess reg egister er Ø MDR: Mem emor

ry data

reg egister er Ø IR: Instruction

n reg

egister er Ø ALU LU: Arithmet etic Log Logic Unit Unit Ø Acc Acc: Accumulator

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

SLIDE 4

Architecture

Sequential Execution and Instruction Cycle

Ø Th The e six phases es of

f the

e instruction

n cycle:

e:

Ø Fetch Ø Decode Ø Evaluate Address Ø Fetch Operands Ø Execute Ø Store Result

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

SLIDE 5

Architecture

Sequential Execution and Instruction Cycle

Ø Fet Fetch

Ø MARçPC Ø MDRçMEM[MAR] Ø IRçMDR

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

SLIDE 6

Architecture

Sequential Execution and Instruction Cycle

Ø Dec ecod

e

Ø DECODERçIR.OP

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

SLIDE 7

Architecture

Sequential Execution and Instruction Cycle

Ø Evaluate e Addres ess

Ø MARçIR.ADDR Ø MDRçMEM[MAR]

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

SLIDE 8

Architecture

Sequential Execution and Instruction Cycle

Ø Exec ecute e

Ø Acc çAcc + MDR

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

SLIDE 9

Architecture

Sequential Execution and Instruction Cycle

Ø Stor

e Res esult

Ø MDRçAcc

PC MAR Acc MDR OP ADDRESS MEMORY A L U

Decoder

SLIDE 10

Architecture

Sequential Execution and Instruction Cycle

Ø Reg egister er Fi File e

PC MAR

Register File

OP ADDRESS MEMORY A L U

Decoder

SLIDE 11

Architecture

Sequential Execution and Instruction Cycle

PC MEMORY A L U IR Register File

SLIDE 12

Architecture

Parallel Hardware

Ø Shared ed vs vs Distributed ed Mem emor

MEMORY

PC A L U IR Register File PC A L U IR Register File PC A L U IR Register File

……

Ø Multi-C

e and Many-C

e Architec ecture e

SLIDE 13

Architecture

Parallel Hardware

Ø Shared ed vs vs Distributed ed Mem emor

MEMORY

PC A L U IR Register File PC A L U IR Register File PC A L U IR Register File

……

MEMORY MEMORY Ø Cluster er Com

mputing, Grid Com
mputing, Clou
ud Com
mputing

SLIDE 14

Architecture

Multi-Core vs Many-Core

Ø Defi efinition

n of
f Cor
re

e – Indep epen enden ent ALU LU Ø How How abou

ut a vec

ector

r proc
ces

essor

Ø SIMD: E.g. Intel’s SSE.

Ø How How many is “many”?

Ø What if there are too “many” cores in the Multi-core design?

Shared ed con

ntrol
l log
gic (PC

(PC, IR, Sched edule) e)

SLIDE 15

Architecture

Multi-Core

Ø Each core has its own control (PC and IR)

SLIDE 16

Architecture

Many-Core

Ø A grou

up of
f cor
res

es shares es the e con

ntrol
l (PC

(PC, IR and Th Threa ead Sched eduling) )

SLIDE 17

Architecture

NVIDIA Fermi Architecture

16 Stream Multiprocessor (SM) 32 Core for Each SM

SLIDE 18

Architecture

Fermi SM

SLIDE 19

Architecture

Execution in a SM

A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of 16 load/store units. This figure shows how instructions are issued to the execution blocks.

SLIDE 20

Architecture

Data Parallel

Ø Data Pa Parallel el vs vs Ta Task Pa Parallel el

Ø What to partition? Data or Task?

Ø Massive e Data Pa Parallel el

Ø Millions (or more) of threads Ø Same instruction, different data elements

SLIDE 21

Architecture

Computing on GPUs

Ø Stea eam proc

essing and Vec ector

riza

zation

n (S

(SIMD) )

Input Stream Output Stream Instructions SIMD

SLIDE 22

Architecture

GPU Programming Model: Stream

Ø Strea eam Pr Prog

gramming Mod
del

el Ø Strea eams:

Ø An array of data units

Ø Ker ernel els:

Ø Take streams as input, produce streams at output Ø Perform computation on streams Ø Kernels can be linked together

Stream Stream Kernel