Architecture
CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong - - PowerPoint PPT Presentation
CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong - - PowerPoint PPT Presentation
Architecture CS 5234 Spring 2013 Advanced Parallel Computing Architecture Yong Cao Architecture Goals Seq equen ential Machine e and Von on-N -Neu eumann Mod odel el Pa Parallel el Ha Hardware e Distributed vs
Architecture
Goals
Ø Seq equen ential Machine e and Von
- n-N
- Neu
eumann Mod
- del
el Ø Pa Parallel el Ha Hardware e
Ø Distributed vs Shared Memory
Ø Architec ecture e Classes es
Ø Multiple-core Ø Many-core (massive parallel)
Ø NVIDIA GPU PU Architec ecture e
Architecture
Von-Neumann Machine (VN)
Ø PC PC: Pr Prog
- gram cou
- unter
er Ø MAR: Mem emor
- ry addres
ess reg egister er Ø MDR: Mem emor
- ry data
reg egister er Ø IR: Instruction
- n reg
egister er Ø ALU LU: Arithmet etic Log Logic Unit Unit Ø Acc Acc: Accumulator
- r
PC MAR Acc MDR OP ADDRESS MEMORY A L U
Decoder
Architecture
Sequential Execution and Instruction Cycle
Ø Th The e six phases es of
- f the
e instruction
- n cycle:
e:
Ø Fetch Ø Decode Ø Evaluate Address Ø Fetch Operands Ø Execute Ø Store Result
PC MAR Acc MDR OP ADDRESS MEMORY A L U
Decoder
Architecture
Sequential Execution and Instruction Cycle
Ø Fet Fetch
Ø MARçPC Ø MDRçMEM[MAR] Ø IRçMDR
PC MAR Acc MDR OP ADDRESS MEMORY A L U
Decoder
Architecture
Sequential Execution and Instruction Cycle
Ø Dec ecod
- de
e
Ø DECODERçIR.OP
PC MAR Acc MDR OP ADDRESS MEMORY A L U
Decoder
Architecture
Sequential Execution and Instruction Cycle
Ø Evaluate e Addres ess
Ø MARçIR.ADDR Ø MDRçMEM[MAR]
PC MAR Acc MDR OP ADDRESS MEMORY A L U
Decoder
Architecture
Sequential Execution and Instruction Cycle
Ø Exec ecute e
Ø Acc çAcc + MDR
PC MAR Acc MDR OP ADDRESS MEMORY A L U
Decoder
Architecture
Sequential Execution and Instruction Cycle
Ø Stor
- re
e Res esult
Ø MDRçAcc
PC MAR Acc MDR OP ADDRESS MEMORY A L U
Decoder
Architecture
Sequential Execution and Instruction Cycle
Ø Reg egister er Fi File e
PC MAR
Register File
OP ADDRESS MEMORY A L U
Decoder
Architecture
Sequential Execution and Instruction Cycle
PC MEMORY A L U IR Register File
Architecture
Parallel Hardware
Ø Shared ed vs vs Distributed ed Mem emor
- ry
MEMORY
PC A L U IR Register File PC A L U IR Register File PC A L U IR Register File
……
Ø Multi-C
- Cor
- re
e and Many-C
- Cor
- re
e Architec ecture e
Architecture
Parallel Hardware
Ø Shared ed vs vs Distributed ed Mem emor
- ry
MEMORY
PC A L U IR Register File PC A L U IR Register File PC A L U IR Register File
……
MEMORY MEMORY Ø Cluster er Com
- mputing, Grid Com
- mputing, Clou
- ud Com
- mputing
Architecture
Multi-Core vs Many-Core
Ø Defi efinition
- n of
- f Cor
- re
e – Indep epen enden ent ALU LU Ø How How abou
- ut a vec
ector
- r proc
- ces
essor
- r?
Ø SIMD: E.g. Intel’s SSE.
Ø How How many is “many”?
Ø What if there are too “many” cores in the Multi-core design?
Shared ed con
- ntrol
- l log
- gic (PC
(PC, IR, Sched edule) e)
Architecture
Multi-Core
Ø Each core has its own control (PC and IR)
Architecture
Many-Core
Ø A grou
- up of
- f cor
- res
es shares es the e con
- ntrol
- l (PC
(PC, IR and Th Threa ead Sched eduling) )
Architecture
NVIDIA Fermi Architecture
16 Stream Multiprocessor (SM) 32 Core for Each SM
Architecture
Fermi SM
Architecture
Execution in a SM
A total of 32 instructions from one or two warps can be dispatched in each cycle to any two of the four execution blocks within a Fermi SM: two blocks of 16 cores each, one block of four Special Function Units, and one block of 16 load/store units. This figure shows how instructions are issued to the execution blocks.
Architecture
Data Parallel
Ø Data Pa Parallel el vs vs Ta Task Pa Parallel el
Ø What to partition? Data or Task?
Ø Massive e Data Pa Parallel el
Ø Millions (or more) of threads Ø Same instruction, different data elements
Architecture
Computing on GPUs
Ø Stea eam proc
- ces
essing and Vec ector
- riza
zation
- n (S
(SIMD) )
Input Stream Output Stream Instructions SIMD
Architecture
GPU Programming Model: Stream
Ø Strea eam Pr Prog
- gramming Mod
- del
el Ø Strea eams:
Ø An array of data units
Ø Ker ernel els:
Ø Take streams as input, produce streams at output Ø Perform computation on streams Ø Kernels can be linked together
Stream Stream Kernel
Architecture
Why Streams?
Ø Ample e com
- mputation
- n by ex
expos
- sing parallel
elism
Ø Stream expose data parallelism
Ø Multiple stream elements can be processed in parallel
Ø Pipeline (task) parallelism
Ø Multiple tasks can be processed in parallel
Ø Effi fficien ent com
- mmunication
- n
Ø Producer-consumer locality Ø Predictable memory access pattern
Ø Optimize for throughput of all elements, not latency of
- ne