Parallel Programming Models for Heterogeneous MPSoCs
Pieter van der Wolf Philips Research MPSoC’05 July 11-15, 2005
Philips Parallel Programming Models for Heterogeneous MPSoCs - - PowerPoint PPT Presentation
Philips Parallel Programming Models for Heterogeneous MPSoCs Pieter van der Wolf Philips Research MPSoC05 July 11-15, 2005 Outline Introduction Task Transaction Level interface: TTL Abstract interface for streaming in MPSoCs
Parallel Programming Models for Heterogeneous MPSoCs
Pieter van der Wolf Philips Research MPSoC’05 July 11-15, 2005
2 Philips Confidential MPSoC’05
Outline
– Abstract interface for streaming in MPSoCs
– Constraint-driven code transformations
– Sea-of-DSP – Smart Camera – Cake / Wasabi
3 Philips Confidential MPSoC’05
MPSoC Design
– Implement advanced functionalities – Low cost – Power efficient – Flexible
– Increasing design efforts – SW effort overtaking HW effort – Increasing time-to-market
– Raise level of abstraction – Structured design – IP reuse – EDA support
0.35µ 0.25µ 0.18µ 0.15µ 0.12µ 0.1µ Log Scale Gates/cm2 Moore’s Law (59% CAGR) Design Productivity (20-25% CAGR) Software Productivity (8-10% CAGR)
4 Philips Confidential MPSoC’05
video pixel processing video decoding audio decoding
PCOMP DA
Sharpness improvement
PEAK LTI CTI
Video
VS, HS
Picture rate up-conversion
NR ME, MC DEINT UPC
Spatial scaling
VS, HS
Analog Video
Picture rate up-conversion
MPEG ME, MC DEINT UPC
Spatial scaling
VS, HS
MPEG bit stream
Audio decoding
AC-3
Audio in 1
Audio decoding
AC-3
Audio in 2
Sharpness improvement
PCOMP DA
VCR
PEAK LTI CTI
Audio out 1 Audio out 2
Many task graphs like this have to be supported
Example TV application
5 Philips Confidential MPSoC’05
Example MPSoC Hardware
digital TV SoC (Viper2)
TM3260 TM3260 MIPS3960 QVCP2L QVCP5L VIP MSP TDCS MDCS MBS VMPG
6 Philips Confidential MPSoC’05
Middleware JavaTV, TVPAK, OpenTV, MHP/Java, proprietary ... Applications
Nexperia Nexperia Hardware Hardware
Streaming Infrastructure Streaming Infrastructure
Kernel: pSOS, WinCE, JavaOS
Example MPSoC Software Stack
Streaming Components Streaming Components
7 Philips Confidential MPSoC’05
MPSoC Integration
– Ad hoc approaches – Low-level interfaces
– Synchronization via low-level primitives
– Data access services partly in IP
– Part of IP is specific for underlying communication infrastructure
IP Module
Computation Communication
DTL, AXI, …
8 Philips Confidential MPSoC’05
MPSoC Integration
– Hardware / software IP designer must deal with low-level issues
– IP becomes specific for particular use
– IP integrator must deal with low-level issues
– Infrastructures cannot evolve
9 Philips Confidential MPSoC’05
Interface Centric Design: TTL
– Parallel application models
– Platform interface
– Structured design & programming – Based on TTL
Platform Infrastructure A T S K S
Task Task Task
Mapping TTL TTL
10 Philips Confidential MPSoC’05
TTL Requirements
– Focus: stream processing applications – Make concurrency and communication explicit
– Make high-level services available
– Easy to use for IP development – Facilitate reuse and integration of IP – Provide implementation freedom
– E.g. supporting fine grain synchronization for on-chip memory
IP Module
Computation Communication
Shell IP module
TTL
11 Philips Confidential MPSoC’05
TTL in Example Architecture
– Enable communication in heterogeneous MPSoCs CPU SW Shell HW Shell Task 3 ASP Task 1 Task 2 TTL SW-API TTL HW-interface Interconnect DTL, AHB, AXI, OCP
12 Philips Confidential MPSoC’05
TTL Inter-Task Communication
Logical model and terminology
task port empty token full token channel private variable with value TTL interface
13 Philips Confidential MPSoC’05
Example: Message Passing Interface
Producer side
– Write data into channel connected to port
Consumer side
– Read data from channel connected to port
– Appropriate for modeling application ? – Appropriate for implementation on architecture ?
14 Philips Confidential MPSoC’05
TTL Interface Types
– Different applications
– Different implementation styles
– Each interface type offers narrow interface
– Each interface type supports particular communication style – Offer multiple interface types in one framework – Based on single model for interoperability
15 Philips Confidential MPSoC’05
TTL Interface Types
T1 T7 T6 T5 T4 T3 T2
16 Philips Confidential MPSoC’05
TTL Interface Types
Direct Non-blocking Out-of-order DNO Direct Blocking Out-of-order DBO Direct Non-blocking In-order DNI Direct Blocking In-order DBI Relative Non-blocking RN Relative Blocking RB Combined Blocking CB Full name Acronym
17 Philips Confidential MPSoC’05
Interface Type CB
Producer side
– Write vector of size values into channel
Consumer side
– Read vector of size values from channel
18 Philips Confidential MPSoC’05
Pros / Cons Interface Type CB
+ Easy to use + Reusable tasks – Copying overhead if private variables not in local buffers
– Smart compiler may help in some cases
– If local buffers:
– Large tokens / vectors large local buffers – Small tokens / vectors large synchronization overhead
CPU SW Shell HW Shell Task 2 ASP Task 1 TTL TTL Interconnect Mem
1 2 3 4
19 Philips Confidential MPSoC’05
Separate Synchronization and Data Transfer
acquireRoom (2) store/dereference releaseData (2) acquireData (2) load/dereference releaseRoom (2)
Producer Consumer
20 Philips Confidential MPSoC’05
Interface Types RB and RN
Producer side
(RB)
(RN) – Acquire count empty tokens, blocking (RB) / non-blocking (RN)
– Store vector of size values into the tokens with offset..offset+size-1 to the oldest acquired token
– Release count oldest acquired tokens as full tokens
21 Philips Confidential MPSoC’05
Pros / Cons Interface Types RB / RN
+ Coarse grain synchronization with fine grain data transfer
– Low synchronization overhead with small local buffers
+ Out-of-order data accesses
– Reduce cost of private variables
+ Load only subset of tokens from channel
– Reduce cost of data transfers
– Less abstract than CB
– Increases programming effort – Makes tasks less reusable
– Inefficiencies upon data transfers
– Function call, access to channel admin, address calculations
– Copying may still occur
22 Philips Confidential MPSoC’05
Interface Types DBI and DNI
Producer side
(DBI)
(DNI) – Acquire empty token, blocking (DBI) / non-blocking (DNI)
– Assign value to (part of) token
– Release oldest acquired token as full token
23 Philips Confidential MPSoC’05
Pros / Cons Interface Types DBI / DNI
+ Coarse grain synchronization with fine grain data transfer + Out-of-order data accesses for acquired token(s) + Load only part of token from channel + Direct data accesses
– Efficient data transfers
– Less abstract than CB / RB / RN
– Exposes memory addresses – Makes tasks less reusable
– No vector operations
– Would complicate interface / expose channel implementation
24 Philips Confidential MPSoC’05
Interface Types DBO and DNO
Producer side
(DBO)
(DNO) – Acquire empty token, blocking (DBO) / non-blocking (DNO)
– Assign value to (part of) token
– Release token as the next full token
+ Out-of-order release supports efficient use of memory – More complex implementation of the channel
25 Philips Confidential MPSoC’05
TTL Interface Types
CB RB / RN DBI / DNI DBO / DNO TTL Combined Separated Indirect Direct In-order Out-of-order In-order Vector Vector Scalar Scalar
26 Philips Confidential MPSoC’05
Use of TTL Interface Types
targeted applications
– Based on platform architecture and characteristics of applications
– Allow designer to trade “ease of design” for “efficiency of implementation”
– Mapping technology can automate design optimization – TTL TTL transformations on task code
– Share TTL-based design technology – Reuse IP modules across platforms
27 Philips Confidential MPSoC’05
TTL Multi-Tasking Interface
TTL offers three task types: 1. Process
2. Co-routine
3. Actor
28 Philips Confidential MPSoC’05
TTL APIs and Implementations
– C++ API – C API – Hardware interface
– Functional modeling and verification of TTL application models in C++ / C
– Sea-of-DSP – Smart Camera – Cake / Wasabi
29 Philips Confidential MPSoC’05
Outline
– Abstract interface for streaming in MPSoCs
– Constraint-driven code transformations
– Sea-of-DSP – Smart Camera – Cake / Wasabi
30 Philips Confidential MPSoC’05
Problem
How to efficiently program applications on platforms using the TTL interface?
functions varies on different platforms
interface types varies on one platform
31 Philips Confidential MPSoC’05
Example IQ→IZZ Using CB
01 void IQ::main() 02 while (true) 03 for(int j=0; j<vi; j++) 04 for(int k=0; k<hi; k++) 05 VYApixel Cout[64]; 06 for(int l=0; l<64; l++) 07 VYApixel Cin; 08 read(CinP, Cin); 09 Cout[l] = QT[t][l]*Cin; 10 write(CoutP, Cout, 64); Channel<VYApixel> 01 void IZZ::main() 02 while (true) 03 VYApixel Cin[64]; 04 VYApixel Cout[64]; 05 read(CinP, Cin, 64); 06 for(int i=0; i<64; i++) 07 Cout[zigzag[i]] = Cin[i]; 08 write(CoutP, Cout, 64); Cin[64] Cout[64]
1x write 1x read
32 Philips Confidential MPSoC’05
MEM
Efficiency of IQ→IZZ Using CB (HW)
SW Shell
Channel<VYApixel> Cin[64]
HW Shell Interconnect CPU
Cout[64]
1x write 1x read
Local memory is expensive in hardware
33 Philips Confidential MPSoC’05
MEM
Transform IQ→IZZ Using RB (1)
SW Shell
Channel<VYApixel>
HW Shell Interconnect CPU
Cout Cin
1x acq/rel 64x store 1x acq/rel 64x load
34 Philips Confidential MPSoC’05
Channel<VYApixel>
Transform IQ→IZZ Using RB (2)
Cin[64] 01 void IQ::main() 02 while (true) 03 for(int j=0; j<vi; j++) 04 for(int k=0; k<hi; k++) 05 VYApixel Cout[64]; 06 for(int l=0; l<64; l++) 07 VYApixel Cin; 08 read(CinP, Cin); 09 Cout[l] = QT[t][l]*Cin; 10 write(CoutP, Cout, 64); Cout[64]
35 Philips Confidential MPSoC’05
01 void IQ::main() 02 while (true) 03 for(int j=0; j<vi; j++) 04 for(int k=0; k<hi; k++) 05 reAcquireRoom(CoutP, 64); 06 for(int l=0; l<64; l++) 07 VYApixel Cin; 08 read(CinP, Cin); 09 store(CoutP, l, QT[t][l]*Cin); 10 releaseData(CoutP, 64); Channel<VYApixel> QT[t][l]*Cin
Transform IQ→IZZ Using RB (3)
1x acq/rel 64x store
Cin[64]
36 Philips Confidential MPSoC’05
01 void IZZ::main() 02 while (true) 03 VYApixel Cin[64]; 04 VYApixel Cout[64]; 05 read(CinP, Cin, 64); 06 for(int i=0; i<64; i++) 07 Cout[zigzag[i]] = Cin[i]; 08 write(CoutP, Cout, 64); Channel<VYApixel>
Transform IQ→IZZ Using RB (4)
QT[t][l]*Cin Cin[64]
37 Philips Confidential MPSoC’05
01 void IQ::main() 02 while (true) 03 for(int j=0; j<vi; j++) 04 for(int k=0; k<hi; k++) 05 reAcquireRoom(CoutP, 64); 06 for(int l=0; l<64; l++) 07 VYApixel Cin; 08 read(CinP, Cin); 09 store(CoutP, l, QT[t][l]*Cin); 10 releaseData(CoutP, 64); 01 void IZZ::main() 02 while (true) 03 VYApixel Cout[64]; 04 reAcquireData(CinP, 64); 05 for(int i=0; i<64; i++) 06 VYApixel Cin; 07 load(CinP, i, Cin); 08 Cout[zigzag[i]] = Cin; 09 write(CoutP, Cout, 64); 10 releaseRoom(CinP, 64); Channel<VYApixel> QT[t][l]*Cin Cin
Transform IQ→IZZ Using RB (5)
1x acq/rel 64x store 1x acq/rel 64x load
38 Philips Confidential MPSoC’05
Outline
– Abstract interface for streaming in MPSoCs
– Constraint-driven code transformations
– Sea-of-DSP – Smart Camera – Cake / Wasabi
39 Philips Confidential MPSoC’05
Implementation of TTL
TTL Platform Infrastructure A T S K S
ITC Multi- tasking
func = acquire, release, etc.
func … func scheduler
40 Philips Confidential MPSoC’05
Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile Tile External Interfaces Microprocessor Interface Micro- process
Sea of DSP Architecture
41 Philips Confidential MPSoC’05
DSP
memory
DSP
memory
DSP
memory
Mapping on Sea of DSP
SRC MP3 Radio APP APP
… … …
42 Philips Confidential MPSoC’05
Results for Different Interface Types
MP3 Output Input
TTL IF Type #Cycles Part in TTL #Memory words CB 45579603 2.9% 12493 RB 45551243 2.8% 12494 RN 45505950 2.2% 12365 DBI 45152454 1.1% 9162 DNI 45108086 0.5% 9041
43 Philips Confidential MPSoC’05
Results for Varying Channel Size (CB)
modified
buffer has been reduced in size
4.54 4.55 4.56 4.57 4.58 4.59 4.6 4.61 4.62 4.63 x 10
710000 10500 11000 11500 12000 12500 13000
#cycles MEM #words MEM vs. #cycles
Full Frame 1/2 Frame 1/4 Frame 1/8 Frame 1/16 Frame 1/32 Frame
44 Philips Confidential MPSoC’05
Results: Sub-frame Decoding (RN)
and private buffers are reduced in size
be modified
interface types
4.4 4.45 4.5 4.55 4.6 4.65 4.7 4.75 4.8 x 10
75000 6000 7000 8000 9000 10000 11000 12000 13000
#cycles MEM #words MEM vs. #cycles
Full Frame 1/2 Frame 1/36 Frame 1/4 Frame
45 Philips Confidential MPSoC’05
Smart Cameras Application Areas
Consumer Automotive Mobile Surveillance
EC funded CAMELLIA project (IST-34410)
46 Philips Confidential MPSoC’05
Architecture of Smart Imaging Core
ARM 9xx CPU SW Shell HW Shell Control TTL API TTL interface Interconnect SW Shell
Smart Imaging Coprocessor Pixel Processing Motion Segmentation Motion Estimator Coprocessor
VLIW Memory TTL API
47 Philips Confidential MPSoC’05
TTL shell performance
– reAcquireRoom/Data 5 cycles – releaseRoom/Data 7 cycles – load 5 + 2n cycles – store 5 + n cycles
48 Philips Confidential MPSoC’05
Architecture of Smart Imaging Core
ARM 9xx CPU SW Shell HW Shell Control TTL API TTL interface Interconnect SW Shell
Smart Imaging Coprocessor Pixel Processing Motion Segmentation Motion Estimator Coprocessor
VLIW Memory TTL API
49 Philips Confidential MPSoC’05
TTL Implementation for ME
µ-code FSM
Communication Bus/Network Distributed Register Files VLIW ctrl ACU ROM RAM ALU ASU I/O
00000000000 000000000000 00000000001 111111110011 00000000010 111111100110 00000000011 111111011010 00000000100 111111001101 00000000101 111111000001 00000000110 111110110100 00000000111 111110101000 00000000101 111111000001 00000000110 111110110100 00000000111 111110101000 00000001000 111110011011 00000010100 111100000101 00000010101 111011111000 00000010110 111011101100 00000010111 111011011111VLIW microcode Algorithm + TTL implementation C-code of ME algorithm C-code of TTL implementation Data path description (FUs + Device I/O)
A|RT Designer Function Architecture
50 Philips Confidential MPSoC’05
Cake / Wasabi
with homogeneous bias
DDR2 interface
64-bit
MIPS
TM-video
9x
MSVD
multi-standard video decoderMBVS
memory based video scalerCPIPE
HD-p output
XETAL
image vector processorCTL12 Tunnel PCI-Express x4 PCI-Express x4 PCI-Express x4 PCI-Express x4
L2 cache 2MB
TIC65
TM-video TM-video TM-video
51 Philips Confidential MPSoC’05
TTL Implementation on Cake / Wasabi
14 kB 5 kB Code size TTL (CB + DBI) 1529 1529 Lines of code TTL (all IF types) 29 kB 12 kB Code size TTL (all IF types) 773 773 Lines of code TTL (CB + DBI) 20
(TM - TM)
20
(MIPS - MIPS)
Cycles per sync operation (TTL on top of TRT run-time system) Trimedia MIPS
52 Philips Confidential MPSoC’05
Task-Level Interface Standardization
Industry-wide standardization needed
– Enable eco-system of IP providers
– Support development of function-specific IP – Support integration of IP
53 Philips Confidential MPSoC’05
Conclusion
TTL supports structured and efficient design and integration
– Decreases design effort for task programmer – Facilitates reuse and integration of IP – Provides implementation freedom for platform infrastructure
– Automated transformations support design optimizations – Closes gap between specification and implementation – Decreases design effort for system integrator
– Different architectures – In hardware and software