Outline Background Venezia Hardware Architecture Venezia Software - PDF document

Venezia: a Scalable Multicore Subsystem for Multimedia Applications Takashi Miyamori Toshiba Corporation Outline • Background • Venezia Hardware Architecture • Venezia Software Architecture • Evaluation Chip and Performance Results • Summary MPSoC 2008 2

Outline • Background • Venezia Hardware Architecture • Venezia Software Architecture • Evaluation Chip and Performance Results • Summary MPSoC 2008 3 Background • Today's mobile multimedia devices support many audio and video CODECs. H.264, MPEG-4, VC-1, AC-3, MP3, WMA ･･ etc. • The size of image processing is increasing rapidly. QCIF, CIF, QVGA, VGA, 720p, 1080i, 1080p ･･ etc. Current SoC (T5V) VCORE (Video/JPEG) HW Solution: CPU VMCB VHMCB VDCT VHDCT New designs are required for new CODECs. DMAC VMEF VLZP VHLZP • 90nm CMOS (16Mbit eDRAM x2) • H.264 decode @CIF 30fps, MPEG-4 encode/decode @VGA 30fps MPSoC 2008 4

Our Approach: Scalable Multicore Processor Scalability Performance CPU L1$ CPU L1$ Codec CPU L1$ CPU L1$ 720p L2$ FW CPU L1$ CPU L1$ Compatibility Compatibility CPU L1$ CPU L1$ Binary Binary CPU L1$ CPU L1$ VGA “Venezia” L2$ CPU L1$ CPU L1$ QVGA L2$ CPU L1$ １４８ Number of MPEs MPSoC 2008 5 “MPSoC Architecture Trade-offs for Multimedia Applications,” MPSoC ’07 Homogeneous vs. Heterogeneous Homogeneous Heterogeneous Architecture Mx MDx MDA MDHx M M M M M D D D M D M D H H A Programmab ility / Very good Good Fair M: Main CPU Scalability D: DSP/Media Processor Perf./Cost or Fair Good Very Good Perf./Power A: Accelerator H: Hardware Engine Core 2 (M 2 , M 4 ), Cell (MD 8 ), Examples Uniphier Most of Xbox 360 CPU (M 3 ), SB3000(MD4), current SoCs MPCore(M 4 ), Philips Niagara(M 8 ) Cake/Wasabi MPSoC ’07 6 MPSoC 2008 6

Outline • Background • Venezia Hardware Architecture • Venezia Software Architecture • Evaluation Chip and Performance Results • Summary MPSoC 2008 7 Venezia Processor Architecture Media Processing Headquarters Engines (MPEs) Venezia Architecture • Multi RISC Cores & MeP Core MeP Core MeP MeP Multi MPEs ・・・・・・・ Core Core • Cache-Based System VLIW Cop. VLIW Cop. I$ D$ I$ D$ I$ D$ I$ D$ Media Processing Engine • 3-way VLIW Processor • SIMD Instruction Support L2 Cache • Small Size about Venezia 1.3mm2@65nm MeP: Media embedded Processor MPSoC 2008 8

MPE (Media Processing Engine) • MeP (Media embedded Processor) MPE MeP Core Core I$ + Ctrl. – 32-bit RISC Processor – 5 Pipeline Stages IVC2 – 3 Low-power Mechanisms • IVC2(Coprocessor) Dec. Reg. Dec. – Extension of MeP Core – 64-bit SIMD Operations Cop. Reg. ALU – 3-way VLIW D$ + Ctrl. ALU ALU （ Core + Cop.A + Cop.B ） Mul. Acc. Acc. Core Pipe Cop. PipeA Cop. PipeB MPSoC 2008 9 MPE Instruction Formats • 16b/32b Instruction Mode (Core Mode) core16 core32 cop.a/b32 • 64b Instruction Mode (VLIW Mode) core16 cop.a20 cop.b28 core32 cop.b28 Loop Control, Address Data Calculation Calcu., Load, and Store cop.a28 cop.b28 The instruction mode is switched by the special subroutine call instruction (bsrv). MPSoC 2008 10

IVC2 64-bit SIMD Datapath IVC2 8 Bits x 8 8 Bits x 8 16 Bits x 4 ALU/Shift/ General-purpose Registers 16 Bits x 4 ALU/Shift/ 64bits ｘ 32 5R3W 32 Bits x 2 Compare/ 32 Bits x 2 Compare/ 64 Bits Pack/Unpack 64 Bits Pack/Unpack X1 ALU ALU 8 Bits x 8 Stage 8 Bits x 8 16 Bits x 4 16 Bits x 4 32 Bits x 2 32 Bits x 2 X2 Add/Subtract/ Multiplier Add/Subtract/ Stage ＸＸＸ Shift Shift 32 Bits x 8 32 Bits x 8 64 Bits x 4 64 Bits x 4 Accumulator Accumulator X3 ＋＋＋＋＋＋ Stage Pipe 0 Pipe 1 MPSoC 2008 11 Cache Memory System • L1 Inst./Data Caches MPE Headquarters – 2-way Set Assoc. – 8/16KB ・・・・・・ – 64B Line Size L1 L1 L1 L1 I$ D$ I$ D$ • L2 Cache – 64/128/256/512KB 64b 64b – 4-way Set Assoc. – 256B Line Size ・・・・・・ 512b Buffer 512b Buffer Interconnect • Prefetch Functions – L1 I$ Auto Prefetch 256b Datapath – L1 D$ Prefetch Inst. – L2 Interconnect Buffer L2 $ → Buffer • – L2 $ Prefetch L2 Cache Main Mem. → L2 $ • MPSoC 2008 12

Comparing Memory Systems for Chip Multiprocessors (Stanford Univ., ISCA’07) • CC: Cache-Coherent Model – 32KB 2-way assoc. cache • STR: Streaming Model – 24KB local memory and 8KB 2-way assoc. cache • 512KB L2 cache • Both models perform and scale equally well. • “Non-allocate store to cache” can reduce memory traffic. – Ex. Prepare for Store (PFS) instruction of MIPS32 – MPEG-2: Traffic due to write miss was reduced 56%. • Streaming programming model, such as blocking and locality-aware scheduling, is efficient for cache model as well as streaming model. MPSoC 2008 13 L2 Cache to MPE 64b Interconnect 512b 512b 512b Buffer Buffer Buffer Arbiter 256b Datapath 8-entry Tag 512b Queue Check Tag Data SRAM SRAM 1/2 CPU Freq. Refill Write Back L2 Cache Throughput: 512b / 2 cycles to Main Bus Latency: 10 cycles MPSoC 2008 14

Outline • Background • Venezia Hardware Architecture • Venezia Software Architecture • Evaluation Chip and Performance Results • Summary MPSoC 2008 15 Software Hierarchy of Venezia • V-Kernel: simple and light operating system kernel Media FW V-Thread Execution V-Kernel Library Framework V-Kernel V-Kernel Base Memory Task Task Task Task Management Mng. Mng. Mng. Mng. Head- Memory quarters MPE MPE MPE HW Venezia Subsystem MPSoC 2008 16

Multigrain Parallelism • MPE Exploits Instruction / Data Level Parallelism • Multicore Architecture Exploits Task and V-Thread Level Parallelism Coarse Task Application level Multi-task e.g. Audio Decode, programming Headquarters Video Decode Granularity & MPEs V-Thread Function level Multi-thread e.g. MC, IQ/IT programming VLIW Instruction level MPE Within MPE programming SIMD Data level Fine MPSoC 2008 17 V-Thread Execution Model • Scalability and Compatibility by V-Thread – Parallel Execution by MPEs – Abstraction of MPE Computing Resources Headquarters Media FW Media FW V- V -Thread Thread Task Task V- V -Thread Scheduler Thread Scheduler MPE MPE MPE MPE MPE MPE V- -Thread Thread V V- V -Thread Thread V- V -Thread Thread MPSoC 2008 18

Granularity of V-Threads in H.264 Decode V-Thread 2 V-Thread 1 IP/IQT (L) MC (L) DBF (L) V-Thread 0 Video Signal MVP Video Signal Processor Processor BS Task V-Thread 4 Task V-Thread 3 IP/IQT (C) MC (C) DBF(C) Macro Blocks VGA: 1000 V-Threads / Frame 720p: 3000 V-Threads / Frame MVP: Motion Vector Prediction BS: Boundary Strength MC(L): Motion Compensation (and Weighted Prediction) for Luma MC(C): Motion Compensation (and Weighted Prediction) for Chroma IP/IQT(L): Intra Prediction (and Inverse Quantization, Inverse Transform) for Luma IP/IQT(C): Intra Prediction (and Inverse Quantization, Inverse Transform) for Chroma DBF(L): De-Blocking Filter for Luma DBF(C): De-Blocking Filter for Chroma EoM : The end of the macroblock process MPSoC 2008 19 Spatial Dependency of V-Threads MB Data Dependency V-Thread 00 V-Thread 01 V-Thread 02 V-Thread 10 V-Thread 11 V-Thread 12 MPSoC 2008 20

V-Kernel Shared Memory Model open_private_memory() open_protected_memory() PRIVATE PUBLIC PROTECTED close_private_memory() close_protected_memory() L1 Cache free_public allocate_public Invalidate _memory() _memory() L1 Cache Write Invalidate Cache coherency among INVALID MPEs is maintained by software. L1 $ read L1 $ write L2 Direct read L2 Direct write PUBLIC NG NG OK OK PRIVATE OK OK NG NG PROTECTED OK NG (OK) NG MPSoC 2008 21 V-Kernel and V-Thread Execution Framework • Media FW runs on V-Kernel User API and V-Thread Execution Framework API V-Thread Execution Frame API V-Kernel User API Media FW V-Thread Execution Framework V-Kernel Library V-Thread Syntax & Execution V-Thread Task V-Kernel Base Dispatch Task e.g. Signal Memory Task Task Task Task Processing Management Mng. Mng. Mng. Mng. Head- Memory quarters MPE MPE MPE HW Venezia Subsystem MPSoC 2008 22

Outline • Background • Venezia Hardware Architecture • Venezia Software Architecture • Evaluation Chip and Performance Results • Summary MPSoC 2008 23 VeneziaEX: Evaluation Chip of Venezia Architecture Technology 65nm CMOS, 8LM Die Size 5.06mm x 5.06 mm MPE MPE MPE MPE 333MHz (MPE, L2$ Logic) Frequency 166MHz (L2$ SRAM, Bus I/F) Bus I/F L2$ SRAM L2$ 2.5V (I/O) 4Mbit Controller Supply PLL 1.2V (Core) Voltage D$ I$ 1.2V/0.95V/0V (SVC Output) 8KB (Instruction), 8KB (Data) MPE MPE MPE MPE L1 Cache 5R3W 2-way, FIFO, 64B Line RegFile 512KB (unified), 4-way, LRU, L2 Cache 256B Line MPSoC 2008 24

Performance Scalability � H.264 720p Decode Scalability Bottlenecks • I-Picture: VLD • P-Picture: L2 Cache Miss Penalty � � 4.1x � � Frame Rate [fps] � � � � � � � � � � � � � � � � � � � � Number of MPEs MPSoC 2008 25 Outline • Background • Venezia Hardware Architecture • Venezia Software Architecture • Evaluation Chip and Performance Results • Summary MPSoC 2008 26

Outline Background Venezia Hardware Architecture Venezia Software - PDF document

Venezia: a Scalable Multicore Subsystem for Multimedia Applications Takashi Miyamori Toshiba Corporation Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and Performance Results

Ins Domingues Breast Cancer Workshop April 7th 2015 Outline Outline Outline Outline

Presentation Preparation Outline Speech Outline Template ***Use this outline to guide you in

Outline for St Outline for St Outline for

Beob Kyun Kim, S oonwook Hwang {kyun, hwang}@ kisti.re.kr KIS TI, Korea Outline Outline

Catherine Revels, World Bank November 2009 Presentation outline Presentation outline

Battlestar Galactica Battlestar Galactica Galactica Battlestar Outline Outline Outline

Outline 2 Outline 2 ZSim core simulation techniques Outline 2 ZSim core simulation

Appendix J: Capstone Presentation Outline Revised Spring 2016 CAPSTONE PRESENTATION OUTLINE This

PT1 TMP Presentation Outline 1 Group Members: ___________________________________ Use this outline

Broverview Outline 2 Outline Philosophy and Architecture A framework for network traffic

Xingqian Peng, Huaqiao University, China Presented by Zhen Wu Presented by Zhen Wu October 30,2011

1 Web Application Development 2 3 Web Application Development CSS Outline An outline is a

Lecture Outline Strengthening Induction Hypothesis. Lecture Outline Strengthening Induction

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

High Dimensional Approximation - Outline Background and Sources Wolfgang Dahmen Seminar: USC,

Outline Outline Deaf and Hearing Impaired Deaf and Hearing Impaired Physical Structures of

r r sts t

Thyroid Cases Case Based Discussion Chienying Liu: no disclosures Jennifer Park-Sigal: no

Course summary SO 2020_2021_Q1 1.1 Outline Goals Competences Methodology

St StreamBo eamBox-HB HBM Stream Analytics on High Bandwidth Hybrid Memory Hongyu Miao, Purdue

HVP contributions to anomalous magnetic moments of all leptons from first principle At Physical

Outcome Delivery at Miki Renee Tsielepi ? Mobius Loop Workshop Reservations Purchasing

Object-based SSD (OSSD): Our Practice and Experience Jaesoo Lee jaesu.lee@samsung.com Flash

Latent Factor Analysis of Gaussian Distributions under Graphical Constraints Md Mahmudul Hasan,