Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik - PowerPoint PPT Presentation

Algorithmen für die Echtzeitgrafik Algorithmen für die Echtzeitgrafik Daniel Scherzer scherzer@cg.tuwien.ac.at LBI Virtual Archeology 1

Why Parallel Programming? Applications

Future Apps Reflect a Concurrent World � “Supercomputing applications” � New applications in “future” mass computing market � Molecular dynamics simulation � Video and audio coding and manipulation � 3D imaging and visualization � Consumer game physics � Virtual reality products � � � “Super-apps” represent and model physical, concurrent world (huge amount of data, streaming, …) � Various granularities of parallelism exist, but… � programming model must not hinder parallel implementation � data delivery needs careful management 4

Stretching Traditional Architectures � Traditional Apps: � Sequential / hard to parallelize � Covered by CPUs � New Apps: � Huge amount of data � Partly parallelizable 5

Stretching Traditional Architectures � Tradit. parallel architectures cover some super-apps � DSP, GPU, network apps, scientific � But with specifically designed hardware for problem � Extension hard or impossible → Grow mainstream architectures � out � (more cores..) → Or domain-specific architectures � in � (CUDA, OpenCL) 6

Special-purpose processors always choke off real algorithmic creativity - Jim Blinn 7

Example Applications (CUDA) Application Description Source Kernel % time SPEC ‘06 version, change in guess vector H.264 34,811 194 35% SPEC ‘06 version,fluid simulation; change to LBM 1,481 285 >99% single precision and print fewer reports Code cracking; Distributed.net RC5-72 RC5-72 1,979 218 >99% challenge client code Finite element modeling, simulation of 3D FEM 1,874 146 99% graded materials Rye Polynomial Equation Solver, quantum RPES 1,104 281 99% chem, 2-electron repulsion Petri Net simulation of a distributed system PNS 322 160 >99% Single-precision implementation of saxpy, SAXPY 952 31 >99% used in Linpack’s Gaussian elim. Routine Two Point Angular Correlation Function TRACF 536 98 96% Finite-Difference Time Domain analysis of FDTD 1,365 93 16% 2D electromagnetic wave propagation Computing a matrix Q, a scanner’s MRI-Q 490 33 >99% configuration in MRI reconstruction 8

Speedup of Applications � GeForce 8800 GTX vs. 2.2GHz Opteron 248 � 10 × speedup in kernel typical, as long as kernel can occupy enough parallel threads � 25 × to 400 × speedup if function’s data requirements and control flow suit the GPU �� ! �� ""!#$%&#' �� 9

Why Parallel Programming? Best Bang for the Buck –GPUs and Money

The display is the computer. – Jen-Hsun Huang, CEO of NVIDIA 11

CPU vs. GPU Many-core GPU Multi-core CPU Courtesy: John Owens 12

GPUs: Upstream Over Time The dark ages (early-mid 1990’s), when there were only Display frame buffers for normal PC’s. Some accelerators were no more than a simple chip that sped up linear interpolation along a single span, so increasing fill Rasterization rate. Once even high-end systems supported just triangle setup and fill. CPU sent triangle with color and depth per vertex and Triangle Setup it’s rendered. This is where pipelines start for PC commodity graphics, Projection & Clipping 1995-1998. Seminal event is 3dfx’s Voodoo in October 1996. This part of the pipeline reaches the consumer level with the Transform & Lighting introduction of the NVIDIA GeForce256, Fall 1999. More and more moves to the GPU – what is the best division Geometry Shader, Instancing, of labor? Should it even be a pipeline, or something more Stream Out, Tessellation, … general? 13

Wheel of Reincarnation Coined by Myer and Sutherland, 1968. CPU only 1995 ? 1999 CPU & GAccel CPU & GPU Will the wheel turn again? 14

Spending Transistors � CPU � Control logic (ILP) � Memory � GPU � (used to) spend on algorithm logic 15

Spending Transistors � CPU and GPU heading towards each other � CPU � SSE through SSE5 � 128 bit registers � 256 bits data path with AVX � GPU � Unified shaders � Large pools of registers � Less fixed-function stages � Multiple paths out of GPU 16

Modern Processor Trends � Moore’s Law: ~1.6x transistors every year (10x every 5 years) � DRAM capacity (per year) � 1.6x from 1980-1992 � 1.4x 1996-2002 � DRAM bandwidth (per year) � 1.25x = 25%, (10x every 10 years) � DRAM latency (per year) � 1.05x = 5% (10x every 48 years). � Bandwidth improves by at least the square of the improvement in latency [Patterson2004]. 17

Memory & Latency � “Cache is King” � Missing the L2 cache and going to main memory is death, *10-50 slower. Why secondary rays usually stink. � CPUs focus on very fast caches, GPUs (used to) try to hide latency via many threads. 18

Opportunity: Latency Hiding for i = 1 to N: a[i] = b[i] * c + d[i] sizeof(a && b && d) > cache size But sizeof(a) + sizeof(b || d) < cache size Instead, to hide memory access: for i = 1 to N: t[i] = b[i] for i = 1 to N: t[i] *= c for i = 1 to N: t[i] += d[i] for i = 1 to N: a[i] = t[i] Speedup of *10-50 possible 19

Memory Wall: Bandwidth � Multi-core � Private caches allow some cores to continue when others suffer from memory latency � Ability to tolerate memory latency comes with increased memory bandwidth (traffic to caches) � Coherence/consistency maintains illusion of monolithic memory 20

The Three Walls � Instruction Level Parallelism (ILP) mined out � Branch prediction � Out of order processing � Control improvements � Memory (access latency) � Load and store is slow � Power (reason for multi-core) � GHz peaked in 2005 at around 3.8 GHz. � Diminishing returns � Increasing power does not linearly increase processing speed. 1.6x speed costs ~2-2.5x power and ~2-3x die area. 21

The Future: Parallelism � Design must change! � Intel: in 2006 gave plan of 80 cores by 2011 � Berkeley: could have thousand cores on a chip � Migration towards multi-processing � Provide other threads of execution while waiting for memory � Big caches � Increase memory bandwidth to compensate for long latency � Do not solve problem! 22

The Future: Parallelism � Tradeoff: large, fast core vs. many slower cores. � All tasks need to run reasonably, serial and parallel � Implies a hybrid: some fast cores, many small ones � The “HPU”: what is our goal? Solve for “H” (but Cell died!?) � Intel Turbo Boost [Knight Rider 1982] 23

Tearing Down the Memory Wall � Traditional software model � Arbitrary data access � Flat monolithic memory � Identifying private data and localize access � Eliminate unnecessary access and update to main memory � High compute to memory access ratio � Key to programming massively parallel processors 24

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik - PowerPoint PPT Presentation

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Daniel Scherzer scherzer@cg.tuwien.ac.at LBI Virtual Archeology 1 Why Parallel Programming? Applications Future Apps Reflect a Concurrent World Supercomputing

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Daniel Scherzer

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Daniel Scherzer

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Daniel Scherzer

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Daniel Scherzer

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Temporal Coherence

IO and Instructions Original by Koen Claessen How Would You do That? (1) Suppose you wanted to

Repetitionen Clevere Algorithmen, ETH Zrich Dr. Tobias Kohn, University of Cambridge

Evolution of the Aluminium High Pressure Die Casting in India CONTENTS INDIA - A VIBRANT ECONOMY

Die Lernende Organisation Die Schule auf dem Weg zu einer lernenden Organisation Dr. Heinz Hinz

bug.aj @interface A {} aspect Test { declare @field : @A int var* : @A; declare @field : int

Die Hard 1.1024.0: Die Hard 1.1024.0: Backward compatibility of a Backward compatibility of a

Architecture of distributed systems Netzprogrammierung (Algorithmen und Programmierung V) Barry

Maschinelles Lernen: Methoden, Algorithmen, Potentiale und gesellschaftliche

Genetische Algorithmen Christian Borgelt Arbeitsgruppe Neuronale Netze und Fuzzy-Systeme

Algorithmen und Datenstrukturen D3. Kompression 1 Marcel L uthi Universit at Basel 23. Mai

Vorlesung Datenstrukturen und Algorithmen Letzte Vorlesung 2018 Felix Friedrich, 30.5.2018

ISC09, Pisa, Italy Outline Outline Contributions 2 Attacks Target Ciphers: RC6, ERC6 and

Predicates Reading: EC 1.4 Peter J. Haas INFO 150 Fall Semester 2019 Lecture 3 1/ 15

Can virtuous institutions crowd out selfish preferences in a market environment? Marco Faillo

Foundations of Network and Foundations of Network and Computer Security Computer Security J ohn

"Systemized" Static Analysis Harry Xu University of California, Los Angeles Overview

Ubuntu Kernel Factory How we have Ubuntu kernels Ike Panhc <ike.pan@canonical.com>

Outline Multiprocessors Flynn taxonomy SIMD architectures Vector architectures MIMD

Web Application Security Payloads Andrs Riancho Director of Web Security BlackHat 2011 -

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik - PowerPoint PPT Presentation

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Daniel Scherzer scherzer@cg.tuwien.ac.at LBI Virtual Archeology 1 Why Parallel Programming? Applications Future Apps Reflect a Concurrent World Supercomputing

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Daniel Scherzer

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Daniel Scherzer

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Daniel Scherzer

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Daniel Scherzer

Algorithmen fr die Echtzeitgrafik Algorithmen fr die Echtzeitgrafik Temporal Coherence

IO and Instructions Original by Koen Claessen How Would You do That? (1) Suppose you wanted to

Repetitionen Clevere Algorithmen, ETH Zrich Dr. Tobias Kohn, University of Cambridge

Evolution of the Aluminium High Pressure Die Casting in India CONTENTS INDIA - A VIBRANT ECONOMY

Die Lernende Organisation Die Schule auf dem Weg zu einer lernenden Organisation Dr. Heinz Hinz

bug.aj @interface A {} aspect Test { declare @field : @A int var* : @A; declare @field : int

Die Hard 1.1024.0: Die Hard 1.1024.0: Backward compatibility of a Backward compatibility of a

Architecture of distributed systems Netzprogrammierung (Algorithmen und Programmierung V) Barry

Maschinelles Lernen: Methoden, Algorithmen, Potentiale und gesellschaftliche

Genetische Algorithmen Christian Borgelt Arbeitsgruppe Neuronale Netze und Fuzzy-Systeme

Algorithmen und Datenstrukturen D3. Kompression 1 Marcel L uthi Universit at Basel 23. Mai

Vorlesung Datenstrukturen und Algorithmen Letzte Vorlesung 2018 Felix Friedrich, 30.5.2018

ISC09, Pisa, Italy Outline Outline Contributions 2 Attacks Target Ciphers: RC6, ERC6 and

Predicates Reading: EC 1.4 Peter J. Haas INFO 150 Fall Semester 2019 Lecture 3 1/ 15

Can virtuous institutions crowd out selfish preferences in a market environment? Marco Faillo

Foundations of Network and Foundations of Network and Computer Security Computer Security J ohn

&quot;Systemized&quot; Static Analysis Harry Xu University of California, Los Angeles Overview

Ubuntu Kernel Factory How we have Ubuntu kernels Ike Panhc &lt;ike.pan@canonical.com&gt;

Outline Multiprocessors Flynn taxonomy SIMD architectures Vector architectures MIMD

Web Application Security Payloads Andrs Riancho Director of Web Security BlackHat 2011 -

"Systemized" Static Analysis Harry Xu University of California, Los Angeles Overview

Ubuntu Kernel Factory How we have Ubuntu kernels Ike Panhc <ike.pan@canonical.com>