Compiler Optimization For (OpenMP) Accelerator Offloading Johannes - PowerPoint PPT Presentation

Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle. 1 master and N-1 worker teams, worker teams M threads: 7/14 The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle.

Masters execute bar concurrently, workers idle. 1 master and N-1 worker teams, worker teams M threads: 7/14 The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution.

1 master and N-1 worker teams, worker teams M threads: 7/14 The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

• a separate master team costs resources • synchronization has overhead • currently impossible to optimization 7/14 Problems: 1 master and N-1 worker teams, worker teams M threads: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

• synchronization has overhead • currently impossible to optimization 7/14 Problems: 1 master and N-1 worker teams, worker teams M threads: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot • a separate master team costs resources bar(); } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

• currently impossible to optimization 7/14 Problems: 1 master and N-1 worker teams, worker teams M threads: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot • a separate master team costs resources bar(); • synchronization has overhead } Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

1 master and N-1 worker teams, worker teams M threads: 7/14 Problems: The Compiler Black Box — Behind the Curtain (of Clang) #pragma omp target teams { foo(); #pragma omp parallel work(); // <- Hotspot • a separate master team costs resources bar(); • synchronization has overhead } • currently impossible to optimization Masters execute foo concurrently, workers idle. Masters delegate work for concurrent execution. Masters execute bar concurrently, workers idle.

few + Logic +Logic + RT Calls Device RT Device Code Code Host Code IPO Module Cross Fat Binary Opt. 8/14 Host 1. Offmoad-Specific Optimizations on Device Code Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM

few + Logic +Logic + RT Calls Device RT Device Code Host Code Module Cross IPO Fat Binary Opt. Host 8/14 1. Offmoad-Specific Optimizations on Device Code Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Code

+Logic few + Logic Device RT + RT Calls IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device Code Code Host Code

+Logic Device RT few + Logic IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Opt. Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device Code+ RT Calls Code Host Code

+Logic Device RT few IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Opt. Device Code Gen. OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device Code+ RT Calls + Logic Code Host Code

+Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code

+Logic few Opt. IPO Module Cross Fat Binary Opt. Host 1. Offmoad-Specific Optimizations on Device Code 8/14 Device Gen. Code OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT Device Code+ RT Calls + Logic Code Host Code

few 8/14 1. Offmoad-Specific Optimizations on Device Code IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+ RT Calls + Logic Code Host Code

8/14 1. Offmoad-Specific Optimizations on Device Code IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

Interested? Take a look and contact me :) 1. Offmoad-Specific Optimizations on Device Code Reviewers are needed! allow for more to come! Pending patches “fix” the motivating problem and 8/14 IPO Module Cross Fat Binary Opt. Host Code Gen. Opt. Device OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

8/14 1. Offmoad-Specific Optimizations on Device Code IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

8/14 2. Optimize Device and Host Code Together IPO Code Gen. Module Cross Fat Binary Opt. Device Opt. Host OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

8/14 Device mization Opti- Device AND Host Opt. Opt. Host Fat Binary Cross Module Gen. Code IPO 2. Optimize Device and Host Code Together OpenMP Offload — Overview & Directions OpenMP Clang Assembly LLVM-IR LLVM Device RT+Logic Device Code+few RT Calls + Logic Code Host Code

• A straight-forward #pragma omp target front-end: • Interface exposes information and implementation choices: • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface

• Interface exposes information and implementation choices: • Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end:

• Device RT interface & implementation are separated: ⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end: • Interface exposes information and implementation choices:

⋄ simplified implementation CGOpenMPRuntimeNVPTX.cpp ~5.0k loc ⋄ improved reusability (F18, ...) CGOpenMPRuntimeTRegion.cpp ~0.5k loc ⋄ “smartness” is moved in the compiler middle-end ⋄ simplifies analyses and transformations in LLVM ⋄ simplifies generated LLVM-IR ⋄ most LLVM & Clang parts become target agnostic 9/14 Pending Patches — Target Region Interface • A straight-forward #pragma omp target front-end: • Interface exposes information and implementation choices: • Device RT interface & implementation are separated:

• if legal, switch all boolean UseSPMDMode flags to true • currently, no (unknown) global side-efgects allowed outside parallel regions. 10/14 1. Offload-Specific Optimizations — “SPMD-zation” • use inter-procedural reasoning to place minimal guards/synchronization

• currently, no (unknown) global side-efgects allowed outside parallel regions. 10/14 1. Offload-Specific Optimizations — “SPMD-zation” • use inter-procedural reasoning to place minimal guards/synchronization • if legal, switch all boolean UseSPMDMode flags to true

10/14 1. Offload-Specific Optimizations — “SPMD-zation” • use inter-procedural reasoning to place minimal guards/synchronization • if legal, switch all boolean UseSPMDMode flags to true • currently, no (unknown) global side-efgects allowed outside parallel regions.

• reachability & post-dominance restrict the set of potential next parallel regions • reuse already communicated/shared values if possible • currently, a simple state machine is generated with explicit conditionals for all to work on known parallel regions in the module 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable

• reuse already communicated/shared values if possible • currently, a simple state machine is generated with explicit conditionals for all to work on known parallel regions in the module 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable • reachability & post-dominance restrict the set of potential next parallel regions

• currently, a simple state machine is generated with explicit conditionals for all to work on known parallel regions in the module 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable • reachability & post-dominance restrict the set of potential next parallel regions • reuse already communicated/shared values if possible

known parallel regions in the module to work on 11/14 1. Offload-Specific Optimizations — Custom State Machines • use optimized state-machines when unavoidable • reachability & post-dominance restrict the set of potential next parallel regions • reuse already communicated/shared values if possible • currently, a simple state machine is generated with explicit conditionals for all

TransitiveCallSite AbstractCallSite Passes (IPOs) 12/14 2. Optimize Device and Host Together — Abstract Call Sites CallInst InvokeInst CallSite Passes (IPOs)

Passes (IPOs) 12/14 2. Optimize Device and Host Together — Abstract Call Sites CallInst InvokeInst CallSite TransitiveCallSite AbstractCallSite Passes (IPOs)

Passes (IPOs) 12/14 Functional changes required for Inter-procedural Constant Propagation: 2. Optimize Device and Host Together — Abstract Call Sites CallInst InvokeInst CallSite TransitiveCallSite AbstractCallSite Passes (IPOs)

13/14 Abstract Call Sites — Performance Results

14/14 Conclusion

I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18)

II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , …

III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var)

IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts:

V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute ⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers

⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — Bidirectional Information Transfer: read/write-only , restrict / noalias , … II: Variable Privatization — Limit Variable Lifetimes: shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute

⟹ reduce start/stop overheads and expose barriers OpenMP - Aware Optimizations (see IWOMP’18) I: Attribute Propagation — In LLVM: Attribute Deduction (IPO!) read/write-only , restrict / noalias , … II: Variable Privatization — In LLVM: Argument Promotion (IPO!) shared(var) ⟶ firstprivate(var) ⟶ private(var) III: Parallel Region Expansion — Maximize Parallel Contexts: IV: Barrier Elimination — Eliminate Redundant Barriers V: Communication Optimization — Move Computations Around: seq. compute&result comm. vs. operand comm. &par. compute

OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N];

OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out);

OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int *N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

OpenMP Input: Early Outlining #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

OpenMP Input: An Abstract Parallel IR #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by an annotated loop for /* parallel */ ( int i = 0; i < N; i++) body_fn(i, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int i , int * N, float ** In, float ** Out) { (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

OpenMP Input: Early Outlined + Transitive Calls #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Model transitive call: body_fn(?, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] }

OpenMP Input: − integration cost per IPO Early Outlined + Transitive Calls #pragma omp parallel for for ( int i = 0; i < N; i++) Out[i] = In[i] + In[i+N]; // Parallel region replaced by a runtime call. omp_rt_parallel_for(0, N, &body_fn, &N, &In, &Out); // Model transitive call: body_fn(?, &N, &In, &Out); // Parallel region outlined in the front-end (clang)! static void body_fn( int tid, int * N, float ** In, float ** Out) { int lb = omp_get_lb(tid), ub = omp_get_ub(tid); + valid and executable IR for ( int i = lb; i < ub; i++) (*Out)[i] = (*In)[i] + (*In)[i + (*N)] } + no unintended interactions

TransitiveCallSite AbstractCallSite Passes (IPOs) IPO in LLVM CallInst InvokeInst CallSite Passes (IPOs)

Compiler Optimization For (OpenMP) Accelerator Offloading Johannes - PowerPoint PPT Presentation

EuroLLVM April 8, 2019 Brussels, Belgium Leadership Computing Facility Argonne National Laboratory Compiler Optimization For (OpenMP) Accelerator Offloading Johannes Doerfert and Hal Finkel https://www.alcf.anl.gov/ This research was

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

QMT QCD Multi Threading First steps Step 1: General Evaluation OpenMP vs.

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

From Bucket-Elimination To Bucket Trees E Bucket E: P(E|B,C) P(E|B,C) D Bucket D: P(D|A,B)

Methods of Adding Vectors Geometrically MCV4U: Calculus & Vectors Recall that two vectors are

BONGARD-LOGO: A NEW BENCHMARK FOR HUMAN-LEVEL CONCEPT LEARNING AND REASONING Weili Nie Zhiding

P

Regular a regular expression I Example 1.68 Consider the following DFA b a 1 2 a b a

Cognitive Complexity of Linguistic Patterns Artificial Grammar Learning Workshop Max Planck

Lecture 1.4: Group presentations Matthew Macauley Department of Mathematical Sciences Clemson

Open Multimethods for C ++ Peter Pirkelbauer Yuriy Solodkyy Bjarne Stroustrup Texas A&M

Compiler Optimization For (OpenMP) Accelerator Offloading Johannes - PowerPoint PPT Presentation

EuroLLVM April 8, 2019 Brussels, Belgium Leadership Computing Facility Argonne National Laboratory Compiler Optimization For (OpenMP) Accelerator Offloading Johannes Doerfert and Hal Finkel https://www.alcf.anl.gov/ This research was

Recommended Reading A Brief Introduction to OpenMP OpenMP FAQ http://openmp.org/openmp-faq.html

Status of GPU offloading on Wayland Axel Davy FOSDEM 2014 Status of GPU offloading on Wayland

Introduction to OpenMP Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Paolo Burgio paolo.burgio@unimore.it A history of OpenMP 1997 OpenMP for

Threaded Programming Lecture 2: OpenMP fundamentals Overview Basic Concepts in OpenMP

OpenMP Device Offloading to FPGA Accelerators Lukas Sommer, Jens Korinth, Andreas Koch

Compiler Construction Chapter 11 1 Compiler Construction Compiler Construction A New Compiler

Advanced OpenMP Lecture 11: OpenMP 4.0 OpenMP 4.0 Version 4.0 was released in July 2013

Parallel Programming with OpenMP CS240A, T. Yang 1 A Programmer s View of OpenMP What

OpenMP 4.0 and Beyond! Aidan Chalk, Hartree Centre, STFC What is OpenMP? OpenMP is an API

Shared Memory Programming Introduction to OpenMP Overview Shared memory systems Basic

QMT QCD Multi Threading First steps Step 1: General Evaluation OpenMP vs.

Speeding Up Reactive Transport Code Using OpenMP By Jared McLaughlin OpenMP A standard for

Introduction to OpenMP Lecture 6: Further topics in OpenMP Nested parallelism Unlike most

Parallel Programming using OpenMP Qin Liu The Chinese University of Hong Kong 1 Overview Why

SHARED MEMORY PROGRAMMING WITH OPENMP Lecture 9: OpenMP Performance 2 A common scenario.....

From Bucket-Elimination To Bucket Trees E Bucket E: P(E|B,C) P(E|B,C) D Bucket D: P(D|A,B)

Methods of Adding Vectors Geometrically MCV4U: Calculus &amp; Vectors Recall that two vectors are

BONGARD-LOGO: A NEW BENCHMARK FOR HUMAN-LEVEL CONCEPT LEARNING AND REASONING Weili Nie Zhiding

P

Regular a regular expression I Example 1.68 Consider the following DFA b a 1 2 a b a

Cognitive Complexity of Linguistic Patterns Artificial Grammar Learning Workshop Max Planck

Lecture 1.4: Group presentations Matthew Macauley Department of Mathematical Sciences Clemson

Open Multimethods for C ++ Peter Pirkelbauer Yuriy Solodkyy Bjarne Stroustrup Texas A&amp;M

Methods of Adding Vectors Geometrically MCV4U: Calculus & Vectors Recall that two vectors are

Open Multimethods for C ++ Peter Pirkelbauer Yuriy Solodkyy Bjarne Stroustrup Texas A&M