Adaptive Sam pling-Based Profiling Techniques for Optim izing the - PowerPoint PPT Presentation

Adaptive Sam pling-Based Profiling Techniques for Optim izing the Distributed JVM Runtim e King Tin Lam, Yang Luo, Cho-Li Wang Speaker: King Tin Lam Date: Apr 20, 2010 Systems Research Group Department of Computer Science The University of Hong Kong IPDPS’10, Atlanta, Georgia, USA

Outline 1 Background 2 Challenges and Problems 3 Adaptive Object Sampling 4 Adaptive Stack Sampling 5 Performance Evaluation 2 2

Parallel Programming Paradigms For a single computer (multiprocessor,  multicore),  Shared m em ory e.g. OpenMP  Much easier  For a multicomputer (distributed-memory  system),  Message passing e.g. MPI, PVM  Hard to programmers   Shared virtual m em ory ( SVM) a.k.a. Software DSM  e.g. Treadmarks, CVM, JiaJia  Bind to a memory consistency model  Resemble ease of shared memory  Less efficient  3

Parallel Programming Paradigms System Developer I m plem entation Level Granularity Consistency Model For a single computer (multiprocessor,  IVY Yale Library + OS Page (1KB) SC multicore), Munin Rice Library + OS Variable ERC TreadMarks  Rice Library Page (4KB) LRC Shared m em ory CVM Maryland Library Page LRC, SC e.g. OpenMP  Midway CMU Library + Compiler Variable EC, PC, RC Much easier  NCP2 UFRJ, Brail Library + Hardware support Page (4KB) EC, RC For a multicomputer (distributed-memory  Quarks Utah Library Region, Page RC, SC system), softFLASH Stanford OS Page (16KB) RC, DIRC Cashmere-2L Rochester Library Page (8KB) HLRC  Message passing Brazos Rice Library Page ScC e.g. MPI, PVM  Shasta DEC WRL Compiler Variable SC Hard to programmers  Mermaid Toronto Library+OS Page (1KB, 8KB) SC  Shared virtual m em ory ( SVM) Mirage UCLA OS 512Bytes SC a.k.a. Software DSM  JIAJIA CAS, China Library Page (4KB) ScC e.g. Treadmarks, CVM, JiaJia  Simple-COMA SICS (Sweden) OS Page SC and SUN Bind to a memory consistency model  Blizzard-S Wisconsin Library Cache line SC Resemble ease of shared memory  Shrimp Princeton OS+Hardware support Page AURC, SC Less efficient  Linda Yale Language Variable SC Orca Vrije Univ., Language Variable EC-like Netherlands 4

Parallel Programming Paradigms For a single computer (multiprocessor, Memory consistency models   Memory consistency models  multicore), Strict Consistency  Strict Consistency  Sequential Consistency (SC)   Shared m em ory Sequential Consistency (SC)  e.g. OpenMP Release consistency (RC)   Release consistency (RC)  Much easier Eager Release Consistency (ERC)   Eager Release Consistency (ERC)  For a multicomputer (distributed-memory Lazy Release Consistency (LRC)   Lazy Release Consistency (LRC)  system), Scope Consistency (ScC)  Scope Consistency (ScC)  Entry Consistency (EC)   Entry Consistency (EC)  Message passing e.g. MPI, PVM  Hard to programmers   Shared virtual m em ory ( SVM) a.k.a. Software DSM  e.g. Treadmarks, CVM, JiaJia  Bind to a memory consistency model  Resemble ease of shared memory  Less efficient  5

Parallel Programming Paradigms For a single computer (multiprocessor,  Remote memory access is the scalability killer!  Remote memory access is the scalability killer!  multicore), Remote >> local latency (assume in 50-60ns)  Remote >> local latency (assume in 50-60ns)   Shared m em ory Infiniband cluster (1-2 μ s): 20 x slower!  Infiniband cluster (1-2 μ s): 20 x slower!  e.g. OpenMP  Ethernet cluster (100 μ s): 2,000 x slower!!  Ethernet cluster (100 μ s): 2,000 x slower!!  Much easier  Grid/Internet (av. 500ms): 10,000,000 x slower!!!  For a multicomputer (distributed-memory  Grid/Internet (av. 500ms): 10,000,000 x slower!!!  system),  Message passing  "To speed up" ≈ "Reduce as m uch rem ote  "To speed up" ≈ "Reduce as m uch rem ote e.g. MPI, PVM  access as possible" Hard to programmers  access as possible"   Shared virtual m em ory ( SVM) The key is to im prove locality  The key is to im prove locality a.k.a. Software DSM  e.g. Treadmarks, CVM, JiaJia  Bind to a memory consistency model  Resemble ease of shared memory  Less efficient  6

The PGAS Model User hints  Add annotation  Use special API constructs for locality hint inputs  (e.g. X10’s places )  PGAS (Partitioned Global Address Space) "Hybrid" parallel paradigm  Essentially Distributed Shared Memory (DSM)  But corporate some MPI-like constructs  Research languages:  UPC, Co-Array Fortran (CAF), Titanium HPCS Languages:  X10 (IBM), Chapel (Cray) A burden to programmers  7

Our Dream Model: PGPGAS or (PG) 2 AS  Profile-Guided PGAS ( PG 2 AS) A built-in runtim e profiler instead of humans for  digging out the locality hints Profile-guided adaptive locality management  Thread migration  Object home migration  Something new in Object prefetching  this paper API-free shared virtual memory  Transparent clustering and scaling  Automatic thread distribution  Location-transparent access  System instruments cluster-wide logics  No modification to existing applications  Previous distributed JVM research (e.g. cJVM, JavaSplit, JESSICA, …) 8

Techniques to improve locality  Runtime techniques Migration  Thread  T2 T1 Object (Home)  Prefetching  Spatial  Temporal  objects node 1 node 2 remote access 9

J ava JESSICA Distributed Java VM E nabled S ingle A cluster-wide JVM with  S ystem Dynamic thread mobility in JIT mode  I mage Global Object Space (GOS)  C omputing Portable Java Frames A rchitecture Thread Migration Source Java Class Java Source Class Remote Class Loading Code Compiler Files Compiler Code Files Thread Thread Thread Thread Thread Thread Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler Class Class Class Class Class Class Load Load Load Loader Loader Loader Loader Loader Loader Monitor Monitor Monitor Thread 3 Thread 3 Thread 3 Daemon Daemon Daemon Thread 2 Thread 2 Thread 2 Thread 1 Java Thread 1 Java Thread 1 Java Java Java Java Method Area Method Area Method Area Method Area Method Area Method Area Registers PC Registers PC Registers PC Stack Stack Stack Execution Execution Execution Execution Execution Execution Frames Frames Frames Engine Engine Engine Engine Engine Engine Local Heap Local Heap Local Heap Local Heap Local Heap Local Heap Host Manager Host Manager Host Manager Master JVM Worker JVM Worker JVM Host Manager Host Manager Host Manager OS OS OS Hardware Hardware Hardware 12 Communication Network

J ava JESSICA Distributed Java VM E nabled S ingle A cluster-wide JVM with  S ystem Dynamic thread mobility in JIT mode  I mage Global Object Space (GOS)  C omputing Portable Java Frames A rchitecture Thread Migration Source Java Class Java Source Class Remote Class Loading Code Compiler Files Compiler Code Files Thread Thread Thread Thread Thread Thread Scheduler Scheduler Scheduler Scheduler Scheduler Scheduler Class Class Class Class Class Class Load Load Load Loader Loader Loader Loader Loader Loader Monitor Monitor Monitor Thread 3 Thread 3 Thread 3 Daemon Daemon Daemon Thread 2 Thread 2 Thread 2 Thread 1 Java Thread 1 Java Thread 1 Java Java Java Java Method Area Method Area Method Area Method Area Method Area Method Area Registers PC Registers PC Registers PC Stack Stack Stack Execution Execution Execution Execution Execution Execution Frames Frames Frames Engine Engine Engine Engine Engine Engine Heap object Heap object Local Heap Local Heap Local Heap Local Heap Local Heap object Local Heap (Global Object Space) object (Global Object Space) Host Manager Host Manager Host Manager Master JVM Worker JVM Worker JVM Host Manager Host Manager Host Manager OS OS OS Hardware Hardware Hardware 13 Communication Network

Adaptive Sam pling-Based Profiling Techniques for Optim izing the - PowerPoint PPT Presentation

Adaptive Sam pling-Based Profiling Techniques for Optim izing the Distributed JVM Runtim e King Tin Lam, Yang Luo, Cho-Li Wang Speaker: King Tin Lam Date: Apr 20, 2010 Systems Research Group Department of Computer Science The University

Effect of BDD Optim ization Effect of BDD Optim ization on Synthesis of Reversible and Quantum

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

SoberIT Software Business and Engineering Institute Design, sam pling and im plem entation of a

Profiling of Algorithms Profiling refers to the experimental measurement of the performance of

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Profiling of Data-Parallel Processors Daniel Kruck 09/02/2014 09/02/2014 Profiling Daniel

Leaving no one behind The role of evidence-building and profiling to include displacement in

Expression Profiling Mark Voorhies 4/4/2011 Mark Voorhies Expression Profiling Review

Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie

COZ : Finding Code that Counts with Causal Profiling Anuja Golechha Agenda Profiling

Optimization Profiling VisualVM Exercise Meme Credit: Randall Munroe, hrefhttp://xkcd.comxkcd

An introduction to Profiling Physics Coding Club: 09/06/2017 D. Dickinson

Theresa Sam Houghton sam@greengutwellness.com (518) 545-8370

Topology Optim ization ? ? State-of-the-Art and Future Perspectives Design domain Ole Sigmund

1 Gaps in Information Exchange 2 Patients are . . . HIE Study 3 e-Prescribing ePrescribing

Disclosure of Interest: No conflict of interest in this research. Charges /costs of

Design Verification with e The language e Contains all the constructs necessary

nuclear collisions Tom Trainor University of Washington ISMD 2017 Tlaxcala, Mexico Agenda

SpiNNaker Chip Resources Steve Temple SpiNNaker Workshop Manchester Sep 2015 Overview

Outline for Week 7 2 Six Sigma Basics and history What is 6 Sigma 5 Process for

Temporal Difference Learning Spring 2019, CMU 10-403 Katerina Fragkiadaki Used Materials

Lecture 5: Value Function Approximation Emma Brunskill CS234 Reinforcement Learning. Winter 2020