Optim imiz izatio ion Coachin ing for Fork/Join in Applic - - PowerPoint PPT Presentation

optim imiz izatio ion coachin ing for fork join in applic
SMART_READER_LITE
LIVE PREVIEW

Optim imiz izatio ion Coachin ing for Fork/Join in Applic - - PowerPoint PPT Presentation

Optim imiz izatio ion Coachin ing for Fork/Join in Applic licatio ions on the Java Vir irtual l Machin ine Eduardo Eduar do Ros osal ales es Advisor: Prof. Walter Binder Research area: Parallel applications, performance analysis


slide-1
SLIDE 1

EuroDW 2018 April 23, 2018 Porto, Portugal

Eduar Eduardo do Ros

  • sal

ales es

Optim imiz izatio ion Coachin ing for Fork/Join in Applic licatio ions on the Java Vir irtual l Machin ine

Advisor:

  • Prof. Walter Binder

Research area: Parallel applications, performance analysis PhD stage: Planner

slide-2
SLIDE 2

Opti timizati ation Coachi

  • n Coaching

ng for for For Fork/J /Joi

  • in

n Applicati cations

  • ns on the J
  • n the Jav

ava V a Virtual tual M Machi achine ne

§ The

The pro probl blem: despite the complexities associated with developing and tuning fork/join applications, there is little work focused on assisting developers in optimizing such applications on the JVM.

§ Re

Relevance: fork/join parallelism has an increasing popularity among developers targeting the JVM. It has been integrated to support parallel processing on the Java library, thread management in JVM languages and a variety of parallel applications based on Actors, MapReduce, etc.

§ Ou

Our pro propo posal: coaching developers towards

  • ptimizing

fork/join applications by diagnosing performance issues on such applications and further suggest concrete code refactoring to solve them.

§ Ex

Expe pected out

  • utcom

come: in contrast to the manual experimentation often required to tune fork/join applications on the JVM, we devise a tool able to automatically assist developers in optimizing a fork/join application.

slide-3
SLIDE 3

Fork/join Application

§ Wh

What is is a fo fork/j /join appl applicat cation?

  • n?

fo fork rk jo join in j

  • j
  • i

n i n fo fork rk fo fork rk jo join in j

  • j
  • i

n i n fo fork rk fo fork rk jo join in j

  • j
  • i

n i n fo fork rk

solve(Problem problem) { if if (problem is small) directly solve problem sequentially else else { recursively split problem into independent parts:

fork new

new tasks to solve each part

join all forked tasks

} }

slide-4
SLIDE 4

The Java Fork/Join Framework

§ The

The Jav ava for fork/j /joi

  • in fr

fram amewor ework [1] is the implementation enabling fork/join applications on the JVM § It implements the work-stealing [2] scheduling strategy:

[1] D. Lea. A Java Fork/Join Framework. JAVA 2000. [2] Burton et al. Executing Functional Programs on a Virtual Tree of Processors. FPCA 1981.

Task Submission

T a k T a k e e Push Push P

  • p

P

  • p

P

  • p

P

  • p

Push Push St Steal eal

Worker thread 1 Worker thread 2

Tak Take

Deque 1 Deque 2 task

CP CPU

COR ORE COR ORE

slide-5
SLIDE 5

The Java Fork/Join Framework

§ The

The Jav ava for fork/j /joi

  • in fr

fram amewor ework [1] is the implementation enabling fork/join applications on the JVM § It implements the work-stealing [2] scheduling strategy:

[1] D. Lea. A Java Fork/Join Framework. JAVA 2000. [2] Burton et al. Executing Functional Programs on a Virtual Tree of Processors. FPCA 1981.

Task Submission

T a k T a k e e Push Push P

  • p

P

  • p

P

  • p

P

  • p

Push Push

Worker thread 1 Worker thread 2

Tak Take

Deque 1 Deque 2 task

CP CPU

COR ORE COR ORE

slide-6
SLIDE 6

The Java Fork/Join Framework

§ Supports parallel processing in the Java library:

  • java.util.Array
  • java.util.streams (package)
  • java.util.concurrent.CompletableFuture<T>

§ Supports thread management for other JVM languages:

  • Scala
  • Apache Groovy
  • Clojure

§ Supports diverse fork/join parallelism, including applications based on Actors and MapReduce

slide-7
SLIDE 7

The Java Fork/Join Framework

[3] D. Lea. Concurrent Programming in Java. Second Edition: Design Principles and Patterns. Addison-Wesley Professional, 2nd edition, 1999.

§ Many of the design forces encountered when implementing fork/join

designs surround task granularity at four levels [3]:

M i M i n n i i m i m i z z i i n n g g

  • v

e r

  • v

e r h e a d s h e a d s M i M i n n i i m i m i z z i i n n g g c

  • n

t c

  • n

t e n t e n t i i

  • n
  • n

M a M a x x i i m i m i z z i i n n g g p a r p a r a l a l l l e l e l i i s m s m M a M a x x i i m i m i z z i i n n g g l

  • l
  • c

c a a l i t y l i t y

Task Task granul anular arity

slide-8
SLIDE 8

Example of a common performance issues 1/4

§

Sub ubop

  • pti

timal al for forking ng

§

Ex Excessiv ive forkin ing

CP CPU

COR ORE COR ORE COR ORE COR ORE

  • Deque accesses
  • Object creation/reclaiming

✗ Parallelization overheads due to excessive:

Tak Take Pus Push Pop Pop Tak Take Pus Push Pop Pop Tak Take Pop Pop Tak Take Pop Pop Pus Push Pus Push

Too Too fine ne-gr grain ined d tasks

slide-9
SLIDE 9

§

Sub ubop

  • pti

timal al for forking ng

§

Spa Sparse forkin ing

Tak Take Pus Push Pop Pop Tak Take Pus Push Pop Pop Tak Take Pop Pop Tak Take Pop Pop

CP CPU

COR ORE COR ORE COR ORE COR ORE

  • Low CPU utilization
  • Load imbalance

Missed parallelization opportunities:

Pus Push Pus Push

St Steal eal

id idle le

Few Few coars coarse-gr grain ined d tasks

Example of a common performance issues 2/4

slide-10
SLIDE 10

The problem

CPU

CORE CORE CORE CORE

Memory CPU

CORE CORE CORE CORE

CPU

CORE CORE CORE CORE

CPU

CORE CORE CORE CORE

A single shared-memory multicore For Fork/j /joi

  • in ap

n applicati cations

  • ns

running in a single JVM

De Despite the complexities associated wi with developing and tuning fo fork/j /join a applicati tions, , there is little wo work focused on assisting developers towa wards optimizing such applications on the JVM.

The scope:

slide-11
SLIDE 11

Our Approach

Op Optimization Coachi Coaching ng Pr Profiling g te techniques

Ou Our Ap Approach

In contrast to manual experimentation used to tune a fork/join application, we propose an approach based on:

slide-12
SLIDE 12

Our Approach

Op Optimization Coachi Coaching ng Pr Profiling g te techniques

Ou Our Ap Approach

In contrast to manual experimentation often used to tune a fork/join application, we propose an approach based on: Static and dynamic analysis to autom automati atical cally d diag agnos nose e per erfor formance i ance issues ues

slide-13
SLIDE 13

Our Approach

Op Optimization Coachi Coaching ng Pr Profiling g te techniques

Ou Our Ap Approach

In contrast to manual experimentation often used to tune a fork/join application, we propose an approach based on: § Stati tatic anal c analysis: : to automatically inspect the source code to detect fork/join anti patterns. § Dy Dynam namic anal c analysis: : to automatically diagnose performance issues noticeable at runtime (e.g., suboptimal forking, excessive garbage collection, low CPU usage, contention).

slide-14
SLIDE 14

Our Approach

Op Optimization Coachi Coaching ng Pr Profiling g te techniques

Ou Our Ap Approach

In contrast to manual experimentation often used to tune a fork/join application, we propose an approach based on:

[4] St-Amour et al. Optimization Coaching: Optimizers Learn to Communicate with Programmers. OOPSLA 2012.

Opti timizati ation coachi

  • n coaching

ng [4]:

[4]: processing the output

generated by the compiler’s optimizer to suggest concrete code modifications that may enable the compiler to achieve missed optimizations.

slide-15
SLIDE 15

Our Approach

Op Optimization Coachi Coaching ng Pr Profiling g te techniques

Ou Our Ap Approach

In contrast to manual experimentation often used to tune a fork/join application, we propose an approach based on:

Inspired by Optimization Coaching the goal is aut automat

  • matical

cally sugges uggesting ng concr concret ete e code code modi modificat cations

  • ns to
  • sol
  • lve

e th the d dete tecte ted i issues

slide-16
SLIDE 16

§

Method ethodol

  • log
  • gy for

for the autom the automati atic d c diag agnos nosing ng of p

  • f per

erfor formance i ance issues ues:

§

Define a model to characterize fork/join tasks

§

Characterize all tasks spawned by a fork/join application

§

Determine the metrics and entities worth to consider to automatically diagnose performance issues

§

Method ethodol

  • log
  • gy for

for the autom the automati atic s c sug ugges esti tion of op

  • n of opti

timizati ations

  • ns:

§

Automatic recognition of fork/join anti patterns and matching to concrete suggestions to avoid them

§

Val alidati ation of the r

  • n of the res

esul ults ts:

§

Discover fork/join workloads, suitable for validating both aforementioned methodologies

Future Work

slide-17
SLIDE 17

BAC BACKU KUP P SL SLIDES. ES.

slide-18
SLIDE 18

Related Work

[10] Teng et al. THOR: a Performance Analysis Tool for Java Applications Running on Multicore Systems. IBM Journal of Research and Development, 54(5):4:1–4:17, 2010.

18

[9] Adhianto et al. HPCTOOLKIT: Tools for Performance Analysis of Optimized Parallel Programs. Concurr. Comput.: Pract. Exper., 22(6): pp. 685–701, 2010.

§ An

Analy lysis is of paralle llel l applic licatio ions on the JVM

§ A number of parallelism profilers focus on the JVM [9][10] [9][10] The The goal

  • al

Characterizing processes or threads over time. Li Limitat ations

  • ns
  • None of the existing tools targets fork/join applications.

JP JProf

  • filer

er Yo YourKi Kit Java Java Pr Profiler Java Java Mi Mission Control In Inte tel l vTune vTune

slide-19
SLIDE 19

Related Work

[6] Gong et al. JITProf: Pinpointing JIT-unfriendly JavaScript Code. ESEC/FSE 2015.

19

[5] St-Amour et al. Optimization Coaching for Javascript. ECOOP 2015. [4] St-Amour et al. Optimization Coaching: Optimizers Learn to Communicate with Programmers. OOPSLA 2012.

§ As

Assis isted optim imiz izatio ion of applic licatio ions

§ “Optimization Coaching” was first coined to describe

techniques to optimize Racket [4]

[4] and JavaScript [5] [5] [6] [6]

applications

The The goal

  • al

Report to the developer precise changes in the code that may enable the compiler’s optimizer to achieve missed optimizations. Li Limitat ations

  • ns
  • The techniques were not designed for optimizing parallel

applications.

  • The prototyped techniques target only specific compilers.
slide-20
SLIDE 20

Related Work

[8] Pinto et al. Understanding Parallelism Bottlenecks in ForkJoin Applications. ASE 2017.

20

[7] De Wael et al. Fork/Join Parallelism in the Wild: Documenting Patterns and Antipatterns in Java Programs Using the Fork/Join Framework. PPPJ 2014.

§ An

Analy lyses on the use of concurrency on the JVM

§ Documenting fork/join anti patterns on the JVM [7][8] [7][8] The The goal

  • al

Identification of common bad practices and bottlenecks on real fork/join applications. Li Limitat ations

  • ns
  • Focus on detecting performance issues by using code inspection

(manual code inspection and static analysis).

  • Do not consider the granularity of the tasks spawned by the

fork/join application.

  • Do not mentor the developer towards optimizing a fork/join

application.

slide-21
SLIDE 21

Challenges

§ Autom utomati atic d c detecti etection of p

  • n of per

erfor formance i ance issues ues and and s sug ugges esti tion of fi

  • n of fixes

es

  • Combination of program-analysis and machine-learning

techniques to automatically identify performance problems, to pinpoint them in the source code, and to recommend concrete

  • ptimizations.

§ Accur ccuratel ately m meas easur ure g e granul anular arity ty of each tas

  • f each task
  • Recursion, fine-grained parallelism, task scheduling,

exception handling, auxiliary task, etc.

§ Low Low per ertur urbat ation

  • n in

n met etric c col collect ection

  • n
  • Use of efficient and reduced instrumentation code.
  • Avoid any heap allocation in the target application.
slide-22
SLIDE 22

Ongoing Research

§ tgp:

: a Task-Granularity Profiler for multi-threaded, task-parallel applications executing on the JVM [11]

[11]

§

Features as a vertical profiler [12]

[12] collecting metrics from the full

system stack at runtime to characterize task granularity

§

Shows the impact of task granularity on application and system performance

§

Generates actionable profiles [13]

[13]

§

Developers can optimize code portions suggested by tgp

21

[12] M. Hauswirth et al. Vertical Profiling: Understanding the Behavior of Object-oriented Applications. OOPSLA 2004. [11] Rosales et al. tgp: a Task-Granularity Profiler for the Java Virtual Machine. APSEC 2017. [13] Mytkowicz et al. Evaluating the Accuracy of Java Profilers. PLDI 2010.

slide-23
SLIDE 23

§

We analyzed task granularity in DaCapo [14]

[14] and ScalaBench [15] [15]

§

We revealed fine- and coarse-grained tasks mainly in Java thread pools causing performance drawbacks [11]

[11]

§

We optimized suboptimal task granularity in pmd and lusearch [16]

[16]

§

Speedups up to 1.53x (pmd) and 1.13x (lusearch)

Ongoing Research

22

[11] Rosales et al. tgp: a Task-Granularity Profiler for the Java Virtual Machine. APSEC 2017. [16] Rosá, Rosales and Binder. Analyzing and Optimizing Task Granularity on the JVM. CGO 2018. [14] Blackburn et al. The DaCapo Benchmarks: Java Benchmarking Development and Analysis. OOPSLA 2006. [15] Sewe et al. DaCapo con Scala: Design and Analysis of a Scala Benchmark Suite for the JVM . OOPSLA 2011.

slide-24
SLIDE 24

§

Heav Heavy cop copying ng on for

  • n fork

✗ Parallelization overheads due to:

  • Excessive object creation
  • High memory load
  • Frequent garbage collection

Sh Shared d da data structure

Ful Full copy copy Ful Full copy copy F u l F u l l l c

  • p

y c

  • p

y Ful Full copy copy Ful Full copy copy Ful Full copy copy Ful Full copy copy

Example of a common performance issues

slide-25
SLIDE 25

Automatic detection of performance issues

§

Stati tatic anal c analysis: : analysis of source code for detecting for fork/j /joi

  • in anti

n anti- patter atterns ns, , including:

§

He Heavy copy py on fork rk (e.g., detecting the use of methods such as

System.arraycopy, sublist).

§

He Heavy me mergin rging g on jo join in (e.g., detecting the use of methods such as

addAll, putAll).

§

In Inappropriate te s synchronizati tion (e.g., detecting the use of improper synchronization during task execution, for example to wait for the result

  • f another computation).

§

The he lack ack of

  • f a

a seq equent uential al thr hres eshol hold (i.e., a threshold which determines whether a task will execute a sequential computation rather than forking parallel child tasks).

slide-26
SLIDE 26

Automatic detection of performance issues

§

Dy Dynam namic anal c analysis: : analysis of the fork/join application at runtime to deal with polymorphism and reflection along with detecting:

§

Su Subo bopt ptim imal l forkin king g (i.e., the presence of too fine-grained tasks or few too coarse-grained tasks).

§

According to the Java fork/join framework documentation: “a task should perform more than 100 and less than 10000 basic computational steps”. [17]

[17]

§

St Strategy: gy: collection and automatic of metrics from the full system stack at runtime:

§

Fr Fram amew ewor

  • rk-le

level l me metric rics (e.g., the number of already queued tasks via getSurplusQueuedTaskCount method).

§

JVM JVM-le level l me metric rics (e.g., garbage collections)

§

OS OS-le level l me metric rics (e.g., CPU usage, memory load)

§

Ha Hardwa rdware re Pe Perf rforma rmance Co Counters rs (e.g., reference cycles, machine instructions)

[17] https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ForkJoinTask.html

slide-27
SLIDE 27

§

Inap Inapprop

  • priate s

ate shar haring ng

✗ Parallelization overheads due to significant:

  • Thread synchronization
  • Inter-thread communication

Sh Shared d resource

Example of a common performance issues

slide-28
SLIDE 28

Automatic detection of performance issues

§

Dy Dynam namic anal c analysis: : analysis of the fork/join application at runtime to detect:

§

In Inappropriate te s sharing o

  • f r

f resources (e.g., the use of shared

  • bjects, locks, files, data bases and other resources by several

parallel tasks).

§

St Stra rategy gy: collection and automatic analysis of performance metrics:

§

VM VM-level m metr trics (e.g., allocations in Java Heap)

§

OS OS-level m metr trics (e.g., context switches, cache misses, page faults)

slide-29
SLIDE 29

Validation of the results

AutoBench [18]

[18], a toolchain combining:

§ code repository crawling § pluggable hybrid analyses, and § workload characterization techniques to discover candidate workloads satisfying the needs of domain- specific benchmarking.

[18] Zheng et al. AutoBench: Finding Workloads that You Need Using Pluggable Hybrid Analyses. SANER 2016.

slide-30
SLIDE 30

Achievements

Pu Publ blications

§ E.

  • E. Rosale

les, A. Rosà, and W. Binder. tgp: a Task-Granularity Profiler for the Java Virtual Machine. 24th Asia-Pacific Software Engineering Conference (APSEC’17), Nanjing, China, December 2017. IEEE Press, ISBN 978-1-5386-3681-7, pp. 570-575 § A. Rosà, E.

  • E. Rosale

les, and W. Binder. Accurate Reification of Complete Supertype Information for Dynamic Analysis on the JVM. 16th International Conference on Generative Programming: Concepts & Experiences (GPCE’17), Vancouver, Canada, October 2017. ACM Press, ISBN 978-1-4503-5524-7, pp. 104-116. § A. Rosà, E.

  • E. Rosale

les, and W. Binder. Analyzing and Optimizing Task Granularity on the JVM. International Symposium on Code Generation and Optimization (CGO’18), Vienna, Austria, February 2018. ACM Press, ISBN 978-1-4503-5617-6, pp. 27-37. § A. Rosà, E.

  • E. Rosale

les, F. Schiavio, and W. Binder. Understanding Task Granularity on the JVM: Profiling, Analysis, and Optimization. Accepted to be presented on the Workshop on Modern Language Runtimes, Ecosystems, and VMs (MoreVMs’18), Nice, France, April 2018. § E.

  • E. Rosale

les and W. Binder. Op Optimization Coaching for Fork/Join Applications on the Ja Java va Virtual Machine. Accepted to be presented on the 12th EuroSys 2018 Doctoral Workshop (EuroDW’18), Porto, Portugal, April 2018.

slide-31
SLIDE 31

Concepts

§

Refer eference ence cy cycl cle: reference cycle elapses at the nominal frequency of the CPU, even if the actual CPU frequency is scaled up or down.

§

Contex Context swi witches tches:

  • ccurs

when the kernel switches the processor from one thread to another—for example, when a thread with a higher priority than the running thread becomes ready.

§

Pag age faul fault: is a type of exception raised by computer hardware when a running program accesses a memory page that is not currently mapped by the memory management unit (MMU) into the virtual address space of a process.