Optimization Through Recomputation in the Polyhedral Model By Mike - - PowerPoint PPT Presentation

optimization through
SMART_READER_LITE
LIVE PREVIEW

Optimization Through Recomputation in the Polyhedral Model By Mike - - PowerPoint PPT Presentation

Optimization Through Recomputation in the Polyhedral Model By Mike Jongen, Luc Waeijen, Roel Jordans, Lech Jwiak , Henk Corporaal. 1 Contents Introduction Related work Optimizing Through Recompute Polyhedral


slide-1
SLIDE 1

Optimization Through Recomputation in the Polyhedral Model

By Mike Jongen, Luc Waeijen, Roel Jordans, Lech Jóźwiak, Henk Corporaal.

1

slide-2
SLIDE 2

Contents

  • Introduction
  • Related work
  • Optimizing Through Recompute
  • Polyhedral modelling
  • Experimental Results
  • Conclusion and future work

2

slide-3
SLIDE 3

Introduction

slide-4
SLIDE 4

Introduction

  • (Mobile) systems use more artificial neural networks

– Artificial vision – Image processing – Speech recognition

  • Large amount of data accesses
  • Can be improved by code transformations

4

slide-5
SLIDE 5

Current possibilities and extensions

  • Tiling
  • Fusion
  • Distribution
  • Recomputation/overlapped tiling

– Allows for better paralellism – Reduces memory traffic

5

slide-6
SLIDE 6

This paper

  • An example CNN application which includes recompute
  • Extension of Polly
  • Demonstration of the effectiveness of recomputation

6

slide-7
SLIDE 7

Related Work

slide-8
SLIDE 8

Automated polyhedral optimization frameworks

  • Greatly reduce the effort of translating the original network

description into an optimized form

  • Automatically verifying the validity
  • Different options: Polly, R-Stream-TF, and PPCG
  • None of these frameworks provides a method of including

recomputation in the optimization space

8

slide-9
SLIDE 9

Why do we use Polly

  • Uses the Polyhedral model for optimizations
  • Direct integration with the LLVM compiler framework
  • Adjustable

– Add extra functionality – User defined schedules – Automate the process

9

slide-10
SLIDE 10

Optimizing Through Recompute

slide-11
SLIDE 11

System Architecture

11

Processor Local Memory Global Memory

slide-12
SLIDE 12

Educational Example

12

slide-13
SLIDE 13

Inter Tile Reuse

13

Stored Part of the intermediate image

slide-14
SLIDE 14

Inter Tile Reuse

14

Stored Part of the intermediate image

slide-15
SLIDE 15

Inter Tile Reuse

15

slide-16
SLIDE 16

Inter Tile Reuse

16

slide-17
SLIDE 17

Other Dimensions

17

slide-18
SLIDE 18

Methods to handle overlap

  • Store the overlap globally
  • Store the overlap locally
  • Recompute the overlap

18

slide-19
SLIDE 19

Global Method

  • Pixels are stored externally
  • Small buffer size
  • Expensive memory accesses

19

slide-20
SLIDE 20

Local Method

  • Pixels are stored locally
  • Larger buffers required
  • Cheaper accesses

20

slide-21
SLIDE 21

Recomputation Method

  • Recomputes the pixels
  • No extra memory required
  • No extra accesses required
  • More computations are required

21

slide-22
SLIDE 22

Recomputation Tradeoffs

22

slide-23
SLIDE 23

Recomputation Tradeoffs

23

Storing the overlap

slide-24
SLIDE 24

Recomputation Tradeoffs

24

Storing the overlap

slide-25
SLIDE 25

Recomputation Tradeoffs

25

Storing the overlap

slide-26
SLIDE 26

Recomputation Tradeoffs

26

Storing the overlap

slide-27
SLIDE 27

Recomputation Tradeoffs

27

Storing the overlap

slide-28
SLIDE 28

Recomputation Tradeoffs

28

Recomputing the overlap

slide-29
SLIDE 29

Recomputation Tradeoffs

29

Recomputing the overlap

slide-30
SLIDE 30

Recomputation Tradeoffs

30

Recomputing the overlap

slide-31
SLIDE 31

Recomputation Tradeoffs

31

Recomputing the overlap

slide-32
SLIDE 32

Recomputation Tradeoffs

32

Recomputing the overlap

slide-33
SLIDE 33

Recomputation Tradeoffs

33

Recomputing the overlap Storing the overlap

slide-34
SLIDE 34

Polyhedral Modeling

slide-35
SLIDE 35

The Polyhedral Model and Recomputation

  • Execution order is defined by the schedule
  • Schedule is singular valued

– One execution time per statement – One statement per execution time

  • Recomputation:

– Statements are executed multiple times – Non-singular valued schedules are required

35

slide-36
SLIDE 36

Including Recomputation

  • Support for non-singular valued schedules
  • Transforming non-singular valued schedules to singular valued

schedules

36

slide-37
SLIDE 37

Example

37

Stmt[0] Stmt[1] Stmt[2] [0, 0] [0, 1] [1, 0] [1, 1]

slide-38
SLIDE 38

Example

38

Stmt[0] Stmt[1] Stmt[2] [0, 0] [0, 1] [1, 0] [1, 1]

Old Schedule

slide-39
SLIDE 39

Example

39

Stmt[0] Stmt[1] Stmt[2] [0, 0] [0, 1] [1, 0] [1, 1]

Old Schedule

Stmt[0] Stmt[1] Stmt[2] [0, 0] [0, 1] [1, 1]

Lexicographical Minimum

slide-40
SLIDE 40

Example

40

Stmt[0] Stmt[1] Stmt[2] [0, 0] [0, 1] [1, 0] [1, 1]

Old Schedule

Stmt[0] Stmt[1] Stmt[2] [0, 0] [0, 1] [1, 1]

Lexicographical Minimum

slide-41
SLIDE 41

Example

41

Stmt[0] Stmt[1] Stmt[2] [0, 0] [0, 1] [1, 0] [1, 1]

Old Schedule

Stmt[0] Stmt[1] Stmt[2] [0, 0] [0, 1] [1, 1]

Lexicographical Minimum

slide-42
SLIDE 42

Example

42

Rest of Schedule

[1, 0] Stmt[1] Stmt[0] Stmt[1] Stmt[2] [0, 0] [0, 1] [1, 1]

Lexicographical Minimum

slide-43
SLIDE 43

Stmt[0, 0] Stmt[1, 0] Stmt[2, 0] [0, 0] [0, 1] [1, 1]

Example

43

New Schedule Rest of Schedule

[1, 0] Stmt[1]

slide-44
SLIDE 44

Stmt[0, 0] Stmt[1, 0] Stmt[2, 0] [0, 0] [0, 1] [1, 1]

Example

44

New Schedule Lexicographical Minimum

[1, 0] Stmt[1]

slide-45
SLIDE 45

Stmt[0, 0] Stmt[1, 0] Stmt[2, 0] [0, 0] [0, 1] [1, 1]

Example

45

New Schedule Lexicographical Minimum

[1, 0] Stmt[1]

slide-46
SLIDE 46

Stmt[0, 0] Stmt[1, 0] Stmt[2, 0] [0, 0] [0, 1] [1, 1]

Example

46

New Schedule Lexicographical Minimum

[1, 0] Stmt[1]

slide-47
SLIDE 47

Example

47

Lexicographical Minimum

[1, 0] Stmt[1] Stmt[0, 0] Stmt[1, 0] Stmt[2, 0] [0, 0] [0, 1] [1, 1] [1, 0] Stmt[1, 1]

New Schedule

slide-48
SLIDE 48

Including Recomputation: location

48

slide-49
SLIDE 49

Jscop Implementation

49

Conv[i0,i1,i2,i3] → [i0,i1,i2,i3]

slide-50
SLIDE 50

Jscop Implementation

50

Conv[i0,i1,i2,i3] →[t0,i1,t1,i2,i3] : 0 <= t0 < no_tiles and 0 <= t1 < tilesize and i0 = tilesize ∗ t0 + t1

slide-51
SLIDE 51

Jscop Implementation

51

Conv[i0,i1,i2,i3] →[t0,i1,t1,i2,i3] : 0 <= t0 < no_tiles and 0 <= t1 < tilesize + overlap and i0 = tilesize ∗ t0 + t1

slide-52
SLIDE 52

Dependencies

52

Before After OR

slide-53
SLIDE 53

Experimental Results

slide-54
SLIDE 54

54

Results for different tile sizes

slide-55
SLIDE 55

55

Results for different tile sizes

slide-56
SLIDE 56

56

Results for different tile sizes and several kernel sizes

slide-57
SLIDE 57

Conclusion and Future Work

slide-58
SLIDE 58

Conclusion

  • An example CNN application which includes recompute
  • Extension of Polly
  • Demonstration of the effectiveness of recomputation

58

slide-59
SLIDE 59

Future Work

  • Legality Checks
  • Model of the effects
  • More applications

59

slide-60
SLIDE 60

And Finally…

  • Questions?
  • Remarks?
  • Suggestions?

60