Satisfying Dataflow Programs Constraints on Multicore Architectures - - PowerPoint PPT Presentation

satisfying dataflow programs constraints on multicore
SMART_READER_LITE
LIVE PREVIEW

Satisfying Dataflow Programs Constraints on Multicore Architectures - - PowerPoint PPT Presentation

Satisfying Dataflow Programs Constraints on Multicore Architectures Citi lab PhD day 2014 Manuel Selva Supervisor: Lionel Morel Director: Stphane Frnot Bull: Frdric Soinne 27th March 2014 1 / 14 More and more parallelism Nehalem


slide-1
SLIDE 1

Satisfying Dataflow Programs Constraints on Multicore Architectures

Citi lab PhD day 2014 Manuel Selva Supervisor: Lionel Morel Director: Stéphane Frénot Bull: Frédéric Soinne 27th March 2014

1 / 14

slide-2
SLIDE 2

More and more parallelism

Nehalem multicore die

(source: www.intel.com)

Multiprocessor motherboard

(source: www.bit-tech.net)

Apple/ARM dual core

(source: www.cultofmac.com)

How to program ?

  • Threads (Java, C + Pthreads)
  • Annotations to sequential code (OpenMP)
  • Dataflow

2 / 14

slide-3
SLIDE 3

Why and what is dataflow programming ?

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display

(a) H264 decoding (b) LTE-Adv decoding

3 / 14

slide-4
SLIDE 4

Why and what is dataflow programming ?

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display

(a) H264 decoding (b) LTE-Adv decoding

  • Different kinds of parallelism
  • Actors exchanging data only through FIFO channels

3 / 14

slide-5
SLIDE 5

Why and what is dataflow programming ?

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display

(a) H264 decoding (b) LTE-Adv decoding

  • Different kinds of parallelism
  • Actors exchanging data only through FIFO channels
  • Task

3 / 14

slide-6
SLIDE 6

Why and what is dataflow programming ?

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display

(a) H264 decoding (b) LTE-Adv decoding

  • Different kinds of parallelism
  • Actors exchanging data only through FIFO channels
  • Task , pipeline

3 / 14

slide-7
SLIDE 7

Why and what is dataflow programming ?

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display

  • Text. Y
  • Text. Y
  • Text. Y

(a) H264 decoding (b) LTE-Adv decoding

  • Different kinds of parallelism
  • Actors exchanging data only through FIFO channels
  • Task , pipeline , data

3 / 14

slide-8
SLIDE 8

Why and what is dataflow programming ?

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display

256 256 256 768

(a) H264 decoding (b) LTE-Adv decoding

  • Different kinds of parallelism
  • Actors exchanging data only through FIFO channels
  • Task , pipeline , data

Many models

Static analyses Expressiveness SDF ... ... ... (DDF)

3 / 14

slide-9
SLIDE 9

How to execute dataflow programs ?

  • Compilation to synchronized tasks respecting data

dependencies

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display

Task 5 Mot.Y; Task 4 Text.V; Task 3 Text.U; Task 2 Text.Y; Task 1 Parser; Task 6 Mot.U; Task 7 Mot.V; Task 8 Merger; Task 9 Display;

DF Compiler

4 / 14

slide-10
SLIDE 10

How to execute dataflow programs ?

  • Compilation to synchronized tasks respecting data

dependencies

  • Mapping of tasks and channels to hardware

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display

Task 5 Mot.Y; Task 4 Text.V; Task 3 Text.U; Task 2 Text.Y; Task 1 Parser; Task 6 Mot.U; Task 7 Mot.V; Task 8 Merger; Task 9 Display;

DF Compiler

C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

Core1 Core2 Core3 Core4 RAM 1 Core5 Core6 Core7 Core8 DF Mapper RAM 2 Dual socket processor

T1 T2 T5 T6 T3 T4 T7 T8 T9 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

4 / 14

slide-11
SLIDE 11

Proposition

Motivations

  • DF applications with throughput constraints
  • Mapping satisfying constraints requires:
  • Actors internal execution time
  • Concurrent applications
  • DF actors consumption/production rates

Goals

Extend DF language Monitor app/resources Adapt static choices

Compile time Runtime

5 / 14

slide-12
SLIDE 12

Contrib 1: Languages/compilers extensions

  • Languages extensions taken into account in compilation

flow [9, 10]

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display 25f/s H264 graph with throughput constraint

6 / 14

slide-13
SLIDE 13

Contrib 1: Languages/compilers extensions

  • Languages extensions taken into account in compilation

flow [9, 10]

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display 25f/s

th1 th2 th3 th4 th5 th6 th7 th8 th9

H264 graph with throughput constraint

  • Propagate this value in SDF languages
  • Determine actors acceptable exec time

6 / 14

slide-14
SLIDE 14

Contrib II - Fine grain monitoring

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display 25f/s

Task 5 Mot.Y; Task 4 Text.V; Task 3 Text.U; Task 2 Text.Y; Task 1 Parser; Task 6 Mot.U; Task 7 Mot.V; Task 8 Merger; Task 9 Display;

DF Compiler

7 / 14

slide-15
SLIDE 15

Contrib II - Fine grain monitoring

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display 25f/s

Task 5 Mot.Y; Task 4 Text.V; Task 3 Text.U; Task 2 Text.Y; Task 1 Parser; Task 6 Mot.U; Task 7 Mot.V; Task 8 Merger; Task 9 Display;

DF Compiler

Task 8 Merger;

Merger instrumented to measure throughput

7 / 14

slide-16
SLIDE 16

Contrib II - Fine grain monitoring

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display 25f/s

Task 5 Mot.Y; Task 4 Text.V; Task 3 Text.U; Task 2 Text.Y; Task 1 Parser; Task 6 Mot.U; Task 7 Mot.V; Task 8 Merger; Task 9 Display;

DF Compiler

Task 5 Mot.Y; Task 4 Text.V; Task 3 Text.U; Task 2 Text.Y; Task 1 Parser; Task 6 Mot.U; Task 7 Mot.V; Task 8 Merger; Task 9 Display;

Tasks instrumented to measure actors execution times

7 / 14

slide-17
SLIDE 17

Contrib II - Fine grain monitoring

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display 25f/s

Task 5 Mot.Y; Task 4 Text.V; Task 3 Text.U; Task 2 Text.Y; Task 1 Parser; Task 6 Mot.U; Task 7 Mot.V; Task 8 Merger; Task 9 Display;

DF Compiler Core1 Core2 Core3 Core4 RAM 1 Core5 Core6 Core7 Core8 DF Mapper RAM 2 Dual socket processor

T1 T2 T5 T6 T3 T4 T7 T8 T9 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

RAM 1 RAM 2 Memory monitoring using PMU: RAM controllers load QPI traffic

7 / 14

slide-18
SLIDE 18

Contrib II - Fine grain monitoring

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display 25f/s

Task 5 Mot.Y; Task 4 Text.V; Task 3 Text.U; Task 2 Text.Y; Task 1 Parser; Task 6 Mot.U; Task 7 Mot.V; Task 8 Merger; Task 9 Display;

DF Compiler Core1 Core2 Core3 Core4 RAM 1 Core5 Core6 Core7 Core8 DF Mapper RAM 2 Dual socket processor

T1 T2 T5 T6 T3 T4 T7 T8 T9 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

RAM 1 RAM 2 Memory monitoring using PMU: RAM controllers load QPI traffic

Conclusions

Are we facing cores load imbalance ? Are the actors too slow because of memory latencies ?

7 / 14

slide-19
SLIDE 19

Contrib III - Dataflow adaptations

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display

Task 5 Mot.Y; Task 4 Text.V; Task 3 Text.U; Task 2 Text.Y; Task 1 Parser; Task 6 Mot.U; Task 7 Mot.V; Task 8 Merger; Task 9 Display;

DF Compiler Core1 Core2 Core3 Core4 RAM 1 Core5 Core6 Core7 Core8 DF Mapper RAM 2 Dual socket processor

T1 T2 T5 T6 T3 T4 T7 T8 T9 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

8 / 14

slide-20
SLIDE 20

Contrib III - Dataflow adaptations

  • Cpu load balancing

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display

Task 5 Mot.Y; Task 4 Text.V; Task 3 Text.U; Task 2 Text.Y; Task 1 Parser; Task 6 Mot.U; Task 7 Mot.V; Task 8 Merger; Task 9 Display;

DF Compiler Core1 Core2 Core3 Core4 RAM 1 Core5 Core6 Core7 Core8 DF Mapper RAM 2 Dual socket processor

T1 T2 T5 T6 T3 T4 T7 T8 T9 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

8 / 14

slide-21
SLIDE 21

Contrib III - Dataflow adaptations

  • Cpu load balancing
  • Memory load balancing

Parser Text.Y Text.U Text.V Mot.Y Mot.U Mot.V Merger Display

Task 5 Mot.Y; Task 4 Text.V; Task 3 Text.U; Task 2 Text.Y; Task 1 Parser; Task 6 Mot.U; Task 7 Mot.V; Task 8 Merger; Task 9 Display;

DF Compiler Core1 Core2 Core3 Core4 RAM 1 Core5 Core6 Core7 Core8 DF Mapper RAM 2 Dual socket processor

T1 T2 T5 T6 T3 T4 T7 T8 T9 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10

8 / 14

slide-22
SLIDE 22

Conclusion

Dynamic framework for DF programs

  • Applicative monitoring
  • Hardware monitoring
  • Runtime adaptations

making profit of DF information

Current work

  • Finishing implementation

in Streamit

State of the art

  • DF compilation [7, 13, 4]
  • DF theoretical throughput

analysis [3, 11]

  • DF adaptation [12, 8, 5, 1]
  • Non-DF NUMA

adaptations [6, 2]

9 / 14

slide-23
SLIDE 23

CHOI, Y., LI, C.-H., SILVA, D. D., BIVENS, A., AND SCHENFELD, E. Adaptive task duplication using on-line bottleneck detection for streaming applications. In Proceedings of the 9th Conference on Computing Frontiers (New York, NY, USA, 2012), CF ’12, ACM,

  • pp. 163–172.

DASHTI, M., FEDOROVA, A., FUNSTON, J., GAUD, F., LACHAIZE, R., LEPERS, B., QUEMA, V., AND ROTH, M. Traffic management: A holistic approach to memory placement on numa systems. In Proceedings of the Eighteenth International Conference

  • n Architectural Support for Programming Languages and

Operating Systems (New York, NY, USA, 2013), ASPLOS ’13, ACM, pp. 381–394.

10 / 14

slide-24
SLIDE 24

GHAMARIAN, A.-H., GEILEN, M. C. W., STUIJK, S., BASTEN, T., MOONEN, A. J. M., BEKOOIJ, M., THEELEN, B., AND MOUSAVI, M. Throughput analysis of synchronous data flow graphs. In Application of Concurrency to System Design, 2006. ACSD 2006. Sixth International Conference on (2006),

  • pp. 25–36.

GORDON, M. I. Compiler Techniques for Scalable Performance of Stream Programs PhD thesis, MIT, 2010. HORMATI, A. H., CHOI, Y., KUDLUR, M., RABBAH, R., MUDGE, T., AND MAHLKE, S. Flextream: Adaptive compilation of streaming applications for heterogeneous architectures.

11 / 14

slide-25
SLIDE 25

In Proceedings of the 2009 18th International Conference

  • n Parallel Architectures and Compilation Techniques

(2009), pp. 214–223. LACHAIZE, R., LEPERS, B., AND QUÉMA, V. Memprof: A memory profiler for numa multicore systems. In Proceedings of the 2012 USENIX Conference on Annual Technical Conference (Berkeley, CA, USA, 2012), USENIX ATC’12, USENIX Association, pp. 5–5. LEE, E. A., AND MESSERSCHMITT, D. Synchronous data flow. Proceedings of the IEEE 75, 9 (sept. 1987), 1235 – 1245. MIN, C., AND EOM, Y. I. Danbi: Dynamic scheduling of irregular stream programs for many-core systems. In Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques

12 / 14

slide-26
SLIDE 26

(Piscataway, NJ, USA, 2013), PACT ’13, IEEE Press,

  • pp. 189–200.

SELVA, M., MOREL, L., MARQUET, K., AND FRÉNOT, S. Extending dataflow programs with throughput properties. In Proceedings of the First International Workshop on Many-core Embedded Systems (New York, NY, USA, 2013), MES ’13, ACM, pp. 54–57. SELVA, M., MOREL, L., MARQUET, K., AND FRÉNOT, S. Qos monitoring system for dataflow programs. In Proceedings of the Conférence d’informatique en Parallélisme, Architecture et Système (ComPAS) (2013), CFSE track.

13 / 14

slide-27
SLIDE 27

STUIJK, S., BASTEN, T., GEILEN, M. C. W., AND CORPORAAL, H. Multiprocessor resource allocation for throughput-constrained synchronous dataflow graphs. In Design Automation Conference, 2007. DAC ’07. 44th ACM/IEEE (2007), pp. 777–782. TAN, C. A hybrid static/dynamic approach to scheduling stream programs, 2009. THIES, W. Language and compiler support for stream programs. PhD thesis, MIT, 2009.

14 / 14