Without Losing Performance A big.LITTLE Software Strategy Klaas van - - PowerPoint PPT Presentation

without losing performance
SMART_READER_LITE
LIVE PREVIEW

Without Losing Performance A big.LITTLE Software Strategy Klaas van - - PowerPoint PPT Presentation

Cut Power Consumption by 5x Without Losing Performance A big.LITTLE Software Strategy Klaas van Gend FAE, Trainer & Consultant The mandatory Klaas-in-a-Plane picture LINUXCON EUROPE 2014 2 | October 10, 2014 Quad Core vs. Dual Core


slide-1
SLIDE 1

Cut Power Consumption by 5x Without Losing Performance

A big.LITTLE Software Strategy

Klaas van Gend FAE, Trainer & Consultant

slide-2
SLIDE 2

2 | October 10, 2014 LINUXCON EUROPE 2014

The mandatory Klaas-in-a-Plane picture

slide-3
SLIDE 3

3 | October 10, 2014 LINUXCON EUROPE 2014

Quad Core vs. Dual Core –

Why isn’t it Twice as Fast?

VS

slide-4
SLIDE 4

4 | October 10, 2014 LINUXCON EUROPE 2014

The GHz race

slide-5
SLIDE 5

5 | October 10, 2014 LINUXCON EUROPE 2014

Why GHz++ cost power^2

slide-6
SLIDE 6

ARM big.LITTLE

“OK, heavy work costs power. Let’s not waste power on light work…”

slide-7
SLIDE 7

7 | October 10, 2014 LINUXCON EUROPE 2014

ARM playing it cool: big.LITTLE

Source: http://community.arm.com/groups/processors/blog/2013/06/18/ten-things-to-know-about-biglittle/

slide-8
SLIDE 8

8 | October 10, 2014 LINUXCON EUROPE 2014

A7 vs A15

Cortex A7:

  • Less silicon area
  • Less optimal cycles
  • Less cycles/second
  • More power efficient
slide-9
SLIDE 9

9 | October 10, 2014 LINUXCON EUROPE 2014

How to use big.LITTLE today

Source: http://community.arm.com/groups/processors/blog/2013/06/18/ten-things-to-know-about-biglittle/

slide-10
SLIDE 10

10 | October 10, 2014 LINUXCON EUROPE 2014

Some available big.LITTLE hardware

  • AllWinner A80
  • Renesas automotive silicon
  • Samsung Galaxy S4 for South-Korean market
  • Samsung Galaxy S5 for South-Korean market
  • Hardkernel ODROID-XU board

Exynos5

Built-in Power Measurement

slide-11
SLIDE 11

Use Case:

Chromium

slide-12
SLIDE 12

12 | October 10, 2014 LINUXCON EUROPE 2014

Chrome / Chromium / ChromeShell

  • Chromium:
  • pen source browser

based on KHTML Webkit Blink

  • Google Chrome:

closed-source browser based on Chromium

  • ChromeShell:
  • pen source Chromium “browser” for Android
  • Chrome for Android:closed-source browser for Android
slide-13
SLIDE 13

13 | October 10, 2014 LINUXCON EUROPE 2014

Chromium workload

Visualized

Loading Parsing Layouting/Rendering Painting JavaScript Canvas

slide-14
SLIDE 14

14 | October 10, 2014 LINUXCON EUROPE 2014

HTML5 Canvas

“graphics device for JavaScript”

slide-15
SLIDE 15

15 | October 10, 2014 LINUXCON EUROPE 2014

HTML5 Canvas

“graphics device for JavaScript”

slide-16
SLIDE 16

16 | October 10, 2014 LINUXCON EUROPE 2014

Parallelizing Canvas

“not as easy as it looks”

slide-17
SLIDE 17

17 | October 10, 2014 LINUXCON EUROPE 2014

Canvas Parallelization -

Performance Results

Benchmark

  • n quad-core

Standard Blink Parallelized Blink Performance improvement Flashcanvas perf 1,69 score 2,44 score 44% Fc perf w/ alpha 1,04 score 1,52 score 50% Guimark2 Vector 9,5 fps 13,3 fps 40% Canvasmark ‘13 3475 score 4116 score 53% Average improvement 47%

With parallelism you can improve performance of even the most complex applications!

slide-18
SLIDE 18

18 | October 10, 2014 LINUXCON EUROPE 2014

Google Chrome

slide-19
SLIDE 19

19 | October 10, 2014 LINUXCON EUROPE 2014

Google Chrome on Odroid-XU+E

Using Google’s Chome (version 33 for Android)

  • 2 cores active: 54% and 84%
  • Power use A15+A7 cores: 2.374 Watts
  • Test average: 9.44 fps
slide-20
SLIDE 20

20 | October 10, 2014 LINUXCON EUROPE 2014

Our ChromeShell on Odroid-XU+E

Using our optimized ChromeShell:

  • 3 A15 cores active: 59%, 63% and 38%
  • Power use A15+A7 cores: 3.116 Watts
  • Test average: around 14 fps
slide-21
SLIDE 21

21 | October 10, 2014 LINUXCON EUROPE 2014

Canvas Parallelization -

works even on ‘normal’ silicon like Qualcomm Snapdragon 800

  • LG’s NEXUS 5 phone
  • Quad core Qualcomm Snapdragon 800
  • Phone heating up similarly in both cases

Default Chrome Average: 7.12 fps “Our” ChromeShell: Average: 14.48 fps

slide-22
SLIDE 22

22 | October 10, 2014 LINUXCON EUROPE 2014

Canvas Parallelization -

Power Consumption on “Flashcanvas perf”

Benchmark Standard Blink

  • n A15+GPU

Parallelized Blink

  • n quad-A7

Difference No optimization 29 fps 17 fps

  • 40%

Performance 29 fps 26 fps

  • 10%

Power consumption 2,2W 0,4W 550% Performance / Watt 1,3 65 490%

With parallelism and right chip choices

  • you can get 5x power savings
  • without losing performance!
slide-23
SLIDE 23

23 | October 10, 2014 LINUXCON EUROPE 2014

Comparing performance / watt:

Using Google’s Chome (version 33 Android)

  • 2 cores active: 54% and 84%
  • Power use A15+A7 cores: 2.374 Watts
  • Test average: 9.44 fps

Using our optimized ChromeShell:

  • 3 A7 cores active: 73%, 80% and 44%
  • Power use A15+A7 cores: 0.472 Watts
  • Test average: 10.04 fps
slide-24
SLIDE 24

24 | October 10, 2014 LINUXCON EUROPE 2014

1x A15 or 4x A7?

slide-25
SLIDE 25

25 | October 10, 2014 LINUXCON EUROPE 2014

1x A15 < 4x A7 !

20000 MIPS

More than twice the performance Less W

slide-26
SLIDE 26

Back to big.LITTLE

Making these results work outside a lab

slide-27
SLIDE 27

27 | October 10, 2014 LINUXCON EUROPE 2014

State of big.LITTLE in Linux - 1

What’s in the kernel today?

slide-28
SLIDE 28

28 | October 10, 2014 LINUXCON EUROPE 2014

State of big.LITTLE in Linux - 2

What else is relevant?

  • IKS (In-kernel-Switcher)

– Firstly available in Linaro kernel trees – Merged in 3.11 kernel

  • Qualcomm / LG / etc powerdaemons

– Throttle performance if cores overheat – Usually “secret”

  • Not-in-mainline Schedulers:

– Linaro’s GTS (Global Task Scheduler), – a.k.a. HMP (Heterogeneous Multi-Processing)

  • Kernel Summit 2014 “Energy-Aware Scheduling Workship”
slide-29
SLIDE 29

29 | October 10, 2014 LINUXCON EUROPE 2014

Feedback loop

  • We know when we want to have 4xA7 or 1xA15
  • If we can tell the kernel, it can anticipate

– instead of noticing an increase in workload – and by accident turning on the A15s

Setpoint

slide-30
SLIDE 30

30 | October 10, 2014 LINUXCON EUROPE 2014

Where to go?

  • Qualcomm MARE

– Research project – Framework to aid parallelization – Should assist kernel in scheduling/cpufreq

  • Deadline scheduler

– Merged in Linux 3.14 – Application sets SCHED_DEADLINE – Application sets scheduling attributes – Task repetition in microseconds – Task start within repetition – Task completion deadline within repetition

“Feedback loop” (in user space) “Feedback loop” (in kernel space)

slide-31
SLIDE 31

31 | October 10, 2014 LINUXCON EUROPE 2014

Is parallelism going to stay?

Actually, is big.LITTLE going to stay???

  • The GHz race has come to an end

– Now also for ARM

  • The speed of light limits “clock domain size”
  • Thus many clock islands on a die

– Multicore is just an “easy” way to improve performance – At the cost of the programmer – Who needs extra training

  • ARM big.LITTLE

– Is a mechanism to skip heavy power consumption – At the cost of more mm2 silicon – Is it worth it???

slide-32
SLIDE 32

32 | October 10, 2014 LINUXCON EUROPE 2014

My ideal ARM-based design:

big: 1x A57 LITTLE: 4x A53

Why is no-one designing this chip?

slide-33
SLIDE 33

Conclusions

slide-34
SLIDE 34

34 | October 10, 2014 LINUXCON EUROPE 2014

Conclusion

  • big.LITTLE works
  • IFF

– Short bursts can be handled by one ‘big’ core – Heavier workloads are parallelizable and run on clusters of LITTLEs – APIs become available: Programs must indicate what the workload will be BTW: Chromium is parallelizable – we did it.

slide-35
SLIDE 35

35 | October 10, 2014 LINUXCON EUROPE 2014

Conclusion

  • big.LITTLE works
  • IFF

– Short bursts can be handled by one ‘big’ core – Heavier workloads are parallelizable and run on clusters of LITTLEs – APIs become available: Programs must indicate what the workload will be BTW: Chromium is parallelizable – we did it.

slide-36
SLIDE 36

36 | October 10, 2014 LINUXCON EUROPE 2014

Vector Fabrics – the Company

  • Founded February 2007
  • Founding team

– Strong in SoC design and multi-core software – Currently 15 FTE: 6 PhD, 7 MSc

  • Protected technology

– 3 patents filed in US & Europe

  • Recognition

– “Hot Startup” in EE Times Silicon 60, since 2011 – Selected by Gartner as “Cool vendor in Embedded Systems & Software” 2013 – Global Semiconductors Alliance award, March 2013

slide-37
SLIDE 37

37 | October 10, 2014 LINUXCON EUROPE 2014

Contact Information

  • Web:

www.vectorfabrics.com

  • Email:

klaas@vectorfabrics.com

  • Tel:

+31 40 8200960

  • Address:

Vector Fabrics B.V. Vonderweg 22 5616RM Eindhoven The Netherlands

slide-38
SLIDE 38

Thank You!

(drop your business card if you want the slides and the to-be-released whitepaper)

Klaas van Gend FAE, Trainer & Consultant klaas@vectorfabrics.com