Without Losing Performance A big.LITTLE Software Strategy Klaas van - - PowerPoint PPT Presentation
Without Losing Performance A big.LITTLE Software Strategy Klaas van - - PowerPoint PPT Presentation
Cut Power Consumption by 5x Without Losing Performance A big.LITTLE Software Strategy Klaas van Gend FAE, Trainer & Consultant The mandatory Klaas-in-a-Plane picture LINUXCON EUROPE 2014 2 | October 10, 2014 Quad Core vs. Dual Core
2 | October 10, 2014 LINUXCON EUROPE 2014
The mandatory Klaas-in-a-Plane picture
3 | October 10, 2014 LINUXCON EUROPE 2014
Quad Core vs. Dual Core –
Why isn’t it Twice as Fast?
VS
4 | October 10, 2014 LINUXCON EUROPE 2014
The GHz race
5 | October 10, 2014 LINUXCON EUROPE 2014
Why GHz++ cost power^2
ARM big.LITTLE
“OK, heavy work costs power. Let’s not waste power on light work…”
7 | October 10, 2014 LINUXCON EUROPE 2014
ARM playing it cool: big.LITTLE
Source: http://community.arm.com/groups/processors/blog/2013/06/18/ten-things-to-know-about-biglittle/
8 | October 10, 2014 LINUXCON EUROPE 2014
A7 vs A15
Cortex A7:
- Less silicon area
- Less optimal cycles
- Less cycles/second
- More power efficient
9 | October 10, 2014 LINUXCON EUROPE 2014
How to use big.LITTLE today
Source: http://community.arm.com/groups/processors/blog/2013/06/18/ten-things-to-know-about-biglittle/
10 | October 10, 2014 LINUXCON EUROPE 2014
Some available big.LITTLE hardware
- AllWinner A80
- Renesas automotive silicon
- Samsung Galaxy S4 for South-Korean market
- Samsung Galaxy S5 for South-Korean market
- Hardkernel ODROID-XU board
Exynos5
Built-in Power Measurement
Use Case:
Chromium
12 | October 10, 2014 LINUXCON EUROPE 2014
Chrome / Chromium / ChromeShell
- Chromium:
- pen source browser
based on KHTML Webkit Blink
- Google Chrome:
closed-source browser based on Chromium
- ChromeShell:
- pen source Chromium “browser” for Android
- Chrome for Android:closed-source browser for Android
13 | October 10, 2014 LINUXCON EUROPE 2014
Chromium workload
Visualized
Loading Parsing Layouting/Rendering Painting JavaScript Canvas
14 | October 10, 2014 LINUXCON EUROPE 2014
HTML5 Canvas
“graphics device for JavaScript”
15 | October 10, 2014 LINUXCON EUROPE 2014
HTML5 Canvas
“graphics device for JavaScript”
16 | October 10, 2014 LINUXCON EUROPE 2014
Parallelizing Canvas
“not as easy as it looks”
17 | October 10, 2014 LINUXCON EUROPE 2014
Canvas Parallelization -
Performance Results
Benchmark
- n quad-core
Standard Blink Parallelized Blink Performance improvement Flashcanvas perf 1,69 score 2,44 score 44% Fc perf w/ alpha 1,04 score 1,52 score 50% Guimark2 Vector 9,5 fps 13,3 fps 40% Canvasmark ‘13 3475 score 4116 score 53% Average improvement 47%
With parallelism you can improve performance of even the most complex applications!
18 | October 10, 2014 LINUXCON EUROPE 2014
Google Chrome
19 | October 10, 2014 LINUXCON EUROPE 2014
Google Chrome on Odroid-XU+E
Using Google’s Chome (version 33 for Android)
- 2 cores active: 54% and 84%
- Power use A15+A7 cores: 2.374 Watts
- Test average: 9.44 fps
20 | October 10, 2014 LINUXCON EUROPE 2014
Our ChromeShell on Odroid-XU+E
Using our optimized ChromeShell:
- 3 A15 cores active: 59%, 63% and 38%
- Power use A15+A7 cores: 3.116 Watts
- Test average: around 14 fps
21 | October 10, 2014 LINUXCON EUROPE 2014
Canvas Parallelization -
works even on ‘normal’ silicon like Qualcomm Snapdragon 800
- LG’s NEXUS 5 phone
- Quad core Qualcomm Snapdragon 800
- Phone heating up similarly in both cases
Default Chrome Average: 7.12 fps “Our” ChromeShell: Average: 14.48 fps
22 | October 10, 2014 LINUXCON EUROPE 2014
Canvas Parallelization -
Power Consumption on “Flashcanvas perf”
Benchmark Standard Blink
- n A15+GPU
Parallelized Blink
- n quad-A7
Difference No optimization 29 fps 17 fps
- 40%
Performance 29 fps 26 fps
- 10%
Power consumption 2,2W 0,4W 550% Performance / Watt 1,3 65 490%
With parallelism and right chip choices
- you can get 5x power savings
- without losing performance!
23 | October 10, 2014 LINUXCON EUROPE 2014
Comparing performance / watt:
Using Google’s Chome (version 33 Android)
- 2 cores active: 54% and 84%
- Power use A15+A7 cores: 2.374 Watts
- Test average: 9.44 fps
Using our optimized ChromeShell:
- 3 A7 cores active: 73%, 80% and 44%
- Power use A15+A7 cores: 0.472 Watts
- Test average: 10.04 fps
24 | October 10, 2014 LINUXCON EUROPE 2014
1x A15 or 4x A7?
25 | October 10, 2014 LINUXCON EUROPE 2014
1x A15 < 4x A7 !
20000 MIPS
More than twice the performance Less W
Back to big.LITTLE
Making these results work outside a lab
27 | October 10, 2014 LINUXCON EUROPE 2014
State of big.LITTLE in Linux - 1
What’s in the kernel today?
28 | October 10, 2014 LINUXCON EUROPE 2014
State of big.LITTLE in Linux - 2
What else is relevant?
- IKS (In-kernel-Switcher)
– Firstly available in Linaro kernel trees – Merged in 3.11 kernel
- Qualcomm / LG / etc powerdaemons
– Throttle performance if cores overheat – Usually “secret”
- Not-in-mainline Schedulers:
– Linaro’s GTS (Global Task Scheduler), – a.k.a. HMP (Heterogeneous Multi-Processing)
- Kernel Summit 2014 “Energy-Aware Scheduling Workship”
29 | October 10, 2014 LINUXCON EUROPE 2014
Feedback loop
- We know when we want to have 4xA7 or 1xA15
- If we can tell the kernel, it can anticipate
– instead of noticing an increase in workload – and by accident turning on the A15s
Setpoint
30 | October 10, 2014 LINUXCON EUROPE 2014
Where to go?
- Qualcomm MARE
– Research project – Framework to aid parallelization – Should assist kernel in scheduling/cpufreq
- Deadline scheduler
– Merged in Linux 3.14 – Application sets SCHED_DEADLINE – Application sets scheduling attributes – Task repetition in microseconds – Task start within repetition – Task completion deadline within repetition
“Feedback loop” (in user space) “Feedback loop” (in kernel space)
31 | October 10, 2014 LINUXCON EUROPE 2014
Is parallelism going to stay?
Actually, is big.LITTLE going to stay???
- The GHz race has come to an end
– Now also for ARM
- The speed of light limits “clock domain size”
- Thus many clock islands on a die
– Multicore is just an “easy” way to improve performance – At the cost of the programmer – Who needs extra training
- ARM big.LITTLE
– Is a mechanism to skip heavy power consumption – At the cost of more mm2 silicon – Is it worth it???
32 | October 10, 2014 LINUXCON EUROPE 2014
My ideal ARM-based design:
big: 1x A57 LITTLE: 4x A53
Why is no-one designing this chip?
Conclusions
34 | October 10, 2014 LINUXCON EUROPE 2014
Conclusion
- big.LITTLE works
- IFF
– Short bursts can be handled by one ‘big’ core – Heavier workloads are parallelizable and run on clusters of LITTLEs – APIs become available: Programs must indicate what the workload will be BTW: Chromium is parallelizable – we did it.
35 | October 10, 2014 LINUXCON EUROPE 2014
Conclusion
- big.LITTLE works
- IFF
– Short bursts can be handled by one ‘big’ core – Heavier workloads are parallelizable and run on clusters of LITTLEs – APIs become available: Programs must indicate what the workload will be BTW: Chromium is parallelizable – we did it.
36 | October 10, 2014 LINUXCON EUROPE 2014
Vector Fabrics – the Company
- Founded February 2007
- Founding team
– Strong in SoC design and multi-core software – Currently 15 FTE: 6 PhD, 7 MSc
- Protected technology
– 3 patents filed in US & Europe
- Recognition
– “Hot Startup” in EE Times Silicon 60, since 2011 – Selected by Gartner as “Cool vendor in Embedded Systems & Software” 2013 – Global Semiconductors Alliance award, March 2013
37 | October 10, 2014 LINUXCON EUROPE 2014
Contact Information
- Web:
www.vectorfabrics.com
- Email:
klaas@vectorfabrics.com
- Tel:
+31 40 8200960
- Address:
Vector Fabrics B.V. Vonderweg 22 5616RM Eindhoven The Netherlands
Thank You!
(drop your business card if you want the slides and the to-be-released whitepaper)