Profiling a warehouse-scale computer Svilen Kanev Harvard - - PowerPoint PPT Presentation

profiling a warehouse scale computer
SMART_READER_LITE
LIVE PREVIEW

Profiling a warehouse-scale computer Svilen Kanev Harvard - - PowerPoint PPT Presentation

Profiling a warehouse-scale computer Svilen Kanev Harvard University Juan Pablo Darago Universidad de Buenos Aires Kim Hazelwood Yahoo Labs Parthasarathy Ranganathan, Tipp Moseley Google Inc. Gu-Yeon Wei, David Brooks Harvard University


slide-1
SLIDE 1

Profiling a warehouse-scale computer

Svilen Kanev Juan Pablo Darago Kim Hazelwood Parthasarathy Ranganathan, Tipp Moseley Gu-Yeon Wei, David Brooks Harvard University Universidad de Buenos Aires Yahoo Labs Google Inc. Harvard University

slide-2
SLIDE 2

The cloud is here to stay

[http://google.com/trends, 2015]

2

slide-3
SLIDE 3

Warehouse-scale computers (of yore)

datacenters built around a few “killer workloads” problem sizes >> 1 machine

3

...

distributed, but tightly interconnected services communication through remote-procedure calls (RPCs)

slide-4
SLIDE 4

Now “the datacenter is the computer”

(the WSC model has caught on)

4

“microservice architecture” thousands of services are “one RPC away”

“... about a hundred of services that comprise Siri’s backend...”

[Apple, Mesos meetup 2015] Did you mean: #pldi15 frequency[“#isca15”]++

slide-5
SLIDE 5

How do modern WSC applications interact with hardware?

And what does that imply for future server processors?

slide-6
SLIDE 6

Traditional profiling: load testing

6

Find representative inputs Find representative

  • perating point

Profile / optimize Repeat Isolate a service

slide-7
SLIDE 7

Live datacenter-scale profiling

(Google-wide profiling)

Select random production machines ~20,000 / day GWP DB

[Ren et al. Google-wide profiling, 2010]

7

Profile each one (for a while) without isolation while running live traffic for billions of users Aggregate days, weeks, years worth of execution

slide-8
SLIDE 8

Live WSC profiling insights

8

Where are cycles spent in a datacenter? Are there really no killer applications? How do WSC applications interact with instruction caches? How much ILP is there? Big / small cores? DRAM latency vs. bandwidth? Hyperthreading?

slide-9
SLIDE 9

Where are WSC cycles spent?

slide-10
SLIDE 10

No “killer” application to optimize for

10

Instead: a long tail of various different services

[1 week of sampled WSC cycles]

slide-11
SLIDE 11

Ongoing application diversification

11

[~3 years of sampled WSC cycles]

Optimizing hardware one-application-at-a-time has diminishing returns

slide-12
SLIDE 12

Within applications: no hotspots

Corollary: hunting for per-application hotspots is not justified

12

[search leaf node; 1 week of cycles]

slide-13
SLIDE 13

Shared low-level routines; typical for larger-than-1-server problems

Hotspots across applications: “datacenter tax’’

13

slide-14
SLIDE 14

Hotspots across applications: “datacenter tax’’

Prime candidates for accelerators in server SoCs

14

Only 6 self-contained routines account for ~30% of WSC cycles

slide-15
SLIDE 15

Live WSC profiling insights

15

Where are cycles spent in a datacenter? Everywhere. Are there really no killer applications? Datacenter tax. How do WSC applications interact with instruction caches? How much ILP is there? Big / small cores? DRAM latency vs. bandwidth? Hyperthreading?

slide-16
SLIDE 16

Microarchitecture: WSC i-cache pressure

slide-17
SLIDE 17

Severe instruction cache bottlenecks

15-30% of core cycles wasted on instruction-supply stalls

17

20,000 Intel IvyBridge servers 2 days Top-Down analysis [Yasin 2014]

slide-18
SLIDE 18

Severe instruction cache bottlenecks

Fetching instructions from L3 caches Very high i-cache miss rates 10x the highest in SPEC 50% higher than CloudSuite 15-30% of core cycles wasted on instruction-supply stalls Lots of lukewarm code 100s MBs of instructions per binary; no hotspots

18

slide-19
SLIDE 19

A problem in the making

I-cache working sets 4-5x larger than largest in SPEC Growing almost 30% / year significantly faster than i-caches One solution: L2 i/d partitioning

19

slide-20
SLIDE 20

Live WSC profiling insights

20

Where are cycles spent in a datacenter? Everywhere. Are there really no killer applications? Datacenter tax. How do WSC applications interact with instruction caches? Poorly. How much ILP is there? Big / small cores? Bimodal. DRAM latency vs. bandwidth? Latency. Hyperthreading? Yes.

slide-21
SLIDE 21

To sum up

A growing number of programs cover “the world’s WSC cycles”. There is no “killer application”, and hand-optimizing each program is suboptimal. Low-level routines (datacenter tax) are a surprisingly high fraction of cycles. Good candidates for accelerators in future server processors. Common microarchitectural footprint: working sets too large for i-caches; many d- cache stalls; generally low IPC; bimodal ILP; low memory bandwidth utilization.