Challenging the Intel Xeon: ARM and OpenPower Now you really have - - PowerPoint PPT Presentation

challenging the intel xeon arm and openpower
SMART_READER_LITE
LIVE PREVIEW

Challenging the Intel Xeon: ARM and OpenPower Now you really have - - PowerPoint PPT Presentation

Challenging the Intel Xeon: ARM and OpenPower Now you really have to optimize Mighty Intel Intel had a 99.2 percent market share in server chips (IDC, 2015 Quoted on InfoWorld) We started experimenting with SoCs two


slide-1
SLIDE 1

Challenging the Intel Xeon: ARM and OpenPower

Now you really have to optimize

slide-2
SLIDE 2

Mighty Intel …

  • “Intel had a 99.2 percent

market share in server chips” (IDC, 2015 – Quoted

  • n InfoWorld)
  • “We started experimenting

with SoCs two years ago. … didn't work well because the single-thread performance was too low, resulting in higher latency for our web platform” – Facebook Engineering

slide-3
SLIDE 3

…sits solid on the Throne

  • Best & most mature process

technology in the world

– 14 nm finfet trigate (2014)

  • Power management the

competition can only dream

  • ff
  • Richest software ecosystem
slide-4
SLIDE 4

Sizing Servers

  • Established in 2006 at Howest*, funded

by Flemish gov since 2007

  • 4 – 6 FTE (2007-2016)
  • 2 – 3 trainees
  • Specialized in independent performance
  • ptimization research
  • Howest = Technical University in West-

Flanders (Kortrijk, Belgium)

slide-5
SLIDE 5

March 2016

IWT VIS TR 135096

slide-6
SLIDE 6

March 2012

  • Java performance

– + 60% for Xeon E5 v1 – +19% for Xeon E5 v4

  • OLTP

– + 51% for Xeon E5 v1 – +19% for Xeon E5 v4

slide-7
SLIDE 7

Recognize this one?

  • Moore’s law
  • “were shrinking so fast that every year

twice as many could fit onto a chip.

  • 1975 “adjusted the pace to a doubling

every two years”

slide-8
SLIDE 8

There is Moore

  • CPU processing power per

dollar

  • DRAM & NAND: price per

megabit

– a 35% per year reduction in price

  • Also drives the Cloud / Internet
  • “Google will do anything to beat

Moore’s law”

slide-9
SLIDE 9

MOORE'S LAW IS “SILICON VALLEY'S BEATING HEART””

slide-10
SLIDE 10

The Thermal Wall: 2004

slide-11
SLIDE 11
slide-12
SLIDE 12

A few examples today

Product line Cores Clock Year Name Process Min die size Power

Power Density

Historical ref points Pentium 4 1 3,8 2004 "Prescott" 65nm 112 115

103

Pentium 3 1 1 1999 "Coppermine"180 nm 106 29

27

Today Core i7-6xxx 4 4 2016 "Sky Lake" 14 nm 122 91

75

Xeon E5 8 3,4 2016 "Broadwell" 14 nm 246 140

57

Core i7 4xxx 4 4 2014 "Hasswell" 22 nm 177 88

50

GPUs GeForce 1000 3584 1,6 2014 "Pascal" 16 nm 520 300

58

GeForce 800 2880 0,9 2016 "Kepler" 28 nm 571 250

44

slide-13
SLIDE 13

A bumpy road

  • 90 nm (2004), strained Silicon (35% faster

switching)

  • 45 nm (2008) “high-k dieelectric” – reduced

leakage

  • 22 nm (2012) “Trigate” (reduce both

swithing and leakage power)

– Research started in 2002!!

  • THE WALL: photolithography process

light with a 193 nanometre wavelength

– EUV (13,5 nm)

slide-14
SLIDE 14

2013

  • Still optimistic
  • Intel, AMD,

TSMC, GlobalFoundries , and IBM =>

  • “Moore’s Law

Roadmap”

slide-15
SLIDE 15

2016

  • 10 nm Postponed

to late 2017

  • 7 nm: Big

Question mark!

  • NO more Silicon,

but Indium Gallium Arsenide (InGaAs) at 7 nm

  • Nanotubes?

Graphene?

slide-16
SLIDE 16
  • 4% loss per

generation!

slide-17
SLIDE 17

Problem: big data gets brains

  • Data gets too

complex for humans to analyze

slide-18
SLIDE 18

And Now?

  • Field Programmable Gate Array (FPGA)
  • ASICs (App Specific IC)
  • Graphical Processing Unit (GPU)
  • MIC (Many Integrated Cores)

IWT VIS TR 135096

slide-19
SLIDE 19

IWT VIS TR 135096

slide-20
SLIDE 20

EVOLVING MARKET, NEW PLAYERS

The market has changed too

slide-21
SLIDE 21

Total Market: something has changed

slide-22
SLIDE 22
slide-23
SLIDE 23

Cavium Thunder-X

  • First 64-bit ARM server vs “mid

range” Xeon E5

  • 48 “simple 2 IPC” cores @ 2 GHz

@ 120W

– Single thread perf is 3-5x lower

  • 28 nm technology
  • Gigabyte servers
slide-24
SLIDE 24
slide-25
SLIDE 25
slide-26
SLIDE 26

Software ecosystem

  • No Java Native Access Libraries
  • Spark crashes with machine language message
  • MySQL, LAMP

, most Java applications work

slide-27
SLIDE 27
slide-28
SLIDE 28
slide-29
SLIDE 29

Performance / watt

slide-30
SLIDE 30
slide-31
SLIDE 31

Conclusion ARMv8 (64)

  • Niche oriented Cavium Thunder-X
  • Future chips of Qualcomm, Cavium (MaybeAvago Broadcomm)
  • AMD & AppliedMicro not competitive (yet??)
  • A few big customers:

– Paypal (VPN, firewall, some webservices) – Already conquering the Chinese market (HiSilicon, HuaWei)

  • Fragmented market
  • Still unmature ecosystem:

– JNA & ElasticSearch, Spark

slide-32
SLIDE 32

OPENPOWER

slide-33
SLIDE 33

POWER8 disadvantages

  • Very power hungry: 10 cores @ 190 W TDP + Mem

buffers (60-80W) vs 22 cores @ 145W Xeon

  • JNA not supported
  • Some software still a bit unoptimized (MySQL)
slide-34
SLIDE 34

When OpenPOWER makes sense

  • Based upon most complex core on the market (8 threads, 8 IPC,

3.5+ GHz)

  • (Some) Pricing competitive with HP/Dell
  • 32 DIMM slots per CPU (Intel: 12)
  • Open from firmware to Software
  • Google & Rackspaces have a new OpenPOWER server
  • Some software runs as fast as best Xeons (MongoDB, PostGreS)
  • Software ecosystem has grown fast…
slide-35
SLIDE 35

OpenPower Ecosystem

slide-36
SLIDE 36

IBM: first integrator of NVLink

slide-37
SLIDE 37

“Deep Learning” P100

slide-38
SLIDE 38

Page Migration Engine & POWER8 with NVLink

  • Far easier to create new applications on Tesla P100
  • NVIDIA Page Migration Engine ensures unified

memory space

  • Unified memory: address space spans CPU and GPU, 1TB+
  • Hardware managed transfers: eliminates explicit data transfers
  • T

esting program implementing these advantages

– POWER8 with NVLink ensures speedy data throughput

  • 1TB memory space requires faster CPU:GPU data movement
  • Bus masks transfer times

– Close code-base to parallel CPU code

| 3 8

Too Large a Memory Space Required

Too complicated to move data

Moves too much data

Too much custom coding for GPU data movement

Software UVM feature too limiting

Requires page faulting support

Barriers to Entry Removed

slide-39
SLIDE 39

Percona MySQL 5.7

slide-40
SLIDE 40

SPARK TESTING

Few Large or many small nodes?

slide-41
SLIDE 41

Our test

300 GB GZIP “Common Crawl” Web archives Body tekst extract by “BoilerPipe” Natural Language Processing (Stanford) Aggregate: Group by & Sort entity counts Generate recommendations w Alternating Least Square

IWT VIS TR 135096

slide-42
SLIDE 42

Realtime in-memory processing with Spark

slide-43
SLIDE 43

Spark Optimization

  • Number of virtual cores per executor (JVM):

– 1 per 2 logical cores (Intel: 1, IBM: 4)

  • Number of executors = number of physical cores – 1
  • spark.default.parallelism = +/- 1,5-2 tasks per executor
  • GCThreads= 1 per virtual core per executor
  • Speed up = 10-20%
slide-44
SLIDE 44
  • 20% gain per

generation

slide-45
SLIDE 45
slide-46
SLIDE 46

Conclusions so far

  • Moore’s law is dead: opportunity for niche players
  • OpenPower has some tangible advantages
  • Next generation of ARM servers should be watched
  • New innovations …

– Combining streaming, sensor data & static data – Deep learning

  • … will require much more tuning & specialized chips
slide-47
SLIDE 47

Rate My Session!