Challenging the Intel Xeon: ARM and OpenPower Now you really have - - PowerPoint PPT Presentation

▶

Mar 10, 2023 365 likes •839 views

Challenging the Intel Xeon: ARM and OpenPower Now you really have to optimize Mighty Intel Intel had a 99.2 percent market share in server chips (IDC, 2015 Quoted on InfoWorld) We started experimenting with SoCs two

SLIDE 1

Challenging the Intel Xeon: ARM and OpenPower

Now you really have to optimize

SLIDE 2

Mighty Intel …

“Intel had a 99.2 percent

market share in server chips” (IDC, 2015 – Quoted

n InfoWorld)
“We started experimenting

with SoCs two years ago. … didn't work well because the single-thread performance was too low, resulting in higher latency for our web platform” – Facebook Engineering

SLIDE 3

…sits solid on the Throne

Best & most mature process

technology in the world

– 14 nm finfet trigate (2014)

Power management the

competition can only dream

ff
Richest software ecosystem

SLIDE 4

Sizing Servers

Established in 2006 at Howest*, funded

by Flemish gov since 2007

4 – 6 FTE (2007-2016)
2 – 3 trainees
Specialized in independent performance
ptimization research
Howest = Technical University in West-

Flanders (Kortrijk, Belgium)

SLIDE 5

March 2016

IWT VIS TR 135096

SLIDE 6

March 2012

Java performance

– + 60% for Xeon E5 v1 – +19% for Xeon E5 v4

OLTP

– + 51% for Xeon E5 v1 – +19% for Xeon E5 v4

SLIDE 7

Recognize this one?

Moore’s law
“were shrinking so fast that every year

twice as many could fit onto a chip.

1975 “adjusted the pace to a doubling

every two years”

SLIDE 8

There is Moore

CPU processing power per

dollar

DRAM & NAND: price per

megabit

– a 35% per year reduction in price

Also drives the Cloud / Internet
“Google will do anything to beat

Moore’s law”

SLIDE 9

MOORE'S LAW IS “SILICON VALLEY'S BEATING HEART””

SLIDE 10

The Thermal Wall: 2004

SLIDE 11

SLIDE 12

A few examples today

Product line Cores Clock Year Name Process Min die size Power

Power Density

Historical ref points Pentium 4 1 3,8 2004 "Prescott" 65nm 112 115

103

Pentium 3 1 1 1999 "Coppermine"180 nm 106 29

27

Today Core i7-6xxx 4 4 2016 "Sky Lake" 14 nm 122 91

75

Xeon E5 8 3,4 2016 "Broadwell" 14 nm 246 140

57

Core i7 4xxx 4 4 2014 "Hasswell" 22 nm 177 88

50

GPUs GeForce 1000 3584 1,6 2014 "Pascal" 16 nm 520 300

58

GeForce 800 2880 0,9 2016 "Kepler" 28 nm 571 250

44

SLIDE 13

A bumpy road

90 nm (2004), strained Silicon (35% faster

switching)

45 nm (2008) “high-k dieelectric” – reduced

leakage

22 nm (2012) “Trigate” (reduce both

swithing and leakage power)

– Research started in 2002!!

THE WALL: photolithography process

light with a 193 nanometre wavelength

– EUV (13,5 nm)

SLIDE 14

2013

Still optimistic
Intel, AMD,

TSMC, GlobalFoundries , and IBM =>

“Moore’s Law

Roadmap”

SLIDE 15

2016

10 nm Postponed

to late 2017

7 nm: Big

Question mark!

NO more Silicon,

but Indium Gallium Arsenide (InGaAs) at 7 nm

Nanotubes?

Graphene?

SLIDE 16

4% loss per

generation!

SLIDE 17

Problem: big data gets brains

Data gets too

complex for humans to analyze

SLIDE 18

And Now?

Field Programmable Gate Array (FPGA)
ASICs (App Specific IC)
Graphical Processing Unit (GPU)
MIC (Many Integrated Cores)

IWT VIS TR 135096

SLIDE 19

IWT VIS TR 135096

SLIDE 20

EVOLVING MARKET, NEW PLAYERS

The market has changed too

SLIDE 21

Total Market: something has changed

SLIDE 22

SLIDE 23

Cavium Thunder-X

First 64-bit ARM server vs “mid

range” Xeon E5

48 “simple 2 IPC” cores @ 2 GHz

@ 120W

– Single thread perf is 3-5x lower

28 nm technology
Gigabyte servers

SLIDE 24

SLIDE 25

SLIDE 26

Software ecosystem

No Java Native Access Libraries
Spark crashes with machine language message
MySQL, LAMP

, most Java applications work

SLIDE 27

SLIDE 28

SLIDE 29

Performance / watt

SLIDE 30

SLIDE 31

Conclusion ARMv8 (64)

Niche oriented Cavium Thunder-X
Future chips of Qualcomm, Cavium (MaybeAvago Broadcomm)
AMD & AppliedMicro not competitive (yet??)
A few big customers:

– Paypal (VPN, firewall, some webservices) – Already conquering the Chinese market (HiSilicon, HuaWei)

Fragmented market
Still unmature ecosystem:

– JNA & ElasticSearch, Spark

SLIDE 32

OPENPOWER

SLIDE 33

POWER8 disadvantages

Very power hungry: 10 cores @ 190 W TDP + Mem

buffers (60-80W) vs 22 cores @ 145W Xeon

JNA not supported
Some software still a bit unoptimized (MySQL)

SLIDE 34

When OpenPOWER makes sense

Based upon most complex core on the market (8 threads, 8 IPC,

3.5+ GHz)

(Some) Pricing competitive with HP/Dell
32 DIMM slots per CPU (Intel: 12)
Open from firmware to Software
Google & Rackspaces have a new OpenPOWER server
Some software runs as fast as best Xeons (MongoDB, PostGreS)
Software ecosystem has grown fast…

SLIDE 35

OpenPower Ecosystem

SLIDE 36

IBM: first integrator of NVLink

SLIDE 37

“Deep Learning” P100

SLIDE 38

Page Migration Engine & POWER8 with NVLink

Far easier to create new applications on Tesla P100
NVIDIA Page Migration Engine ensures unified

memory space

Unified memory: address space spans CPU and GPU, 1TB+
Hardware managed transfers: eliminates explicit data transfers
T

esting program implementing these advantages

– POWER8 with NVLink ensures speedy data throughput

1TB memory space requires faster CPU:GPU data movement
Bus masks transfer times

– Close code-base to parallel CPU code

| 3 8

Too Large a Memory Space Required

Too complicated to move data

Moves too much data

Too much custom coding for GPU data movement

Software UVM feature too limiting

Requires page faulting support

Barriers to Entry Removed

SLIDE 39

Percona MySQL 5.7

SLIDE 40

SPARK TESTING

Few Large or many small nodes?

SLIDE 41

Our test

300 GB GZIP “Common Crawl” Web archives Body tekst extract by “BoilerPipe” Natural Language Processing (Stanford) Aggregate: Group by & Sort entity counts Generate recommendations w Alternating Least Square

IWT VIS TR 135096

SLIDE 42

Realtime in-memory processing with Spark

SLIDE 43

Spark Optimization

Number of virtual cores per executor (JVM):

– 1 per 2 logical cores (Intel: 1, IBM: 4)

Number of executors = number of physical cores – 1
spark.default.parallelism = +/- 1,5-2 tasks per executor
GCThreads= 1 per virtual core per executor
Speed up = 10-20%

SLIDE 44

20% gain per

generation

SLIDE 45

SLIDE 46

Conclusions so far

Moore’s law is dead: opportunity for niche players
OpenPower has some tangible advantages
Next generation of ARM servers should be watched
New innovations …

– Combining streaming, sensor data & static data – Deep learning

… will require much more tuning & specialized chips

SLIDE 47

Challenging the Intel Xeon: ARM and OpenPower

Mighty Intel …

market share in server chips” (IDC, 2015 – Quoted

with SoCs two years ago. … didn't work well because the single-thread performance was too low, resulting in higher latency for our web platform” – Facebook Engineering

…sits solid on the Throne

technology in the world

– 14 nm finfet trigate (2014)

competition can only dream

Sizing Servers

by Flemish gov since 2007

Flanders (Kortrijk, Belgium)

March 2016

March 2012

Recognize this one?

twice as many could fit onto a chip.

every two years”

There is Moore

dollar

megabit

Moore’s law”

MOORE'S LAW IS “SILICON VALLEY'S BEATING HEART””

The Thermal Wall: 2004

A few examples today

Power Density

103

27

75

57

50

58

44

A bumpy road

switching)

leakage

swithing and leakage power)

light with a 193 nanometre wavelength

2013

TSMC, GlobalFoundries , and IBM =>

Roadmap”

2016

to late 2017

Question mark!

but Indium Gallium Arsenide (InGaAs) at 7 nm

Graphene?

generation!

Problem: big data gets brains

complex for humans to analyze

And Now?

EVOLVING MARKET, NEW PLAYERS

The market has changed too

Total Market: something has changed

Cavium Thunder-X

range” Xeon E5

@ 120W

Software ecosystem

, most Java applications work

Performance / watt

Conclusion ARMv8 (64)

OPENPOWER

POWER8 disadvantages

buffers (60-80W) vs 22 cores @ 145W Xeon

When OpenPOWER makes sense

3.5+ GHz)

OpenPower Ecosystem

IBM: first integrator of NVLink

“Deep Learning” P100

Page Migration Engine & POWER8 with NVLink

memory space

Percona MySQL 5.7

SPARK TESTING

Few Large or many small nodes?

Our test

300 GB GZIP “Common Crawl” Web archives Body tekst extract by “BoilerPipe” Natural Language Processing (Stanford) Aggregate: Group by & Sort entity counts Generate recommendations w Alternating Least Square

Spark Optimization

– 1 per 2 logical cores (Intel: 1, IBM: 4)

generation

Conclusions so far

– Combining streaming, sensor data & static data – Deep learning

Rate My Session!