UltraSPARC T1: A 32-threaded CMP for Servers James Laudon - - PowerPoint PPT Presentation

ultrasparc t1 a 32 threaded cmp for servers
SMART_READER_LITE
LIVE PREVIEW

UltraSPARC T1: A 32-threaded CMP for Servers James Laudon - - PowerPoint PPT Presentation

UltraSPARC T1: A 32-threaded CMP for Servers James Laudon Distinguished Engineer Sun Microsystems james.laudon@sun.com Outline Server design issues > Application demands > System requirements Building a better server-oriented CMP


slide-1
SLIDE 1

UltraSPARC T1: A 32-threaded CMP for Servers

James Laudon Distinguished Engineer Sun Microsystems james.laudon@sun.com

slide-2
SLIDE 2

Page 2 4/9/06

Outline

  • Server design issues

> Application demands > System requirements

  • Building a better server-oriented CMP

> Maximizing thread count > Keeping the threads fed > Keeping the threads cool

  • UltraSPARC T1 (Niagara)

> Micro-architecture > Performance > Power

slide-3
SLIDE 3

Page 3 4/9/06

Attributes of Commercial Workloads

  • Adapted from “A Performance methodology for commercial servers,”
  • S. R. Kunkel et al, IBM J. Res. Develop. vol. 44 no. 6 Nov 2000

high high high high high high Thread-level parallelism high large low ERP SAP 3T DB med large high DSS TPC-H med large low Server Java jBOB (JBB) high large low OLTP TPC-C low large low Web server Web99 med Data sharing med Instruction/Data working set med Instruction-level parallelism ERP Application Category SAP 2T Attribute

slide-4
SLIDE 4

Page 4 4/9/06

Commercial Server Workloads

  • SpecWeb05, SpecJappserver04, SpecJBB05,

SAP SD, TPC-C, TPC-E, TPC-H

  • High degree of thread-level parallelism (TLP)
  • Large working sets with poor locality leading to

high cache miss rates

  • Low instruction-level parallelism (ILP) due to high

cache miss rates, load-load dependencies, and difficult to predict branches

  • Performance is bottlenecked by stalls on memory

accesses

  • Superscalar and superpipelining will not help much
slide-5
SLIDE 5

Page 5 4/9/06

ILP Processor on Server Application

ILP reduces the compute time and overlaps computation with L2 cache hits, but memory stall time dominates overall performance

C M C M C M

Time Compute Compute

Memory Latency Memory Latency

Thread Scalar processor

Memory Latency

Compute Time

Time Saved

Memory Latency

Compute

C M C M C M

Thread Processor optimized for ILP

slide-6
SLIDE 6

Page 6 4/9/06

Attacking the Memory Bottleneck

  • Exploit the TLP-rich nature of server applications
  • Replace each large, superscalar processor with

multiple simpler, threaded processors

> Increases core count (C) > Increases thread per core count (T) > Greatly increases total thread count (C*T)

  • Threads share a large, high-bandwidth L2 cache

and memory system

  • Overlap the memory stalls of one thread with the

computation of other threads

slide-7
SLIDE 7

Page 7 4/9/06

TLP Processor on Server Application

TLP focuses on overlapping memory references to improve throughput; needs sufficient memory bandwidth

Core0

Memory Latency

Compute

Thread 4 Thread 3 Thread 2 Thread 1

Core1

Thread 4 Thread 3 Thread 2 Thread 1

Core2

Thread 4 Thread 3 Thread 2 Thread 1 Thread 4 Thread 3 Thread 2 Thread 1

Core3 Core4

Thread 4 Thread 3 Thread 2 Thread 1

Core5

Thread 4 Thread 3 Thread 2 Thread 1

Core6

Thread 4 Thread 3 Thread 2 Thread 1

Core7

Thread 4 Thread 3 Thread 2 Thread 1

Time

slide-8
SLIDE 8

Page 8 4/9/06

Server System Requirements

  • Very large power demands

> Often run at high utilization and/or with large

amounts of memory

> Deployed in dense rack-mounted datacenters

  • Power density affects both datacenter construction

and ongoing costs

  • Current servers consume far more power than

state of the art datacenters can provide

> 500W per 1U box possible > Over 20 kW/rack, most datacenters at 5 kW/rack > Blades make this even worse...

slide-9
SLIDE 9

Page 9 4/9/06

Server System Requirements

  • Processor power is a significant portion of total

> Database: 1/3 processor, 1/3 memory, 1/3 disk > Web serving: 2/3 processor, 1/3 memory

  • Perf/watt has been flat between processor

generations

  • Acquisition cost of server hardware is declining

> Moore's Law – more performance at same cost

  • r same performance at lower cost
  • Total cost of ownership (TCO) will be dominated by

power within five years

  • The “Power Wall”
slide-10
SLIDE 10

Page 10 4/9/06

Performance/Watt Trends

Source: L. Barroso, The Price of Performance, ACM Queue vol 3 no 7

slide-11
SLIDE 11

Page 11 4/9/06

Impact of Flat Perf/Watt on TCO

Source: L. Barroso, The Price of Performance, ACM Queue vol 3 no 7

slide-12
SLIDE 12

Page 12 4/9/06

Implications of the “Power Wall”

  • With TCO dominated by power usage, the metric

that matters is performance/Watt

  • Performance/Watt has been mostly flat for several

generations of ILP-focused designs

> Should have been improving as a result of

voltage scaling (fCV2+ TILCV)

> C, T, ILC, and f increases have offset voltage

decreases

  • TLP-focused processors reduce f and C/T (per-

processor) and can greatly improve performance/Watt for server workloads

slide-13
SLIDE 13

Page 13 4/9/06

Outline

  • Server design issues

> Application demands > System requirements

  • Building a better server-oriented CMP

> Maximizing thread count > Keeping the threads fed > Keeping the threads cool

  • UltraSPARC T1 (Niagara)

> Micro-architecture > Performance > Power

slide-14
SLIDE 14

Page 14 4/9/06

Building a TLP-focused processor

  • Maximizing the total number of threads

> Simple cores > Sharing at many levels

  • Keeping the threads fed

> Bandwidth! > Increased associativity

  • Keeping the threads cool

> Performance/watt as a design goal > Reasonable frequency > Mechanisms for controlling the power envelope

slide-15
SLIDE 15

Page 15 4/9/06

Maximizing the thread count

  • Tradeoff exists between large number of simple

cores and small number of complex cores

> Complex cores focus on ILP for higher single

thread performance

> ILP scarce in commercial workloads > Simple cores can deliver more TLP

  • Need to trade off area devoted to processor cores,

L2 and L3 caches, and system-on-a-chip

  • Balance performance and power in all subsystems:

processor, caches, memory and I/O

slide-16
SLIDE 16

Page 16 4/9/06

Maximizing CMP Throughput with Mediocre1 Cores

  • J. Davis, J. Laudon, K. Olukotun PACT '05 paper
  • Examined several UltraSPARC II, III, IV, and T1

designs, accounting for differing technologies

  • Constructed an area model based on this

exploration

  • Assumed a fixed-area large die (400 mm2), and

accounted for pads, pins, and routing overhead

  • Looked at performance for a broad swath of scalar

and in-order superscalar processor core designs

1 Mediocre: adj. ordinary; of moderate quality, value, ability, or performance

slide-17
SLIDE 17

Page 17 4/9/06

CMP Design Space

  • Large simulation space: 13k runs/benchmark/technology (pruned)
  • Fixed die size: number of cores in CMP depends on the core size

1 or more pipelines with 1 or more threads per pipeline

Scalar Processor Superscalar Processor

1 superscalar pipeline with 1 or more threads per pipeline

OR

D$

IDP

I$ I$

IDP IDP

D$

Shared L2 Cache

Core Core

Crossbar

Integer Pipeline IDP Thread I$: Instruction Cache D$: Data Cache DRAM DRAM DRAM DRAM

slide-18
SLIDE 18

Page 18 4/9/06

Scalar vs. Superscalar Core Area

1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Threads per Core Relative Core Area

1 IDP 2 IDP 3 IDP 4 IDP 2-SS 4-SS

1.54 X 1.36 X 1.84 X 1.75 X

slide-19
SLIDE 19

Page 19 4/9/06

Trading complexity, cores and caches

Source: J. Davis, J. Laudon, K. Olukotun, Maximizing CMP Throughput with Medicore Cores, PACT '05

14-17 12-14 7-9 5-7 7 4

slide-20
SLIDE 20

Page 20 4/9/06

The Scalar CMP Design Space

2 4 6 8 10 12 14 16 5 10 15 20 25

Total Cores Aggregate IPC

High Thread Count, Small L1/L2 “Mediocre Cores” Medium Thread Count, Large L1/L2 Low Thread Count, Small L1/L2

slide-21
SLIDE 21

Page 21 4/9/06

Limitations of Simple Cores

  • Lower SPEC CPU2000 ratio performance

> Not representative of most single-thread code > Abstraction increases frequency of branching and

indirection

> Most applications wait on network, disk, memory;

rarely execution units

  • Large number of threads per chip

> 32 for UltraSPARC T1, 100+ threads soon > Is software ready for this many threads? > Many commercial applications scale well > Workload consolidation

slide-22
SLIDE 22

Page 22 4/9/06

Simple core comparison

UltraSPARC T1 379 mm2 Pentium Extreme Edition 206 mm2

slide-23
SLIDE 23

Page 23 4/9/06

Comparison Disclaimers

  • Different design teams and design environments
  • Chips fabricated in 90 nm by TI and Intel
  • UltraSPARC T1: designed from ground up as a

CMP

  • Pentium Extreme Edition: two cores bolted together
  • Apples to watermelons comparison, but still

interesting

slide-24
SLIDE 24

Page 24 4/9/06

Pentium EE- US T1 Bandwidth Comparison

Feature Pentium Extreme Edition UltraSPARC T1 Clock Speed 3.2 Ghz 1.2 Ghz Pipeline Depth 31 stages 6 stages Power 130 W (@ 1.3 V) 72W (@ 1.3V) Die Size 206 mm2 379 mm2 Transistor Count 230 million 279 million Number of cores 2 8 Number of threads 4 32 L1 caches 12 kuop Instruction/16 kB Data 16 kB Instruction/8 kB Data Load-to-use latency 1.1 ns 2.5 ns L2 cache Two copies of 1 MB, 8-way associative 3 MB, 12-way associative L2 unloaded latency 7.5 ns 19 ns L2 bandwidth ~180 GB/s 76.8 GB/s Memory unloaded latency 80 ns 90 ns Memory bandwidth 6.4 GB/s 25.6 GB/s

slide-25
SLIDE 25

Page 25 4/9/06

Sharing Saves Area & Ups Utilization

  • Hardware threads within a processor core share:

> Pipeline and execution units > L1 caches, TLBs and load/store port

  • Processor cores within a CMP share:

> L2 and L3 caches > Memory and I/O ports

  • Increases utilization

> Multiple threads fill pipeline and overlap memory

stalls with computation

> Multiple cores increase load on L2 and L3

caches and memory

slide-26
SLIDE 26

Page 26 4/9/06

Sharing to save area

  • UltraSPARC T1
  • Four threads per core
  • Multithreading increases:

> Register file > Trap unit > Instruction buffers and fetch

resources

> Store queues and miss buffers

  • 20% area increase in core

excluding cryptography unit

IFU EXU MUL TRAP MMU LSU

slide-27
SLIDE 27

Page 27 4/9/06

Sharing to increase utilization

Four Threads One Thread of Four Single Thread 1 2 3 4 5 6 7 8

CPI Breakdown

Miscellaneous Pipeline Latency Store Buffer Full L2 Miss Data Cache Miss Inst Cache Miss Pipeline Busy Thread D Active Thread C Active Thread B Active Thread A Active

CPI

  • Application run with

both 8 and 32 threads

  • With 32 threads,

pipeline and memory contention slow each thread by 34%

  • However, increased

utilization leads to 3x speedup with four threads

UltraSPARC T1 Database App Utilization

slide-28
SLIDE 28

Page 28 4/9/06

Keeping the threads fed

  • Dedicated resources for thread memory requests

> Private store buffers and miss buffers

  • Large, banked, and highly-associative L2 cache

> Multiple banks for sufficient bandwidth > Increased size and associativity to hold the

working sets of multiple threads

  • Direct connection to high-bandwidth memory

> Fallout from shared L2 will be larger than from a

private L2

> But increase in L2 miss rate will be much smaller

than increase in number of threads

slide-29
SLIDE 29

Page 29 4/9/06

Keeping the threads cool

  • Sharing of resources increases unit utilization and

thus leads to increase in power

  • Cores must be power efficient

> Minimal speculation – high-payoff only > Moderate pipeline depth and frequency

  • Extensive mechanisms for power management

> Voltage and frequency control > Clock gating and unit shutdown > Leakage power control > Minimizing cache and memory power

slide-30
SLIDE 30

Page 30 4/9/06

Outline

  • Server design issues

> Application demands > System requirements

  • Building a better server-oriented CMP

> Maximizing thread count > Keeping the threads fed > Keeping the threads cool

  • UltraSPARC T1 (Niagara)

> Micro-architecture > Performance > Power

slide-31
SLIDE 31

Page 31 4/9/06

UltraSPARC T1 Overview

  • TLP-focused CMP for servers

> 32 threads to hide memory and pipeline stalls

  • Extensive sharing

> Four threads share each processor core > Eight processor cores share a single L2 cache

  • High-bandwidth cache and memory subsystem

> Banked and highly-associative L2 cache > Direct connection to DDR II memory

  • Performance/Watt as a design metric
slide-32
SLIDE 32

Page 32 4/9/06

UltraSPARC T1 Block Diagram

Floating Point Unit

Crossbar

Clock & Test Unit Control Register Interface JTAG JBUS System Interface SSI ROM Interface DDR2

144@400 MT/s

JBUS (200 MHz) SSI (50 MHz) Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 L2 Bank 1 L2 Bank 2 L2 Bank 3 L2 Bank 0 Core 0 Channel 0 DRAM Control Channel 1 Channel 2 Channel 3 DDR2

144@400 MT/s

DDR2

144@400 MT/s

DDR2

144@400 MT/s

slide-33
SLIDE 33

Page 33 4/9/06

UltraSPARC T1 Micrograph

Features:

  • 8 64-bit Multithreaded

SPARC Cores

  • Shared 3 MB, 12-way 64B

line writeback L2 Cache

  • 16 KB, 4-way 32B line

ICache per Core

  • 8 KB, 4-way 16B line write-

through DCache per Core

  • 4 144-bit DDR-2 channels
  • 3.2 GB/sec JBUS I/O

Technology:

  • TI's 90nm CMOS Process
  • 9LM Cu Interconnect
  • 63 Watts @ 1.2GHz/1.2V
  • Die Size: 379mm2
  • 279M Transistors
  • Flip-chip ceramic LGA

SPARC Core 1 SPARC Core 3 SPARC Core 5 SPARC Core 7 DDR2_0 DDR2_1 DDR2_2 DDR2_3

L2 Data Bank 0 L2 Data Bank 1 L2 Data Bank 3 L2 Data Bank 2 L2Tag Bank 0 L2Tag Bank 2

L2 Buff Bank 0 L2 Buff Bank 1

CLK & Test Unit DRAM Ctl 0,2 DRAM Ctl 1,3

CROSSBAR

JBUS IO Bridge

L2 Buff Bank 2 L2 Buff Bank 3

FPU

SPARC Core 0 SPARC Core 2 SPARC Core4 SPARC Core 6

L2Tag Bank 1 L2Tag Bank 3

slide-34
SLIDE 34

Page 34 4/9/06

UltraSPARC T1 Floorplanning

  • Modular design for “step and

repeat”

  • Main issue is that all cores want to

be close to all the L2 cache banks

> Crossbar and L2 tags located in

the center

> Processor cores on the top and

bottom

> L2 data on the left and right > Memory controllers and SOC fill

in the holes

slide-35
SLIDE 35

Page 35 4/9/06

Maximing Thread Count on US-T1

  • Power-efficient, simple cores

> Six stage pipeline, almost no speculation > 1.2 GHz operation > Four threads per core

>Shared: pipeline, L1 caches, TLB, L2 interface >Dedicated: register and other architectural state, instruction buffers, 8-entry store buffers

> Pipeline switches between available threads

every cycle (interleaved/vertical multithreading)

> Cryptography acceleration unit per core

slide-36
SLIDE 36

Page 36 4/9/06

UltraSPARC T1 Pipeline

Fetch Thrd Sel Decode Execute Memory WB

ICache Itlb

Inst buf x 4

DCache Dtlb Stbuf x 4 Decode

Regfile x4

Thread selects

Thrd Sel Mux Thrd Sel Mux

PC logic x 4 Thread select logic

Instruction type misses traps & interrupts resource conflicts Crossbar Interface Alu Mul Shft Div

slide-37
SLIDE 37

Page 37 4/9/06

Thread Selection: All Threads Ready

  • St0-ld Dt0-ld Et0-ld Mt0-ld Wt0-ld
  • Ft0-add St1-sub Dt1-sub Et1-sub Mt1-sub

Wt1-sub

  • Ft1-ld

St2-ld

Dt2-ld Et2-ld Mt2-ld Wt2-ld

  • Ft2-br St3-add Dt3-add Et3-add Mt3-add
  • Ft3-add St0-add Dt0-add Et0-add

Instructions Cycles

slide-38
SLIDE 38

Page 38 4/9/06

Thread Selection: Two Threads Ready

  • St0-ld

Dt0-ld Et0-ld Mt0-ld Wt0-ld

  • Ft0-add St1-sub Dt1-sub Et1-sub Mt1-sub

Wt1-sub

  • Ft1-ld

St1-ld

Dt1-ld Et1-ld Mt1-ld Wt1-ld

  • Ft1-br St0-add Dt0-add Et0-add Mt0-add

Instructions Cycles Thread '0' is speculatively switched in before cache hit information is available, in time for the 'load' to bypass data to the 'add'

slide-39
SLIDE 39

Page 39 4/9/06

Feeding the UltraSPARC T1 Threads

  • Shared L2 cache

> 3 MB, writeback, 12-way associative, 64B lines > 4 banks, interleaved on cache line boundary > Handles multiple outstanding misses per bank > MESI coherence – L2 cache orders all requests > Maintains directory and inclusion of L1 caches

  • Direct connection to memory

> Four 144-bit wide (128+16) DDR II interfaces > Supports up to 128 GB of memory > 25.6 GB/s memory bandwidth

slide-40
SLIDE 40

Page 40 4/9/06

Keeping the US-T1 Threads Cool

  • Power efficient cores

> 1.2 GHz 6-stage single-issue pipeline

  • Features to keep peak power close to average

> Ability to suspend issue from any thread > Limit on number of outstanding memory requests

  • Extensive clock gating

> Coarse-grained (unit shutdown, partial activation) > Fine-grained (selective gating within datapaths)

  • Static design for most of chip
  • 63 Watts typical power at 1.2V and 1.2 GHz
slide-41
SLIDE 41

Page 41 4/9/06

UltraSPARC T1 Power Breakdown

  • Fully static design
  • Fine granularity clock

gating for datapaths (30% flops disabled)

  • Lower 1.5 P/N width ratio

for library cells

  • Interconnect wire classes
  • ptimized for power x

delay

  • SRAM activation control

SPARC Cores Leakage L2Data L2Tag Unit L2 Buff Unit Crossbar Floating Point Misc Units Interconnect Global Clock IO's

63W @ 1.2Ghz / 1.2V < 2 Watts / Thread

Cores (26%) Leakage (25%) IOs (11%) L2Cache (12%) Top Route (16%) Xbar (4%)

slide-42
SLIDE 42

Page 42 4/9/06

Advantages of CoolThreadsTM

  • No need for exotic

cooling technologies

  • Improved reliability

from lower and more uniform junction temperatures

  • Improved

performance/reliability tradeoff in design

59oC 107oC

66oC 59oC 66oC 59oC 59oC 59oC

slide-43
SLIDE 43

Page 43 4/9/06

UltraSPARC T1 System (T1000)

slide-44
SLIDE 44

Page 44 4/9/06

UltraSPARC T1 System (T2000)

slide-45
SLIDE 45

Page 45 4/9/06

T2000 Power Breakdown

Sun Fire T2000 Power

UltraSPARC T1 16 GB memory IO Disks Service Proc Fans AC/DC Conversion

  • 271W running

SPECJBB 2000

  • Power breakdown

> 25% processor > 22% memory > 22% I/O > 4% disk > 1% service processor > 10% fans > 15% AC/DC conversion

slide-46
SLIDE 46

Page 46 4/9/06

UltraSPARC T1 Performance

Perf/Watt Performance Power Performance Perf/Watt Power 298 W 63378 BOPS SpecJBB 2005 330 W 14001 SpecWeb 2005 2U Height 1 Sockets UltraSPARC T1 CPU

Sun Fire T2000

E10K

1997 32 x US2 77.4 ft3 2000 lbs 13,456 W 52,000 BTUs/hr 2005 1 x US T1 0.85 ft3 37 lbs ~300 W 1,364 BTUs/hr

T2000

42.4 212.7

slide-47
SLIDE 47

Page 47 4/9/06

Future Trends

  • Improved thread performance

> Deeper pipelines > More high-payoff speculation

  • Increased number of threads per core
  • More of the system components will move on-chip
  • Continued focus on delivering high

performance/Watt and performance/Watt/Volume (SWaP)

slide-48
SLIDE 48

Page 48 4/9/06

Conclusions

  • Server TCO will soon be dominated by power
  • Server CMPs need to be designed from ground up to

improve performance/Watt

> Simple MT cores => threads => performance

 

> Lower frequency and less speculation => power  > Must provide enough bandwidth to keep threads fed

  • UltraSPARC T1 employs these principles to deliver
  • utstanding performance and performance/Watt on a

broad range of commercial workloads

slide-49
SLIDE 49

Page 49 4/9/06

Legal Disclosures

  • SPECweb2005 Sun Fire T2000 (8 cores, 1 chip) 14001

SPECweb2005

  • SPEC, SPECweb reg tm of Standard Performance

Evaluation Corporation

  • Sun Fire T2000 results submitted to SPEC Dec 6th 2005
  • Sun Fire T2000 server power consumption taken from

measurements made during the benchmark run

  • SPECjbb2005 Sun Fire T2000 Server (1 chip, 8 cores, 1-

way) 63,378 bops

  • SPEC, SPECjbb reg tm of Standard Performance Evaluation

Corporation

  • Sun Fire T2000 results submitted to SPEC Dec 6th 2005
  • Sun Fire T2000 server power consumption taken from

measurements made during the benchmark run