Server Oriented Microprocessor Optimizations Charles R. Moore - - PowerPoint PPT Presentation

server oriented microprocessor optimizations
SMART_READER_LITE
LIVE PREVIEW

Server Oriented Microprocessor Optimizations Charles R. Moore - - PowerPoint PPT Presentation

Server Oriented Microprocessor Optimizations Charles R. Moore Senior Technical Staff Member crmoore@us.ibm.com IBM Corporation What is a Server? What is a Server? What is a Server? What is a Server? Confidential Info Info (Servers)


slide-1
SLIDE 1

Server Oriented Microprocessor Optimizations

Charles R. Moore Senior Technical Staff Member crmoore@us.ibm.com IBM Corporation

slide-2
SLIDE 2

11/08/99

Server Oriented Microprocessor Optimizations

IBM

What is a Server? What is a Server? What is a Server? What is a Server?

Many different types of servers in use today (many more tomorrow) All have interesting technical challenges and business opportunities The architecture of this collection of servers is a very interesting topic Today, I am focusing mostly on the Enterprise Server

Phone/Cable Switches Routers & Switches

  • Enterprise

Server ISP Server Mission Critical Data Firewall Internet Web Servers Intranet Servers Info Confidential Info

Product orders Inventory updates Production status

(Servers) Home Server Small Office Server

www.eCompany.com

ERP BI

slide-3
SLIDE 3

11/08/99

Server Oriented Microprocessor Optimizations

IBM

Elements of Enterprise Server Performance Elements of Enterprise Server Performance Elements of Enterprise Server Performance Elements of Enterprise Server Performance

Large system parallelism and concurrent execution

Tightly-coupled SMP scaling NUMA access ratios Clustering topologies

Memory and I/O system design

Cache structure, Coherency protocols, "Smart" caching Latency and Bandwidth Network and I/O "impedance matching"

Software optimization and path length

OS, Database, Application - algorithms and scaling Compiler exploitation of hardware resources

Compatibility and upgradabilty

Hot plug I/O, Disks, Memory, and Processors Compatibility and durability between generations of machines Logical and physical partitioning (dynamic reconfiguration)

Reliability, Availability and Serviceability (RAS)

slide-4
SLIDE 4

11/08/99

Server Oriented Microprocessor Optimizations

IBM

System Robustness and RAS

Observed Performance Observed Performance

Q: Which system has better performance?

crash maintenance crash

Time (measured in days/weeks) Time (measured in days/weeks)

For servers, this is proving to be more important than Raw Performance !

slide-5
SLIDE 5

11/08/99

Server Oriented Microprocessor Optimizations

IBM

Commercial Large database footprints Small record access Random access patterns Sharing/Thread communication Technical Structured data Large data movement Predictable strides Minimal data reuse

Server Workload Characteristics e-Business applications include attributes from both Commercial and Technical workloads

slide-6
SLIDE 6

11/08/99

Server Oriented Microprocessor Optimizations

IBM

Today, processors spend most of their time waiting for cache misses

This is true for most workloads regardless of processor architecture or design Feeding processors is the principal performance challenge

The memory hierarchy bottleneck will get worse over time

Processor speed will continue to improve faster than memory and cache speeds Software design trends (object oriented programming, just-in-time compilation, etc.) will place increased load on the memory hierarchy SMP and NUMA designs expand the problem

Memory hierarchy bandwidth and latency are limiting factors around which server designs need to be optimized Processor Busy Time Processor Wait Time "Infinite L1 Cache" Time "Finite Cache Adder"

The Memory Hierarchy is Critical

slide-7
SLIDE 7

11/08/99

Server Oriented Microprocessor Optimizations

IBM

Examples of Cache / Memory System Optimizations

  • 1. Improve cache performance
  • n-chip cache hierarchy

exploitation of eDRAM technology for large caches "smart caches" / adaptive cache coherency protocols multiported caches and banking schemes software controls for caches and TLBs (hints, prefetch, blocking, affinity, etc)

  • 2. Manage overall latency

OOO execution to accelerate storage access instructions multiple outstanding cache misses hardware initiated prefetching (data and instructions) allow speculation beyond synchronization boundaries allow speculation beyond lock structures

slide-8
SLIDE 8

11/08/99

Server Oriented Microprocessor Optimizations

IBM

Examples of Cache / Memory System Optimizations (continued)

  • 3. Maximize bandwidth

exploit extraordinary amount of available on-chip bandwidth exploit large number of available module I/Os (cost trade-off) fast I/O circuits and smart interface protocols

  • 4. Multiprocessor optimizations

shared caches efficient cache invalidate (XI) and cache-to-cache transfers minimize synchronization / barrier overhead (avoid broadcasts) fast lock processing; dedicated lock fabric between processors Exploit weak storage consistency model (posted stores) Multiple Threads per Chip (CMP, HMT, SMT)

slide-9
SLIDE 9

11/08/99

Server Oriented Microprocessor Optimizations

IBM

Technology Effects on SMP Performance

Higher bandwidth Parallelizing compilers Hardware scaling limitations Software scaling limitations

# processors (threads) # processors (threads) performance performance

Aggressive system packaging

Synergistic Technology Deployment

Better scaling ratios More usable processors Higher overall throughput

Scattered Technology Deployment

Curve flattens out quickly Inherent limitations work against you

SMP performance strongly benefits from synergistic technology deployment

slide-10
SLIDE 10

11/08/99

Server Oriented Microprocessor Optimizations

IBM

Potential Architecture Optimizations for Servers

Synchronization, Locking, and Cache Controls

Special purpose synchronization ops - only pay for what you need Dedicated lock hardware Cache policy hints

Special Purpose accelerators

Move, Copy, Zero, Compare pages Pointer chasing acceleration Programmable stream prefetching engine

Error recovery and RAS

Synchronous machine checks on memory / bus errors Multiple interrupt tolerance

Support for NUMA and Clustering

Message passing optimizations; Broadcast optimizations Synchronous fencing of store errors

Support for Logical Partitioning

In Servers, the ISA is far less important than the system-level optimizations.

slide-11
SLIDE 11

11/08/99

Server Oriented Microprocessor Optimizations

IBM

Attributes of Server Oriented Microprocessors

Choppy workloads; modest amounts of ILP Workloads have large instruction and data footprints Workloads demonstrate high degree of data sharing Workload partitioning ranges from trivial to very complex Complex, multi-tiered SW and system environments Systems demand non-stop

  • peration (e-business)

Systems demand configuration flexibility High Frequency Operation Optimized memory systems with large caches Shared caches; Optimized intervention Optimized Locking and Synchronization Support tight SMP, NUMA & Clustering Full system design and optimization Strong focus on RAS Binary compatibility across generations Architecture extensions for partitioning

slide-12
SLIDE 12

11/08/99

Server Oriented Microprocessor Optimizations

IBM

IBM's GigaProcessor (POWER4)

Cornerstone of significant new Enterprise System Architecture

RS/6000 and AS/400 Systems Binary compatibility with previous systems Enhancements for synch, locking, partitioning, compiler controls

> 1 GHz Operating Frequency (starting point)

Full custom design leveraging copper wiring and SOI

Dual processors, integrated L2 Cache and L3 Cntrl on CPU chip Aggressive, SMP optimized Cache Hierarchy

Low latency access, very high bandwidth High bandwidth cache-to-cache interconnection fabric Hardware-based prefetching for instructions and data

Enterprise-class RAS features Development substantially far along

slide-13
SLIDE 13

11/08/99

Server Oriented Microprocessor Optimizations

IBM

  • >1 Ghz

Core >1 Ghz Core Shared L2

>100 GB/s Bandwidth

POWER4 - Chip Multiprocessing

slide-14
SLIDE 14

11/08/99

Server Oriented Microprocessor Optimizations

IBM

  • >1 Ghz

Core >1 Ghz Core Shared L2 L3 Dir L3

Memory

>10 GB/s Bandwidth

POWER4 - High BW L3 and Memory

slide-15
SLIDE 15

11/08/99

Server Oriented Microprocessor Optimizations

IBM

  • >1 Ghz

Core >1 Ghz Core Shared L2 L3 Dir L3

Memory

POWER4 - Low-end Server Solution

Expansion Bus

slide-16
SLIDE 16

11/08/99

Server Oriented Microprocessor Optimizations

IBM

  • >1 Ghz

Core >1 Ghz Core Shared L2 L3 Dir

Chip-chip communication

L3

Memory

> 3 5 G B / s C h i p I n t e r c

  • n

n e c t

>100 GB/sec L2 to Core BW >10 GB/sec L3 BW >35 GB/sec Interconnect BW

Server Building Block

slide-17
SLIDE 17

11/08/99

Server Oriented Microprocessor Optimizations

IBM

Multi-chip Module Boundary

>1 Ghz Core >1 Ghz Core Shared L2 L3 Dir

Chip-chip communication

Expansion Bus Expansion Bus Expansion Bus Expansion Bus L3

Memory Memory Memory Memory

Server Multi-chip Module (8-way SMP)

slide-18
SLIDE 18

11/08/99

Server Oriented Microprocessor Optimizations

IBM

L2 L2 L2 LSU LSU ISU ISU FPU FPU IDU IDU IFU BXU IFU BXU FXU FXU

L3 Directory and Control

POWER4 Unit-level Floorplan

slide-19
SLIDE 19

11/08/99

Server Oriented Microprocessor Optimizations

IBM

~2300 Signal C4s > 500 MHz Wavepipelined I/O > 1 Terabit/sec Bandwidth at the Chip

POWER4 C4 Footprint

slide-20
SLIDE 20

11/08/99

Server Oriented Microprocessor Optimizations

IBM

POWER4 Multi-Chip Module

slide-21
SLIDE 21

11/08/99

Server Oriented Microprocessor Optimizations

IBM

isu

fxu

ifu

fpu

idu l1d l2

wire dut

Result Checker Trace Function Noise Generators (1) Noise(2) Noise (2) Noise(3) Cop Tech. Exp. Tech. Exp. Tech. Exp.

GigaProcessor Test Chip Die Photo

slide-22
SLIDE 22

11/08/99

Server Oriented Microprocessor Optimizations

IBM

Process

IBM CMOS 8S2, 0.18um Copper and SOI with 7 layers of metal 170 million transistors

Package

Uses large number of I/Os at chip and MCM level >2,300 I/O with >5,500 Pins Multi Chip Module (MCM) for dense integration

High bandwidth with fast busses

Elastic I/O provides >500 Mhz chip-to-chip busses

Technology Leverage in POWER4