TSUBAME---A Year Lat er Sat oshi Mat suoka, Prof essor/ Dr.Sci. - - PowerPoint PPT Presentation

tsubame a year lat er
SMART_READER_LITE
LIVE PREVIEW

TSUBAME---A Year Lat er Sat oshi Mat suoka, Prof essor/ Dr.Sci. - - PowerPoint PPT Presentation

1 TSUBAME---A Year Lat er Sat oshi Mat suoka, Prof essor/ Dr.Sci. Global Scient if ic I nf ormat ion and Comput ing Cent er Tokyo I nst . Technology & NAREGI Proj ect Nat ional I nst . I nf ormat ics EuroPVM/ MPI , Paris, France, Oct .


slide-1
SLIDE 1

1

TSUBAME---A Year Lat er

Sat oshi Mat suoka, Prof essor/ Dr.Sci. Global Scient if ic I nf ormat ion and Comput ing Cent er

Tokyo I nst . Technology & NAREGI Proj ect Nat ional I nst . I nf ormat ics EuroPVM/ MPI , Paris, France, Oct . 2, 2007

slide-2
SLIDE 2

2

Topics for Today

  • Intro
  • Upgrades and other New stuff
  • New Programs
  • The Top 500 and Acceleration
  • Towards TSUBAME 2.0
slide-3
SLIDE 3

3

The TSUBAME Production “Supercomputing Grid Cluster” Spring 2006-2010

ClearSpeed CSX600 SIMD accelerator 360 boards, 35TeraFlops(Current)) Storage 1.0 Petabyte (Sun “Thumper”) 0.1Petabyte (NEC iStore) Lustre FS, NFS, CIF, WebDAV (over IP) 50GB/s aggregate I/O BW

500GB 48disks 500GB 48disks 500GB 48disks

NEC SX-8i (for porting)

Unified IB network

Sun Galaxy 4 (Opteron Dual core 8-socket) 10480core/655Nodes 21.4Terabytes 50.4TeraFlops OS Linux (SuSE 9, 10) NAREGI Grid MW

Voltaire ISR9288 Infiniband 10Gbps x2 (DDR next ver.) ~1310+50 Ports ~13.5Terabits/s (3Tbits bisection)

10Gbps+External Network

“Fastest Supercomputer in Asia, 29th Top500@48.88TF

1.5PB

70GB/s

slide-4
SLIDE 4

4

Titech TSUBAME Titech TSUBAME ~ ~76 76 racks racks 350m2 floor area 350m2 floor area 1.2 MW (peak) 1.2 MW (peak)

slide-5
SLIDE 5

5

~500 TB out of 1.1PB ~500 TB out of 1.1PB Node Rear Node Rear Local Infiniband Switch Local Infiniband Switch (288 ports) (288 ports) Currently Currently 2GB/s / node 2GB/s / node Easily scalable to Easily scalable to 8GB/s / node 8GB/s / node Cooling Towers (~ Cooling Towers (~32 32 units) units)

slide-6
SLIDE 6

6

TSUBAME assembled like iPod…

AMD:Fab36

NEC: Main Integrator, Storage, Operations SUN: Galaxy Compute Nodes, Storage, Solaris AMD: Opteron CPU Voltaire: Infiniband Network ClearSpeed: CSX600 Accel. CFS: Parallel FSCFS Novell: Suse 9/10 NAREGI: Grid MW Titech GSIC: us

UK Germany Israel USA Japan

slide-7
SLIDE 7

7

The racks were ready The racks were ready Nodes arrives in mass Nodes arrives in mass

slide-8
SLIDE 8

8

Design Principles of TSUBAME(1)

  • Capability and Capacity : have the cake and eat it, too!

– High-performance, low power x86 multi-core CPU

  • High INT-FP, high cost performance, Highly reliable
  • Latest process technology – high performance and low power
  • Best applications & software availability: OS (Linux/Solaris/Windows),

languages/compilers/tools, libraries, Grid tools, all ISV Applications

– FAT Node Architecture (later)

  • Multicore SMP – most flexible parallel programming
  • High memory capacity per node (32/64/128(new)GB)
  • Large total memory – 21.4 Terabytes
  • Low node count – improved fault tolerance, easen network design

– High Bandwidth Infiniband Network, IP-based (over RDMA)

  • (Restricted) two-staged fat tree
  • High bandwidth (10-20Gbps/link), multi-lane, low latency (<

10microsec), reliable/redundant (dual-lane)

  • Very large switch (288 ports) => low switch count, low latency
  • Resilient to all types of communications; nearest neighbor,

scatter/gather collectives, embedding multi-dimensional networks

  • IP-based for flexibility, robustness, synergy with Grid & Internet
slide-9
SLIDE 9

9

Design Principles of TSUBAME(2)

  • PetaByte large-scale, high-perfomance, reliable storage

– All Disk Storage Architecture (no tapes), 1.1Petabyte

  • Ultra reliable SAN/NFS storage for /home (NEC iStore), 100GB
  • Fast NAS/Lustre PFS for /work (Sun Thumper), 1PB

– Low cost / high performance SATA2 (500GB/unit) – High Density packaging (Sun Thumper), 24TeraBytes/4U – Reliability thru RAID6, disk rotation, SAN redundancy (iStore)

  • Overall HW data loss: once / 1000 years

– High bandwidth NAS I/O: ~50GBytes/s Livermore Benchmark – Unified Storage and Cluster interconnect: low cost, high bandwidth, unified storage view from all nodes w/o special I/O nodes or SW

  • Hybrid Architecture: General-Purpose Scalar

+ SIMD Vector Acceleration w/ ClearSpeed CSX600

– 35 Teraflops peak @ 90 KW (~ 1 rack of TSUBAME) – General purpose programmable SIMD Vector architecture

slide-10
SLIDE 10

10

TSUBAME Architecture =

Commodity PC Cluster + Traditional FAT node Supercomputer + The Internet & Grid + (Modern) Commodity SIMD-Vector Acceleration + iPod (HW integration & enabling services)

slide-11
SLIDE 11

11

TSUBAME Physical Installation

  • 3 rooms (600m2), 350m2

service area

  • 76 racks incl. network &

storage, 46.3 tons

– 10 storage racks

  • 32 AC units, 12.2 tons
  • Total 58.5 tons (excl. rooftop

AC heat exchangers)

  • Max 1.2 MWatts
  • ~3 weeks construction time

1st Floor 2nd Floor A 2nd Floor B

Titech Grid Cluster

TSUBAME TSUBAME TSUBAME TSUBAME & Storage

slide-12
SLIDE 12

12

TSUBAME Network: (Restricted) Fat Tree, IB-RDMA & TCP-IP

X4600 x 120nodes (240 port s) per swit ch => 600 + 55 nodes, 1310 port s, 13.5Tbps I B 4x 10Gbps x 2

Volt air I SR9288

I B 4x 10Gbps x 24

Bisect ion BW = 2.88Tbps x 2

I B 4x 10Gbps X4500 x 42nodes (42 port s) => 42port s 420Gbps Single mode f iber f or cross-f loor connect ions

Ext ernal Et her

slide-13
SLIDE 13

13

The Benefits of Being “Fat Node”

  • Many HPC Apps f avor large SMPs
  • Flexble programming models---MPI , OpenMP, J ava, ...
  • Lower node count – higher reliabilit y/ manageabilit y
  • Full I nt erconnect possible --- Less cabling & smaller swit ches, mult i-

link parallelism, no “mesh” t opologies

1~8GB 10~40GF 2~4

Typical PC Clust er

0.5~1GB 5.6 GF 2

I BM BG/ L

16 16 64~128 8, 16 8, 32

CPUs/ Node

32~128(new)GB

  • 76. 8GF+ 96GF

TSUBAME

(Tokyo Tech)

16GB 128GF

The Eart h Simulat or

512GB 532.48GF~799GF

Fuj it su PrimePower

(Kyot o-U, Nagoya-U)

32~64GB 60.8GF~135GF

Hit achi SR11000

(U-Tokyo, Hokkaido-U)

16~128GB 48GF~217.6GF

I BM eServer

(SDSC Dat aSt ar)

Memory/ Node Peak/ Node

slide-14
SLIDE 14

14

TSUBAME Cooling Density Challenge

  • Room 2F-B

– 480 nodes, 1330W/node max, 42 racks – Rack area = 2.5m x 33.2m = 83m2 = 922ft2

  • Rack spaces only---Excludes CRC units

– Max Power = x4600 nodes 1330W x 480 nodes + IB switch 3000W x 4 = 650KW – Power density ~= 700W/ft2 (!)

  • Well beyond state-of-art datacenters (500W/ft2 )

– Entire floor area ~= 14m x 14m ~= 200m2 = 2200 ft2 – But if we assume 70% cooling power as in the Earth Simulator then total is 1.1MW – still ~500W/ft2

slide-15
SLIDE 15

15

TSUBAME Physical Installation 700W/ft2

  • n hatched area

500W/ft2 for the whole room High density cooling & power reduction

2nd Floor B

TSUBAME TSUBAME

slide-16
SLIDE 16

16

Cooling and Cabling 700W/ft2

  • -- hot/cold row separation and rapid airflow---

46U Rack 11 Sunfire x4600 Units Low Ceiling:3m smaller air volume CRC Unit 25-27 degrees

45cm raised f loor, cabling only

  • --no f loor cooling

no turbulant airf low causing hotspots

CRC Unit 46U Rack 11 X4600 Units 46U Rack 11 X4600 Units

Cold row Isolated hot row

Pressurized cool air Increase effective air volume, evens flow

Isolation plate prevents Venturi effect

Narrow Aisles Narrow Aisles

slide-17
SLIDE 17

17

Very narrow hot row aisle-

  • - Hot air f rom the nodes
  • n the right is immediately

absorbed and cooled by the CRC units on the lef t Narrow Cold Row Aisle- - - no f loor cooling, just cables underneath Duct openings on the ceiling, and the transparent isolation plates to prevent hot- cold mixture

Pressurized cold air blowing down f rom the ceiling duct - - - very strong wind

slide-18
SLIDE 18

18

TSUBAME as No.1 in Japan circa 2006

Total 45 TeraFlops, 350 Terabytes (circa 2006)

>>

>85 TeraFlops 1.1Petabyte 4 year procurement cycle Has beaten the Earth Simulator in both peak and Top500 Has beaten all the other Univ. centers combined

All University National Centers

slide-19
SLIDE 19

19

  • Different usage env.

from

  • No HP sharing with

client’s PC

  • Special HW/SW, lack
  • f ISV support
  • Lack of common

development env. (e.g. Visual Studio)

  • Simple batch based,

no interactive usage, good UI

Massive Usage Env. Gap

Seamless,Ubiquitous access and usage =>Breakthrough Science through Commoditization of Supercomputing and Grid Technologies “Everybody’s Supercomputer”

Hmm, it’s like my personal machine

Might as well use my Laptop

Service Oriented Idealism of Grid: Seamless integration of supercomputer resource with end- user and enterprise environment

Isolated High-End

“Everybody’s Supercomputer”

slide-20
SLIDE 20

20

Grid Portal based WebMO

Computational Chemistry Web Portal for a variety of Apps (Gaussian,NWChem,GAMESS, MOPAC, Molpro) (Prof. Takeshi Nishikawa @ GSIC)

1.SSO 2.Job Mgmt 3.Edit Molecules 4.Set Conditions

TSUBAME WinCCS

HPC Services in Educational Activities to over 10,000 users

  • High-End educat ion using supercomput ers in undergrad labs

– High end simulat ions t o supplement “physical” lab courses

  • Seamless int egrat ion of lab resources t o SCs w/ grid t echnologies
  • Port al-based applicat ion usage

My desktop scaled to 1000 CPUs!☺

slide-21
SLIDE 21

21

TSUBAME General Purpose DataCenter Hosting As a core of IT Consolidation All University Members == Users

  • Campus-wide AAA Sytem (April 2006)

– 50TB (for email), 9 Galaxy1 nodes

  • Campus-wide Storage Service (NEST)

– 10s GBs per everyone on campus

PC mountable, but accessible directly from TSUBAME

– Research Repository

  • CAI, On-line Courses

(OCW = Open CourseWare)

  • Administrative Hosting (VEST)

I can backup ALL my data☺

slide-22
SLIDE 22

22

Tsubame Status

How it’s flying about… (And doing some research too)

slide-23
SLIDE 23

23

TSUBAME Timeline

  • 2005, Oct. 31: TSUBAME contract
  • Nov. 14th Announce @ SC2005
  • 2006, Feb. 28: stopped services of old SC

– SX-5, Origin2000, HP GS320

  • Mar 1~Mar 7: moved the old machines out
  • Mar 8~Mar 31: TSUBAME Installation
  • Apr 3~May 31: Experimental Production phase 1

– 32 nodes (512CPUs), 97 Terabytes storage, free usage – Linpack 38.18 Teraflops May 8th, #7 on the 28th Top500 – May 1~8: Whole system Linpack, achieve 38.18 TF

  • June 1~Sep. 31: Experimental Production phase 2

– 299 nodes, (4748 CPUs), still free usage

  • Sep. 25-29 Linpack w/ClearSpeed, 47.38 TF
  • Oct. 1: Full production phase

– ~10,000CPUs, several hundred Terabytes for SC – Innovative accounting: Internet-like Best Effort & SLA

slide-24
SLIDE 24

24

TSUBAME Scheduling and Account ing

  • -- Synonimit y w/ Exist ing Social I nf rast ruct ures
  • Three account / queue t ypes (VO-based) (REALY MONEY!)

– Small FREE Usage: “Promotion Trial (Catch- and- bait)” – Service Level Agreement : “Cell Phones”

  • Exclusivit y and ot her high QoS guarant ees

– Best Ef f ort (new): “I nternet I SP”

  • Flat allocat ion f ee per each “UNI T”
  • I nvest ment Model f or allocat ion (e.g. “Stocks&Bonds”)

– Open & ext ensive inf ormat ion, f air policy guarant ee – Users make t heir own invest ment decisions---collect ive societ al opt imizat ion (Adam Smit h) C.f . Top-Down planned allocat ion (planned economy)

64CPUs 64CPUs 64CPUs 64CPUs 64CPUs

Jan Feb

64CPUs 64CPUs 64CPUs

Mar Nano- VO Max CPU=192

Dynamic machine- level resource allocation SLA > BES > Small

Over 1300 SC users 10, 000 accounts

slide-25
SLIDE 25

25

Batch Queue Prediction on TSUBAME (work w/Rich Wolski, USCB)

Long wait times for small jobs due to massive parameter sweep Long wait times for large jobs due to long-running MPI jobs that are difficult to pre-empt, and require apps-specific QoS (e.g., memory)

slide-26
SLIDE 26

26

New School Year

slide-27
SLIDE 27

27

Tsubame in Magazines (e.g., Unix Magazine, a 20 page special)

slide-28
SLIDE 28

28

For Det ails…

  • A ~70 Page

Document t hat describes t he policy, t he implement at ion, and every ot her lit t le det ail… (by M. Hamakawa @Sun Services, J apan)

slide-29
SLIDE 29

29

Titech Supercomputer Contest “The 12th SuperCon”

  • High

High-

  • school students (~10 out

school students (~10 out

  • f 50 team apps)
  • f 50 team apps)
  • Since 1995: Cray => Origin =>

Since 1995: Cray => Origin => TSUBAM TSUBAME E

  • 700 CPUs allocated for 1 week

700 CPUs allocated for 1 week

sponsors

Multple Multple Testimonies Testimonies “ “TSUBAME was so easy to use, TSUBAME was so easy to use, just like my PC, but much faster! just like my PC, but much faster!” ”

slide-30
SLIDE 30

30

TSUBAME Applicat ion Prof ile

  • Large scale codes, e.g. port f rom t he Eart h

Simulat or

– Simple port ing is easy – Tuned Vect or code int o cache-f riendly “normal code” t akes more t ime.

  • Large-Scale (>

1,000~10,000 inst ances) Paramet er Survey, Ensemble, Opt imizat ion, …

  • Lot s of I SV Code---Gaussian, Amber, …
  • St orage-I nt ensive Codes --- Visualizat ion
  • =>

Of t en Limit ed by Memory, not CPUs

  • Must Give users bot h EASE and

COMPELLI NG REASON t o use TSUBAME

slide-31
SLIDE 31

31

TSUBAME Applications---Massively Complex Turbulant Flow and its Visualization (by Tanahashi Lab and Aoki Lab, Tokyo Tech.)

Turbulant Flow from Airplane Taylor-Couette Flow

slide-32
SLIDE 32

32

AMBER Example: 1UAO with water molecules

  • Smallest protein chignolin in TIP3P water buffer

(30A radius)

  • 37,376 atoms
  • cutoff 20.0 angstrom
  • 2.0 fs timestep

Three conditions hava good scalarability in 30 A and 40A case.

slide-33
SLIDE 33

33

TSUBAME Job Statistics

  • Dec. 2006-Aug.2007 (#Jobs)
  • 797,886 J obs (~3270

daily)

  • 597,438 serial j obs

(74.8%)

  • 121,108 <

=8p j obs (15.2%)

  • 129,398 I SV Applicat ion

J obs (16.2%)

  • However, >

32p j obs account f or 2/ 3 of cumulat ive CPU usage

T S U B A M E J

  • b

s 1 2 3 4 5 6 7 = 1 p < = 8 p < = 1 6 p < = 3 2 p < = 6 4 p< = 1 2 8 p> 1 2 8 p # P r

  • c

e s s

  • r

s / J

  • b

# J

  • b

s 系列1

90%

Coexist ence of ease-of -use in bot h

  • short durat ion paramet er survey
  • large scale MPI

(Bot h are hard f or physically large-scale dist ribut ed grid)

slide-34
SLIDE 34

34

TSUBAME Job Statistics for ISV Apps (# Processes)

#

  • f

J

  • b

, e x c e p t P G I _ C D K A B A Q U S A M B E R A V S _ E x p r e s s ( D e v e l

  • p

e r + P C E ) E n S i g h t G a u s s i a n G a u s s V i e w G R O M A C S M a t h e m a t i c a M A T L A B M

  • l

p r

  • M

O P A C M S C _ N A S T R A N M S C _ P A T R A N N W C h e m P O V

  • R

a y S A S T i n k e r U T C h e m G A M E S S

A m b e r 8 % G a u s s i a n 1 % 10ヶ 月

slide-35
SLIDE 35

35

Reprisal: TSUBAME Job Statistics for ISV Apps (# CPU Timeshare)

C P U t i m e s h a r e f r

  • m

6 A p r . t

  • 7

J a n . ( I S V A p p s O n l y )

A B A Q U S A M B E R A V S _ E x p r e s s D i s c

  • v

e r y S t u d i

  • E

n S i g h t G a u s s i a n G a u s s V i e w G R O M A C S M a t e r i a l s E x p l

  • r

e r M a t e r i a l s S t u d i

  • M

a t h e m a t i c a M A T L A B M

  • l

p r

  • M

O P A C M S C _ N A S T R A N M S C _ P A T R A N N W C h e m P G I _ C D K

G a u s s i a n 6 % A m b e r 3 5 % Multi-User and Ensemble! (60,000-way Gaussian ensemble job recorded on TSUBAME) => Throughput(!)

slide-36
SLIDE 36

36

TSUBAME Draws Research Grant s

  • “Comput at ionism” Global Cent er-of -Excellence

(Global COE) Program

– I ncubat ing Mat h/ Comput er Science/ HPC Expert s – $2~2.5 mil x 5 years

  • “Cent er of (I ndust rial) I nnovat ion Program”

– I ndust rial Collaborat ion w/ High-End Facilit ies – ~$1 mil x 5 years

  • More Coming…
slide-37
SLIDE 37

37

Non-traditional computational modeling Apply non-traditional mathematical approaches Making the Impossible (Infeasible) Possible

Compuationism Approach to Science

Example) Proteomic Interactions 1000x1000 mutual interactions of proteins

Structural Matching [Y. Akiyama]

⇒ Non-traditional modeling and approach

Complex & Large Scale

P1 P2 P3 P4 P5 …. P1000 P1 P2 P3 P4 P5 … P1000

Possible in a few months Complexity: 1000 ⇒ 1000 x 1000 Infeasible with traditional ab-initio approaches 100s of years on a Petascale supercomputer

Drug Design

Narrowing the Candidate

slide-38
SLIDE 38

38

Educating “Computatism Experts”

Incubating Computing Generalists

Target Profile

Domain Scientist Counterpart Theory of Computing & Applied Math ・ Algorithms ・ Optimization Theory ・ Probabilistic Theory … HPC & CS Expertise ・ Modeling ・ Programming ・ Systems … Computationism Ideology ・ Work with domain scientists ・ Willing to Study and understand the Science and the discipline

Collaborate

slide-39
SLIDE 39

39

Building the COE on TSUBAME

10Gbps InfiniBand 2,304 ports

Super Titanet

Sun Fire X4600 ClearSpeed Advance Accelerator Board 360 boards, 35TFlops(peak) Sun Fire X4500

NEC iStorage S1800AT

657 nodes,

5,256CPU,10,512Cores

50.6TFlops(peak) 21.7 Terabytes 85TFlops (Peak) 47.38TFLops (Linpack)

アジア#1

24Gbp s 62 nodes, 1.5 Petabyte ディ スク スト レッ ジ 0.1 PB RAID6

Production HPC Service COE Edu

COE Research TSUBAME Acceleration TSUBAME Storage Extensions

COE TSUBAME @ GSIC, Titech

slide-40
SLIDE 40

40

Ministry of Edu. “Center of Innovation Program”

Industrial Collaboration w/ High-End Facilities Provide industrial access to TSUBAME (via Grid)

  • (x86) PC&WS Apps in industry directly execute at

x10~x100 scale

Not just CPu power but memory/storage/network, etc.

  • HPC-Enabling non-traditional industries ---ICT, Financials,

Security, Retail, Services, … )

  • E.g. Ultra Large-scale portfolio risk analysis by a Megabank (ongoing)

Java Perl

slide-41
SLIDE 41

41

Why Industries are interested in TSUBAME?

  • Standard Corporate x86 Cluster Env. vs. TSUBAME -

2GB 128GB

120TB 500 GB

1Gbps 20Gbps

120TB, 1GB/s 120TB, 3GB/s 32~128GB 3840GB 20Gbps 2.5Tbps 16 ( node) 1920 ( job) TSUBA ME 500GB, 50MB/s 10TB(NAS), 100MB/s 2~8GB 128GB 1Gbps 32Gbps 2~4( node) 32~128( job) Std.

Disk(Cap, BW) RAM Network CPU Core

RAM Disk Network

x10~x60

slide-42
SLIDE 42

42

The Industry Usage is Real(!!!) and will be Stellar (!!!)

  • Two calls since July: 8 real industry apps for

TSUBAME (and 18 others for Nat’l Univ. Centers coalition)

  • Example: a Japanese Megabank has run a real

financial analysis app. on 1/3 of TSUBAME, and is EXTREMELY happy with the stellar results.

– Only runnable with >20GB mem, IB-based I/O – Stay tuned for follow-on announcements…

  • Big booster for non-dedicated commercial usage

– The overall grid must be as such

slide-43
SLIDE 43
  • Virtual Cluster

– Virtual Machines (VM) as computing nodes

  • Per-user customization of exec environment
  • Hides software heterogeneity
  • Seamless integration with user’s own resources

– Interconnected via overlay networks

  • Hides network asymmetry
  • Overcomes private networks and firewalls

Research: Grid Resource Sharing with Virtual Clusters ([CCGrid2007] etc.)

User B

Physical Resources

Virtual Cluster A

User A

128nodes MPI, Java 200nodes MPI, gcc

Virtual Cluster B

User’s own Resources

slide-44
SLIDE 44

Site A

VM VM

VM Image VM Image

Pkg Pkg Pkg

Site B

VM

VM Image

Virtual Cluster Requirement

User Virtual Cluster

Our VPC Installer Architecture

Installation Server VM

VM Image

Pkg Pkg Pkg

Easy specification of installation request Scalable image transfer Fast environment construction on VM Autonomic Scheduling of VM Resources

slide-45
SLIDE 45

Scalability w/# of VPC nodes: Optimistic Extrapolation to 1000 VMs

5 1 1 5 2 2 5 3 3 5 4 2 4 6 8 1 W h

  • l

e T r a n s f e r I n s t a l l a t i

  • n

If we pruned unreasonably- slow HDDs… Construction time (sec) Number of nodes Likely to be due to some unstable HDDs 1000-VM virtual cluster in less than 1 minute!

slide-46
SLIDE 46

46

TSUBAME Siblings ---The Domino Effect on Major Japanese SCs

  • Sep. 6th, 2006---U-Tokyo, Kyoto-U, and U-Tsukuba

announced “common procurement procedure” for the next gen SCs in 1H2008

– 100-150 TFlops – HW: x86 cluster-like SC architecture – NW: Myrinet10G or IB + Ethernet – SW: Linux+SCore, common Grid MW

  • Previously, ALL centers ONLY had dedicated SCs
  • Other centers will likely follow…

– No other choices to balance widespread usage, performance, and prices – Makes EVERY sense for University Mgmt.

  • (VERY) standardized SW stack and HW configuration

– Adverse architecture diversity has been impediment for Japanese Grid Infrastructure

slide-47
SLIDE 47

47 Hokkaido University

Information Initiative Center

HITACHI SR11000 5.6 Teraflops

Tohoku University

Information Synergy Center

NEC SX-7 NEC TX7/AzusA

University of Tokyo

Information Technology Center

HITACHI SR8000 HITACHI SR11000 6 Teraflops Others (in institutes)

Nagoya University

Information Technology Center

FUJITSU PrimePower2500 11 Teraflops

Osaka University

CyberMedia Center

NEC SX-5/128M8 HP Exemplar V2500/N 1.2 Teraflops

Kyoto University

Academic Center for Computing and Media Studies FUJITSU PrimePower2500

8.9 Teraflops

Kyushu University

Computing and Communications Center

FUJITSU VPP5000/64 IBM Power5 p595 5 Teraflops

Japan’s 9 Major University Computer Centers (excl. National Labs) circa Spring 2006 10Gbps SuperSINET Interconnecting the Centers

Tokyo Inst. Technology

Global Scientific Information and Computing Center

2006 NEC/SUN TSUBAME 85 Teraflops

University of Tsukuba

FUJITSU VPP5000 PACS-CS 14.5 TFlops

National Inst. of Informatics

SuperSINET/NAREGI Testbed 17 Teraflops

~60 SC Cent ers in J apan incl. Eart h Simulat or

  • 10 Pet af lop

cent er by 2012

slide-48
SLIDE 48

48 Hokkaido University

Information Initiative Center

HITACHI SR11000 5.6 Teraflops

Tohoku University

Information Synergy Center

NEC SX-7 NEC TX7/AzusA

University of Tokyo

Information Technology Center

NextGen x86 150 Teraflops

HITACHI SR11000 18 Teraflops Others (in institutes)

Nagoya University

Information Technology Center

FUJITSU PrimePower2500 11 Teraflops

Osaka University

CyberMedia Center

NEC SX-8 or SX-9

2008 x86 Cluster 35 Teraflops Kyoto University

Academic Center for Computing and Media Studies

NextGen x86 100-150 Teraflops Kyushu University

Computing and Communications Center

2007 x86 50 TeraFlops?

Fujitsu Primequest? IBM Power5 p595 5 Teraflops

Japan’s 9 Major University Computer Centers (excl. National Labs) circa 2008 >40Gbps SuperSINET3 Interconnecting the Centers

Tokyo Inst. Technology

Global Scientific Information and Computing Center

NEC/SUN TSUBAME 85 Teraflops 250 TFlops?

University of Tsukuba

2006 PACS-CS 14.5 TFlops NextGen x86 100-150 Teraflops

National Inst. of Informatics

NAREGI Testbed 4 Teraflops

x86 TSUBAME sibling dominat ion St ill - 10 Pet af lop cent er by 2012

? ? ?

slide-49
SLIDE 49

49

TSUBAME Upgrades

slide-50
SLIDE 50

50

ClearSpeed CSX600 SIMDアク セラ レ ータ 現状35TeraFlops (1)ク ラ スタ の15Teraflopsを 高 度教育・ 研究用(+業務)へ転用

ClearSpeedにてスパコ ン の性能ロスを補完(計100TF)

スト レ ッ ジ 1 Petabyte (Sun “Thumper”) 0.1Petabyte (NEC iStore) Lustre フ ァ イ ルシステム

500GB 48disks 500GB 48disks 500GB 48disks

(2) NEC SX-8 ベク ト ル計算機

(レガシー・ 移植試験等)

高速ネッ ト ワーク Bisection BW 最速13Tbps

(スト レ ッ ジへは 400Gbps)

Sun/AMD高性能計算ク ラ スタ (Opteron Dual core 8-Way) 10480core/655ノ ード 50TeraFlops OS(現状) Linux (検討中) Solaris, Windows NAREGIグリ ッ ド ミ ド ル

Voltaire ISR9288 Infiniband 10Gbps x 288ポート

10Gbps+外部 ネッ ト ワーク

SE増強: システム2名+ア プリ 2名体制、 スキルレ ベル向上 (従来は3名) (3) HD視覚 化・ 画像 表示装置 など

Towards Multi-Petabyte Data Grid Infrastructure based on TSUBAME

Various public research DBs and Mirrors---Astro, Bio, Chemical

Various Observational & Simulation Data

All Historical Archive of Research Publications, Documents, Home Pages,

TSUBAME ~ 100 TeraFlops, Petabytes Storage

Archival & Data Grid Middleware

Petabytes, Stable Storage Data Provenance “Archiving Domain Knowledge”

NESTRE System

All User Storage

( Documents, etc)

slide-51
SLIDE 51

51

TSUBAME Network: (Restricted) Fat Tree, IB-RDMA & TCP-IP

X4600 x 120nodes (240 port s) per swit ch => 600 + 55 nodes, 1310 port s, 13.5Tbps I B 4x 10Gbps x 2

Volt air I SR9288

I B 4x 10Gbps x 24

Bisect ion BW = 2.88Tbps x 2

I B 4x 10Gbps X4500 x 42nodes (42 port s) => 42port s 420Gbps Single mode f iber f or cross-f loor connect ions

Ext ernal Et her

slide-52
SLIDE 52

52

NESTRE (and the old cluster nodes it replaced)

Previous Life Previous Life Now Now…

NESTRE NESTRE

slide-53
SLIDE 53

53

TSUBAME Linpack and Acceleration

Heterogeneity both Intra- and Inter- node

slide-54
SLIDE 54

54

G S I C 過去のスパコ ンおよびT S U B A M E T

  • p

5 性能の歴史および予測

3 5 1 4 9 6 7 4 5 2 7 1 7 1 4 9 7 4 2 2 2 9 9 1 9 5 1 5 8 1 4 7 5 3 4 1 1 8 9 1 4 7 1 1 5 1 4 9 5 6 1 1 7 1 1 1 1 5 1 1 5 2 2 5 3 3 5 4 4 5 5

1 1 / 2 1 6 / 2 1 1 1 / 2 9 6 / 2 9 1 1 / 2 8 6 / 2 8 1 1 / 2 7 6 / 2 7 1 1 / 2 6 6 / 2 6 1 1 / 2 5 6 / 2 5 1 1 / 2 4 6 / 2 4 1 1 / 2 3 6 / 2 3 1 1 / 2 2 6 / 2 2 1 1 / 2 1 6 / 2 1 1 1 / 2 6 / 2 1 1 / 1 9 9 9 6 / 1 9 9 9 1 1 / 1 9 9 8 6 / 1 9 9 8 1 1 / 1 9 9 7 6 / 1 9 9 7 1 1 / 1 9 9 6 6 / 1 9 9 6 1 1 / 1 9 9 5

時系列 T

  • p

5 ラ ンキング 1 1 / 1 9 9 5 N

  • U

p g r a d e R a n k i n g 1 1 / 1 9 9 5 U p g r a d e d R a n k i n g C r a y C 9 1 9 9 5 / 1

  • 2

/ 1 N E C S X

  • 5

2 / 1

  • 2

6 / 3 T S U B A M E 2 6 / 4

  • 2

1 ( 2 1 1 ) ( N

  • U

p g r a d e s ) T S U B A M E ( w / U p g r a d e s )

slide-55
SLIDE 55

55

Hardware

・ 2 5 W M a x P

  • w

e r ・ C S X 6 p r

  • c

e s s

  • r

x 2 ( 9 6 G F L O P S P e a k ) ・ I E E E 7 5 4 6 4 b i t D

  • u

b l e

  • P

r e c i s i

  • n

F l

  • a

t i n g P

  • i

n t ・ 1 3 3 M H z P C I

  • X

H

  • s

t I n t e r f a c e ・ O n b

  • a

r d m e m

  • r

y : 1 G B ( M a x 4 G B ) ・ I n t e r n a l m e m

  • r

y b a n d w i d t h : 2 G b y t e s / s ・ O n

  • b
  • a

r d m e m

  • r

y b a n d w i d t h : 6 . 4 G b y t e s / s

S

  • f

t w a r e

・ S t a n d a r d N u m e r i c a l L i b r a r i e s ・ C l e a r S p e e d S

  • f

t w a r e D e v e l

  • p

m e n t K i t ( S D K )

A p p l i c a t i

  • n

s a n d L i b r a r i e s

  • L

i n e a r A l g e b r a

  • B

L A S , L A P A C K

  • B

i

  • S

i m u l a t i

  • n

s

  • A

M B E R , G R O M A C S

  • S

i g n a l P r

  • c

e s s i n g

  • F

F T ( 1 D , 2 D , 3 D ) , F I R , Wa v e l e t

  • V

a r i

  • u

s S i m u l a t i

  • n

s

  • C

F D , F E A , N

  • b
  • d

y

  • I

m a g e P r

  • c

e s s i n g

  • f

i l t e r i n g , i m a g e r e c

  • g

n i t i

  • n

, D C T s

  • O

i l & G a s

  • K

i r c h h

  • f

f T i m e / Wa v e M i g r a t i

  • n

ClearSpeed Advance Accelerator Board

slide-56
SLIDE 56

56

ClearSpeed Mode-of Use

  • 1. User Application Acceleration

– Matlab, Mathematica, Amber, Gaussian… – Transparent, offload from Opterons

  • 2. Acceleration of Standard Libraries

– BLAS/DGEMM, LAPACK, FFTW… – Transparent to users (Fortran/C bindings)

  • 3. User Applications

– Arbitrary User Applications – Need MPI-like programming with C-dialect

Note: Acceleration is “Narrow Band”=> Hard to Scale

slide-57
SLIDE 57

57

ClearSpeed Matrix Library

Input data Output data library call return computation

1 2 3 4 2 0 4 0 6 0 8 01 1 2 m a t r i x s i z e M S p e e d ( G F l

  • p

s ) B = 9 6 B = 7 6 8 B = 5 7 6 B = 3 8 4

(MxB) x (BxM) multiplication speed

  • About 40 GFlops DGEMM w/old library

– 70GFlops with new beta(!)

  • Performance heavily depends on matrix size
slide-58
SLIDE 58

58

I ssues in a (VERY) Het erogeneous HPL w/ Accelerat ion

  • How can we run HPL ef f icient ly under f ollowing

condit ions?

– Need t o use ef f icient ly bot h Opt eron and ClearSpeed

  • About 70 GFlops by 16 Opt eron cores
  • 30-40 GFlops by ClearSpeed (current )

– Only (360/ 655) TSUBAME nodes have ClearSpeed – Modif icat ion t o HPL code f or het erogeneit y

  • Our policy:

– I nt roduce HPL processes (1) t hat comput e wit h Opt erons and (2) t hat comput e wit h ClearSpeed – Make workload of each HPL process (roughyl) equal by

  • versubscript ion
slide-59
SLIDE 59

59

Our Het erogeneous HPL Algorit hm

Two t ypes of HPL processes are int roduced

  • Host processes use GOTO BLAS’s DGEMM
  • SI MD processes t hrow DGEMM request s t o

accelerat or

Host process SI MD process SI MD server

Addit ional SI MD server direct ly calls CSXL DGEMM

  • mmap() is used f or sharing mat rix dat a
slide-60
SLIDE 60

60

Linpack Details

  • SunFire X4600 nodes in TSUBAME

– Each has 16 Opteron cores, 32 GB memory

  • Three measurements:

– Full CS: ClearSpeed boards on all nodes are used – Half CS: # of ClearSpeed boards is the half of nodes

  • Heterogeneous in both intra and inter node

– No CS: Only Opteron CPUs are used

  • Numbers of processes per node are

– With CS: 3 host processes (x4thread) + 3 SIMD processes – W/o CS: 4 host processes (x4thread)

slide-61
SLIDE 61

61

Result s(2)

Peak speeds are

  • Full CS: 5.203TFlops

(N=391680)

  • Half CS: 4.366TFlops

(N=345600)

  • No CS: 3.802TFlops

(N=391680) Not e: Half CS doesn’t work (very slow) wit h N=391680, because of t he memory limit at ion

Block size NB is

  • 960 in Full CS/ Half CS
  • 240 in No CS

Speed vs mat rix size on 60 nodes

1000 2000 3000 4000 5000 6000 200000 400000 Matrix size N Speed (GFlops) Full CS Half CS No CS

slide-62
SLIDE 62

62

Experimental Results

  • 47.38TF with 648 nodes and 360 Accelerators Sep.

– +24 % improvement over No Acc (38.18TF) – +25.5GFlops per accelerator – Matrix size N=1148160 (It was 1334160 in No Acc) – 5.9hours

  • NEW(!) With new DGEMM, 48.88 TFlops / 62% Efficiency

5 10 15 20 25 30 35 40 45 50 60 350 648 Number of nodes Speed (TFlops) Full Acc Half Acc No Acc

0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 60 350 648 Number of nodes Relative Speed Full Acc Half Acc No Acc

Relative speed (No Acc=1)

47.38TF 38.18TF

slide-63
SLIDE 63

63

Ont o TSUBAME 2.0

Pet ascale and Beyond-but how?

slide-64
SLIDE 64

64

1TF 10TF 100TF 1PF 2002 2006 2008 2010 2012 2004

Earth Simulator 40TF (2002) TSUBAME2 1PF Sustained, >10PB (2010-11)

2010 TSUBAME 2.0

=> Interim 200TeraFlops @ 2008 => Sustained Petaflop @ 2010 Sustain leadership in Japan

10PF

Japanese “Keisoku” >10PF(2011-12)

Titech Supercomputing Campus Grid (incl TSUBAME )~90TF (2006)

US Petascales (Peak) (2007~8) US HPCS (2010) BlueGene/L 360TF(2005) TSUBAME Upgrade >300TF (2008-2H) Quad Core Opteron + Acceleration US 10P (2011~12?) 1.3TF

TSUBAME Upgrades Towards Petaflops

KEK 59TF BG/L+SR11100

Titech Campus Grid

U-Tokyo, Kyoto-U, Tsukuba 100-150TF (2008) Others

TSUBAME 110 TF, Storage 1.6 PB, 128GB nodes(2007)

slide-65
SLIDE 65

65

In the Supercomputing Landscape, Petaflops class is already here…in early 2008

2008Q1 TACC/ Sun “Ranger” ~52,600 “Barcelona” Opt eron CPU Cores, ~500TFlops ~100 racks, ~300m2 f loorspace 2.4MW Power, 1.4km I B cx4 copper cabling 2 Pet abyt es HDD 2008 LLNL/ I BM “BlueGene/ P” ~300,000 PPC Cores, ~1PFlops ~72 racks, ~400m2 f loorspace ~3MW Power, copper cabling

> 10 Pet af lops > million cores > 10s Pet abyt es planned f or 2011-2012 in t he US, J apan, (EU), (ot her APAC)

Ot her Pet af lops 2008/ 2009

  • LANL/ I BM “Roadrunner”
  • J I CS/ Cray(?) (NSF Track 2)
  • ORNL/ Cray
  • ANL/ I BM BG/ P
  • EU Machines (J ulich…

) …

slide-66
SLIDE 66

66

Scaling t o a Pet aFlop in 2010 is Easy, Given Exist ing TSUBAME

Year 2003 2006 2008 2010 2012 2014 2015 Microns

0.09 0.065 0.045 0.032 0.022 0.016 0.011

Scalar Cores 1 2 4 8 16 32 64 GFLOPS/ Socket 6 24 48 96 192 384 768

Total KWf or 1 PF (200W/ Socket)

3.3E+05 83333 41667 20833 10417 5208 2604 SI MD/ Vect or

  • 96

192 384 768 1536 3072 GFLOPS/ Board

  • 96

192 384 768 1536 3072

Total KWf or 1 PF (25W/ Board)

  • 260.4

130.2 65.1 32.6 16.3 8.14

2009 Conservatively Assuming 0. 065- 0. 045 microns, 4 cores, 48 GFlops/ Socket=>200Teraf lops, 800 Teraf lop Accelerator board “Commodity” Petaf lop easily achievable in 2009- 2010

slide-67
SLIDE 67

67

In fact we can build one now (!) In fact we can build one now (!)

  • @Tokyo

@Tokyo---

  • --One of the Largest IDC in the

One of the Largest IDC in the World (in Tokyo...) World (in Tokyo...)

  • Can fit a 10PF here easy (> 20 Rangers)

Can fit a 10PF here easy (> 20 Rangers)

  • On top of a 55KV/6GW Substation

On top of a 55KV/6GW Substation

  • 150m diameter (small baseball stadium)

150m diameter (small baseball stadium)

  • 140,000 m2 IDC

140,000 m2 IDC floorspace floorspace

  • 70+70 MW power

70+70 MW power

  • Size of entire Google(?) (~million LP nodes)

Size of entire Google(?) (~million LP nodes)

slide-68
SLIDE 68

68

Commodit y Scaling t o 2~10 PFs Circa 2011 (Cont ’d)

  • Loosely coupled apps scale well
  • I mpract ical t o assume memory int ensive, large

message apps (such as spect ral met hods) t o scale t o Pet af lops

– St rong t echnological scaling limit s in memory size, bandwidt h, et c. – Physical limit s e.g., power/ cooling, $$$ – I mpract icalit y in resolut ion (because of chaot ic nat ure of physics, et c.)

  • Why ensemble met hods and coupled met hods

(which are scalable) are good

– => Apps t hat worked “well on grids” (small scale)

slide-69
SLIDE 69

69

Nano-Science : coupled simluations on the Grid as the sole future for true scalability

The only way to achieve true scalability! … between Continuum & Quanta. 10 -6 10-9 m

Material physics的 (Infinite system) ・ Fluid dynamics ・ Statistical physics ・ Condensed matter theory … Molecular Science ・ Quantum chemistry ・

Molecular Orbital method

・ Molecular Dynamics …

Multi-Physics

E.g. Fragmented MO, Could use 100,000 loosely-coupled CPUs in pseudo paramter

E.g., Advanced MD,

  • req. mid-sized tightly-

coupled SMP (#CPU not the limit, but memory and BW)

Old HPC environment: ・ decoupled resources, ・ hard to use, ・ special software, ...

・ Too general- purpose(!) Slide stolen from my NAREGI Grid Slide Stack => Tightly-coupled “Grid” as future Petascale machine

slide-70
SLIDE 70

70

Reprisal: TSUBAME Job Statistics for ISV Apps (# CPU Timeshare)

C P U t i m e s h a r e f r

  • m

6 A p r . t

  • 7

J a n . ( I S V A p p s O n l y )

A B A Q U S A M B E R A V S _ E x p r e s s D i s c

  • v

e r y S t u d i

  • E

n S i g h t G a u s s i a n G a u s s V i e w G R O M A C S M a t e r i a l s E x p l

  • r

e r M a t e r i a l s S t u d i

  • M

a t h e m a t i c a M A T L A B M

  • l

p r

  • M

O P A C M S C _ N A S T R A N M S C _ P A T R A N N W C h e m P G I _ C D K

G a u s s i a n 5 5 % A m b e r 3 5 % Multi-User and Ensemble! (20,000-way Gaussian ensemble job recorded on TSUBAME) => Throughput(!)

slide-71
SLIDE 71

71

St andf ord Folding@Home

  • (Ensemble) GROMACS,

Amber et c. on Volunt eer Grid

  • PS3: 1/ 2 (ef f ect ive)

Pet af lops and growing (in st andard OS(!))

  • Accelerat or (GPGPU)

most Flops/ CPU/ unit

  • Combined, 71%

ef f ect ive FLOPS @ 14% CPUs

  • 7 Pet af lops Peak (SFP),

10% ef f iciency

– Feasible NOW t o build a usef ul 10PF machine

3.21 229926 739 Tot al

  • 15. 91

30, 294 482 PS3

  • 58. 74

749 44 GPGPU 1.69 25,389 43 Linux 2.97 3,028 9

Mac/ I nt el

0.79 8,880 7 Mac/ PPC 0.95 161,586 154 Windows

GFLOPS/ CPU Act ive CPUs TFLOPS OS Type

Folding@Home 2007- 03- 25 18:18:07

slide-72
SLIDE 72

72

Future Multi-Petascale Designs

  • Assuming Upper bound
  • n Machine Cost
  • A homogeneous

machine entails compromises in all applications

  • Heterogeneous Grids of

Large Resources would allow multitple design points to coexist

  • And this also applies to

a single machine as well More FLOPS More Storage/BW

Classic Design Point New design points in a single machine

  • r aggregated as a

tightly-coupled “Grid”

App1 App2

slide-73
SLIDE 73

73

Biggest Problem is Power…

Machine CPU Cores Wat t s P eak GFLOP S P eak MFLOP S/ Wat t Wat t s/ CP U Core Rat io c.f . TSUBAME TSUBAME(Opt er on) 10480 800,000 50,400 63.00 76.34 TSUBAME(w/ ClearSpeed) 11,200 810,000 85,000 104.94 72.32 1.00 Eart h Simulat or 5120

6,000,000

40,000 6.67 1171.88 0.06 ASCI P ur ple (LLNL) 12240

6,000,000

77,824 12.97 490.20 0.12

AI ST Supercluster

3188 522,240 14400 27.57 163.81 0.26 LLNL BG/ L (rack) 2048 25,000 5734.4 229.38 12.21 2.19 Next Gen BG/ P (rack) 4096 30,000 16384 546.13 7.32 5.20 TSUBAME 2.0 (2010Q3/ 4) 160,000 810,000 2,048,000 2528.40 5.06 24.09

TSUBAME 2.0 x24 improvement in 4.5 years…? ~ x1000 over 10 years

slide-74
SLIDE 74

74

The new JST-CREST “Ultra Low Power HPC” Project 2007-2012

  • x1000 Flops/W improvement @ 10 years -

MRAM PRAM Flash etc.

Ultra Multi Ultra Multi-

  • Core

Core Slow & Parallel Slow & Parallel (& ULP) (& ULP) ULP-HPC SIMD-Vector (GPGPU, etc.) VM Job Migration Power Optimization ULP-HPC Networks New Massive & Dense Cooling Technologies Zero Emission Power Sources

Modeling & Power Optimization

Application-Level Low Power Algorithms

slide-75
SLIDE 75

75

TSUBAME in Ret rospect and Fut ure

  • I ncreasing Commodit izat ion of HPC Space

– CPUs (since Beowulf , ASCI Red, … ) – High BW memory, Large-memory SMP – Very Fast I / O (PCI -E, HT3, … ) – High BW I nt erconnect (10GbE, I B => 100Gb) – Now SI MD-Vect or (ClearSpeed, GPGPU, Cell… ) – Next : Ext reme Many-Core, Opt ical Chip-Chip int erconnect , 3-D Chip Packaging, …

  • Technology =>

Sof t ware St ack & t he right apps & met a-applicat ion schema

– The same sof t ware st ack on your lapt op + Grid – DON’T f ocus on a single app or user ef f iciency – met a- applicat ion schema, mult i-user, inf rast ruct ue design – Learn f rom t he Grid (!)

  • propriet ary archit ect ures makes no sense

– Ecosyst ems and Economics THE KEY of f ut ure HPC(!) TSUBAME Timeline

slide-76
SLIDE 76

76

Beyond Petascale “Grid” Scalability is the key

#Users Capacity ~1000 ~100TF

#Users

Capacity

~1,000,000 ~100PF > x1000

~1GF

x107

~1GF

x106

slide-77
SLIDE 77

77

2016A.D. Deskside Petascale

2006A.D. Titech Supercomputing Grid #1 in Asia: 100TeraFlops, > 10,000 CPU, 1.5 MegaWatt, 300m2 2016 Deskside Workstation >100TeraFlops, 1.5KiloWatt, 300cm2

1000 times scaling down

  • f a SC:

but how?

Simple scaling will not work

No more aggressive clock increase Multi-core works but less than x100

Need R&D as “Petascale Informatics” in CS and Applications to achieve x1000 breakthrough + What can a scientist or an engineer achive with daily, personal use of petascale simulation?

slide-78
SLIDE 78

78

Seasonal Corporate Usage

申請書p.3