1
TSUBAME---A Year Lat er
Sat oshi Mat suoka, Prof essor/ Dr.Sci. Global Scient if ic I nf ormat ion and Comput ing Cent er
Tokyo I nst . Technology & NAREGI Proj ect Nat ional I nst . I nf ormat ics EuroPVM/ MPI , Paris, France, Oct . 2, 2007
TSUBAME---A Year Lat er Sat oshi Mat suoka, Prof essor/ Dr.Sci. - - PowerPoint PPT Presentation
1 TSUBAME---A Year Lat er Sat oshi Mat suoka, Prof essor/ Dr.Sci. Global Scient if ic I nf ormat ion and Comput ing Cent er Tokyo I nst . Technology & NAREGI Proj ect Nat ional I nst . I nf ormat ics EuroPVM/ MPI , Paris, France, Oct .
1
Tokyo I nst . Technology & NAREGI Proj ect Nat ional I nst . I nf ormat ics EuroPVM/ MPI , Paris, France, Oct . 2, 2007
2
3
ClearSpeed CSX600 SIMD accelerator 360 boards, 35TeraFlops(Current)) Storage 1.0 Petabyte (Sun “Thumper”) 0.1Petabyte (NEC iStore) Lustre FS, NFS, CIF, WebDAV (over IP) 50GB/s aggregate I/O BW
500GB 48disks 500GB 48disks 500GB 48disks
NEC SX-8i (for porting)
Unified IB network
Sun Galaxy 4 (Opteron Dual core 8-socket) 10480core/655Nodes 21.4Terabytes 50.4TeraFlops OS Linux (SuSE 9, 10) NAREGI Grid MW
Voltaire ISR9288 Infiniband 10Gbps x2 (DDR next ver.) ~1310+50 Ports ~13.5Terabits/s (3Tbits bisection)
10Gbps+External Network
“Fastest Supercomputer in Asia, 29th Top500@48.88TF
70GB/s
4
5
~500 TB out of 1.1PB ~500 TB out of 1.1PB Node Rear Node Rear Local Infiniband Switch Local Infiniband Switch (288 ports) (288 ports) Currently Currently 2GB/s / node 2GB/s / node Easily scalable to Easily scalable to 8GB/s / node 8GB/s / node Cooling Towers (~ Cooling Towers (~32 32 units) units)
6
AMD:Fab36
NEC: Main Integrator, Storage, Operations SUN: Galaxy Compute Nodes, Storage, Solaris AMD: Opteron CPU Voltaire: Infiniband Network ClearSpeed: CSX600 Accel. CFS: Parallel FSCFS Novell: Suse 9/10 NAREGI: Grid MW Titech GSIC: us
7
8
– High-performance, low power x86 multi-core CPU
languages/compilers/tools, libraries, Grid tools, all ISV Applications
– FAT Node Architecture (later)
– High Bandwidth Infiniband Network, IP-based (over RDMA)
10microsec), reliable/redundant (dual-lane)
scatter/gather collectives, embedding multi-dimensional networks
9
– All Disk Storage Architecture (no tapes), 1.1Petabyte
– Low cost / high performance SATA2 (500GB/unit) – High Density packaging (Sun Thumper), 24TeraBytes/4U – Reliability thru RAID6, disk rotation, SAN redundancy (iStore)
– High bandwidth NAS I/O: ~50GBytes/s Livermore Benchmark – Unified Storage and Cluster interconnect: low cost, high bandwidth, unified storage view from all nodes w/o special I/O nodes or SW
– 35 Teraflops peak @ 90 KW (~ 1 rack of TSUBAME) – General purpose programmable SIMD Vector architecture
10
11
TSUBAME Physical Installation
service area
storage, 46.3 tons
– 10 storage racks
AC heat exchangers)
1st Floor 2nd Floor A 2nd Floor B
Titech Grid Cluster
TSUBAME TSUBAME TSUBAME TSUBAME & Storage
12
X4600 x 120nodes (240 port s) per swit ch => 600 + 55 nodes, 1310 port s, 13.5Tbps I B 4x 10Gbps x 2
I B 4x 10Gbps x 24
I B 4x 10Gbps X4500 x 42nodes (42 port s) => 42port s 420Gbps Single mode f iber f or cross-f loor connect ions
13
link parallelism, no “mesh” t opologies
1~8GB 10~40GF 2~4
Typical PC Clust er
0.5~1GB 5.6 GF 2
I BM BG/ L
16 16 64~128 8, 16 8, 32
CPUs/ Node
32~128(new)GB
TSUBAME
(Tokyo Tech)
16GB 128GF
The Eart h Simulat or
512GB 532.48GF~799GF
Fuj it su PrimePower
(Kyot o-U, Nagoya-U)
32~64GB 60.8GF~135GF
Hit achi SR11000
(U-Tokyo, Hokkaido-U)
16~128GB 48GF~217.6GF
I BM eServer
(SDSC Dat aSt ar)
Memory/ Node Peak/ Node
14
– Entire floor area ~= 14m x 14m ~= 200m2 = 2200 ft2 – But if we assume 70% cooling power as in the Earth Simulator then total is 1.1MW – still ~500W/ft2
15
TSUBAME Physical Installation 700W/ft2
500W/ft2 for the whole room High density cooling & power reduction
2nd Floor B
TSUBAME TSUBAME
16
45cm raised f loor, cabling only
no turbulant airf low causing hotspots
Cold row Isolated hot row
Pressurized cool air Increase effective air volume, evens flow
Isolation plate prevents Venturi effect
Narrow Aisles Narrow Aisles
17
Very narrow hot row aisle-
absorbed and cooled by the CRC units on the lef t Narrow Cold Row Aisle- - - no f loor cooling, just cables underneath Duct openings on the ceiling, and the transparent isolation plates to prevent hot- cold mixture
Pressurized cold air blowing down f rom the ceiling duct - - - very strong wind
18
>85 TeraFlops 1.1Petabyte 4 year procurement cycle Has beaten the Earth Simulator in both peak and Top500 Has beaten all the other Univ. centers combined
All University National Centers
19
from
client’s PC
development env. (e.g. Visual Studio)
no interactive usage, good UI
Massive Usage Env. Gap
Seamless,Ubiquitous access and usage =>Breakthrough Science through Commoditization of Supercomputing and Grid Technologies “Everybody’s Supercomputer”
Hmm, it’s like my personal machine
Might as well use my Laptop
Isolated High-End
20
Grid Portal based WebMO
Computational Chemistry Web Portal for a variety of Apps (Gaussian,NWChem,GAMESS, MOPAC, Molpro) (Prof. Takeshi Nishikawa @ GSIC)
1.SSO 2.Job Mgmt 3.Edit Molecules 4.Set Conditions
TSUBAME WinCCS
– High end simulat ions t o supplement “physical” lab courses
My desktop scaled to 1000 CPUs!☺
21
– 50TB (for email), 9 Galaxy1 nodes
– 10s GBs per everyone on campus
PC mountable, but accessible directly from TSUBAME
– Research Repository
I can backup ALL my data☺
22
23
– SX-5, Origin2000, HP GS320
– 32 nodes (512CPUs), 97 Terabytes storage, free usage – Linpack 38.18 Teraflops May 8th, #7 on the 28th Top500 – May 1~8: Whole system Linpack, achieve 38.18 TF
– 299 nodes, (4748 CPUs), still free usage
– ~10,000CPUs, several hundred Terabytes for SC – Innovative accounting: Internet-like Best Effort & SLA
24
– Small FREE Usage: “Promotion Trial (Catch- and- bait)” – Service Level Agreement : “Cell Phones”
– Best Ef f ort (new): “I nternet I SP”
– Open & ext ensive inf ormat ion, f air policy guarant ee – Users make t heir own invest ment decisions---collect ive societ al opt imizat ion (Adam Smit h) C.f . Top-Down planned allocat ion (planned economy)
64CPUs 64CPUs 64CPUs 64CPUs 64CPUs
Jan Feb
64CPUs 64CPUs 64CPUs
Mar Nano- VO Max CPU=192
Dynamic machine- level resource allocation SLA > BES > Small
Over 1300 SC users 10, 000 accounts
25
Long wait times for small jobs due to massive parameter sweep Long wait times for large jobs due to long-running MPI jobs that are difficult to pre-empt, and require apps-specific QoS (e.g., memory)
26
27
28
29
High-
school students (~10 out
Since 1995: Cray => Origin => TSUBAM TSUBAME E
700 CPUs allocated for 1 week
sponsors
30
31
TSUBAME Applications---Massively Complex Turbulant Flow and its Visualization (by Tanahashi Lab and Aoki Lab, Tokyo Tech.)
Turbulant Flow from Airplane Taylor-Couette Flow
32
33
T S U B A M E J
s 1 2 3 4 5 6 7 = 1 p < = 8 p < = 1 6 p < = 3 2 p < = 6 4 p< = 1 2 8 p> 1 2 8 p # P r
e s s
s / J
# J
s 系列1
Coexist ence of ease-of -use in bot h
(Bot h are hard f or physically large-scale dist ribut ed grid)
34
#
J
, e x c e p t P G I _ C D K A B A Q U S A M B E R A V S _ E x p r e s s ( D e v e l
e r + P C E ) E n S i g h t G a u s s i a n G a u s s V i e w G R O M A C S M a t h e m a t i c a M A T L A B M
p r
O P A C M S C _ N A S T R A N M S C _ P A T R A N N W C h e m P O V
a y S A S T i n k e r U T C h e m G A M E S S
35
C P U t i m e s h a r e f r
6 A p r . t
J a n . ( I S V A p p s O n l y )
A B A Q U S A M B E R A V S _ E x p r e s s D i s c
e r y S t u d i
n S i g h t G a u s s i a n G a u s s V i e w G R O M A C S M a t e r i a l s E x p l
e r M a t e r i a l s S t u d i
a t h e m a t i c a M A T L A B M
p r
O P A C M S C _ N A S T R A N M S C _ P A T R A N N W C h e m P G I _ C D K
36
37
Structural Matching [Y. Akiyama]
⇒ Non-traditional modeling and approach
Complex & Large Scale
P1 P2 P3 P4 P5 …. P1000 P1 P2 P3 P4 P5 … P1000
Drug Design
Narrowing the Candidate
38
Domain Scientist Counterpart Theory of Computing & Applied Math ・ Algorithms ・ Optimization Theory ・ Probabilistic Theory … HPC & CS Expertise ・ Modeling ・ Programming ・ Systems … Computationism Ideology ・ Work with domain scientists ・ Willing to Study and understand the Science and the discipline
39
10Gbps InfiniBand 2,304 ports
Super Titanet
Sun Fire X4600 ClearSpeed Advance Accelerator Board 360 boards, 35TFlops(peak) Sun Fire X4500
NEC iStorage S1800AT
657 nodes,
5,256CPU,10,512Cores
50.6TFlops(peak) 21.7 Terabytes 85TFlops (Peak) 47.38TFLops (Linpack)
アジア#1
24Gbp s 62 nodes, 1.5 Petabyte ディ スク スト レッ ジ 0.1 PB RAID6
Production HPC Service COE Edu
COE Research TSUBAME Acceleration TSUBAME Storage Extensions
COE TSUBAME @ GSIC, Titech
40
Not just CPu power but memory/storage/network, etc.
41
120TB 500 GB
120TB, 1GB/s 120TB, 3GB/s 32~128GB 3840GB 20Gbps 2.5Tbps 16 ( node) 1920 ( job) TSUBA ME 500GB, 50MB/s 10TB(NAS), 100MB/s 2~8GB 128GB 1Gbps 32Gbps 2~4( node) 32~128( job) Std.
Disk(Cap, BW) RAM Network CPU Core
42
– Virtual Machines (VM) as computing nodes
– Interconnected via overlay networks
User B
Physical Resources
Virtual Cluster A
User A
128nodes MPI, Java 200nodes MPI, gcc
Virtual Cluster B
User’s own Resources
Site A
VM VM
VM Image VM Image
Pkg Pkg Pkg
Site B
VM
VM Image
Virtual Cluster Requirement
User Virtual Cluster
Installation Server VM
VM Image
Pkg Pkg Pkg
Easy specification of installation request Scalable image transfer Fast environment construction on VM Autonomic Scheduling of VM Resources
5 1 1 5 2 2 5 3 3 5 4 2 4 6 8 1 W h
e T r a n s f e r I n s t a l l a t i
If we pruned unreasonably- slow HDDs… Construction time (sec) Number of nodes Likely to be due to some unstable HDDs 1000-VM virtual cluster in less than 1 minute!
46
announced “common procurement procedure” for the next gen SCs in 1H2008
– 100-150 TFlops – HW: x86 cluster-like SC architecture – NW: Myrinet10G or IB + Ethernet – SW: Linux+SCore, common Grid MW
– No other choices to balance widespread usage, performance, and prices – Makes EVERY sense for University Mgmt.
– Adverse architecture diversity has been impediment for Japanese Grid Infrastructure
47 Hokkaido University
Information Initiative Center
HITACHI SR11000 5.6 Teraflops
Tohoku University
Information Synergy Center
NEC SX-7 NEC TX7/AzusA
University of Tokyo
Information Technology Center
HITACHI SR8000 HITACHI SR11000 6 Teraflops Others (in institutes)
Nagoya University
Information Technology Center
FUJITSU PrimePower2500 11 Teraflops
Osaka University
CyberMedia Center
NEC SX-5/128M8 HP Exemplar V2500/N 1.2 Teraflops
Kyoto University
Academic Center for Computing and Media Studies FUJITSU PrimePower2500
8.9 Teraflops
Kyushu University
Computing and Communications Center
FUJITSU VPP5000/64 IBM Power5 p595 5 Teraflops
Tokyo Inst. Technology
Global Scientific Information and Computing Center
2006 NEC/SUN TSUBAME 85 Teraflops
University of Tsukuba
FUJITSU VPP5000 PACS-CS 14.5 TFlops
National Inst. of Informatics
SuperSINET/NAREGI Testbed 17 Teraflops
~60 SC Cent ers in J apan incl. Eart h Simulat or
cent er by 2012
48 Hokkaido University
Information Initiative Center
HITACHI SR11000 5.6 Teraflops
Tohoku University
Information Synergy Center
NEC SX-7 NEC TX7/AzusA
University of Tokyo
Information Technology Center
NextGen x86 150 Teraflops
HITACHI SR11000 18 Teraflops Others (in institutes)
Nagoya University
Information Technology Center
FUJITSU PrimePower2500 11 Teraflops
Osaka University
CyberMedia Center
NEC SX-8 or SX-9
2008 x86 Cluster 35 Teraflops Kyoto University
Academic Center for Computing and Media Studies
NextGen x86 100-150 Teraflops Kyushu University
Computing and Communications Center
2007 x86 50 TeraFlops?
Fujitsu Primequest? IBM Power5 p595 5 Teraflops
Tokyo Inst. Technology
Global Scientific Information and Computing Center
NEC/SUN TSUBAME 85 Teraflops 250 TFlops?
University of Tsukuba
2006 PACS-CS 14.5 TFlops NextGen x86 100-150 Teraflops
National Inst. of Informatics
NAREGI Testbed 4 Teraflops
x86 TSUBAME sibling dominat ion St ill - 10 Pet af lop cent er by 2012
49
50
ClearSpeed CSX600 SIMDアク セラ レ ータ 現状35TeraFlops (1)ク ラ スタ の15Teraflopsを 高 度教育・ 研究用(+業務)へ転用
ClearSpeedにてスパコ ン の性能ロスを補完(計100TF)
スト レ ッ ジ 1 Petabyte (Sun “Thumper”) 0.1Petabyte (NEC iStore) Lustre フ ァ イ ルシステム
500GB 48disks 500GB 48disks 500GB 48disks(2) NEC SX-8 ベク ト ル計算機
(レガシー・ 移植試験等)
高速ネッ ト ワーク Bisection BW 最速13Tbps
(スト レ ッ ジへは 400Gbps)Sun/AMD高性能計算ク ラ スタ (Opteron Dual core 8-Way) 10480core/655ノ ード 50TeraFlops OS(現状) Linux (検討中) Solaris, Windows NAREGIグリ ッ ド ミ ド ル
Voltaire ISR9288 Infiniband 10Gbps x 288ポート
10Gbps+外部 ネッ ト ワーク
SE増強: システム2名+ア プリ 2名体制、 スキルレ ベル向上 (従来は3名) (3) HD視覚 化・ 画像 表示装置 など
Various public research DBs and Mirrors---Astro, Bio, Chemical
Various Observational & Simulation Data
All Historical Archive of Research Publications, Documents, Home Pages,
TSUBAME ~ 100 TeraFlops, Petabytes Storage
Archival & Data Grid Middleware
Petabytes, Stable Storage Data Provenance “Archiving Domain Knowledge”
NESTRE System
All User Storage
( Documents, etc)
51
X4600 x 120nodes (240 port s) per swit ch => 600 + 55 nodes, 1310 port s, 13.5Tbps I B 4x 10Gbps x 2
I B 4x 10Gbps x 24
I B 4x 10Gbps X4500 x 42nodes (42 port s) => 42port s 420Gbps Single mode f iber f or cross-f loor connect ions
52
Previous Life Previous Life Now Now…
53
54
G S I C 過去のスパコ ンおよびT S U B A M E T
5 性能の歴史および予測
3 5 1 4 9 6 7 4 5 2 7 1 7 1 4 9 7 4 2 2 2 9 9 1 9 5 1 5 8 1 4 7 5 3 4 1 1 8 9 1 4 7 1 1 5 1 4 9 5 6 1 1 7 1 1 1 1 5 1 1 5 2 2 5 3 3 5 4 4 5 5
1 1 / 2 1 6 / 2 1 1 1 / 2 9 6 / 2 9 1 1 / 2 8 6 / 2 8 1 1 / 2 7 6 / 2 7 1 1 / 2 6 6 / 2 6 1 1 / 2 5 6 / 2 5 1 1 / 2 4 6 / 2 4 1 1 / 2 3 6 / 2 3 1 1 / 2 2 6 / 2 2 1 1 / 2 1 6 / 2 1 1 1 / 2 6 / 2 1 1 / 1 9 9 9 6 / 1 9 9 9 1 1 / 1 9 9 8 6 / 1 9 9 8 1 1 / 1 9 9 7 6 / 1 9 9 7 1 1 / 1 9 9 6 6 / 1 9 9 6 1 1 / 1 9 9 5
時系列 T
5 ラ ンキング 1 1 / 1 9 9 5 N
p g r a d e R a n k i n g 1 1 / 1 9 9 5 U p g r a d e d R a n k i n g C r a y C 9 1 9 9 5 / 1
/ 1 N E C S X
2 / 1
6 / 3 T S U B A M E 2 6 / 4
1 ( 2 1 1 ) ( N
p g r a d e s ) T S U B A M E ( w / U p g r a d e s )
55
Hardware
・ 2 5 W M a x P
e r ・ C S X 6 p r
e s s
x 2 ( 9 6 G F L O P S P e a k ) ・ I E E E 7 5 4 6 4 b i t D
b l e
r e c i s i
F l
t i n g P
n t ・ 1 3 3 M H z P C I
H
t I n t e r f a c e ・ O n b
r d m e m
y : 1 G B ( M a x 4 G B ) ・ I n t e r n a l m e m
y b a n d w i d t h : 2 G b y t e s / s ・ O n
r d m e m
y b a n d w i d t h : 6 . 4 G b y t e s / s
S
t w a r e
・ S t a n d a r d N u m e r i c a l L i b r a r i e s ・ C l e a r S p e e d S
t w a r e D e v e l
m e n t K i t ( S D K )
A p p l i c a t i
s a n d L i b r a r i e s
i n e a r A l g e b r a
L A S , L A P A C K
i
i m u l a t i
s
M B E R , G R O M A C S
i g n a l P r
e s s i n g
F T ( 1 D , 2 D , 3 D ) , F I R , Wa v e l e t
a r i
s S i m u l a t i
s
F D , F E A , N
y
m a g e P r
e s s i n g
i l t e r i n g , i m a g e r e c
n i t i
, D C T s
i l & G a s
i r c h h
f T i m e / Wa v e M i g r a t i
56
57
Input data Output data library call return computation
1 2 3 4 2 0 4 0 6 0 8 01 1 2 m a t r i x s i z e M S p e e d ( G F l
s ) B = 9 6 B = 7 6 8 B = 5 7 6 B = 3 8 4
(MxB) x (BxM) multiplication speed
– 70GFlops with new beta(!)
58
59
Host process SI MD process SI MD server
Addit ional SI MD server direct ly calls CSXL DGEMM
60
61
Peak speeds are
(N=391680)
(N=345600)
(N=391680) Not e: Half CS doesn’t work (very slow) wit h N=391680, because of t he memory limit at ion
Block size NB is
1000 2000 3000 4000 5000 6000 200000 400000 Matrix size N Speed (GFlops) Full CS Half CS No CS
62
– +24 % improvement over No Acc (38.18TF) – +25.5GFlops per accelerator – Matrix size N=1148160 (It was 1334160 in No Acc) – 5.9hours
5 10 15 20 25 30 35 40 45 50 60 350 648 Number of nodes Speed (TFlops) Full Acc Half Acc No Acc
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 60 350 648 Number of nodes Relative Speed Full Acc Half Acc No Acc
Relative speed (No Acc=1)
63
64
1TF 10TF 100TF 1PF 2002 2006 2008 2010 2012 2004
Earth Simulator 40TF (2002) TSUBAME2 1PF Sustained, >10PB (2010-11)
2010 TSUBAME 2.0
=> Interim 200TeraFlops @ 2008 => Sustained Petaflop @ 2010 Sustain leadership in Japan
10PF
Japanese “Keisoku” >10PF(2011-12)
Titech Supercomputing Campus Grid (incl TSUBAME )~90TF (2006)
US Petascales (Peak) (2007~8) US HPCS (2010) BlueGene/L 360TF(2005) TSUBAME Upgrade >300TF (2008-2H) Quad Core Opteron + Acceleration US 10P (2011~12?) 1.3TF
KEK 59TF BG/L+SR11100
Titech Campus Grid
U-Tokyo, Kyoto-U, Tsukuba 100-150TF (2008) Others
TSUBAME 110 TF, Storage 1.6 PB, 128GB nodes(2007)
65
2008Q1 TACC/ Sun “Ranger” ~52,600 “Barcelona” Opt eron CPU Cores, ~500TFlops ~100 racks, ~300m2 f loorspace 2.4MW Power, 1.4km I B cx4 copper cabling 2 Pet abyt es HDD 2008 LLNL/ I BM “BlueGene/ P” ~300,000 PPC Cores, ~1PFlops ~72 racks, ~400m2 f loorspace ~3MW Power, copper cabling
Ot her Pet af lops 2008/ 2009
) …
66
Year 2003 2006 2008 2010 2012 2014 2015 Microns
0.09 0.065 0.045 0.032 0.022 0.016 0.011
Scalar Cores 1 2 4 8 16 32 64 GFLOPS/ Socket 6 24 48 96 192 384 768
Total KWf or 1 PF (200W/ Socket)
3.3E+05 83333 41667 20833 10417 5208 2604 SI MD/ Vect or
192 384 768 1536 3072 GFLOPS/ Board
192 384 768 1536 3072
Total KWf or 1 PF (25W/ Board)
130.2 65.1 32.6 16.3 8.14
2009 Conservatively Assuming 0. 065- 0. 045 microns, 4 cores, 48 GFlops/ Socket=>200Teraf lops, 800 Teraf lop Accelerator board “Commodity” Petaf lop easily achievable in 2009- 2010
67
68
69
Material physics的 (Infinite system) ・ Fluid dynamics ・ Statistical physics ・ Condensed matter theory … Molecular Science ・ Quantum chemistry ・
Molecular Orbital method
・ Molecular Dynamics …
E.g. Fragmented MO, Could use 100,000 loosely-coupled CPUs in pseudo paramter
E.g., Advanced MD,
coupled SMP (#CPU not the limit, but memory and BW)
Old HPC environment: ・ decoupled resources, ・ hard to use, ・ special software, ...
70
C P U t i m e s h a r e f r
6 A p r . t
J a n . ( I S V A p p s O n l y )
A B A Q U S A M B E R A V S _ E x p r e s s D i s c
e r y S t u d i
n S i g h t G a u s s i a n G a u s s V i e w G R O M A C S M a t e r i a l s E x p l
e r M a t e r i a l s S t u d i
a t h e m a t i c a M A T L A B M
p r
O P A C M S C _ N A S T R A N M S C _ P A T R A N N W C h e m P G I _ C D K
71
– Feasible NOW t o build a usef ul 10PF machine
3.21 229926 739 Tot al
30, 294 482 PS3
749 44 GPGPU 1.69 25,389 43 Linux 2.97 3,028 9
Mac/ I nt el
0.79 8,880 7 Mac/ PPC 0.95 161,586 154 Windows
GFLOPS/ CPU Act ive CPUs TFLOPS OS Type
Folding@Home 2007- 03- 25 18:18:07
72
Classic Design Point New design points in a single machine
tightly-coupled “Grid”
73
Machine CPU Cores Wat t s P eak GFLOP S P eak MFLOP S/ Wat t Wat t s/ CP U Core Rat io c.f . TSUBAME TSUBAME(Opt er on) 10480 800,000 50,400 63.00 76.34 TSUBAME(w/ ClearSpeed) 11,200 810,000 85,000 104.94 72.32 1.00 Eart h Simulat or 5120
6,000,000
40,000 6.67 1171.88 0.06 ASCI P ur ple (LLNL) 12240
6,000,000
77,824 12.97 490.20 0.12
AI ST Supercluster
3188 522,240 14400 27.57 163.81 0.26 LLNL BG/ L (rack) 2048 25,000 5734.4 229.38 12.21 2.19 Next Gen BG/ P (rack) 4096 30,000 16384 546.13 7.32 5.20 TSUBAME 2.0 (2010Q3/ 4) 160,000 810,000 2,048,000 2528.40 5.06 24.09
TSUBAME 2.0 x24 improvement in 4.5 years…? ~ x1000 over 10 years
74
MRAM PRAM Flash etc.
Ultra Multi Ultra Multi-
Core Slow & Parallel Slow & Parallel (& ULP) (& ULP) ULP-HPC SIMD-Vector (GPGPU, etc.) VM Job Migration Power Optimization ULP-HPC Networks New Massive & Dense Cooling Technologies Zero Emission Power Sources
Application-Level Low Power Algorithms
75
76
#Users Capacity ~1000 ~100TF
~1GF
~1GF
77
2006A.D. Titech Supercomputing Grid #1 in Asia: 100TeraFlops, > 10,000 CPU, 1.5 MegaWatt, 300m2 2016 Deskside Workstation >100TeraFlops, 1.5KiloWatt, 300cm2
1000 times scaling down
but how?
No more aggressive clock increase Multi-core works but less than x100
Need R&D as “Petascale Informatics” in CS and Applications to achieve x1000 breakthrough + What can a scientist or an engineer achive with daily, personal use of petascale simulation?
78
申請書p.3