The Godson-3 Multi-Core Processor and its Application in High - - PowerPoint PPT Presentation

the godson 3 multi core processor and its application in
SMART_READER_LITE
LIVE PREVIEW

The Godson-3 Multi-Core Processor and its Application in High - - PowerPoint PPT Presentation

The Godson-3 Multi-Core Processor and its Application in High Performance Computers Weiwu Hu Institute of Computing Technology, CAS Elio Guidetti STMicro Electronics 1 Contents A brief introduction to ICT A brief introduction to


slide-1
SLIDE 1

1

The Godson-3 Multi-Core Processor and its Application in High Performance Computers

Weiwu Hu Institute of Computing Technology, CAS Elio Guidetti STMicro Electronics

slide-2
SLIDE 2

2

Contents

 A brief introduction to ICT  A brief introduction to Godson processors  The Godson-3 multi-core processor  PetaFLOPS and TeraFLOPS

Godson is the academic name of LoongsonTM

slide-3
SLIDE 3

3

ICT history contribution

 Found in 1956, the first organization in China for computer science and technology research  All original computer researchers in China are trained from ICT  Built many computers for the country building before 1980  Spin off big companies such as Lenovo and Dawning after 1980  Spin off other institutes of CAS, such as institute of software and institute of microelectronics

slide-4
SLIDE 4

4

Headquarter

Branches

ICT Is a Networked Institute

slide-5
SLIDE 5

5

ICT Organization

 R&D Divisions and Centers

◆Computer Systems (HPC, CPU, etc) ◆Network and Pervasive Computing ◆Intelligent Information Processing ◆Advanced Studies

 8 regional branches

◆In cooperation with local government to promoting market

 Human Resource

◆1500 people at headquarter : including 1000 graduate students

slide-6
SLIDE 6

6

Three Main Tasks of ICT

 Solving the Nation’s Big Problems  Research and Development  Graduate Education

in computing

for the nation and the world

slide-7
SLIDE 7

7

Solving the Nation’s Big Problems

Improve Innovation Capability Pressing National Challenges

◆Energy, Healthcare, Environment, Education

Benefiting the masses (1.3 billion)

slide-8
SLIDE 8

8

China Computer Market Trends

308 662 403.9 7.07 2020 191 411 217.3 4.75 2015 106 233 115.6 3.00 2010 49.5 111 59.0 2.30 2005 8.9 22.5 25.9 1.08 2000 7.4 0.69 1995 Client Devices (Million) Internet Users (Million) Computer Market ($Billion) GDP ($Trillion)

Expert from State e-Nation Office: Cannot copy the US route – Cost > $10 trillion, Time > 30 years

1 : 0.26 : 0.26 46 45 175 World 1 : 0.29 : 0.30 20 19 66 North America 1 : 0.80 : 0.90 9 8 10 Japan 1 : 0.60 : 0.50 1 1.2 2 Korea 1 : 0.16 : 0.06 0.7 1.8 11 China PC : Server : Storage Storage Server PC $Billion

China’s IT market is still very shallow Source: IDC 2004

slide-9
SLIDE 9

9

Computer Milestones

1941 1 flop/s 1945 100 flop/s 1949 1 Kflop/s 1951 10 Kflop/s 1961 100 Kflop/s 1964 1 Mflop/s 1968 10 Mflop/s 1975 100 Mflop/s 1987 1 Gflop/s 1992 10 Gflop/s 1993 100 Gflop/s 1997 1 Tflop/s 2000 10 Tflop/s 14 100 Tflop/s 2008 1 Pflop/s 2013 10 Pflop/s 2016 100 Pflop/s 2020 1 Eflop/s 1941-2000 data barrowed from Jack Donggara, 2004

World Milestones

1958 Model 103 13 1959 Model 104 8 1967 Model 109B 6 1976 Model 013 12 1983 Model 757 15 1995 Dawning2000 6 2000 Dawning2000B 7 2003 Dawning4000L 6 2004 Dawning4000A 4 2008 Dawning5000 3 2010 Dawning5000A 2 High-Performance Computer Brand at ICT: Dawning

ICT Computers & Gaps

slide-10
SLIDE 10

10

Computers designed by ICT

Model 104 Model 109C First Large-scale Transistor Computer in China Model 757, 1983/11 Vector Computer

Dawning Computers

Model 103, 1958/8 First Computer in China Model KJ8920, 1991/11 Mainframe Computer

slide-11
SLIDE 11

11

Evolution of HPC in China: Dawning HPCs

Dawning1000, 1995 intel i860 First MPP Computer in China 2.5Gflops Peak Dawning2000, 1999, Motorola PowerPC First SMP Cluster in China 100Gflops Peak Dawning3000, 2001 IBM Power3 SUMA Cluster in China 400Gflops Peak Dawning4000-A, 2004 AMD Opteron Grid-enabling Cluster 11.2 Tflops Peak Dawning5000-A, 2008 6400 AMD 4-core Opteron 220 Tflops Peak

slide-12
SLIDE 12

12

 CPU CPU : : 6400 AMD 4-core 6400 AMD 4-core  Blade Blade : : 1600 4-CPU-SMP 1600 4-CPU-SMP  Node Node : : 160 10-blade 160 10-blade  Cabinet Cabinet : : 40 4-Node 40 4-Node  Interconnection Interconnection : : 10x12x24 DDR InfiniBand 10x12x24 DDR InfiniBand  System System : : 200TFlops, 100TB Memory, 20Gbps 200TFlops, 100TB Memory, 20Gbps  Storage Storage : : 500TB, 50GB/s 500TB, 50GB/s  Power Power : : 800KW 800KW  Cooling Cooling : : Air-cooling in Box + Water-cooling in Air-cooling in Box + Water-cooling in Cab Cab

Dawning5000A Configuration

slide-13
SLIDE 13

13

slide-14
SLIDE 14

14

Evolution of Dawning HPC Systems

  • 0. 01
  • 0. 1

1 10 100 1000 10000 100000

1993 1995 1996 1998 2000 2001 2003 2004

G f l op/ s M em

  • ry G

B Storage G B C PU s Li npack

Annual Sale: tens hundreds 1995/6 Top1=170 Top500=1.96 Dawning=1.2 2004/6 Top1=35860 Top500=624 Dawning=8061

slide-15
SLIDE 15

15

Evolution of HPC in China: What Next?

Dawning1000, 1995 intel i860 First MPP Computer in China 2.5Gflops Peak Dawning2000, 1999, Motorola PowerPC First SMP Cluster in China 100Gflops Peak Dawning3000, 2001 IBM Power3 SUMA Cluster in China 400Gflops Peak Dawning4000-A, 2004 AMD Opteron Grid-enabling Cluster 11.2 Tflops Peak Dawning5000-A, 2008 6400 AMD 4-core Opteron 220 Tflops Peak

PetaFLOPS Godson-3 2010

???

slide-16
SLIDE 16

16

2001~2006 Graduate Student Enrolment

976 533 443 898 444 454 812 379 433 701 327 374 528 258 270 374 186 188 2001年 2002年 2003年 2004年 2005年 2006年

博士 士 硕 总数

PHD Master Total

slide-17
SLIDE 17

17

2000~2007 New Admissions

86 93 99 124 143 140 116 168 109 183 99 194 108 178 2001 2002 2003 2004 2005 2006 2007 P

  • h. D

. M aster

slide-18
SLIDE 18

18

Academia and Professional

 International Journal Editorial Boards (>10)

◆IEEE Transactions on Computers ◆Parallel Computing ◆Journal of Systems and Software ◆Information and Management ◆……

 IEEE Computer Society Beijing Center  Journal of Computer Science & Technology

◆published by Springer

 International Conferences

slide-19
SLIDE 19

19

Contents

 A brief introduction to ICT  A brief introduction to Godson processors  The Godson-3 multi-core processor  PetaFLOPS and TeraFLOPS

slide-20
SLIDE 20

20

National Project

 High performance CPU is national strategic product

◆ Chinese IT industry is big but not strong: 5.6 trillion RMB in 2007,

  • nly 22% by domestic companies, 3.75% profits

 Godson CPU is supported by

◆ National 863 project ◆ National 973 project ◆ National Science Foundation of China ◆ National key project ◆ Key project of Chinese Academy of Sciences

slide-21
SLIDE 21

21

Godson CPU Briefs

 ICT started Godson CPU design in 2001.  The 32-bit Godson-1 CPU in 2002 is the first general purpose CPU in China.  The 64-bit Godson-2B in 2003.10  The 64-bit Godson-2C in 2004.12  The 64-bit Godson-2E in 2006.03  Each Triple the performance of its previous one.

slide-22
SLIDE 22

22

Godson Development

10 100 1000 10000 1999 2000 2001 2002 2003 2004 2005 2006

Intel/AMD/HP/IBM/SGI/Sparc SPEC cpu2000 rate Godson rate X3 X3 X3

slide-23
SLIDE 23

23

<503> SPEC int2000 465 645 3000 300.twolf 411 365 1500 256.bzip2 722 263 1900 255.vortex 458 240 1100 254.gap 508 354 1800 253.perlbmk 690 188 1300 252.eon 382 472 1800 197.parser 598 167 1000 186.crafty 586 307 1800 181.mcf 497 221 1100 176.gcc 512 273 1400 175.vpr 347 403 1400 164.gzip Ratio Run time Reftime Programs

Godson-2E SPEC CPU2000 Rate

<503> SPEC fp2000 493 528 2600 301.apsi 319 345 1100 200.sixtrack 395 531 2100 191.fma3d 506 396 2000 189.lucas 509 432 2200 188.ammp 632 300 1900 187.facerec 624 208 1300 183.equake 624 416 2600 179.art 704 412 2900 178.galgel 634 221 1400 177.mesa 382 549 2100 173.applu 311 579 1800 172.mgrid 469 660 3100 171.swim 672 238 1600 168.wupwise Ratio Run time Ref time Programs

slide-24
SLIDE 24

24

Godson-2E and Godson-2F

 1.0GHz@90nm CMOS, 5-7W  47M xtors, area 36mm^2  Godson-2 CPU Core

◆ 64-bit MIPS III Compatible ◆ Four-issue, OOO ◆ 64KB+64KB L1 (four-way) ◆ 512KB L2 (four-way)

 On-chip DDR Controller  SysAD Front-end bus  1.0GHz@90nm CMOS, 3-5W  51M xtors, area 43mm^2  Godson-2 CPU Core

◆ 64-bit MIPS III Compatible ◆ Four-issue, OOO ◆ 64KB+64KB L1 (four-way) ◆ 512KB L2 (four-way)

 On-Chip DDR2 controller.  PCI/PCIX, Local IO, GPIO, etc.  Volume production

slide-25
SLIDE 25

25

Some Applications

 With the high performance features, Loongson-2 CPU is welcome by many customers

◆Low-cost PC & notebook ◆Network applications ◆Low-end servers & HPC ◆High-end embedded applications.

 Million units order

slide-26
SLIDE 26

26

Godson-2 Architecture Features

 64-bit out-of-order execution pipeline

◆ 9 stage pipeline, four issue ◆ Dynamic scheduling: Group RS(16 fix+16 float), 64-entry ROB ◆ Register renaming: 64-entry physical register file ◆ Branch prediction: Gshare, BTB, RAS, 8-entry branch queue ◆ Five Function units: tow fix, two float (SSE2-lke media), one memory

 Memory Hierarchy

◆ 64KB instruction cache and 64KB data cache, 4-way set associated ◆ TLB: 64-entry fully associated, two 4KB-4MB page each, separate 16 entry ITLB ◆ 24 non-blocking accesses & on-the-fly memory disambiguation ◆ Load speculation: return values on previous pending stores ◆ 512KB-1MB L2 Cache ◆ On-Chip memory controller

 Word-level CPU core

◆ In-stat Report: The sophistication of the Godson-2 shows that the Chinese are poised to produce microprocessors as powerful as any in the world

slide-27
SLIDE 27

27

Godson Roadmap: the Big Version

Products Time

PC Mobile End Digital Watch PDA

Mobile media end

Network Server Grid Server ???

Early Computer

Walkman /CD ??? Low-cost PC

……

game Digital camera Video Camera MP3 High End PC PC TV Settop

Digital TV/ G2 STB Transactio n Server

Phone Cell Phone

High: Performance

Wide: 3C Convergence

Grid Tech.

slide-28
SLIDE 28

28

The Parallel Computing Landscape: A View from Berkeley 2.0

Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Edward Lee, Nelson Morgan, George Necula, Dave Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Kathy Yelick

October, 2007

A View from Berkeley 2.0

slide-29
SLIDE 29

29

Re-inventing Client/Server

 Laptop/Handheld as future client, Datacenter as future server

 “The Datacenter is the Computer” Building sized computers: Google, MS, …  “The Laptop/Handheld is the Computer”  ‘07: HP no. laptops > desktops  1B Cell phones/yr, increasing in function  Otellini demoed "Universal Communicator”

 Combination cell phone, PC and video device

 Apple iPhone

slide-30
SLIDE 30

30

Laptop/Handheld Reality

 Use with large displays at home or office  What % time disconnected? 10%? 30% 50%?

◆Disconnectedness due to Economics

Cell towers and support system expensive to maintain ⇒ charge for use to recover costs ⇒ costs to communicate Policy varies, but most countries allow wireless investment where make most money ⇒ Cities well covered ⇒ Rural areas will never be covered

◆Disconnectedness due to Technology

No coverage deep inside buildings Satellite communication uses up batteries

⇒ Need computation & storage in Laptop/Handheld

slide-31
SLIDE 31

31

Low end roadmap: From CPU to SOC

CPU

North Bridge South Bridge

Graphic

CPU +NB South Bridge

Graphic

CPU +NB GPU+ South Bridge CPU +GPU +NB+ SB

2E ( 2006 ) 2H ( 2009 ) 2F ( 2007 ) 2F, 2G ( 2008 )

PCI PCI PCI/HT/PCIE

slide-32
SLIDE 32

32

2G vs. 2F

 65nm vs. 90nm  A little higher frequency: 5%  Similar die size: 40-50 mm^2  Half power consumption due to 65nm: 1w-3w  Improved CPU core

◆MIPS64 compatible ◆X86 binary translation support ◆ECC for reliability, EJTAG for debug ◆128-bit memory access ◆1MB L2 vs. 512KB

 Much more flexible IOs

◆HT, PCI/PCIX, LPC, SPI, UART, GPIO

slide-33
SLIDE 33

33

South Bridge of 2F and 2G

 For both 2F and 2G: PCI/PCIX interface  Tolerate the current flaw of 2F PCI

◆Out of order instruction fetch ◆Low efficiency with SM502

 Only for low cost PC, as simple as possible

◆Simple GPU + 16/32 bit DDR2/3 for separate video RAM ◆USB, MAC, SATA, etc. ◆2F+SB or 2G+SB: ~ $25

 If the user needs high end SB, can be connect to HT, PCIE, PCI/PCIX  A low-end GS232 core integrated to be a stand alone SOC for some applications.

slide-34
SLIDE 34

34

Loongson-2H: Single-chip Hetero multicore

 Closing the gap between

◆ desktop performance and handheld low power consumption

 Single chip for mobile, desktop, and settop box

◆ SP2: Scaling Performance by Scaling Power

 Hetero multicore

◆ L-mode(0.1W ): ◆ M-mode(1.0W): ◆ H-mode(5-10W):

slide-35
SLIDE 35

35

Contents

 A brief introduction to ICT  A brief introduction to Godson processors  The Godson-3 multi-core processor  PetaFLOPS and TeraFLOPS

slide-36
SLIDE 36

36

 Distributed scalable architecture  Reconfigurable architecture  X86 binary translation speedup  Low Power Consumption  >1.0GHz@65nm

Godson-3 Briefs

slide-37
SLIDE 37

37

Scalable Architecture

 Scalable interconnection network

◆ Crossbar + Mesh ◆ Single crossbar connects cores, L2s, and four directions

 Directory-based cache coherence protocol

◆ Distributed L2 caches are global addressed ◆ Each cache block has a directory entry ◆ Both data cache and instruction cache are recorded in directory

8x8 Xbar P0 P1 P2 P3 L2 L2 L2 L2 E S W N E S W N

slide-38
SLIDE 38

38

Reconfigurable Architecture

6*6 AXI Switch S0 P0 P1 P2 P3 S1 S2 S3 m1 m2 m3 m4 m5 s5 s1 s2 s3 s4 5*4 AXI Switch MC1 MC0 m0 s0

DMA Controller HT, PCIE HT, PCIE DMA Controller PCI...

Shared L2 can be configured as internal RAM, DMA to internal RAM directly General Purpose Core, 64-bit, 4-issue, OOO, AXI interface Multiple Purpose Core LINPACK, biology, signal processing, AXI interface DMA engine supports pre- fetch and matrix 8 configurable address windows of each master port allow pages migration across L2 and memory

slide-39
SLIDE 39

39

Generality vs. Efficiency

Power, area efficiency Generality 5% 20%-30% 50~60%

multicore Stream processor manycore

General purpose core + multiple purpose core for wide applications

Godson- 3

slide-40
SLIDE 40

40

GS464 general purpose core

 MIPS64, 200+ more instructions for X86 binary translation and media acceleration  Four-issue superscalar OOO pipeline  Two fix, two FP, one memory units  Two FP units each supports full pipelined double/paired-single MAC operation  48-bit VA and PA, 128-bit memory access  64KB icache and 64KB dcache, 4-way  64-entry fully associated TLB, 16-entry ITLB, variable page size  Non-blocking accesses, load speculation  Directory-based cache-coherence for CMP  Parity check for icache, ECC for dcache  EJTAG for debugging  Standard 128-bit AXI interface

DMA+AXI Controller XBAR

1024*64 4W4R AXI interface Micro- Controller Micro-Code Store 1024*64 4W4R 1024*64 4W4R 1024*64 4W4R 1024*64 4W4R 1024*64 4W4R 1024*64 4W4R 1024*64 4W4R Processor Interface BTB BHT ITLB

ICach e

Fix Queu e Reorder Queue

Floati ng Point Regist er File

ALU1 ALU2 FPU1 FPU2 Floa t Queu e

CP0 Que ue

DCACH E

DTLB ROQ BR Q

Intege r Regist er File

AGU Write back Bus Commit Bus Map Bus missqu eue Refill Bus imemre ad dmemwr ite

PC

PC+1 6 dmemread, duncache ucque ue wtbkqu eue

AXI Interface

EJTAG TAP Controller Test Controller

JTAG Interface Test Interface

GS464 Architecture

Decode Bus Branch Bus

clock, reset, int, … Predecoder Decoder Register Mapper

Tag Compare

GStera multiple purpose core

 Target for LINPACK, biology computation, signal processing, etc  8-16 MACs per node  Big multi-port register file  Reconfigurable based on applications.  Standard 128-bit AXI interface

slide-41
SLIDE 41

41

Matrix Transposing Performance

 15+times faster

50000000 100000000 150000000 200000000 250000000 256x256 512x512 1024x1024 dsp cpu

slide-42
SLIDE 42

42

Hardware Support for X86 Binary Translation

 Define new instructions

◆ X86 ISA function and MIPS ISA format ◆ Binary translation mechanism supporting ◆ >200 instructions are defined with 5% additional silicon cost

 Speedup X86-to-MIPS binary translation by

◆ 10 times

6 49 61 2 1 01 4 21 1 6 43 4 17 8 23 12 54 3 4 85 1 6 74 3 9 03 2 6 86 99 8 86 1 6 1 02 1 2 7 95 4 14 3 50 11 6 11 9 96 4 2 00 4 00 6 00 8 00 1 00 1 2 00 1 4 00 1 6 00 I D C T FFT(FX ) FFT(FP) G P E FLA G N
  • H
O , N
  • S
O H O O n l y B
  • th
I d eal

Figure 2. The architecture of Godson-2 virtual machine

Enhanced Godson internal operations Enhanced MIPS decode Linux on MIPS Process level X86 VM System level X86 VM Linux apps. on MIPS Linux apps. on X86 MS windows

slide-43
SLIDE 43

43

Preliminary Virtual Machine Performance

 QEMU@LS2F 800MHz vs. PIII 850MHz

◆ Average efficiency 18% (except mcf and art) ◆ INT 15%, FP 21%

 QEMU@LS2F 800MHz vs. LS2F 800MHz

◆ Average efficiency 14% ◆ INT 17%, FP 11%

 QEMU @LS3 vs. LS2F

◆ Average efficiency 27% ◆ INT 34%, FP %19

slide-44
SLIDE 44

44

Low Power Design

 Low leakage process is selected

◆LP/GP mixed ◆HVT/SVT mixed

 Manual clock gating regarding architecture

◆Much efficient in reducing power consumption compared to P&R tools

 Power management

◆Module level (CPU core, HT, PCIE, DDR2) clock gating ◆Frequency Scaling ◆Temperature Sensor

slide-45
SLIDE 45

45

Physical implementation

 65nm CMOS LP/GP technology  Cell-based design methodology

◆DC + ICC ◆Manual P&R for critical cells

 2008: 4-core (4GP + 0MP) + 4MB L2

◆GP: General purpose core ◆MP: Multiple purpose core ◆10w@1GHz

 2009: 8-core (4GP + 4MP) + 4MB L2

◆ 20w@1GHz

slide-46
SLIDE 46

46

4-core and 8-core

6*6 X1 Switch S0 P0 P1 P2 P3 S1 S2 S3 m1 m2 m3 m4 m5 s5 s1 s2 s3 s4 5*4 X2 Switch MC1 MC0 m0 s0

DMA Controller HT, PCIE HT, PCIE DMA Controller PCI...

X1 Switch S0 P0 P1 P2 P3 S1 S2 S3 m1 m2 m3 m4 m5 s5 s1 s2 s3 s4 X2 Switch MC0 PCILP C m0 s0

DMA Controller HT, PCIE

X1 Switch S0 P0 P1 P2 P3 S1 S2 S3 m1 m2 m3 m4 m5 s5 s1 s2 s3 s4 X2 Switch MC1 m0 s0

HT, PCIE DMA Controller

slide-47
SLIDE 47

47 Physical register file Size: 321um x 262um Power: 50mW@1GHz Delay: 470ps TLB CAM Size: 224 um x 235 um Power: 55mw @ 1GHz Delay: 550ps

Full Customer Register file and CAM

slide-48
SLIDE 48

48 HT1.0 Driver & Receiver FlipChip Compatible 2Row design 800mw @ 1.6Gbps

Size: 250um x 300um Power: < 10mW Freq: 1.6GHz Jitter: 20 ps

HyperTransport PHY

slide-49
SLIDE 49

49 TEST CHIP ST 65nm 1206um x 1206um Function: CAM1W1R - BIST CAM1W1R - Scan RAM4W4R - BIST RAM4W4R - Scan ICT_PLL - Freq. test HT1.0 - BIST HT1.0 - Error rate test

Test Chip for Customer Blocks

slide-50
SLIDE 50

50

Cell-based high performance physical design  The Full Hierarchical Design Methodology  Manual placement & route for critical paths  Manual placement of all FFs and clock buffers, manual clock gating  Architecture optimization with physical feedback

ictreg1 ictreg0 Mul Div regfile0 rissuebus2 resbus2 fwdbus2 mapbus0/1/2/3 resbus0 resbus1 re s b u s re s b u s 2 fw d b u s 2 b r_ p c _ v a lu e re s b u s 2 /3 fw d b u s 2 /3 jr_ ta rg e t alu1 qissbus rissbus alu_buf_vsrc alu2 cam0 fxqueue
slide-51
SLIDE 51

51

Clock Tree

 H-Tree + Mesh  Manual placement of FFs  Manual clock gate

slide-52
SLIDE 52

52

Layout of 4-core Godson-3

Xbar Xbar L2 L2 L2 L2 GS464 GS464 GS464 GS464 PCIE PCIE HT HT DDR2/3 DDR2/3

slide-53
SLIDE 53

53

Contents

 A brief introduction to ICT  A brief introduction to Godson processors  The Godson-3 multi-core processor  PetaFLOPS and TeraFLOPS

slide-54
SLIDE 54

54

PetaFLOPS and TeraFLOPS

 PetaFLOPS for National HPC

◆To build PetaFLOPS HPC with Godson-3 in 2010.

 TeraFLOPS for Personal HPC

◆Putting desktop to pockets ◆Putting TeraFLOPS to desktop: computing for the masses

slide-55
SLIDE 55

55

PetaFLOPS

 National 863 project

◆Based on Godson-3 CPU

 Hyper Parallel Processing Hyper Parallel Processing

◆ Good Scalability Good Scalability ◆ Good Commodity Good Commodity ◆ Low cost Low cost ◆ Low power consumption < 1MW Low power consumption < 1MW

slide-56
SLIDE 56

56

Dawning5000C Configuration

8 COREs (4GP+4MP) per CPU 4 chips per PE 8 PEs per node 400 nodes per system

slide-57
SLIDE 57

57

 CPU CPU : : 12800 80GF 12800 80GF 8-core Loongson-3 8-core Loongson-3  Blade Blade : : 3200 4-CPU PE 3200 4-CPU PE  Node Node : : 400 8-blade (+1-OS-module) 400 8-blade (+1-OS-module)  Cabinet Cabinet : : 60 7-Node 60 7-Node  Interconnection Interconnection : : 16x18x36 InfiniBand, 16x18x36 InfiniBand, Global Global Synchronization Synchronization  System System : : 1PF, 200TB Memory, 40Gbps, 1PF, 200TB Memory, 40Gbps, 1us Barrier 1us Barrier  Power Power : : < <800KW 800KW  Compatible Compatible : : X86/Linux Application X86/Linux Application  Energy Energy : : 1000 MF/w 1000 MF/w (Linpack/Power) (Linpack/Power)

Dawning5000C Configuration

slide-58
SLIDE 58

58

Personal High Performance Computers

 Computers are popular when they are personalized

◆IBM PC-XT

 HPC will be popular when they are personalized

◆Anti-Cloud?

 Personal HPC features:

◆High Performance: > 1TeraFLOPS ◆Low cost: < $10000/Teraflops ◆Low power: <1000W, connected to the wall of office ◆Ease to Use

slide-59
SLIDE 59

59

Scaling down TeraFLOPS

$100K/2007

“refrigerator”

$50K/2008

“washing machine”

$10K/2009

“microwave oven”

Scaling down

TeraFLOPS in 1997

slide-60
SLIDE 60

60

2F node 1U12P

Refrigerator: TeraFLOPS HPC based on Loongson-2F

slide-61
SLIDE 61

61

Proposals

 Design the multiple purpose core according to the requirement oc CERN  Integrated in the 8-core version of Godson-3

slide-62
SLIDE 62

62

Thanks