R. (lele) Tripiccione Dipartimento di Fisica, Universita' di Ferrara - - PowerPoint PPT Presentation

r lele tripiccione dipartimento di fisica universita di
SMART_READER_LITE
LIVE PREVIEW

R. (lele) Tripiccione Dipartimento di Fisica, Universita' di Ferrara - - PowerPoint PPT Presentation

Spin glass simulations on Janus R. (lele) Tripiccione Dipartimento di Fisica, Universita' di Ferrara raffaele.tripiccione@unife.it UCHPC, Rodos (Greece) Aug. 27 th , 2012 Warning / Disclaimer / Fineprints I' m an outsider here ---> a


slide-1
SLIDE 1
slide-2
SLIDE 2

Spin glass simulations on Janus

  • R. (lele) Tripiccione

Dipartimento di Fisica, Universita' di Ferrara

raffaele.tripiccione@unife.it

UCHPC, Rodos (Greece) Aug. 27th, 2012

slide-3
SLIDE 3

I' m an outsider here ---> a physicist's view on an application-specific architecture A flavor of physics-motivated, performance-paranoic, (hopefully) unconventional computer architecture However a few points of contact with main-stream CS may still exist ...

Warning / Disclaimer / Fineprints

slide-4
SLIDE 4

WHAT?: spin-glass simulations in short WHY?: computational challenges HOW?: the JANUS systems DID IT WORK?: measured and expected performance (and comparison with “conventional” systems) Take-away lessons / Conclusions

On the menu today

slide-5
SLIDE 5

Our computational problem

Bring a spin-glass (*) system of e.g. 483 grid points to thermal equilibrium:

  • a challenge never attempted sofar --->
  • follow the system for 1012 – 1013 Monte Carlo (*) steps
  • on ~100 independent system instances

Back-of-envelope estimate: 1 high-end CPU for 10,000 years (which is not the same as 10,000 CPUs for 1 year ...)

( * ) t

  • b

e d e f i n e d i n t h e n e x t s l i d e s

slide-6
SLIDE 6

Statistical mechanics in brief ....

Statistical mechanics tries to describe the macroscopic behaviour

  • f matter in terms of average values of microscopic structure

An (hopefully familiar) example : Explain why magnets have a transition temperature beyond which they lose their magnetic state T

slide-7
SLIDE 7

The Ising model .....

The tiny little magnets are named spins; they take just two values A “configuration” is a specific value assignment for all spins in the system The “macro”-behavior is dictated by the energy function at the “micro” level: Each spin interacts only with its nearest neighbours in a discrete D-dim mesh: Statistical physics bridges the gap from micro to macro .... U {S}=−∑〈ij 〉 J Si S j J0

slide-8
SLIDE 8

The spin-glass model .....

Spin-glasses are a generalization of Ising systems. They are the reference theoretical model of glassy behavior

Interesting per se A model of complexity Interesting for industrial applications

An apparently trivial change in the energy functions makes spin-glasses much more complex than Ising systems Studying these systems is a computational nightmare ...

slide-9
SLIDE 9

Why are Spin Glasses so hard??

A very simple change in the energy-function (defined on e.g. a discrete 3- D lattice) hides tremendously complex dynamics, due to the extremely irregular energy landscape in the configuration space (frustration):

U=−∑NBij Jij i j, ={1,−1} J={1,−1}

slide-10
SLIDE 10

Monte Carlo algorithms

These beasts are best studied numerically by Monte Carlo algorithms Monte Carlo algorithms navigate in configuration space in such a way that:

  • ---> any configuration will show up according to its probability

to be realized in the real world (at a given temperature) MC algorithms come in several versions … … most versions have remarkably similar requirements in terms

  • f their algorithmic structure.
slide-11
SLIDE 11

The Metropolis algorithm

An endless loop ..... Pick up one (or several) spin(s) Compute the energy Flip it/them Compute the new energy Compute If accept the change unconditionally else accept the change only with probability pick up new spin(s) and do it again

U=U '−U U ' U U0

e

− U/KT

slide-12
SLIDE 12

... just a few C lines

slide-13
SLIDE 13

Monte Carlo algorithms

Common features: bit-manipulation operations on spins (+ LUT access) (good-quality/long) random numbers a huge degree of available parallelism regular program flow (orderly loops on the grid sites) regular, predictable memory access pattern information-exchange (processor<->memory) is huge however the size of the data-base is tiny

m a n y s m a l l ( n

  • t
  • t
  • s

m a l l ) c

  • r

e s h a r d w i r e d c

  • n

t r

  • l
  • n
  • c

h i p m e m

  • r

y

slide-14
SLIDE 14

Compute intensive, you mean??

One Monte Carlo step is roughly the (real) time in which a (real) system flips one of its spins, roughly 1 pico-second If you want to understand what happens in just the first seconds of a real experiment you need O(1012) time steps on ~ 100 replicas of a 1003 system ---> 1020 updates Clever programming on standard CPUs: 1 ns /spin-update ---> 3000 years

slide-15
SLIDE 15

Compute intensive, you mean??

The dynamics is dramatically slow (see picture) So even a simulated box whose size is a small multiple of the corr. Length will give accurate physics results Good news: we're in business even if we simulate a very small box .... However ....

slide-16
SLIDE 16

Hard scaling vs Weak Scaling

Amdahl's law (strong scaling) vs... … Gustafson's law (weak scaling) In our case … enlarging system-size is meaningless, as we do not yet have the resources to study a “small” system ----> the ultimate quest for strong scaling ....

SA = 1−pp 1−pp/N= 1 1−pp/N SG = 1−pN p 1−pp = 1−pNp

slide-17
SLIDE 17

An attempt at developing, building and operating an application- driven compute engine for Monte Carlo simulations of spin glass systems A collaboration of: Universities of Rome (La Sapienza) and Ferrara Universities of Madrid, Zaragoza, Badajoz BIFI (Zaragoza) Eurotech Partially supported by Microsoft, Xilinx

The JANUS project

slide-18
SLIDE 18

The nature of the available parallelism

Spin – glass simulations have two levels of available parallelism 1) Embarassingly trivial: need statistics on several replicas ---> farm it out to independent processors 2) Trivially identified: sweep order for Monte Carlo update is not specified ---> can update in parallel any set of non-mutually interacting spins make it a black-white checkerboard: it opens the way to tens of thousands of independent thread... 1) & 2) do not commute

slide-19
SLIDE 19

The ideal spin glass machine .....

A further question: what is the appropriate system-scale at which this parallelism is best exploited One update engine: computes the local contribution to U addresses a probability table compares with a freshly generated random numbr assigns the new spin value

U=−∑NBij i Jij j

slide-20
SLIDE 20

The ideal spin glass machine .....

All this is just a bunch (~1000) of gates

And in spite of that a typical CPU core, with O(107+) gates can process perhaps 4 spins at each clock cycle If you can arrange your stock of gates the way it best suits the algorithm, can easily expect ~1000 update engines on one chip

  • --->

The best structure is a massively-many-core organization ( or perhaps an application-driven GPU??)

slide-21
SLIDE 21

The ideal spin glass machine .....

is an orderly structure (a 2D grid) of a large number of “update engines” each update engine handles a subset of the physical mesh its architectural structure is extremely simple each data path processess one bit at a time

memory addresing is regular and predictable

SIMD processing is OK however memory bandwidth requirements are huge (need 7 bit to process one bit..)

however memory can be “local to the processor”

Simple hardware structure ---> FPGA are OK!

slide-22
SLIDE 22

The JANUS machine

A parallel system of (themselves) massively parallel processor chips The basic hardware element: A 2-D grid of 4 x 4 (FPGA based) processors (SP's) Data links among nearest neighbours on the grid One control processors on each board (IOP) with 2 Gbit Ethernet links to host st

slide-23
SLIDE 23

JANUS: a picture gallery

slide-24
SLIDE 24

Our “large” machine

256 (16 x 16) processors 8 host PCs --> ~ 90 TIPS for spin-glass simulation A typical simulation wall-clock time

  • n this nice little machine

goes down to a more manageable ~ 100 days.

slide-25
SLIDE 25

The 2008 implementation (XILINX Virtex4-LX200): 1024 update cores on each processor, pipelineable to one spin update per clock cycle

  • --> 88% of available logic resources

system clock at 62.5 Mhz

  • --> 16 ps average spin update time

using a bandwidth of ~ 12000 read bits + 1000 written bits per clock cycle

  • --> 47% of available on-chip memory

JANUS as a spin-glass engine

slide-26
SLIDE 26

Let's use “conventional” units, first ???? The data path of each Processing Element (PE) performs 11 + 2 sustained pipelined ops per clock cycle (62.5 Mhz) We have 1024 PEs ----> ~ 830 GIPS However 11 ops are on very short data words: more honestly: 7 ... 8 sustained “conventional” pipelined ops per clock cycle: We have 1024 PEs ----> ~ 300 GIPS ---> 10 GIPS/W Sustained by ~ 1 Tbyte/sec combined memory bandwidth

(Measured) Performances

slide-27
SLIDE 27

Physicicst like a different figure-of-merit ----> the spin-flip rate R, typically measured in psecs per flip For each processor in the system: For one complete element of the IANUS core (16 procs):

(Measured) Performances

R = 1 Nf = 1 1024 × 62.5 MHz ≃ 16ps / flip R = 1 NpNf = 1 16 × 1024 × 62.5 MHz ≃ 1 ps / flip

a s f a s t a s N a t u r e . . .

slide-28
SLIDE 28

Physics results

slide-29
SLIDE 29

Spin-glass addicts like to quote the average spin-update time

Performance figures (2008-2009)

SUT GUT Janus module 16 ps 1 ps PC (IntelCoreDuo) 3000 ps 700 ps IBM CBE (all cores)

  • 65 ps

3 x – 7 x ! !

slide-30
SLIDE 30

In the last couple of years, multi/many core processors and GPUs have entered the arena....

Performance figures (2010-2011)

S t i l l 1 x – 2 x ! !

slide-31
SLIDE 31

4+ years old Janus still has an edge on state-of-the-art commercial HPC computing architectures Reasonable to continue on the same line, surfing on technology developments Expected performance increase????

What next??

FPGA size 2.5 – 3.0 x Clock frequency 4.0 x SUT parallel 16 x Grand total 160 -200 x “Log(Grand Total)” ~ 7.5

slide-32
SLIDE 32

Exactly the same architecture of JANUS but .... Xilinx Virtex-7 FPGAs (Virtex7-485) 2 DDR-3 memory banks on each SP Improved local 4x4 interconnection Tighter coupling with the HOST (on-box CPU + PCIe gen2) Protos in fall 2012 – Physics in early 2013 ----> Simulate a 1283 Ising-spin glass for 242 time steps

What next?? Janus 2

slide-33
SLIDE 33

Looking at the crystal ball....

How long is the (predicted) opportunity window for Janus2?? A graphical anwer (and some speculations

  • n Moore's law)
slide-34
SLIDE 34

Take-away lessons

JANUS is an extremely rewarding example of (strongly

application driven) on-chip multiprocessing: We designed a machine around an “unconventional problem” No wonder the machine turned out to be “unconventional” enough Results were rewarding... WHY????

slide-35
SLIDE 35

Take-away lessons

Results were rewarding .....WHY?? there is a lot of parallelism available that is actually exploited; load is automatically balanced among the update engines; memory access is heavy, but patterns are predictable; processors (and their memories) are arranged on a regular grid; inter-node traffic is not huge and regular. IN SHORT Our machine tried to exploit all these feature at best