SLIDE 1
SLIDE 2 Spin glass simulations on Janus
Dipartimento di Fisica, Universita' di Ferrara
raffaele.tripiccione@unife.it
UCHPC, Rodos (Greece) Aug. 27th, 2012
SLIDE 3
I' m an outsider here ---> a physicist's view on an application-specific architecture A flavor of physics-motivated, performance-paranoic, (hopefully) unconventional computer architecture However a few points of contact with main-stream CS may still exist ...
Warning / Disclaimer / Fineprints
SLIDE 4
WHAT?: spin-glass simulations in short WHY?: computational challenges HOW?: the JANUS systems DID IT WORK?: measured and expected performance (and comparison with “conventional” systems) Take-away lessons / Conclusions
On the menu today
SLIDE 5 Our computational problem
Bring a spin-glass (*) system of e.g. 483 grid points to thermal equilibrium:
- a challenge never attempted sofar --->
- follow the system for 1012 – 1013 Monte Carlo (*) steps
- on ~100 independent system instances
Back-of-envelope estimate: 1 high-end CPU for 10,000 years (which is not the same as 10,000 CPUs for 1 year ...)
( * ) t
e d e f i n e d i n t h e n e x t s l i d e s
SLIDE 6 Statistical mechanics in brief ....
Statistical mechanics tries to describe the macroscopic behaviour
- f matter in terms of average values of microscopic structure
An (hopefully familiar) example : Explain why magnets have a transition temperature beyond which they lose their magnetic state T
SLIDE 7
The Ising model .....
The tiny little magnets are named spins; they take just two values A “configuration” is a specific value assignment for all spins in the system The “macro”-behavior is dictated by the energy function at the “micro” level: Each spin interacts only with its nearest neighbours in a discrete D-dim mesh: Statistical physics bridges the gap from micro to macro .... U {S}=−∑〈ij 〉 J Si S j J0
SLIDE 8
The spin-glass model .....
Spin-glasses are a generalization of Ising systems. They are the reference theoretical model of glassy behavior
Interesting per se A model of complexity Interesting for industrial applications
An apparently trivial change in the energy functions makes spin-glasses much more complex than Ising systems Studying these systems is a computational nightmare ...
SLIDE 9 Why are Spin Glasses so hard??
A very simple change in the energy-function (defined on e.g. a discrete 3- D lattice) hides tremendously complex dynamics, due to the extremely irregular energy landscape in the configuration space (frustration):
U=−∑NBij Jij i j, ={1,−1} J={1,−1}
SLIDE 10 Monte Carlo algorithms
These beasts are best studied numerically by Monte Carlo algorithms Monte Carlo algorithms navigate in configuration space in such a way that:
- ---> any configuration will show up according to its probability
to be realized in the real world (at a given temperature) MC algorithms come in several versions … … most versions have remarkably similar requirements in terms
- f their algorithmic structure.
SLIDE 11 The Metropolis algorithm
An endless loop ..... Pick up one (or several) spin(s) Compute the energy Flip it/them Compute the new energy Compute If accept the change unconditionally else accept the change only with probability pick up new spin(s) and do it again
U=U '−U U ' U U0
e
− U/KT
SLIDE 12
... just a few C lines
SLIDE 13 Monte Carlo algorithms
Common features: bit-manipulation operations on spins (+ LUT access) (good-quality/long) random numbers a huge degree of available parallelism regular program flow (orderly loops on the grid sites) regular, predictable memory access pattern information-exchange (processor<->memory) is huge however the size of the data-base is tiny
m a n y s m a l l ( n
m a l l ) c
e s h a r d w i r e d c
t r
h i p m e m
y
SLIDE 14
Compute intensive, you mean??
One Monte Carlo step is roughly the (real) time in which a (real) system flips one of its spins, roughly 1 pico-second If you want to understand what happens in just the first seconds of a real experiment you need O(1012) time steps on ~ 100 replicas of a 1003 system ---> 1020 updates Clever programming on standard CPUs: 1 ns /spin-update ---> 3000 years
SLIDE 15
Compute intensive, you mean??
The dynamics is dramatically slow (see picture) So even a simulated box whose size is a small multiple of the corr. Length will give accurate physics results Good news: we're in business even if we simulate a very small box .... However ....
SLIDE 16 Hard scaling vs Weak Scaling
Amdahl's law (strong scaling) vs... … Gustafson's law (weak scaling) In our case … enlarging system-size is meaningless, as we do not yet have the resources to study a “small” system ----> the ultimate quest for strong scaling ....
SA = 1−pp 1−pp/N= 1 1−pp/N SG = 1−pN p 1−pp = 1−pNp
SLIDE 17
An attempt at developing, building and operating an application- driven compute engine for Monte Carlo simulations of spin glass systems A collaboration of: Universities of Rome (La Sapienza) and Ferrara Universities of Madrid, Zaragoza, Badajoz BIFI (Zaragoza) Eurotech Partially supported by Microsoft, Xilinx
The JANUS project
SLIDE 18
The nature of the available parallelism
Spin – glass simulations have two levels of available parallelism 1) Embarassingly trivial: need statistics on several replicas ---> farm it out to independent processors 2) Trivially identified: sweep order for Monte Carlo update is not specified ---> can update in parallel any set of non-mutually interacting spins make it a black-white checkerboard: it opens the way to tens of thousands of independent thread... 1) & 2) do not commute
SLIDE 19
The ideal spin glass machine .....
A further question: what is the appropriate system-scale at which this parallelism is best exploited One update engine: computes the local contribution to U addresses a probability table compares with a freshly generated random numbr assigns the new spin value
U=−∑NBij i Jij j
SLIDE 20 The ideal spin glass machine .....
All this is just a bunch (~1000) of gates
And in spite of that a typical CPU core, with O(107+) gates can process perhaps 4 spins at each clock cycle If you can arrange your stock of gates the way it best suits the algorithm, can easily expect ~1000 update engines on one chip
The best structure is a massively-many-core organization ( or perhaps an application-driven GPU??)
SLIDE 21
The ideal spin glass machine .....
is an orderly structure (a 2D grid) of a large number of “update engines” each update engine handles a subset of the physical mesh its architectural structure is extremely simple each data path processess one bit at a time
memory addresing is regular and predictable
SIMD processing is OK however memory bandwidth requirements are huge (need 7 bit to process one bit..)
however memory can be “local to the processor”
Simple hardware structure ---> FPGA are OK!
SLIDE 22
The JANUS machine
A parallel system of (themselves) massively parallel processor chips The basic hardware element: A 2-D grid of 4 x 4 (FPGA based) processors (SP's) Data links among nearest neighbours on the grid One control processors on each board (IOP) with 2 Gbit Ethernet links to host st
SLIDE 23
JANUS: a picture gallery
SLIDE 24 Our “large” machine
256 (16 x 16) processors 8 host PCs --> ~ 90 TIPS for spin-glass simulation A typical simulation wall-clock time
- n this nice little machine
goes down to a more manageable ~ 100 days.
SLIDE 25 The 2008 implementation (XILINX Virtex4-LX200): 1024 update cores on each processor, pipelineable to one spin update per clock cycle
- --> 88% of available logic resources
system clock at 62.5 Mhz
- --> 16 ps average spin update time
using a bandwidth of ~ 12000 read bits + 1000 written bits per clock cycle
- --> 47% of available on-chip memory
JANUS as a spin-glass engine
SLIDE 26
Let's use “conventional” units, first ???? The data path of each Processing Element (PE) performs 11 + 2 sustained pipelined ops per clock cycle (62.5 Mhz) We have 1024 PEs ----> ~ 830 GIPS However 11 ops are on very short data words: more honestly: 7 ... 8 sustained “conventional” pipelined ops per clock cycle: We have 1024 PEs ----> ~ 300 GIPS ---> 10 GIPS/W Sustained by ~ 1 Tbyte/sec combined memory bandwidth
(Measured) Performances
SLIDE 27 Physicicst like a different figure-of-merit ----> the spin-flip rate R, typically measured in psecs per flip For each processor in the system: For one complete element of the IANUS core (16 procs):
(Measured) Performances
R = 1 Nf = 1 1024 × 62.5 MHz ≃ 16ps / flip R = 1 NpNf = 1 16 × 1024 × 62.5 MHz ≃ 1 ps / flip
a s f a s t a s N a t u r e . . .
SLIDE 28
Physics results
SLIDE 29 Spin-glass addicts like to quote the average spin-update time
Performance figures (2008-2009)
SUT GUT Janus module 16 ps 1 ps PC (IntelCoreDuo) 3000 ps 700 ps IBM CBE (all cores)
3 x – 7 x ! !
SLIDE 30
In the last couple of years, multi/many core processors and GPUs have entered the arena....
Performance figures (2010-2011)
S t i l l 1 x – 2 x ! !
SLIDE 31
4+ years old Janus still has an edge on state-of-the-art commercial HPC computing architectures Reasonable to continue on the same line, surfing on technology developments Expected performance increase????
What next??
FPGA size 2.5 – 3.0 x Clock frequency 4.0 x SUT parallel 16 x Grand total 160 -200 x “Log(Grand Total)” ~ 7.5
SLIDE 32
Exactly the same architecture of JANUS but .... Xilinx Virtex-7 FPGAs (Virtex7-485) 2 DDR-3 memory banks on each SP Improved local 4x4 interconnection Tighter coupling with the HOST (on-box CPU + PCIe gen2) Protos in fall 2012 – Physics in early 2013 ----> Simulate a 1283 Ising-spin glass for 242 time steps
What next?? Janus 2
SLIDE 33 Looking at the crystal ball....
How long is the (predicted) opportunity window for Janus2?? A graphical anwer (and some speculations
SLIDE 34
Take-away lessons
JANUS is an extremely rewarding example of (strongly
application driven) on-chip multiprocessing: We designed a machine around an “unconventional problem” No wonder the machine turned out to be “unconventional” enough Results were rewarding... WHY????
SLIDE 35
Take-away lessons
Results were rewarding .....WHY?? there is a lot of parallelism available that is actually exploited; load is automatically balanced among the update engines; memory access is heavy, but patterns are predictable; processors (and their memories) are arranged on a regular grid; inter-node traffic is not huge and regular. IN SHORT Our machine tried to exploit all these feature at best