r lele tripiccione dipartimento di fisica universita di
play

R. (lele) Tripiccione Dipartimento di Fisica, Universita' di Ferrara - PowerPoint PPT Presentation

Spin glass simulations on Janus R. (lele) Tripiccione Dipartimento di Fisica, Universita' di Ferrara raffaele.tripiccione@unife.it UCHPC, Rodos (Greece) Aug. 27 th , 2012 Warning / Disclaimer / Fineprints I' m an outsider here ---> a


  1. Spin glass simulations on Janus R. (lele) Tripiccione Dipartimento di Fisica, Universita' di Ferrara raffaele.tripiccione@unife.it UCHPC, Rodos (Greece) Aug. 27 th , 2012

  2. Warning / Disclaimer / Fineprints I' m an outsider here ---> a physicist's view on an application-specific architecture A flavor of physics-motivated, performance-paranoic, (hopefully) unconventional computer architecture However a few points of contact with main-stream CS may still exist ...

  3. On the menu today WHAT?: spin-glass simulations in short WHY?: computational challenges HOW?: the JANUS systems DID IT WORK?: measured and expected performance (and comparison with “conventional” systems) Take-away lessons / Conclusions

  4. Our computational problem Bring a spin-glass (*) system of e.g. 48 3 grid points to thermal equilibrium: s e d i l - a challenge never attempted sofar ---> s t x e n - follow the system for 10 12 – 10 13 Monte Carlo (*) steps e h t n - on ~100 independent system instances i d e n i f e d e b Back-of-envelope estimate: o t ) * ( 1 high-end CPU for 10,000 years (which is not the same as 10,000 CPUs for 1 year ...)

  5. Statistical mechanics in brief .... Statistical mechanics tries to describe the macroscopic behaviour of matter in terms of average values of microscopic structure An (hopefully familiar) example : Explain why magnets have a transition temperature beyond which they lose their magnetic state T

  6. The Ising model ..... The tiny little magnets are named spins; they take just two values A “configuration” is a specific value assignment for all spins in the system The “macro”-behavior is dictated by the energy function at the “micro” level: Each spin interacts only with its nearest neighbours in a discrete D-dim mesh: U { S }=− ∑ 〈 ij 〉 J S i S j J  0 Statistical physics bridges the gap from micro to macro ....

  7. The spin-glass model ..... Spin-glasses are a generalization of Ising systems. They are the reference theoretical model of glassy behavior Interesting per se A model of complexity Interesting for industrial applications An apparently trivial change in the energy functions makes spin-glasses much more complex than Ising systems Studying these systems is a computational nightmare ...

  8. Why are Spin Glasses so hard?? A very simple change in the energy-function (defined on e.g. a discrete 3- D lattice) U =− ∑ NB  ij  J ij  i  j , ={ 1, − 1 } J ={ 1, − 1 } hides tremendously complex dynamics, due to the extremely irregular energy landscape in the configuration space (frustration):

  9. Monte Carlo algorithms These beasts are best studied numerically by Monte Carlo algorithms Monte Carlo algorithms navigate in configuration space in such a way that: ----> any configuration will show up according to its probability to be realized in the real world (at a given temperature) MC algorithms come in several versions … … most versions have remarkably similar requirements in terms of their algorithmic structure.

  10. The Metropolis algorithm An endless loop ..... Pick up one (or several) spin(s) Compute the energy U Flip it/them Compute the new energy U '  U = U ' − U Compute  U  0 If accept the change unconditionally − U / KT else accept the change only with probability e pick up new spin(s) and do it again

  11. ... just a few C lines

  12. Monte Carlo algorithms - t o n Common features: s ( e l r l a o m c ) s l l bit-manipulation operations on spins (+ LUT access) y a n m l a o s r (good-quality/long) random numbers m t o n o o t a huge degree of available parallelism c d e r i w d regular program flow (orderly loops on the grid sites) r y a r h regular, predictable memory access pattern o m e m p information-exchange (processor<->memory) is huge i h c - however the size of the data-base is tiny n o

  13. Compute intensive, you mean?? One Monte Carlo step is roughly the (real) time in which a (real) system flips one of its spins, roughly 1 pico-second If you want to understand what happens in just the first seconds of a real experiment you need O(10 12 ) time steps on ~ 100 replicas of a 100 3 system ---> 10 20 updates Clever programming on standard CPUs: 1 ns /spin-update ---> 3000 years

  14. Compute intensive, you mean?? The dynamics is dramatically slow (see picture) So even a simulated box whose size is a small multiple of the corr. Length will give accurate physics results Good news: we're in business even if we simulate a very small box .... However ....

  15. Hard scaling vs Weak Scaling Amdahl's law (strong scaling) vs...  1 − p  p 1 S A =  1 − p  p / N =  1 − p  p / N … Gustafson's law (weak scaling) S G =  1 − p  N p =  1 − p  N p  1 − p  p In our case … enlarging system-size is meaningless, as we do not yet have the resources to study a “small” system ----> the ultimate quest for strong scaling ....

  16. The JANUS project An attempt at developing, building and operating an application- driven compute engine for Monte Carlo simulations of spin glass systems A collaboration of: Universities of Rome (La Sapienza) and Ferrara Universities of Madrid, Zaragoza, Badajoz BIFI (Zaragoza) Eurotech Partially supported by Microsoft, Xilinx

  17. The nature of the available parallelism Spin – glass simulations have two levels of available parallelism 1) Embarassingly trivial: need statistics on several replicas ---> farm it out to independent processors 2) Trivially identified: sweep order for Monte Carlo update is not specified ---> can update in parallel any set of non-mutually interacting spins make it a black-white checkerboard: it opens the way to tens of thousands of independent thread... 1) & 2) do not commute

  18. The ideal spin glass machine ..... A further question: what is the appropriate system-scale at which this parallelism is best exploited One update engine: U =− ∑ NB  ij   i J ij  j computes the local contribution to U addresses a probability table compares with a freshly generated random numbr assigns the new spin value

  19. The ideal spin glass machine ..... All this is just a bunch (~1000) of gates And in spite of that a typical CPU core, with O(10 7 +) gates can process perhaps 4 spins at each clock cycle If you can arrange your stock of gates the way it best suits the algorithm, can easily expect ~1000 update engines on one chip ----> The best structure is a massively-many-core organization ( or perhaps an application-driven GPU??)

  20. The ideal spin glass machine ..... is an orderly structure (a 2D grid) of a large number of “update engines” each update engine handles a subset of the physical mesh its architectural structure is extremely simple each data path processess one bit at a time memory addresing is regular and predictable SIMD processing is OK however memory bandwidth requirements are huge (need 7 bit to process one bit..) however memory can be “local to the processor” Simple hardware structure ---> FPGA are OK!

  21. The JANUS machine A parallel system of (themselves) massively parallel processor chips The basic hardware element: A 2-D grid of 4 x 4 (FPGA based) processors (SP's) Data links among nearest neighbours on the grid One control processors on each board (IOP) with 2 Gbit Ethernet links to host st

  22. JANUS: a picture gallery

  23. Our “large” machine 256 (16 x 16) processors 8 host PCs --> ~ 90 TIPS for spin-glass simulation A typical simulation wall-clock time on this nice little machine goes down to a more manageable ~ 100 days.

  24. JANUS as a spin-glass engine The 2008 implementation (XILINX Virtex4-LX200): 1024 update cores on each processor, pipelineable to one spin update per clock cycle ---> 88% of available logic resources system clock at 62.5 Mhz ---> 16 ps average spin update time using a bandwidth of ~ 12000 read bits + 1000 written bits per clock cycle ---> 47% of available on-chip memory

  25. (Measured) Performances Let's use “conventional” units, first ???? The data path of each Processing Element (PE) performs 11 + 2 sustained pipelined ops per clock cycle (62.5 Mhz) We have 1024 PEs ----> ~ 830 GIPS However 11 ops are on very short data words: more honestly: 7 ... 8 sustained “conventional” pipelined ops per clock cycle: We have 1024 PEs ----> ~ 300 GIPS ---> 10 GIPS/W Sustained by ~ 1 Tbyte/sec combined memory bandwidth

  26. (Measured) Performances Physicicst like a different figure-of-merit ----> the spin-flip rate R, typically measured in psecs per flip For each processor in the system: 1 1 R 16ps / flip = = ≃ Nf 1024 × 62.5 MHz For one complete element of the IANUS core (16 procs): 1 1 . R = = ≃ 1 ps / flip . . N p Nf 16 × 1024 × 62.5 MHz e r u t a N s a t s a f s a

  27. Physics results

  28. Performance figures (2008-2009) Spin-glass addicts like to quote the average spin-update time SUT GUT ! ! Janus module 16 ps 1 ps x 0 0 7 – PC (IntelCoreDuo) 3000 ps 700 ps x 0 0 3 IBM CBE (all cores) - 65 ps

  29. Performance figures (2010-2011) In the last couple of years, multi/many core processors and GPUs have entered the arena.... ! ! x 0 2 – x 0 1 l l i t S

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend