Self-Tuning Bio-Inspired Massively-Parallel Computing Steve Furber - - PowerPoint PPT Presentation

self tuning bio inspired
SMART_READER_LITE
LIVE PREVIEW

Self-Tuning Bio-Inspired Massively-Parallel Computing Steve Furber - - PowerPoint PPT Presentation

Self-Tuning Bio-Inspired Massively-Parallel Computing Steve Furber The University of Manchester steve.furber@manchester.ac.uk EXADAPT Mar 2012 1 Outline 63 years of progress Many cores make light work Building brains The


slide-1
SLIDE 1

EXADAPT Mar 2012 1

Self-Tuning Bio-Inspired Massively-Parallel Computing

Steve Furber The University of Manchester

steve.furber@manchester.ac.uk

slide-2
SLIDE 2

EXADAPT Mar 2012 2

Outline

  • 63 years of progress
  • Many cores make light work
  • Building brains
  • The SpiNNaker project
  • The networking challenge
  • A generic neural modelling platform
  • Plans & conclusions
slide-3
SLIDE 3

EXADAPT Mar 2012 3

Manchester Baby (1948)

slide-4
SLIDE 4

EXADAPT Mar 2012 4

SpiNNaker CPU (2011)

slide-5
SLIDE 5

EXADAPT Mar 2012 5

63 years of progress

  • Baby:

– filled a medium-sized room – used 3.5 kW of electrical power – executed 700 instructions per second

  • SpiNNaker ARM968 CPU node:

– fills ~3.5mm2 of silicon (130nm) – uses 40 mW of electrical power – executes 200,000,000 instructions per second

slide-6
SLIDE 6

EXADAPT Mar 2012 6

Energy efficiency

  • Baby:

– 5 Joules per instruction

  • SpiNNaker ARM968:

– 0.000 000 000 2 Joules per instruction

25,000,000,000 times

better than Baby!

(James Prescott Joule born Salford, 1818)

slide-7
SLIDE 7

EXADAPT Mar 2012 7

Moore’s Law

Transistors per Intel chip

0.001 0.01 0.1 1 10 100 1970 1975 1980 1985 1990 1995 2000 Year Millions of transistors per chip 8008 8080 8086 286 386 486 Pentium 4004 Pentium II Pentium III Pentium 4

slide-8
SLIDE 8

EXADAPT Mar 2012 8

…the Bad News

  • atomic scales
  • less predictable
  • less reliable
slide-9
SLIDE 9

EXADAPT Mar 2012 9

Outline

  • 63 years of progress
  • Many cores make light work
  • Building brains
  • The SpiNNaker project
  • The networking challenge
  • A generic neural modelling platform
  • Plans & conclusions
slide-10
SLIDE 10

EXADAPT Mar 2012 10

Multi-core CPUs

  • High-end uniprocessors

– diminishing returns from complexity – wire vs transistor delays

  • Multi-core processors

– cut-and-paste – simple way to deliver more MIPS

  • Moore’s Law

– more transistors – more cores

… but what about the software?

slide-11
SLIDE 11

EXADAPT Mar 2012 11

Multi-core CPUS

  • General-purpose parallelization

– an unsolved problem – the ‘Holy Grail’ of computer science for half a century? – but imperative in the many-core world

  • Once solved

– few complex cores, or many simple cores? – simple cores win hands-down on power-efficiency!

slide-12
SLIDE 12

EXADAPT Mar 2012 12

Back to the future

  • Imagine…

– a limitless supply of (free) processors – load-balancing is irrelevant – all that matters is:

  • the energy used to perform a computation
  • formulating the problem to avoid synchronisation
  • abandoning determinism
  • How might such systems work?
slide-13
SLIDE 13

EXADAPT Mar 2012 13

Outline

  • 63 years of progress
  • Many cores make light work
  • Building brains
  • The SpiNNaker project
  • The networking challenge
  • A generic neural modelling platform
  • Plans & conclusions
slide-14
SLIDE 14

EXADAPT Mar 2012 14

Building brains

  • Brains demonstrate

– massive parallelism (1011 neurons) – massive connectivity (1015 synapses) – excellent power-efficiency

  • much better than today’s microchips

– low-performance components (~ 100 Hz) – low-speed communication (~ metres/sec) – adaptivity – tolerant of component failure – autonomous learning

slide-15
SLIDE 15

EXADAPT Mar 2012 15

Bio-inspiration

  • How can massively parallel computing

resources accelerate our understanding

  • f brain function?
  • How can our growing understanding of

brain function point the way to more efficient parallel, fault-tolerant computation?

slide-16
SLIDE 16

EXADAPT Mar 2012 16

  • Neurons
  • multiple inputs, single output

(c.f. logic gate)

  • useful across multiple scales

(102 to 1011)

  • Brain structure
  • regularity
  • e.g. 6-layer cortical

‘microarchitecture’

Building brains

slide-17
SLIDE 17

EXADAPT Mar 2012 17

Spike Timing Dependent Plasticity

slide-18
SLIDE 18

EXADAPT Mar 2012 18

  • Spot the

pattern?

Neuron ID Simulation time (msec)

Learning patterns

slide-19
SLIDE 19

EXADAPT Mar 2012 19

Neuron ID Simulation time (msec)

Learning patterns

  • Now you

see it!

slide-20
SLIDE 20

EXADAPT Mar 2012 20

Learning patterns

Simulation time Delay after pattern input (ms)

slide-21
SLIDE 21

EXADAPT Mar 2012 21

Self-tuning: in brains

  • With STDP, and no other re-inforcement
  • neurons learn the statistics of their inputs
  • and, with just a little mutual inhibition
  • populations distribute themselves across

the range of presented inputs.

  • New inputs are interpreted against these

learnt statistics.

  • Bayes would be very proud!

Masquelier & Thorpe, 2007

slide-22
SLIDE 22

EXADAPT Mar 2012 22

Outline

  • 63 years of progress
  • Many cores make light work
  • Building brains
  • The SpiNNaker project
  • The networking challenge
  • A generic neural modelling platform
  • Plans & conclusions
slide-23
SLIDE 23

EXADAPT Mar 2012 23

SpiNNaker project

  • Multi-core CPU node

– 18 ARM968 processors – to model large-scale systems of spiking neurons

  • Scalable up to systems

with 10,000s of nodes

– over a million processors – >108 MIPS total

slide-24
SLIDE 24

EXADAPT Mar 2012 24

Design principles

  • Virtualised topology

– physical and logical connectivity are decoupled

  • Bounded asynchrony

– time models itself

  • Energy frugality

– processors are free – the real cost of computation is energy

slide-25
SLIDE 25

EXADAPT Mar 2012 25

SpiNNaker system

slide-26
SLIDE 26

EXADAPT Mar 2012 26

CMP node

slide-27
SLIDE 27

EXADAPT Mar 2012 27

SpiNNaker chip

Mobile DDR SDRAM interface

slide-28
SLIDE 28

EXADAPT Mar 2012 28

SpiNNaker SiP

Multi-chip packaging by UNISEM Europe

slide-29
SLIDE 29

EXADAPT Mar 2012 29

Self-tuning: fault-tolerance

  • Strategy: for all components consider:

– fault insertion – how do we test the FT feature? – fault detection – we have a problem! – fault isolation – contain the damage – reconfiguration – repair the damage

  • Goal: minimize performance deficit x time

– real-time system, so checkpoint & restart inapplicable

slide-30
SLIDE 30

EXADAPT Mar 2012 30

Circuit-level fault-tolerance

  • Delay-insensitive comms

– 3-of-6 RTZ on chip – 2-of-7 NRZ off chip

  • Deadlock resistance

– Tx & Rx circuits have high deadlock immunity – Tx & Rx can be reset independently

  • each injects a token at reset
  • true transition detector filters

surplus token

din (2 phase) dout (4 phase) ¬reset ¬ack

Tx Rx

data ack

slide-31
SLIDE 31

EXADAPT Mar 2012 31

System-level fault-tolerance

  • Breaking symmetry

– any processor can be Monitor Processor

  • local ‘election’ on each chip, after self-test

– all nodes are identical at start-up

  • addresses are computed relative to node with host

connection (0,0)

– system initialised using flood-fill

  • nearest-neighbour packet type
  • boot time (almost) independent of system scale
slide-32
SLIDE 32

EXADAPT Mar 2012 32

Application-level fault-tolerance

  • Cross-system

delay << 1ms

– hardware routing – ‘emergency’ routing

  • failed links
  • congestion

– permanent fault

  • reroute (s/w)
slide-33
SLIDE 33

EXADAPT Mar 2012 33

Outline

  • 63 years of progress
  • Many cores make light work
  • Building brains
  • The SpiNNaker project
  • The networking challenge
  • A generic neural modelling platform
  • Plans & conclusions
slide-34
SLIDE 34

EXADAPT Mar 2012 34

The networking challenge

  • Emulate the very high connectivity of real neurons
  • A spike generated by a neuron firing must be

conveyed efficiently to >1,000 inputs

  • On-chip and inter-chip spike communication should

use the same delivery mechanism

slide-35
SLIDE 35

EXADAPT Mar 2012 35

Network – packets

  • Four packet types

– MC (multicast): source routed; carry events (spikes) – P2P (point-to-point): used for bootstrap, debug, monitoring, etc – NN (nearest neighbour): build address map, flood-fill code – FR (fixed route): carry 64-bit debug data to host

  • Timestamp mechanism removes errant packets

– which could otherwise circulate forever

Header (8 bits) Event ID (32 bits) P ER TS T 0 - Payload (32 bits) Header (8 bits) Address (16+16 bits) P SQ TS T 1 - Srce Dest

slide-36
SLIDE 36

EXADAPT Mar 2012 36

Network – MC Router

  • All MC spike event packets are sent to a router
  • Ternary CAM keeps router size manageable at 1024 entries

(but careful network mapping also essential)

  • CAM ‘hit’ yields a set of destinations for this spike event

– automatic multicasting

  • CAM ‘miss’ routes event to a ‘default’ output link

Inter-chip 0 0 1 0 X 1 1 X 000000010000010000 001001 On-chip Event ID

slide-37
SLIDE 37

EXADAPT Mar 2012 37

Topology mapping

06 07 03 09 01 07 01 94 Problem graph (circuit) 1 02 3 2 23 72

4

72 23 Node 94 14 15 Core 10 2 6

5 9 3 6 Synapse 10 2 7 11 1 8 12 1 3 1 2 1 2 2 2 3 3 23 72 94

  • 23

72 94 3 2 3 23 72 94 3 2 2 23 72 94 1 23 72 94 2

  • 23

72 94

  • 1

2

Topology Fragment of MC table

slide-38
SLIDE 38

EXADAPT Mar 2012 38

Problem mapping

SpiNNaker:

Problem: represented as a network of nodes with a certain behaviour... ...behaviour of each node embodied as an interrupt handler in code... ...compile, link... ...binary files loaded into core instruction memory... Our job is to make the model behaviour reflect reality ...problem is split into two parts... ...problem topology loaded into firmware routing tables... ...abstract problem topology...

The code says "send message" but has no control where the output message goes

slide-39
SLIDE 39

EXADAPT Mar 2012 39

Bisection performance

  • 1,024 links

– in each direction

  • ~10 billion packets/s
  • 10Hz mean firing rate
  • 250 Gbps bisection

bandwidth

slide-40
SLIDE 40

EXADAPT Mar 2012 40

Outline

  • 63 years of progress
  • Many cores make light work
  • Building brains
  • The SpiNNaker project
  • The networking challenge
  • A generic neural modelling platform
  • Plans & conclusions
slide-41
SLIDE 41

EXADAPT Mar 2012 41

Event-driven software model

slide-42
SLIDE 42

EXADAPT Mar 2012 42

Event-driven software model

slide-43
SLIDE 43

EXADAPT Mar 2012 43

PACMAN

  • Partitioning and Configuration

Manager

slide-44
SLIDE 44

EXADAPT Mar 2012 44

Self-tuning: software

  • PACMAN: extrinsic configuration
  • good for small systems
  • 1000-processor system
  • move table creation into SpiNNaker
  • 10,000-100,000 processors
  • increasingly intrinsic configuration
  • Million processor system
  • application loaded in one place
  • relax configuration across machine
  • continue relaxation at run-time to relax hot-spots

We don’t know how to do this!

slide-45
SLIDE 45

EXADAPT Mar 2012 45

PyNN integration

slide-46
SLIDE 46

EXADAPT Mar 2012 46

PyNN integration

  • LIF
  • Izhikevich
slide-47
SLIDE 47

EXADAPT Mar 2012 47

PyNN integration

  • Vogels-

Abbott benchmark

– 500 LIF neurons

slide-48
SLIDE 48

EXADAPT Mar 2012 48

SpiNNaker robot control

slide-49
SLIDE 49

EXADAPT Mar 2012 49

Outline

  • 63 years of progress
  • Many cores make light work
  • Building brains
  • The SpiNNaker project
  • The networking challenge
  • A generic neural modelling platform
  • Plans & conclusions
slide-50
SLIDE 50

EXADAPT Mar 2012 50

Hexagonal PCB structure

FPGA FPGA FPGA 2x 3.1 Gbps SATA links 3-board basic unit:

slide-51
SLIDE 51

EXADAPT Mar 2012 51

Hexagonal PCB structure

slide-52
SLIDE 52

EXADAPT Mar 2012 52

48-node PCB

slide-53
SLIDE 53

EXADAPT Mar 2012 53

SpiNNaker machines

103 machine: 864 cores, 1 PCB, 75W 104 machine:10,368 cores, 1 rack, 900W

(NB 12 PCBs for operation without aircon)

105 m/c: 103,680 cores, 1 cabinet, 9kW 106 m/c: 1M cores, 10 cabs, 90kW

slide-54
SLIDE 54

EXADAPT Mar 2012 54

Current status…

  • Full 18-core chip: arrived 20 May 2011
  • Test card: 4 chips, 72 processors

– Cards can be linked together

  • Neuron models: LIF, Izhikevich, MLP
  • Synapse models: STDP, NMDA
  • Networks: PyNN -> SpiNNaker, various small tools to

build Router tables, etc …and the next steps:

  • 48-chip 103 machine (Q1 2012),

500-chip 104 machine (Q2 2012), 5,000-chip 105 machine (H2 2012), 50,000-chip 106 machine (end H2 2012).

slide-55
SLIDE 55

EXADAPT Mar 2012 55

Conclusions

  • Brains represent a significant computational

challenge

  • now coming within range?
  • SpiNNaker is driven by the brain modelling
  • bjective
  • virtualised topology, bounded asynchrony, energy frugality
  • The major architectural innovation is the

multicast communications infrastructure

  • Self-tuning at many levels
  • hardware (for fault-tolerance), software and, most

effectively, in the neurons themselves!

  • We have prototype working hardware!
slide-56
SLIDE 56

EXADAPT Mar 2012 56

SpiNNaker team

Manchester Southampton