Tarek Taha 1
On-line Learning Architectures March 2017 Tarek M. Taha Electrical - - PowerPoint PPT Presentation
On-line Learning Architectures March 2017 Tarek M. Taha Electrical - - PowerPoint PPT Presentation
On-line Learning Architectures March 2017 Tarek M. Taha Electrical and Computer Engineering Department University of Dayton 1 Tarek Taha Device SPICE Model Boise State Univ of Michigan HP Labs Actual Device 1 60 0 1 Current (mA)
Tarek Taha 2
Device SPICE Model
C. Yakopcic, T. M. Taha, G. Subramanyam, and R. E. Pino, "Memristor SPICE Model and Crossbar Simulation Based on Devices with Nanosecond Switching Time," IEEE International Joint Conference on Neural Networks (IJCNN), August 2013. [BEST PAPER AWARD]
0.1 0.2 0.3 0.5 1 Voltage (V) Current (mA)
- Voltage (V)
1 2 3 4 0.2 0.4 0.6 0.8 1 Current (µA)
- 2
- 1
- 0.4
- 0.3
- 0.2
- 0.1
Voltage (V)
- 1.5
- 1
- 0.5
0.5 1 1.5
- 40
- 20
20 40 60 Voltage (V) Current (µA)
Univ of Michigan Boise State HP Labs SPICE Model Actual Device
Tarek Taha 3
Memristor Backpropagation Circuit
A B C F1 F2
Online training via backpropagation circuits
The same memristor crossbar implements both forward and backward passes
Training Unit (L2) C B β C A A B β β β inputs 𝜀𝑀1,6 𝜀𝑀1,1
. . .
+
- +
- +
- +
- +
- +
- ctrl2
ctrl1
∑
target_2
+
- δL2,2
- utput_2
∑
- δL2,1
+
target_1
- utput_1
Training Unit (L1) 𝜀𝑀2,1 𝜀𝑀2,2 Training Unit (L2) C B β C A A B β β β inputs 𝜀𝑀1,6 𝜀𝑀1,1
. . .
+
- +
- +
- +
- +
- +
- ctrl2
ctrl1
∑
target_2
+
- δL2,2
- utput_2
∑
- δL2,1
+
target_1
- utput_1
Training Unit (L1) 𝜀𝑀2,1 𝜀𝑀2,2
Forward Pass Backward Pass
Tarek Taha 4
Socrates: Online Training Architecture
NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R
3D stack eDRAM
Router
Multicore processing system
Implements backpropagation based online training
T wo versions: 1) memristor core and 2) fully digital core
Has three 3D stack memory for storing training data
C B β C A A B β inputs
Memristor learning core Digital learning core
- -OR--
Tarek Taha 5
FPGA Implementation
NC = Neural Core R = Router
Router Neural Core
NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R NC R
Out In
Input port Output port Routing bus 8 bits S S S S S S S S S S S S S S S S S S S S S Router
Actual images to and
- ut of FPGA based
multicore neural processor:
We verified the system functionality by
implementing the digital system on an FPGA.
Tarek Taha 6
Training Efficiency
Accuracy(%) GPU Digital Memristor MNIST 99%
97% 97%
COIL-20 100%
98% 97%
COIL-100 99%
99% 99%
Training compared to GTX980Ti GPU Energy Efficiency:
- Digital: 5900x
- Memristor: 70000x
Speedup:
- Digital: 14x
- Memristor: 7x
- Batch processing on GPU but not
- n specialized processors
Tarek Taha 7
Inference Efficiency
Inference compared to GTX980Ti GPU Energy Efficiency:
- Digital: 7000x
- Memristor: 358000x
Speedup:
- Digital: 29x
- Memristor: 60x
Memristor learning core is about 35x smaller in area:
- Digital: 0.52 mm2
- Memristor: 0.014 mm2
Tarek Taha 8
Large Crossbar Simulation
2 4 6 8 x 10
4
0.9 0.92 0.94 0.96 0.98 1 Memristor Potential across memristor (V) 2 4 6 8 x 10
4
0.9 0.92 0.94 0.96 0.98 1 Memristor Potential across memristor (V)
SPICE 300×300 MATLAB 300×300
y1
. . .
Inputs
. . .
yN/2 VM=1 V V1=1 V
yj
R Rf R
+ - + -
Studying training using memristor
crossbars needs to be done in SPICE to account for circuit parasitics.
Simulating one cycle of a large crossbar
in can take over 24 hours in SPICE. (training requires thousands of iterations and hence is nearly impossible in SPICE).
Developed a fast (~30ms) algorithm to
approximate SPICE output with over 99% accuracy.
Enables us to examine learning on
memristor crossbars.
Tarek Taha 9
Memristor for On-Line Learning
On-line learning requires “highly tunable” or “slow” memristors This does not affect inference speed
Ron Roff 10ns Ron Roff 10ns 500ns Memory Memristor On-line Learning Memristor
Tarek Taha 10
LiNbO3 Memristor for Online Learning
Device is quite slow: order of tens of
millisecond switching time.
Useful for backprogation training.
- 3
- 2
- 1
1 2 3
- 10
- 5
5 10 Voltage (V) Current (mA)
Approx. Switching Time Switching Voltage Minimum Pulse Width Required 10ns 6 1ps 10ns 4.5 10ps 100ns 4.5 100ps 1µs 4.5 1ns 10µs 4.5 10ns Switching speed of devices and minimum programming pulse needed in neuromorphic training circuits.
500 1000 1500 2000
- 9.5
- 9
- 8.5
- 8
- 7.5
- 7
- 6.5
Pulse Number Current (mA)
20 40 60 80 100 2 3 4 5 6 Pulse Number Current (mA)
Tarek Taha 11
Unsupervised Clustering
Before Training After Training
Wisconsin Iris Wine
0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 Benign Malignant
- 1
- 0.5
0.5 1
- 1
- 0.5
0.5 1 Benign Malignant 0.8 0.81 0.82 0.83 0.5 0.55 0.6 0.65 0.7 Setosa Versicolor Virginica
- 1
- 0.8
- 0.6
- 0.4
- 0.2
- 1
- 0.9
- 0.8
- 0.7
- 0.6
- 0.5
- 0.4
- 0.3
Setosa Versicolor Virginica 0.2 0.25 0.3 0.35 0.5 0.55 0.6 0.65 0.7 Class1 Class2 Class3
- 1
- 0.5
0.5 0.2 0.4 0.6 0.8 1 Class1 Class2 Class3
x1 x2 x3 x4 x5 x6 1 n1 n2 n3 1 x1’ x2’ x3’ x4’ x5’ x6’
Layer L1 Layer L2
We extended the backpropagation circuit to implement auto-encoder based unsupervised learning in memristor crossbars.
These graphs show the circuit learning and clustering data in an unsupervised manner.
Tarek Taha 12
Application: Cybersecurity
500 1000 1500 2000 2500 3000 0.2 0.4 0.6 0.8 1 1.2 Normal packet Distance betn input & reconstruction
500 1000 1500 2000 2500 0.2 0.4 0.6 0.8 1 1.2 Anomalous packet Distance betn input & reconstruction 20 40 60 80 100 20 40 60 80 100 False detection (%) Detection rate (%)
The autoencoder based unsupervised training circuits were used to learn normal
network packets.
When the memristor circuits were presented with anomalous network packets,
these were detected with a 96.6% accuracy and a 4% false positive rate.
Could also implement rule based
malicious packet detection (similar to SNORT) in a very compact circuit
Regex : user\s[^\n]{10}
Tarek Taha 13
Application: Autonomous Agent for UAVs
IBM TrueNorth
Exploring implementation on: IBM TrueNorth processor Memristor circuits High performance clusters: CPUs and
GPUs
Cognitively Enhanced Complex Event
Processing (CECEP) Architecture
Geared towards autonomous
decision making in a variety of applications including autonomous UAVs
Event Output Streams
IO “Adapters”
C B β C A A B β inputs
Memristor Systems
Tarek Taha 14
Parallel Cognitive Systems Laboratory
Research Engineers:
- Raqib Hasan, PhD
- Chris
Yakopcic, PhD
- Wei Song, PhD
- Tanvir Atahary, PhD
Sponsors:
http://taha-lab.org
Doctoral Students:
- Zahangir Alom
- Rasitha Fernando
- Ted Josue
- Will Mitchell
- Yangjie Qi
- Nayim Rahman
- Stefan Westberg
- Ayesha Zaman
- Shuo Zhang
Master’s Students:
- Omar Faruq
- Tom Mealy