Feature extraction from deep models Olgert Denas Synopsis Intro - - PowerPoint PPT Presentation
Feature extraction from deep models Olgert Denas Synopsis Intro - - PowerPoint PPT Presentation
Feature extraction from deep models Olgert Denas Synopsis Intro to deep models Applications Neurons & Nets dimer Learning & Depth G1E model Feature extraction Theory 1 Layer Nets Neural computation
Synopsis
Intro to deep models
- Neurons & Nets
- Learning & Depth
Feature extraction
- Theory
- 1 Layer
- Nets
Applications
- dimer
- G1E model
Neural computation
Inspired by organic neural systems
A system of simple computing units with learnable parameters
Intended for conventional computing
efficient arithmetic and calculus but von Neumann’s architecture “won”
Neural computation
Mainly in machine learning Declarative: unambiguous
sort an array of integers
Procedural: can only state by examples
find fraud in network logs
Artificial Neural Nets
Neurons
Neurons
The artificial neuron is very different from the biological one, after all it is a model
Neurons
Natural - organic transfer function complicated mixed communication continuous/impulse state, chemical, physical changes synaptic delays, long axon Artificial parametric function discrete or continuous no state, output is f(x; θ), fixed connections computational delays
Nets of neurons
Computers and brains
Brain Computer speed ms / operation ns / operation size Tera nodes, Peta conns Giga nodes Memory content addressable, in connections contiguous, random access Computing Distributed / fault tolerant Centralized / non-ft Power 10W ~ 300W (GPU)
Organic vs. artificial computer
ANNs architectures
Feed forward NNs (and CNNs) Recurrent NNs RBMs
Feed Forward
Directed Acyclic Graph
Input (first), hidden, and output (last) layers
Connections from a layer to next Transfer functions are nonlinearities
Recurent
Directed graph with cycles
Possibly, hidden layers More complicated, realistic, and powerful
Well-suited to sequential input
Unroll the hidden state, just like DBNs
Restricted Boltzman Machines
Probabilistic model (energy function) A bipartite graph (visible <->hidden) Efficient inference
ANN: Learning
Learning: perceptron
Loop through labeled examples
- on incorrect output:
* case 0: w <- w + x * case 1: w <- w - x Guaranteed separating hyperplane
Input Units Output Unit X1 X2 W1 W2
Learning: perceptron
Parity, or counting problem:
recognize binary strings of length 2 with exactly one 1 red class: 01, 10 green class: 00, 11
Many other problems
(Minsky & Papert 1969)
Learning: features
Input Units Output Unit
Learning: features
00: no unit is activated => 0
11: hidden unit cancels inputs 01, 10: inputs connect directly to output
0 0
Learning: features
1 1
00: no unit is activated => 0
11: hidden unit cancels inputs
01, 10: inputs connect directly to output
Learning: features
0 1 .5
00: no unit is activated => 0 11: hidden unit cancels inputs
01, 10: inputs connect directly to
- utput
Learning: features
1 0 .5
00: no unit is activated => 0 11: hidden unit cancels inputs 01, 10: inputs connect directly to
- utput
Learning: perceptron
Perceptron guaranties a SH if a SH exists Learning from input features requires a lot
- f “(big) data science”
Have the NN do the “(big) data science!”
Deep supervised learning paradigm
Map “raw” input into intermediate hidden layers
Deep means: more layers, means more efficient, means harder-to-train
Classify the hidden representation of data Learn weights for both steps above using backprop or pre- training
Feature extraction
Feature extraction
Trained NNs can be used to predict, but they are black boxes
It is hard to relate high weights with input features
How do we map features from hidden layers back to the input space?
Learning W, b
Batch SGD
Early stop, regularization and a lot of tricks
Maximize average of P(Y|X;θ) over training data
I.e., find a θ with low entropy
Feature extraction: 1 layer
P(Y|X;θ) = f(WXT + b)
Feature extraction: 1 layer
Given trained model and label, find input: * with that label * minimized gray area
1 2/3 1/3 Y P(Y | E[X0]) c0 = fθ(E[X0]) θ = {W, b} E[X0]
Feature extraction: 1 layer
Given trained model and label, find input: * with that label * minimized gray area
1 2/3 1/3 Y P(Y | E[X0]) c0 = fθ(E[X0]) θ = {W, b} E[X0]
Feature extraction: 1 layer
l: label Xl: input features E[Xl]: input average for that label fθ(E[X]): decision boundary cl: fθ(E[Xl]), constraint boundary ε: slack (see below) This is an LP !
Feature extraction on a stack
Feature extraction: ε
The slack variable is a control on the CE achieved by extracted features Useful, if avg. input achieves 0.01 CE, but you are happy with 0.2
Linear programing (in 1 page)
Optimization problems that: minimize a linear cost function satisfy linear constraints very efficient, for continuous variables (simplex)
Feature extraction: implementation
Mnist digits
28x28 pixel binarized handwritten digit images pick pairs and extract differentiating features
Effect of ε on |Xl|
Effect of optimization
Features
Feature extraction: applications
Hematopoiesis & erythroid diff.
Genes dev. 8(10):1184-97, 1994 Genome Res. 21(10):1659-71, 2011