( ) Intro. on Artificial Intelligence from the perspective of - - PowerPoint PPT Presentation

intro on artificial intelligence from the perspective of
SMART_READER_LITE
LIVE PREVIEW

( ) Intro. on Artificial Intelligence from the perspective of - - PowerPoint PPT Presentation

2018 ( ) Intro. on Artificial Intelligence from the perspective of probability theory luozhiling@zju.edu.cn College of Computer Science Zhejiang University http://www.bruceluo.net


slide-1
SLIDE 1

人工智能引论 2018 罗智凌

人工智能引论 (五)

  • Intro. on Artificial Intelligence

from the perspective of probability theory

罗智凌

luozhiling@zju.edu.cn College of Computer Science Zhejiang University http://www.bruceluo.net

slide-2
SLIDE 2

人工智能引论 2018 罗智凌

Comparison on two major branches

Name Feed-Forward NN (Stochastic) Recurrent NN Input Feature Observation Output Ground truth (Latent, Visible) variables Learning Supervised Learning Unsupervised Learning Model Discriminative Model Generative Model Strategy Loss on ground truth(diff or entropy) Loss on observation(energy) Algorithm Gradient Descent (Variational) EM, Sampling Examples Perception, MLP, CNN LSTM, Markov Field, RBM Hybrid DBN, GAN, pre-trained/two-phase learning, AutoEncoder Ground truth Feature Observation

slide-3
SLIDE 3

人工智能引论 2018 罗智凌 CNN

Feed-Forward NN Recurrent NN

MLP Perceptron LSTM FRCNN GAN

Stochastic Model

RBM Auto

  • Encoder

DBN bi-LSTM word2vec LDA Markov Net Hopfield Net

slide-4
SLIDE 4

人工智能引论 2018 罗智凌

OUTLINE

  • Recurrent NN

– Long Short Term Memory

  • Stochastic Model in Neural Network

– Hopfield Nets – Restricted Boltzmann Machine – Sleep/wake Model – Echo-State Model

  • Hybrid Model

– Deep Belief Network – AutoEncoder – Generative Adversarial Network

slide-5
SLIDE 5

人工智能引论 2018 罗智凌

OUTLINE

  • Recurrent NN

– Long Short Term Memory

  • Stochastic Model in Neural Network

– Hopfield Nets – Restricted Boltzmann Machine – Sleep/wake Model – Echo-State Model

  • Hybrid Model

– Deep Belief Network – AutoEncoder – Generative Adversarial Network

slide-6
SLIDE 6

人工智能引论 2018 罗智凌

From multilayer perceptron (MLP) to Recurrent Neural Network to LSTM

l Multi-Layer Perceptron(MLP) is by nature a feedforward directed acyclic network.

l An MLP consists of multiple layers and can map input data to output data via a set of nonlinear activation functions. MLP utilizes a supervised learning technique called backpropagation for training the network. l However, MLP will not be able to learn mapping functions that there are dependency between input data (i.e., sequential data)

Mapping

Input Output

slide-7
SLIDE 7

人工智能引论 2018 罗智凌

Recurrent Neural Network: An RNN has recurrent connections (connections to previous time steps of the same layer). l RNN are powerful but can get extremely complicated. Computations derived from earlier input are fed back into the network, which gives RNN a kind of memory. l Standard RNNs suffer from both exploding and vanishing gradients due to their iterative nature.

Mapping

sequence input

(x0…xt) Embedding vector (ht)

From multilayer perceptron (MLP) to Recurrent Neural Network to LSTM

slide-8
SLIDE 8

人工智能引论 2018 罗智凌

Recurrent Models of Visual Attention

Volodymyr Mnih Nicolas Heess Alex Graves Koray Kavukcuoglu. Recurrent Models of Visual Attention.

slide-9
SLIDE 9

人工智能引论 2018 罗智凌

l Long Short-Term Memory (LSTM) Model:

l LSTM is an RNN devised to deal with exploding and vanishing gradient problems in RNN. l An LSTM hidden layer consists of a set of recurrently connected blocks, known as memory cells. l Each of memory cells is connected by three multiplicative units - the input, output and forget gates. l The input to the cells is multiplied by the activation of the input gate, the output to the net is multiplied by the output gate, and the previous cell values are multiplied by the forget gate.

Sepp Hochreiter &Jűrgen Schmidhuber, Long short-term memory, Neural computation, Vol. 9(8), pp. 1735--1780, MIT Press, 1997

slide-10
SLIDE 10

人工智能引论 2018 罗智凌

LSTM

Cell state Hidden state Input

2 states, 3 gates, 4 layers

Cell/Hidden state Forget/Write/Read gate 3 sigmoid/ 1 tanh perceptron

slide-11
SLIDE 11

人工智能引论 2018 罗智凌

Cell state

forget gate Cell state through time

Sigmoid function Hidden state at t-1 Input at t Forget signal, 1 represents “completely keep this”, 0 represents “completely forget this”

slide-12
SLIDE 12

人工智能引论 2018 罗智凌

Input(Write) gate

Hidden state at t-1 Input at t Write signal, 1 represents “completely write this”, 0 represents “completely ignore this” Content to write

slide-13
SLIDE 13

人工智能引论 2018 罗智凌

Update cell state

Cell state at t-1 Write signal Write signal Updated cell state Content to write

slide-14
SLIDE 14

人工智能引论 2018 罗智凌

Output(Read) gate

Hidden state at t-1 Input at t Read signal, 1 represents “completely read this, 0 represents “completely ignore this Updated hidden state at t

slide-15
SLIDE 15

人工智能引论 2018 罗智凌

Language Translation

slide-16
SLIDE 16

人工智能引论 2018 罗智凌

Stock Prediction

slide-17
SLIDE 17

人工智能引论 2018 罗智凌

OUTLINE

  • Recurrent NN

– Long Short Term Memory

  • Stochastic Model in Neural Network

– Hopfield Nets – Restricted Boltzmann Machine – Sleep/wake Model – Echo-State Model

  • Hybrid Model

– Deep Belief Network – AutoEncoder – Generative Adversarial Network

slide-18
SLIDE 18

人工智能引论 2018 罗智凌

Stochastic NN

  • Energy based probability distribution on (latent,

visible) variables:

  • where Z is called partition function(配分函数).
  • Loss function

𝑀 𝜄, 𝐸 = − '

( ∑

log 𝑞(

  • 0 1 ∈3

x 5 ) = '

( ∑

𝐹 x 5 − log 𝑎

  • 0 1 ∈3
slide-19
SLIDE 19

人工智能引论 2018 罗智凌

Hopfield Nets

  • A Hopfield net is

composed of binary threshold units with recurrent connections between them.

slide-20
SLIDE 20

人工智能引论 2018 罗智凌

The energy function

  • The global energy is the sum of many contributions. Each

contribution depends on one connection weight and the binary states of two neurons:

  • This simple quadratic energy function makes it possible for

each unit to compute locally how it’s state affects the global energy: E = − si

i

∑ bi −

sisj wij

i<j

Energy gap = ΔEi = E(si = 0)− E(si = 1) = bi + sjwij

j

slide-21
SLIDE 21

人工智能引论 2018 罗智凌

Settling to an energy minimum

  • To find an energy minimum in this

net, start from a random state and then update units one at a time in random order. – Update each unit to whichever

  • f its two states gives the

lowest global energy. – i.e. use binary threshold units. 3 2 3 3

  • 1
  • 4
  • 1

1 1 ?

  • E = goodness = 3
slide-22
SLIDE 22

人工智能引论 2018 罗智凌

Settling to an energy minimum

  • To find an energy minimum in this

net, start from a random state and then update units one at a time in random order. – Update each unit to whichever

  • f its two states gives the

lowest global energy. – i.e. use binary threshold units. 3 2 3 3

  • 1
  • 4
  • 1

1 1 ?

  • E = goodness = 3
slide-23
SLIDE 23

人工智能引论 2018 罗智凌

Settling to an energy minimum

  • To find an energy minimum in this

net, start from a random state and then update units one at a time in random order. – Update each unit to whichever

  • f its two states gives the

lowest global energy. – i.e. use binary threshold units. 3 2 3 3

  • 1
  • 4

1

  • 1

1 1 ?

  • E = goodness = 3
  • E = goodness = 4
slide-24
SLIDE 24

人工智能引论 2018 罗智凌

A deeper energy minimum

  • The net has two triangles in which the

three units mostly support each other. – Each triangle mostly hates the other triangle.

  • The triangle on the left differs from the
  • ne on the right by having a weight of 2

where the other one has a weight of 3. – So turning on the units in the triangle

  • n the right gives the deepest

minimum. 3 2 3 3

  • 1
  • 4

1

  • 1

1 1

  • E = goodness = 5
slide-25
SLIDE 25

人工智能引论 2018 罗智凌

A neat way to make use of this type of computation

  • Hopfield (1982) proposed that

memories could be energy minima of a neural net. – The binary threshold decision rule can then be used to “clean up” incomplete or corrupted memories.

  • The idea of memories as energy

minima was proposed by I. A. Richards in 1924 in “Principles of Literary Criticism”.

  • Using energy minima to

represent memories gives a content-addressable memory: – An item can be accessed by just knowing part of its content. – It is robust against hardware damage. – It’s like reconstructing a dinosaur from a few bones.

slide-26
SLIDE 26

人工智能引论 2018 罗智凌

OUTLINE

  • Recurrent NN

– Long Short Term Memory

  • Stochastic Model in Neural Network

– Hopfield Nets – Restricted Boltzmann Machine – Sleep/wake Model – Echo-State Model

  • Hybrid Model

– Deep Belief Network – AutoEncoder – Generative Adversarial Network

slide-27
SLIDE 27

人工智能引论 2018 罗智凌

Boltzmann Machine

  • 如图所示为一个玻尔兹曼机(BM),其蓝色节点

为隐层(hidden),白色节点为输入层(visible)。

  • 与Hopfield Net相比,参数不固定,数据输入是

对v的观察。

  • 与递归神经网络相比:

– 1、RNN本质是学习一个函数,因此有输入和输出层 的概念,而BM的用处在于学习一组数据的“内在表 示”,因此其没有输出层的概念。 – 2、RNN各节点链接为有向环,而BM各节点连接成 无向完全图

slide-28
SLIDE 28

人工智能引论 2018 罗智凌

Restricted Boltzmann Machine

  • 限制玻尔兹曼机和玻尔兹曼机相比,

主要是加入了“限制”。

  • 所谓的限制就是,将完全图变成了二

分图。如图所示,限制玻尔兹曼机由4 个显层节点和3个隐层节点组成。

  • 限制玻尔兹曼机可以用于降维(隐层

少一点),学习特征(隐层输出就是 特征),深度信念网络(多个RBM堆 叠而成)等。

  • 能量函数
slide-29
SLIDE 29

人工智能引论 2018 罗智凌

Restricted Boltzmann Machines

  • About RBM

– Only one layer of hidden units. – No connections between hidden units.

  • In an RBM it only takes one step

to reach thermal equilibrium when the visible units are clamped. – So we can quickly get the exact value of :

p(hj = 1) = 1 1+e

−(bj+ viwij)

i∈vis

< vihj >v

hidden visible i j

slide-30
SLIDE 30

人工智能引论 2018 罗智凌

Restricted Boltzmann Machine

  • Joint Distribution
slide-31
SLIDE 31

人工智能引论 2018 罗智凌

KL divergence

  • Kullback–Leibler divergence
  • Difference on entropy
slide-32
SLIDE 32

人工智能引论 2018 罗智凌

Contrastive Divergence

  • 利用Gibbs 采样,在训练过程中,首先将可视向量值映射给隐

单元,然后用隐层单元重建可视向量,接着再将可视向量值映 射给隐单元……反复执行这种步骤

slide-33
SLIDE 33

人工智能引论 2018 罗智凌

A picture of an inefficient version of the Boltzmann machine learning algorithm for an RBM

<vihj>0

<vihj>∞

i j i i j i j t = 0

Δwij = ε ( <vihj>0 − <vihj>∞)

Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel. a fantasy j t = 1 t = 2 t = infinity

slide-34
SLIDE 34

人工智能引论 2018 罗智凌

Contrastive divergence: A very surprising short-cut

t = 0 t = 1

Δwij = ε ( <vihj>0 − <vihj>1)

Start with a training vector on the visible units. Update all the hidden units in parallel. Update the all the visible units in parallel to get a “reconstruction”. Update the hidden units again. This is not following the gradient of the log likelihood. But it works well. reconstruction data

<vihj>0

<vihj>

1

i j i j

slide-35
SLIDE 35

人工智能引论 2018 罗智凌

How to learn a set of features that are good for reconstructing images of the digit 2

50 binary neurons that learn features

16 x 16 pixel image

Increment weights between an active pixel and an active feature Decrement weights between an active pixel and an active feature data (reality) reconstruction (better than reality) 50 binary neurons that learn features

16 x 16 pixel image

slide-36
SLIDE 36

人工智能引论 2018 罗智凌

The weights of the 50 feature detectors

We start with small random weights to break symmetry

slide-37
SLIDE 37

人工智能引论 2018 罗智凌

slide-38
SLIDE 38

人工智能引论 2018 罗智凌

slide-39
SLIDE 39

人工智能引论 2018 罗智凌

slide-40
SLIDE 40

人工智能引论 2018 罗智凌

slide-41
SLIDE 41

人工智能引论 2018 罗智凌

slide-42
SLIDE 42

人工智能引论 2018 罗智凌

slide-43
SLIDE 43

人工智能引论 2018 罗智凌

slide-44
SLIDE 44

人工智能引论 2018 罗智凌

slide-45
SLIDE 45

人工智能引论 2018 罗智凌

The final 50 x 256 weights: Each neuron grabs a different feature

slide-46
SLIDE 46

人工智能引论 2018 罗智凌 Reconstruction from activated binary features

Data

Reconstruction from activated binary features

Data

How well can we reconstruct digit images from the binary feature activations?

New test image from the digit class that the model was trained on Image from an unfamiliar digit class The network tries to see every image as a 2.

slide-47
SLIDE 47

人工智能引论 2018 罗智凌

Weak-Sleep Algorithm

  • Weak phase:

– Discriminative procedure – Positive phase – Increasing the free energy on observations.

  • Sleep phase:

– Generative procedure – Negative phase – Decreasing the energy on partition function.

Dream

slide-48
SLIDE 48

人工智能引论 2018 罗智凌

Echo State Network

  • The main idea is
  • (i) to drive a random, large, fixed recurrent neural network

with the input signal, thereby inducing in each neuron within this "reservoir" network a nonlinear response signal, and

  • (ii) combine a desired output signal by a trainable linear

combination of all of these response signals.

slide-49
SLIDE 49

人工智能引论 2018 罗智凌

OUTLINE

  • Recurrent NN

– Long Short Term Memory

  • Stochastic Model in Neural Network

– Hopfield Nets – Restricted Boltzmann Machine – Sleep/wake Model – Echo-State Model

  • Hybrid Model

– Deep Belief Network – AutoEncoder – Generative Adversarial Network

slide-50
SLIDE 50

人工智能引论 2018 罗智凌

Deep Belief Network

  • DBN是一个概率生成模型
  • 基本组件是RBM受限玻尔兹曼机
slide-51
SLIDE 51

人工智能引论 2018 罗智凌

DBN on small data

slide-52
SLIDE 52

人工智能引论 2018 罗智凌

Combining two RBMs to make a DBN

1

W

2

W

2

h

1

h

1

h v

1

W

2

W

2

h

1

h v

copy binary state for each v Compose the two RBM models to make a single DBN model Train this RBM first Then train this RBM It’s not a Boltzmann machine!

slide-53
SLIDE 53

人工智能引论 2018 罗智凌

The generative model after learning 3 layers

To generate data: 1. Get an equilibrium sample from the top- level RBM by performing alternating Gibbs sampling for a long time. 2. Perform a top-down pass to get states for all the other layers. The lower level bottom-up connections are not part of the generative model. They are just used for inference.

h2 data h1 h3

W2

W

3

W

1

slide-54
SLIDE 54

人工智能引论 2018 罗智凌

The DBN used for modeling the joint distribution

  • f MNIST digits and their labels

2000 units 500 units 500 units

28 x 28 pixel image

10 labels

  • The first two hidden layers

are learned without using labels.

  • The top layer is learned as

an RBM for modeling the labels concatenated with the features in the second hidden layer.

  • The weights are then fine-

tuned to be a better generative model using contrastive wake-sleep.

slide-55
SLIDE 55

人工智能引论 2018 罗智凌

OUTLINE

  • Recurrent NN

– Long Short Term Memory

  • Stochastic Model in Neural Network

– Hopfield Nets – Restricted Boltzmann Machine – Sleep/wake Model – Echo-State Model

  • Hybrid Model

– Deep Belief Network – AutoEncoder – Generative Adversarial Network

slide-56
SLIDE 56

人工智能引论 2018 罗智凌

111

Unsupervised Learning

Ranzato

Error

input prediction

?

slide-57
SLIDE 57

人工智能引论 2018 罗智凌

118

SPARSE AUTO-ENCODERS

Ranzato

encoder

Error

input prediction

decoder

code

– input: code:

  • loss:

Sparsity Penalty

Le et al. “ICA with reconstruction cost..” NIPS 2011

h=W

T X

X L X ;W =∥W h−X∥

2∑ j∣h j∣

slide-58
SLIDE 58

人工智能引论 2018 罗智凌

The first really successful deep autoencoders

  • Images in 28*28
  • Train a stack of 4 RBM’s and then ‘unroll’ them.
  • Fine-tune with gentle backprop.
slide-59
SLIDE 59

人工智能引论 2018 罗智凌

How to find documents that are similar to a query document

  • Convert each document into a “bag of

words”(BOW)

– This is a vector of word counts ignoring

  • rder

– Ignore stop words (like “the” or “over”)

  • We could compare the word counts of

the query document and millions of the other documents but this is slow.

– So we reduce each query vector to a much smaller vector that still contains most of the information about the content

  • f the doc.
slide-60
SLIDE 60

人工智能引论 2018 罗智凌

How to compress the count vector

  • We train the neural network

to reproduce its input vector as its output

  • This forces it to compress as

much information as possible into the 10 numbers in the central bottleneck.

  • These 10 numbers are then a

good way to compare documents.

slide-61
SLIDE 61

人工智能引论 2018 罗智凌

First compress all documents to 2 numbers using PCA on log(1+count). Then use different colors for different categories.

slide-62
SLIDE 62

人工智能引论 2018 罗智凌

First compress all documents to 2 numbers using deep auto. Then use different colors for different document categories

slide-63
SLIDE 63

人工智能引论 2018 罗智凌

Using a deep autoencoder as a hash-function (semantic hashing) for finding approximate matches

hash function

supermarket search

slide-64
SLIDE 64

人工智能引论 2018 罗智凌

Binary codes for image retrieval

  • Image retrieval is typically done by using the captions.

Why not use the images too?

– Pixels are not like words: individual pixels do not tell us much about the content.

  • Maybe we should extract a real-valued vector that has

information about the content?

– Matching real-valued vectors in a big database is slow and requires a lot of storage.

  • Short binary codes are very easy to store and match.
slide-65
SLIDE 65

人工智能引论 2018 罗智凌

A two-stage method

  • First, use semantic hashing with 28-bit binary codes to get a

long “shortlist” of promising images.

  • Then use 256-bit binary codes to do a serial search for good

matches.

– This only requires a few words of storage per image and the serial search can be done using fast bit-operations.

  • But how good are the 256-bit binary codes?

– Do they find images that we think are similar?

slide-66
SLIDE 66

人工智能引论 2018 罗智凌

Krizhevsky’s deep autoencoder

1024 1024 1024 8192 4096 2048 1024 512

256-bit binary code The encoder has about 67,000,000 parameters. There is no theory to justify this architecture It takes a few days on a GTX 285 GPU to train on two million images.

slide-67
SLIDE 67

人工智能引论 2018 罗智凌

Reconstructions of 32x32 color images from 256-bit codes

slide-68
SLIDE 68

人工智能引论 2018 罗智凌 retrieved using 256 bit codes retrieved using Euclidean distance in pixel intensity space

slide-69
SLIDE 69

人工智能引论 2018 罗智凌 retrieved using 256 bit codes retrieved using Euclidean distance in pixel intensity space

slide-70
SLIDE 70

人工智能引论 2018 罗智凌 Leftmost column is the search image. Other columns are the images that have the most similar feature activities in the last hidden layer.

slide-71
SLIDE 71

人工智能引论 2018 罗智凌

OUTLINE

  • Recurrent NN

– Long Short Term Memory

  • Stochastic Model in Neural Network

– Hopfield Nets – Restricted Boltzmann Machine – Sleep/wake Model – Echo-State Model

  • Hybrid Model

– Deep Belief Network – AutoEncoder – Generative Adversarial Network

slide-72
SLIDE 72

人工智能引论 2018 罗智凌

To evaluate a generative model

  • 给定原始观察值,如何判断一个生成器足够好?
  • By Log likelihood

– The probability of generating obs

  • By KL-Divergence
  • By Contrastive Divergence

– The difference between obs and generated obs

  • By a discriminative model?
slide-73
SLIDE 73

人工智能引论 2018 罗智凌

Generated Samples

slide-74
SLIDE 74

人工智能引论 2018 罗智凌

Generated Samples

slide-75
SLIDE 75

人工智能引论 2018 罗智凌

Min Max on expectation

  • Min G on max entropy of D

Log likelihood on read samples Log likelihood on fake samples noise perior

slide-76
SLIDE 76

人工智能引论 2018 罗智凌

slide-77
SLIDE 77

人工智能引论 2018 罗智凌

slide-78
SLIDE 78

人工智能引论 2018 罗智凌

Deep Learning

  • 欢迎踏入人工智能的最前沿,在你开始大展手脚之前,你需要

如下准备:

  • 1. 一台带有GPU的服务器
  • 2. 一种熟练地编程语言
  • 3. 一颗Engineer/Scientist的心
  • Suggested配置
  • 桌面:i7 + enough RAM + SSD + NVidia TitanX
  • 软件:Ubuntu/CentOS + Python 2.7 + Tensorflow / Keras
slide-79
SLIDE 79

人工智能引论 2018 罗智凌

罗智凌

luozhiling@zju.edu.cn http://www.bruceluo.net