( ) Intro. on Artificial Intelligence from the perspective of - - PowerPoint PPT Presentation

intro on artificial intelligence from the perspective of
SMART_READER_LITE
LIVE PREVIEW

( ) Intro. on Artificial Intelligence from the perspective of - - PowerPoint PPT Presentation

2018 ( ) Intro. on Artificial Intelligence from the perspective of probability theory luozhiling@zju.edu.cn College of Computer Science Zhejiang University http://www.bruceluo.net


slide-1
SLIDE 1

人工智能引论 2018 罗智凌

人工智能引论 (三)

  • Intro. on Artificial Intelligence

from the perspective of probability theory

罗智凌

luozhiling@zju.edu.cn College of Computer Science Zhejiang University http://www.bruceluo.net

slide-2
SLIDE 2

人工智能引论 2018 罗智凌

OUTLINE

  • Strategies
  • Algorithm
  • Applications
slide-3
SLIDE 3

人工智能引论 2018 罗智凌

Strategies

  • Loss in objective function
  • 0-1 loss
  • Quadratic loss
  • Absolute loss
  • Logarithmic loss (log-likelihood loss)

– MLE – MAP

slide-4
SLIDE 4

人工智能引论 2018 罗智凌

Generative/ Discriminative Model

  • Generating Procedure:
  • P5 ~ b(𝛽)
  • P30~ b(𝛾)
  • 𝜄~ Multi(P5, P30,𝛿)
  • G~ b(𝜄)

P5 P30 G Prob Y Y Y 0.173 Y Y N 0.075 Y N Y 0.116 Y N N 0.121 N Y Y 0.075 N Y N 0.127 N N Y 0.179 N N N 0.133

P5 G P30 𝜄 𝛽 𝛾

𝑄(𝑄5, 𝑄30, 𝐻, 𝛽, 𝛾, 𝛿) 𝑄 𝐻 𝑄5, 𝑄30, 𝛽, 𝛾, 𝛿) 𝑄 𝐻, 𝑄5, 𝑄30 𝛽, 𝛾, 𝛿)

Generative model Discriminative model

𝛿

slide-5
SLIDE 5

人工智能引论 2018 罗智凌

Maximum Likelihood Estimation

  • arg max 𝑄(𝐻|𝑄5, 𝑄30, 𝛽, 𝛾, 𝛿)
  • 𝑄(𝐻|𝑄5, 𝑄30, 𝛽, 𝛾, 𝛿)=

∫ 𝑄 𝑄5 𝛽 𝑄 𝑄30 𝛾 𝑄(𝐻 𝜄 𝑄 𝜄 𝑄5, 𝑄30, 𝛿

  • 𝑒𝜄
  • -> arg min –log (∫ 𝑄 𝑄5 𝛽 𝑄 𝑄30 𝛾 𝑄(𝐻 𝜄 𝑄 𝜄 𝑄5, 𝑄30, 𝛿
  • 𝑒𝜄)
slide-6
SLIDE 6

人工智能引论 2018 罗智凌

Maximize A Posterior 𝑄 𝛽, 𝛾, 𝛿 𝑄5, 𝑄30, 𝐻) = 𝑄(𝑄5, 𝑄30, 𝐻, 𝛽, 𝛾, 𝛿) 𝑄(𝑄5, 𝑄30, 𝐻)

= ∫ 𝑄 𝑄5 𝛽 𝑄 𝑄30 𝛾 𝑄(𝐻 𝜄 𝑄 𝜄 𝑄5, 𝑄30, 𝛿

  • 𝑒𝜄 𝑅 𝛽 𝑅 𝛾 𝑅(𝛿)

𝑄(𝑄5, 𝑄30, 𝐻) P 是𝛽, 𝛾, 𝛿的函数,可以记作𝑔(𝛽, 𝛾, 𝛿)或者𝑔(𝛽, 𝛾, 𝛿; 𝑄5, 𝑄30, 𝐻)

slide-7
SLIDE 7

人工智能引论 2018 罗智凌

Maximize A Posterior

  • Solve as

−log (𝑔(𝛽, 𝛾, 𝛿)) 𝛽∗, 𝛾∗, 𝛿∗ = arg 𝑛𝑗𝑜 𝑚

通过随机梯度下降、爬山等已有方法工具(MATLAB, Python)求解

  • Log-likelihood Loss:
  • Regularization (Optimal):

𝜇 (∥ 𝛽 ∥+∥ 𝛾 ∥+∥ 𝛿 ∥)

  • Loss function (objective function):

𝑚 = − log 𝑔 𝛽, 𝛾, 𝛿 + 𝜇 (∥ 𝛽 ∥+∥ 𝛾 ∥+∥ 𝛿 ∥)

slide-8
SLIDE 8

人工智能引论 2018 罗智凌

MLE vs MAP

  • MLE:
  • arg 𝑛𝑏𝑦 ∫ 𝑄 𝑄5 𝛽 𝑄 𝑄30 𝛾 𝑄(𝐻 𝜄 𝑄 𝜄 𝑄5, 𝑄30, 𝛿
  • 𝑒𝜄
  • MAP:
  • arg 𝑛𝑏𝑦 ∫ 𝑄 𝑄5 𝛽 𝑄 𝑄30 𝛾 𝑄(𝐻 𝜄 𝑄 𝜄 𝑄5, 𝑄30, 𝛿
  • 𝑒𝜄𝑅 𝛽 𝑅 𝛾 𝑅(𝛿)

The prior on parameters

slide-9
SLIDE 9

人工智能引论 2018 罗智凌

Understand LDA with MLE

slide-10
SLIDE 10

人工智能引论 2018 罗智凌

Generative vs Discriminative Models

slide-11
SLIDE 11

人工智能引论 2018 罗智凌

OUTLINE

  • Strategies
  • Algorithm

– Gradient Descent (GD) – EM algorithm – Sampling algorithms

  • Applications
slide-12
SLIDE 12

人工智能引论 2018 罗智凌

Gradient Descent

slide-13
SLIDE 13

人工智能引论 2018 罗智凌

Batch/Stochastic gradient

slide-14
SLIDE 14

人工智能引论 2018 罗智凌

Advanced Varients

  • Momentum SGD
  • Adagrad

– Big learning rate for low-frequent param, small for high-frequent one.

  • Adadelta

– Adagrad的改进,用local的梯度平方和替换了全局的梯度平方

  • Adam

– 与Adagrad相似,增加了梯度的二阶矩,更稳定

slide-15
SLIDE 15

人工智能引论 2018 罗智凌

Expectation–Maximization algorithm

Given the statistical model which generates a set X of observed data, a set of unobserved latent data or missing values Z , and a vector of unknown parameters θ , along with a likelihood function L(θ;X,Z)=p(X,Z|θ) , the maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying these two steps:

  • Expectation step (E step): Calculate the expected value of the log likelihood

function, with respect to the conditional distribution of Z given X under the current estimate of the parameters θ(t) :

  • Maximization step (M step): Find the parameters that maximize this quantity:
slide-16
SLIDE 16

人工智能引论 2018 罗智凌

Sampling

  • Conjugate distribution based sampling
  • 巧妙地根据条件概率函数选择先验函数,能使得后验与先验保

持同样的函数形式。

  • 1. The observation is a stochastic variable x in a distribution ∅ phi with

parameters 𝜈.

  • 2. parameter 𝜈 has a known prior distribution f with hyper-parameter 𝜕.
  • 3. The pair of ∅ and f is in one of existing conjugate distributions. For

example ∅ is normal distribution and its expectation f is also in the normal distribution.

slide-17
SLIDE 17

人工智能引论 2018 罗智凌

Discrete distributions

slide-18
SLIDE 18

人工智能引论 2018 罗智凌

Conjugate Priors

slide-19
SLIDE 19

人工智能引论 2018 罗智凌

Conjugate priors

slide-20
SLIDE 20

人工智能引论 2018 罗智凌

Gibbs Sampling

slide-21
SLIDE 21

人工智能引论 2018 罗智凌

OUTLINE

  • About AI
  • Preliminaries about Bayesian
  • Generative/Discriminative Model
  • Applications

– Markov Model – Markov Network – Neural Network

slide-22
SLIDE 22

人工智能引论 2018 罗智凌

Markov Rule

  • A discrete-time Markov chain is a sequence of random

variables X1, X2, X3, ... with the Markov property, namely that the probability of moving to the next state depends only on the present state and not on the previous states

  • First-order Markov and p-order Markov
slide-23
SLIDE 23

人工智能引论 2018 罗智凌

Random Field

  • Markov Random Field
  • Gibbs Random Field
  • Conditional Random Field
  • Gaussian Random Field
slide-24
SLIDE 24

人工智能引论 2018 罗智凌

Markov Network

  • Markov Chain

P5 G P30 𝜄 𝛽 𝛾 P5 P5 P30 P30 G 𝜄

slide-25
SLIDE 25

人工智能引论 2018 罗智凌

Hidden Markov Model

  • Markov chain rule:
  • 语音识别
  • 手势、字体识别
  • 故障检测
slide-26
SLIDE 26

人工智能引论 2018 罗智凌

Markov Network

  • Hidden Markov Model

P5 G P30 𝜄 P5 P5 P30 P30 G 𝜄

slide-27
SLIDE 27

人工智能引论 2018 罗智凌

Markov Random Field

  • 信息编码
  • 人口模拟模型

P5 G P30 P5 P5 P30 P30 G G

slide-28
SLIDE 28

人工智能引论 2018 罗智凌

Neural Network

  • Intent Variable -> Hidden Layer
  • 自动特征混合(非线性混合)
  • 分类/回归

P5 G P30 𝜄 𝜄 𝜄

slide-29
SLIDE 29

人工智能引论 2018 罗智凌

Mixtures of Gaussians

slide-30
SLIDE 30

人工智能引论 2018 罗智凌

Gaussian mixture distribution

  • Definition:
  • Introduce a K-dimensional binary random variable z = (z1, z2, …, zK)T
  • If , then
  • Equivalent formulation of the Gaussian mixture:

Latent variable

responsibility

slide-31
SLIDE 31

人工智能引论 2018 罗智凌

Gaussian mixture distribution

responsibility

slide-32
SLIDE 32

人工智能引论 2018 罗智凌

Gaussian mixture distribution

slide-33
SLIDE 33

人工智能引论 2018 罗智凌

The difficulty of estimating parameters in GMM by ML

  • The log of the likelihood function of GMM:
  • Issue #1: singularities

– Collapses onto a specific data point

  • Issue #2: identifiability

– Total K! equivalent solutions

  • Issue #3: no closed form solution

– The derivatives of the log likelihood are complex.

slide-34
SLIDE 34

人工智能引论 2018 罗智凌

Expectation-Maximization algorithm for GMM

  • Solve µk :
  • Solve Σk :
  • Solve πk :

Weighting factor

M Step

  • 1

E Step

Each iteration will increase the log likelihood function.

slide-35
SLIDE 35

人工智能引论 2018 罗智凌

Expectation-Maximization algorithm for GMM

slide-36
SLIDE 36

人工智能引论 2018 罗智凌

EM algorithm for GMM: experiment

  • The Old Faithful data set:
slide-37
SLIDE 37

人工智能引论 2018 罗智凌

EM algorithm for GMM: experiment

  • The Old Faithful data set:

Illustration of the EM algorithm using the Old Faithful set as used for the illustration of the K-means algorithm

slide-38
SLIDE 38

人工智能引论 2018 罗智凌

罗智凌

luozhiling@zju.edu.cn http://www.bruceluo.net