Deep Learning: Methods and Applications Chapter 3: Three Classes of - - PowerPoint PPT Presentation

deep learning methods and applications
SMART_READER_LITE
LIVE PREVIEW

Deep Learning: Methods and Applications Chapter 3: Three Classes of - - PowerPoint PPT Presentation

Deep Learning: Methods and Applications Chapter 3: Three Classes of Deep Learning Network : , SNU Spoken Language Processing Lab / 3.1. A Three-Way Categorization


slide-1
SLIDE 1 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Deep Learning: Methods and Applications

Chapter 3: Three Classes of Deep Learning Network

발표자 : 조성재, 최상우

slide-2
SLIDE 2 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

목차

  • 3.1. A Three-Way Categorization
  • 딥 러닝을 3가지 방식으로 분류
  • 3.2. Deep Networks for Unsupervised or Generative Learning
  • 첫번째 방식인 Unsupervised learning or Generative Learning
  • 3.3 Deep Networks for Supervised Learning
  • 두번째 방식인 Supervised Learning
  • 3.4 Hybrid Deep Networks

2

slide-3
SLIDE 3 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Machine Learning Basic Concept

3

slide-4
SLIDE 4 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Contents Overview

  • Generative model vs Discriminative model
  • Joint Distribution vs Conditional distribution
  • Denoising Autoencoder
  • Mean square reconstruction error and KL divergence
  • DBN RBM DBM
  • Sum-product network
  • Hessian-Free Optimization
  • Conditional Random fields
  • Deep stacking Network
  • Time delayed neural network
  • Convolutional neural network
  • Hybrid models

4

slide-5
SLIDE 5 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.1 A Three-Way Categorization(OVERVIEW)

  • Category 1: Deep networks for unsupervised or generative learning
  • intended to capture high-order correlation of the observed or visible data for pattern analysis or synthesis

purposes when no information about target class labels is available.

  • Unsupervised feature or representation learning in the literature refers to this category of the deep networks.
  • Unsupervised learning in generative mode, may also be intended to characterize joint statistical distributions of

the visible data and their associated classes

  • Category 2: Deep networks for supervised learning
  • intended to directly provide discriminative power for pattern classification purposes, often by characterizing

the posterior distributions of classes conditioned on the visible data.

  • Target label data are always available in direct or indirect forms for such supervised learning.
  • They are also called discriminative deep networks.
  • Category 3: Hybrid deep Networks
  • the goal is discrimination which is assisted, often in a significant way, with the outcomes of generative or

unsupervised deep networks. This can be accomplished by better optimization or/and regularization of the deep networks in category

5

slide-6
SLIDE 6 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.1 Basic Deep Learning Terminologies - 7가지 용어

1. Deep Learning

  • A class of machine learning techniques
  • The essence of deep learning is to compute hierarchical features or representations of the
  • bservational data where the higher-level features or factors are defined from lower-level
  • nes.

2. Deep Belief network

  • Probabilistic generative models composed of multiple layers of stochastic, hidden variables.

3. Boltzmann machine

  • A network of symmetrically connected, neuron-like units that make stochastic decisions about

whether to be on or off

6

slide-7
SLIDE 7 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.1 Basic Deep Learning Terminologies - 7가지 용어

4. Restricted Boltzmann machine (RBM)

  • A special type of BM consisting of a layer of visible units and a layer of hidden units with no

visible-visible or hidden-hidden connections

5. Deep neural network(DNN)

  • a multilayer perceptron with many hidden layers, whose weights are fully connected and are
  • ften initialized using either an unsupervised or a supervised pretraining technique

6. Deep Autoencoder

  • a “discriminative” DNN whose output targets are the data input itself rather than class labels;

hence an unsupervised learning model

7. Distributed representation

  • an internal representation of the observed data in such a way that they are modeled as being

explained by the interactions of many hidden factors.

7

slide-8
SLIDE 8 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Intro
  • Many deep networks in this category can be used to meaningfully generate samples

by sampling from the networks, with examples of RBMs, DBNs, DBMs, and generalized denoising autoencoders and are thus generative models

  • Some networks in this category, however, cannot be easily sampled, with examples of

sparse coding networks and the original forms of deep autoencoders, and are thus not generative in nature

  • Among the various subclasses of generative or unsupervised deep networks, the

energy-based deep models are the most common

  • Such composition leads to deep belief network (DBN)(Chapter 5).

8

slide-9
SLIDE 9 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Introduced Deep Networks
  • Deep autoencoders
  • Transforming autoencoder
  • Predictive sparse coders
  • De-noising (stacked) autoencoders
  • Deep Boltzmann machines
  • Mean-covariance RBM
  • Deep Belief Networks
  • Sum-product networks
  • Recurrent neural networks

9

slide-10
SLIDE 10 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

10

slide-11
SLIDE 11 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Autoencoders
  • The original form of the deep autoencoder which we will give more detail about in

Chapter 4, is a typical example of this unsupervised model category

  • Most other forms of deep autoencoders are also unsupervised in nature, but with

quite different properties and implementations.

  • Examples are transforming autoencoders , predictive sparse coders and their stacked

version, and de-noising autoencoders and their stacked versions

11

slide-12
SLIDE 12 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Autoencoders
  • Specifically, in de-noising autoencoders, the input vectors are first corrupted by, for

example, randomly selecting a percentage of the inputs and setting them to zeros or adding Gaussian noise to them.(안개 낀 곳에서 구별하는 것과 같은)

  • The encoded representations transformed from the uncorrupted data are used as the

inputs to the next level of the stacked de-noising autoencoder.

12

slide-13
SLIDE 13 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Deep Boltzmann Machines
  • A DBM contains many layers of hidden variables, and has no connections between the

variables within the same layer

  • While having a simple learning algorithm, the general BMs are very complex to study

and very slow to train.

  • In a DBM, each layer captures complicated, higher-order correlations between the

activities of hidden features in the layer below.

  • DBMs have the potential of learning internal representations that become increasingly

complex, highly desirable for solving object and speech recognition problems.

  • Further, the high-level representations can be built from a large supply of unlabeled

sensory inputs and very limited labeled data can then be used to only slightly fine- tune the model for a specific task at hand.

13

slide-14
SLIDE 14 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Deep Boltzmann Machines
  • When the number of hidden layers of DBM is reduced to one, we have restricted

Boltzmann machine (RBM).

  • Like DBM, there are no hidden-to-hidden and no visible-to-visible connections in the

RBM.

  • The main virtue of RBM is that via composing many RBMs, many hidden layers can be

learned efficiently using the feature activations of one RBM as the training data for the next.

14

slide-15
SLIDE 15 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Boltzmann Machines

15

slide-16
SLIDE 16 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Restricted Boltzmann Machines

16

slide-17
SLIDE 17 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Restricted Boltzmann Machines

17

slide-18
SLIDE 18 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Deep Boltzmann Machines

18

slide-19
SLIDE 19 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Deep Belief Networks

19

slide-20
SLIDE 20 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • DBM vs. DBN

20

slide-21
SLIDE 21 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Mean-Covariance RBM(mcRBM)
  • The standard DBN has been extended to the factored higher-order Boltzmann machine

in its bottom layer, with strong results for phone recognition obtained (Dahl et. al., 2010).

  • This model, called the mean-covariance RBM or mcRBM, recognizes the limitation of the

standard RBM in its ability to represent the covariance structure of the data.

  • However, it is difficult to train mcRBMs and to use them at the higher levels of the deep

architecture.

  • the mcRBM parameters in the full DBN are not fine-tuned using the discriminative

information, which is used for fine tuning the higher layers of RBMs, due to the high computational cost.

21

slide-22
SLIDE 22 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Sum Product Networks (SPN)
  • Another representative deep generative network that can be used for unsupervised

(as well as supervised) learning is the sum-product network or SPN.

  • An SPN is a directed acyclic graph with the observed variables as leaves, and with sum and

product operations as internal nodes in the deep network.

  • The “sum” nodes give mixture models, and the “product” nodes build up the feature

hierarchy.

  • Properties of “completeness” and “consistency” constrain the SPN in a desirable way. The

learning of SPNs is carried out using the EM algorithm together with back-propagation.

  • The learning procedure starts with a dense SPN. It then finds an SPN structure by

learning its weights, where zero weights indicate removed connections.

22

slide-23
SLIDE 23 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Sum Product Networks (SPN)

23

slide-24
SLIDE 24 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Sum Product Networks (SPN)
  • The main difficulty in learning SPNs
  • the learning signal (i.e., the gradient) quickly dilutes when it propagates to deep layers.
  • It was pointed out in that early paper that despite the many desirable generative

properties in the SPN, it is difficult to fine tune the parameters using the discriminative information, limiting its effectiveness in classification tasks.

  • However, this difficulty has been overcome in the subsequent work reported, where an

efficient backpropagation-style discriminative training algorithm for SPN was presented.

24

slide-25
SLIDE 25 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Sum Product Networks (SPN)
  • Importantly, the standard gradient descent, based on the derivative of the conditional

likelihood, suffers from the same gradient diffusion problem well known in the regular DNNs.

  • The trick to alleviate this problem in learning SPNs is to replace the marginal inference

with the most probable state of the hidden variables and to propagate gradients through this “hard” alignment only.

  • Excellent results on small-scale image recognition tasks were reported by Gens and

Domingo (2012).

25

slide-26
SLIDE 26 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Recurrent Neural Networks (RNN)
  • Recurrent neural networks (RNNs) can be considered as another class of deep

networks for unsupervised (as well as supervised) learning, where the depth can be as large as the length of the input data sequence.

  • In the unsupervised learning mode, the RNN is used to predict the data sequence in

the future using the previous data samples, and no additional class information is used for learning.

26

slide-27
SLIDE 27 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Recurrent Neural Networks (RNN)
  • The RNN is very powerful for modeling sequence data (e.g., speech or text), but until

recently they had not been widely used partly because they are difficult to train to capture long term dependencies, giving rise to gradient vanishing or gradient explosion problems.

  • Recent advances in Hessian-free optimization have also partially overcome this difficulty

using approximated second-order information or stochastic curvature estimates.

  • In the more recent work , RNNs that are trained with Hessian-free optimization are

used as a generative deep network in the character-level language

27

slide-28
SLIDE 28 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Recurrent Neural Networks (RNN)

28

slide-29
SLIDE 29 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Recurrent Neural Networks (RNN)
  • Modeling tasks, where gated connections are introduced to allow the current input

characters to predict the transition from one latent state vector to the next.

  • Such generative RNN models are demonstrated to be well capable of generating

sequential text characters.

  • More recently, Bengio et al. (2013) and Sutskever (2013) have explored variations of

stochastic gradient descent optimization algorithms in training generative RNNs and shown that these algorithms can outperform Hessian-free optimization methods.

  • Molotov et al. (2010) have reported excellent results on using RNNs for language

modeling.

  • More recently, Mesnil et al. (2013) and Yao et al. (2013) reported the success of RNNs in

spoken language understanding.

29

slide-30
SLIDE 30 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Dynamic and Deep Structures in Speech Recognition
  • There has been a long history in speech recognition research where human speech

production mechanisms are exploited to construct dynamic and deep structure in probabilistic generative models

  • Specifically, the early work generalized and extended the conventional shallow and

conditionally independent HMM structure by imposing dynamic constraints, in the form of polynomial trajectory, on the HMM parameters.

  • A variant of this approach has been more recently developed using different learning

techniques for time-varying HMM parameters and with the applications extended to speech recognition robustness (Yu and Deng, 2009; Yu et al., 2009a).

  • Similar trajectory HMMs also form the basis for parametric speech synthesis

30

slide-31
SLIDE 31 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
  • Subsequent work added a new hidden layer into the dynamic model to explicitly

account for the target-directed, articulatory-like properties in human speech generation

  • More efficient implementation of this deep architecture with hidden dynamics is

achieved with non-recursive or finite impulse response (FIR) filters in more recent studies

31

3.2 Deep Networks for Unsupervised or Generative Learning

  • Dynamic and Deep Structures in Speech Recognition
slide-32
SLIDE 32 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Deep Structured Generative Models
  • The above deep structured generative models of speech can be shown as special cases
  • f the more general dynamic network model and even more general dynamic

graphical models.

  • The graphical models can comprise many hidden layers to characterize the complex

relationship between the variables in speech generation.

  • Armed with powerful graphical modeling tool, the deep architecture of speech has

more recently been successfully applied to solve the very difficult problem of single- channel, multi-talker speech recognition, where the mixed speech is the visible variable while the un-mixed speech becomes represented in a new hidden layer in the deep generative architecture (Rennie et al., 2010; Wohlmayr et al., 2011).

32

slide-33
SLIDE 33 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning

  • Deep Structured Generative Models
  • Deep generative graphical models are indeed a powerful tool in many applications due

to their capability of embedding domain knowledge.

  • However, they are often used with inappropriate approximations in inference, learning,

prediction, and topology design, all arising from inherent intractability in these tasks for most real-world applications.

  • An even more drastic way to deal with this intractability was proposed recently by

Bengio et al. (2013b), where the need to marginalize latent variables is avoided altogether.

33

slide-34
SLIDE 34 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning – to delete?

  • The standard statistical methods used for large-scale speech recognition and

understanding combine (shallow) hidden Markov models for speech acoustics with higher layers of structure representing different levels of natural language hierarchy.

  • This combined hierarchical model can be suitably regarded as a deep generative

architecture, whose motivation and some technical detail may be found in Chapter 7 of the recent book on “Hierarchical HMM” or HHMM.

  • These early deep models were formulated as directed graphical models, missing the key

aspect of “distributed representation” embodied in the more recent deep generative networks of the DBN and DBM

  • Filling in this missing aspect would help improve these generative models.

34

slide-35
SLIDE 35 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.2 Deep Networks for Unsupervised or Generative Learning – to delete?

  • Finally, dynamic or temporally recursive generative models based on neural network

architectures can be found in (Taylor et al., 2007) for human motion modeling for natural language and natural scene parsing.

  • The latter model is particularly interesting because the learning algorithms are capable of

automatically determining the optimal model structure. This contrasts with other deep architectures such as DBN where only the parameters are learned while the architectures need to be pre-defined.

  • Specifically, the recursive structure commonly found in natural scene images and in natural

language sentences can be discovered using a max-margin structure prediction architecture.

  • It is shown that the units contained in the images or sentences are identified, and the

way in which these units interact with each other to form the whole is also identified.

35

slide-36
SLIDE 36 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
  • Many of the discriminative techniques for supervised learning in signal and

information processing are shallow architectures such as conditional random fields (CRFs)

  • A CRF is intrinsically a shallow discriminative architecture, Characterized by the linear

relationship between the input features and the transition features.

36

3.3 Deep Networks for Supervised Learning - CRF

slide-37
SLIDE 37 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
  • Recently, deep-structured CRFs have been developed

by stacking the output in each lower layer of the CRF, together with the original input data, onto its higher layer.

  • Various versions of deep-structured CRFs are successfully

applied to phone recognition, spoken language identification, and natural language processing.

37

3.3 Deep Networks for Supervised Learning – deep-structured CRF

slide-38
SLIDE 38 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.3 Deep Networks for Supervised Learning – Multilayer perceptron (MLP)

  • Morgan (2012) gives an excellent review on other major existing discriminative

models in speech recognition based mainly on the traditional neural network or MLP architecture using backpropagation learning with random initialization.

  • It argues for the importance of both the increased width of each layer of the neural

networks and the increased depth.

  • In particular, a class of deep neural network models forms the basis of the popular

“tandem” approach (Morgan et al., 2005), where the output of the discriminatively learned neural network is treated as part of the observation variable in HMMs.

38

slide-39
SLIDE 39 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.3 Deep Networks for Supervised Learning – Deep Stacking Network (DSN)

  • In the most recent work of a new deep learning

architecture, sometimes called Deep Stacking Network (DSN), together with its tensor variant and its kernel version are developed that all focus on discrimination with scalable, parallelizable learning relying on little or no generative component.

  • We will describe this type of discriminative deep

architecture in detail in Chapter 6.

39

slide-40
SLIDE 40 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.3 Deep Networks for Supervised Learning – Recurrent Neural Network (RNN)

  • RNNs can also be used as a discriminative model where the output is a label

sequence associated with the input data sequence.

40

slide-41
SLIDE 41 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.3 Deep Networks for Supervised Learning – Recurrent Neural Network (RNN)

  • A set of new models and methods were proposed more recently that enable the RNNs

themselves to perform sequence classification while embedding the long-short-term memory (LSTM) into the model.

  • Underlying this method is the idea of interpreting RNN outputs as the conditional

distributions over all possible label sequences given the input sequences.

  • 𝒒(𝒛|𝒚𝟐, 𝒚𝟑, ⋯ , 𝒚𝑶)
  • Then, a differentiable objective function can be derived to optimize these conditional

distributions over the correct label sequences.

  • The effectiveness of this method has been demonstrated in handwriting recognition tasks

and in a small speech task (Graves et al., 2013, 2013a) to be discussed in more detail in Chapter 7 of this book.

41

slide-42
SLIDE 42 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실
  • Another type of discriminative deep architecture is the convolutional neural network

(CNN), in which each module consists of a convolutional layer and a pooling layer.

  • These modules are often stacked up with one on top of another, or with a DNN on top
  • f it, to form a deep model.

42

3.3 Deep Networks for Supervised Learning – convolutional neural network (CNN)

slide-43
SLIDE 43 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.3 Deep Networks for Supervised Learning – convolutional neural network (CNN)

  • The convolutional layer shares many weights, and the pooling layer subsamples the
  • utput of the convolutional layer and reduces the data rate from the layer below.
  • The weight sharing in the convolutional layer, together with appropriately chosen

pooling schemes, endows the CNN with some “invariance” properties (e.g., translation invariance).

  • CNNs have been found highly effective and been commonly used in computer vision

and image. And recently, the CNN is also found effective for speech recognition.

43

slide-44
SLIDE 44 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.3 Deep Networks for Supervised Learning – time-delay neural network (TDNN)

  • It is useful to point out that the time-delay neural network (TDNN) developed for early

speech recognition is a special case and predecessor of the CNN when weight sharing is limited to one of the two dimensions, i.e., time dimension, and there is no pooling layer.

  • It was not until recently that researchers have discovered that the time-dimension

invariance is less important than the frequency-dimension invariance for speech recognition.

  • A careful analysis on the underlying reasons is described in (Deng et al., 2013), together

with a new strategy for designing the CNN’s pooling layer demonstrated to be more effective than all previous CNNs in phone recognition

44

slide-45
SLIDE 45 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.4 Hybrid Deep Networks – two viewpoints

  • The term “hybrid” for this third category refers to the deep architecture that either

comprises or makes use of both generative and discriminative model components.

  • In the existing hybrid architectures published in the literature, the generative

component is mostly exploited to help with discrimination, which is the final goal of the hybrid architecture.

  • How and why generative modeling can help with discrimination can be examined from

two viewpoints.

  • The optimization viewpoint where generative models trained in an unsupervised fashion can

provide excellent initialization points in highly nonlinear parameter estimation problems

  • The regularization perspective where the unsupervised-learning models can effectively

provide a prior on the set of functions representable by the model.

45

slide-46
SLIDE 46 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.4 Hybrid Deep Networks – DBN-DNN model

  • The DBN can be converted to and used as the initial model of a DNN for supervised

learning with the same network structure, which is further discriminatively trained or fine-tuned using the target labels provided.

  • When the DBN is used in this way we consider this DBN-DNN model as a hybrid deep

model, where the model trained using unsupervised data helps to make the discriminative model effective for supervised learning.

  • We will review details of the discriminative DNN for supervised learning in the context of

RBM/DBN generative, unsupervised pre-training in Chapter 5.

46

slide-47
SLIDE 47 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.4 Hybrid Deep Networks – DNN-CRF and DNN-HMM

  • Another example of the hybrid deep network is developed in (Mohamed et al., 2010),

where the DNN weights are also initialized from a generative DBN but are further fine-tuned with a sequence-level discriminative criterion, which is the conditional probability of the label sequence given the input feature sequence, instead of the frame- level criterion of cross-entropy commonly used.

  • This can be viewed as a combination of the static DNN with the shallow

discriminative architecture of CRF. It can be shown that such a DNN-CRF is equivalent to a hybrid deep architecture of DNN and HMM whose parameters are learned jointly using the full-sequence maximum mutual information (MMI) criterion between the entire label sequence and the input feature sequence.

47

slide-48
SLIDE 48 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.4 Hybrid Deep Networks - RBM

  • In (Larochelle and Bengio, 2008), the generative model of RBM is learned using the

discriminative criterion of posterior class-label probabilities.

  • Here the label vector is concatenated with the input data vector to form the

combined visible layer in the RBM.

  • In this way, RBM can serve as a stand-alone solution to classification problems and

the authors derived a discriminative learning algorithm for RBM as a shallow generative model.

48

slide-49
SLIDE 49 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.4 Hybrid Deep Networks – pretraining of CNNs

  • A further example of hybrid deep networks is the use of generative models of DBNs to

pre-train deep convolutional neural networks (Lee et al., 2009, 2010, 2011).

  • Like the fully connected DNN discussed earlier, pre-training also helps to improve the

performance of deep CNNs over random initialization.

  • Pre-training DNNs or CNNs using a set of regularized deep autoencoders (Bengio et

al., 2013a), including denoising autoencoders, contractive autoencoders, and sparse autoencoders, is also a similar example of the category of hybrid deep network.ks

49

slide-50
SLIDE 50 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

3.4 Hybrid Deep Networks – two-stage architecture

  • The final example given here for hybrid deep networks is based on the idea and work of

(Ney, 1999; He and Deng, 2011), where one task of discrimination (e.g., speech recognition) produces the output (text) that serves as the input to the second task of discrimination (e.g., machine translation).

  • The overall system, giving the functionality of speech translation – translating speech in
  • ne language into text in another language – is a two-stage deep architecture

consisting of both generative and discriminative elements.

  • Both models of speech recognition (e.g., HMM) and of machine translation (e.g., phrasal

mapping and non-monotonic alignment) are generative in nature, but their parameters are all learned for discrimination of the ultimate translated text given the speech data.

50

slide-51
SLIDE 51 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question

  • p216에 On the other hand 다음 문장에, unsupervised-learning model은 해석하기 쉽고,

domain knowledge를 embed하기 쉽고, uncertainty를 핸들하기 쉽다고 나오는데, 이 부분이 명확히 이해가 안가서, 이 부분에 대해 예를 들어 자세히 설명해 주실 것 부탁드려요.

On the other hand, the deep unsupervised-learning models, especially the probabilistic generative

  • nes, are easier to interpret, easier to embed domain knowledge, easier to compose, and easier to

handle uncertainty, but they are typically intractable in inference and learning for complex systems.

51

slide-52
SLIDE 52 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer

  • 답은 아래 논문들에 있을 것으로 예상됨. 2 -> 1 -> 3 순서로 중요.

1.

  • G. Alain and Y. Bengio. What regularized autoencoders learn from the data

generating distribution. In Proceedings of International Conference on Learning Representations (ICLR). 2013. 2.

  • Y. Bengio, A. Courville, and P

. Vincent. Representation learning: A review and new

  • perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI),

38:1798–1828, 2013. 3.

  • Y. Bengio, E. Thibodeau-Laufer, and J. Yosinski. Deep generative stochastic networks

trainable by backprop. arXiv 1306:1091, 2013. also accepted to appear in Proceedings of International Conference on Machine Learning (ICML), 2014.

52

slide-53
SLIDE 53 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer

On the other hand, the deep unsupervised-learning models, especially the probabilistic generative ones, are easier to interpret, easier to embed domain knowledge, easier to compose, and easier to handle uncertainty, but they are typically intractable in inference and learning for complex systems.

  • The deep unsupervised-learning models: autoencoders, sparse coding networks, etc.
  • 비교 대상: deep supervised-learning models: DNNs, etc.
  • “Easier to interpret”: 전 페이지의 논문2에서 다양한 interpret 사례를 다룸.
  • “Easier to embed domain knowledge”: 0부터 9까지 숫자 이미지의 representation을

학습할 때 autoencoder의 layer를 10으로 줄 수 있음.

  • “Easier to handle uncertainty”: ? Uncertainty = probability. “the deep unsupervised-

learning models, especially the probabilistic generative ones, are … easier to handle uncertainty ”. Probabilistic models 이니 uncertainty(=probability)를 다루가 쉽다고 생각할 수 있음. 답은 모름.

53

slide-54
SLIDE 54 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question

  • p220에서 SPN 구조에 대해서 그림과 함께 설명 부탁드립니다.

54

slide-55
SLIDE 55 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer

References

  • [Paper (first introduced)] H. Poon and P

. Domingos. Sum-product networks: A new deep a

  • rchitecture. In Proceedings of Uncertainty in Artificial Intelligence. 2011.
  • [Paper] R. Gens and P

. Domingo. Discriminative learning of sum-product networks. Neural Information Processing Systems (NIPS), 2012.

  • [Lecture by the author] Sum-Product Networks: The Next Generation of Deep Models by

P . Domingos (YouTube) (PDF)

  • Comprehensive web page (http://spn.cs.washington.edu/)

55

slide-56
SLIDE 56 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer: Basic Structure

56

Deep structure Basic structure Forward Backward

slide-57
SLIDE 57 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer: Architecture with an Example

57

slide-58
SLIDE 58 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question

  • RNN을 Unsupervised learning 에 사용한다는 부분에 궁금한 점이 있습니다. 이전의 data

sample을 사용하여 future sequence를 추론하는 경우 중간중간 잘못 추론한 결과가 쌓여 결국 전체 sequence를 잘못 추론하는 결과를 야기할 수 있다고 생각합니다. 이 문제를 해결할 수 있는 장치가 있는지 궁금합니다.

58

slide-59
SLIDE 59 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer

  • Training할 때 teacher forcing이라는 기법을 이용해서 문제를 완화시킬 수 있습니다 (Goodfellow

et al. 2016). 현재 예측값이 아니라 현재 참값(ground truth)을 이용해 다음 상태를 예측합니다.

59

Vanilla RNN RNN with teacher forcing

Goodfellow et al. Deep Learning. The MIT Press. 2016

slide-60
SLIDE 60 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

  • 17,18년 오디오 관련한 여러 가지 테스크에 대해서 제출되는 여러 모델들은 아직도 기존의 머신러닝

방법인 HMM등을 사용하는 경우를 볼 수 있습니다. RNN 기반의 variants가 HMM 기반의 모델의 성능에 필적하지 못하는 경우도 있는데 이에 대해서 발표자 분들이 어떻게 생각하시는지 궁금합니다.

  • 제가 오디오처리 전공이 아니라 ‘17,18년 오디오 관련한 여러 가지 테스크에 대해서 제출되는 여러 모델들’에

대해 알지 못합니다.

  • ‘RNN 기반의 variants가 HMM 기반의 모델의 성능에 필적하지 못하는 경우’를 구체적으로 명시해주셨다면

이야기가 가능했을 것 같습니다.

  • HMM과 RNN의 가능 큰 차이는 HMM는 현재 상태를 이용해 다음 상태를 예측하고, RNN은 현재 상태와 이전

여러 상태를 이용해 예측한다는 점입니다. 다만, HMM은 현재 상태과 이전의 모든 상태를 담고 있다고 가정합니다(Markov property). RNN은 long-term dependency에 약하고 Markov property가 없기 때문에 long-term dependency를 잘 할 수 있도록 해야합니다. 그것을 잘하게끔 하기가 어렵기 때문에 RNN이 더 좋은 성능을 못내는 것이라고 생각합니다. 이와 달리 HMM은 long-term dependency를 가지게 끔 설계하기가 RNN보다 쉬워서 HMM이 더 좋은 성능을 낸다고 생각합니다.

60

slide-61
SLIDE 61 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

  • 상대적으로 최근에 개발되었다는 방법인 time-varying HMM parameters를 사용하는 방법

을 알고 싶습니다.

  • 질문을 통해 time-varying HMM을 처음 접했습니다.
  • ‘time-varying HMM parameters를 사용하는 방법’을 질문하셨는데, time-varying HMM

parameters 라는 것이 무엇을 의미하는지 찾지 못했고 사용 목적이 무엇인지 몰라서 질 문의 의도를 파악하기 어렵습니다. Wang et al. 2009에서 소개한 time varying HMM이 맞는지 궁금합니다.

61

slide-62
SLIDE 62 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

  • CRF 에 대한 기본개념이 부족한 상태에서 드리는 문의사항입니다. CRF 가 shallow layer 라고 할 때 이

를 deep structure 로 쌓는다고 하는데요.(p31) 이렇게 쌓아진 구조와 DNN 구조는 어떤 차이를 갖는지 궁금합니다. 가장 단순한 비교로 1 layer DNN 의 구조와 CRF는 어떻게 다른지 궁금합니다.

62

CRF 구조의 예시 Deep structured CRF 구조의 예시 DNN 구조의 예시

slide-63
SLIDE 63 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

  • Are all discrimitive models learned by supervised learning methods?
  • Discriminative 모델은 p(y|x)를 학습하는 것이 목적이기 때문에 y가 주어지지 않다면

이를 학습할 수 없다고 생각합니다. 하지만 hybrid 모델처럼 discriminative 모델의 pretraining에 unsupervised learning 방법을 적용하는 것은 가능합니다.

63

slide-64
SLIDE 64 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

  • 3.4에서 hybrid deep neural network의 한 예로, pre-trained DNN, CNN이 deep

autoencoder를 사용하는 예를 제시하였는데, 그렇다면 deep autoencoder에서 학습된

  • utput이 pre-trained DNN, CNN에서 input으로 feed 된다는 의미인지요?

이때 deep autoencoder와 pre-trained DNN/CNN 각각의 역할이 무엇일지 궁금합니다.

  • Deep autoencoder에서 학습된 encoder를 사용하여 input으로부터 feature를 얻게 되고

이를 DNN/CNN의 input으로 사용한다는 의미입니다. 이때 deep autoencoder의 encoder가 feature extractor의 역할을 하게되고 DNN/CNN은 classifier의 역할을 하게 됩니다.

64

slide-65
SLIDE 65 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

  • 음성 데이터에 대한 전처리의 결과는 많은 경우 시간 축과 frequency 축을 가지게 됩니다.

이는 1차원 시계열 데이터와도, 그리고 2차원의 이미지와도 유사한 성질을 가지고 있는 것 같은데 이렇게 전처리된 데이터에 대해 RNN, 시간 축에 대한 1차원 CNN(TDNN), 2차원 CNN을 적용하는 것이 어떻게 다른지, 그리고 각 구조는 어떤 상황에 사용하는 것이 좋은지 궁금합니다.

  • 먼저 CNN 모델이 RNN 모델과 달리 병렬화가 쉬워 모델의 수행속도가 빠르기 때문에

수행속도가 중요한 경우에 사용하기 좋다고 생각합니다. 그리고 2차원 CNN이 서로 다른 시간 축들을 동시에 고려하기 때문에 1차원 CNN에 비하여 성능이 우수할 것으로 예상합니다.

65

slide-66
SLIDE 66 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

  • hybrid 모델에는 역의 방향도 있는데 (generative for discriminative), 책에서는 전자가 후자

의 prior로 작용하는 regularization을 언급하였습니다(p.226). prior가 된다는 것은 무슨 과정 을 의미하는지 (generative 모델의 결과가 어떻게 discriminative 모델로 feed-in 되는지), 그 리고 이것이 적절한 prior라는 근거가 있는지 궁금합니다.

  • Generative 모델이 discriminative 모델의 초깃값으로 활용된다는 의미입니다. 이것이 적

절한 priror라는 것을 증명한 논문은 읽어보지 못했으며 경험적으로 모델의 weight들을 랜덤으로 초기화하는 것보다 학습이 더 잘되고 성능이 우수하기 때문이라고 생각합니다.

66

slide-67
SLIDE 67 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

  • 교재 221 page 1~9번째 줄 문장에서 언급한 "Hessian-free optimization" 방법이 궁금합니

다. "stochastic gradient descent optimization algorithms" 방법과 비교하였을 때 어떤 차이 점(장단점)이 있나요?

67

  • 이론적으로는 SGD보다 더 최적해를 잘

찾지만 구현이 SGD에 비해 어려우며, 굉장히 불안정하고 conjugate gradient 단계의 계산량이 너무 많다고 합니다.

slide-68
SLIDE 68 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

  • Can hybrid deep networks solve issues of sparsity and the existence of multiple

conceptual interpretations to the same word in speech processing context? If still not, then what do we need to achieve that?

  • Hybrid deep network가 아니더라도 충분한 양의 데이터가 주어지고, RNN 계열의 모델

을 사용하면 장단기적인 문맥을 모두 고려할 수 있기때문에 문제를 해결할 수 있다고 생 각합니다.

68

slide-69
SLIDE 69 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

  • vanilla RNN은 gradient vanishing문제에 직면하기 쉽다고 알고 있는데, text book에서는

"Mesnil et al. (2013) and Yao et al. (2013) reported the success of RNNs in spoken language understanding." 라고 합니다. 그렇다면 이 실험에서는 vanilla RNN이 아닌 발전된 RNN이 적용된 것인가요?

  • 두 연구에서는 Elman-type RNN과 Jordan-type RNN 이라는 vanilla RNN을 변형한 모델

을 적용하였습니다. 그렇지만 이 모델들은 long short-term memory (LSTM), gated recurrent unit (GRU)과 달리 gradient vanishing 문제를 해결하기 위해 제안된 모델은 아 닙니다.

69

slide-70
SLIDE 70 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question

  • 현재는 많이 사용되고 있지는 않을지도 모르지만, Deep Learning에서 Deep Neural

Network의 initialization 이야기할 때 빠지지 않는 방법 중 하나가 RBM 혹은 이것을 확장한 DBN을 이용하여 weight initialization을 하는 방법입니다. 다만 RBM이 bipartite구조로 각 노드의 확률을 sampling 한다는 점에서 이걸로 어떻게 weight initialization을 하는지 잘 모 르겠습니다. 4단원 앞부분에 그 내용이 설명되어있기는 한데, 이렇게 초기화된 weight에 어 떤 통계적인 의의가 있을까요?

70

slide-71
SLIDE 71 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Answer

  • RBM에서 hidden layer에 대응되는 확률을 구하는 과정에서 visible layer의 값들에

weight들을 곱하게 되는데 이 weight들로 DNN의 weight를 초기화한다는 의미입니다. 이렇게 초기화된 weight가 모델에 좋은 prior를 제공해준다는 통계적인 의의가 있습니다.

71

slide-72
SLIDE 72 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

  • 25쪽에서는 "Unsupervised learning refers to no use of task specific supervision

information"이라고 하고 있습니다. 하지만 여기서 표현하는 "task specific supervision information"을 input 그 자체라고 생각하면 곧 generative model이 되는 것이고, 이렇게 생각할 수 있다면 Unsupervised learning과 Supervised learning의 차이가 없어지는 것입니다. 즉, x와 y가 모두 주어지고 y=f(x)에서 함수 f를 학습시키는것이 감독학습인데, x=f(x)라고 볼 수 있다면 무감독학습이 되는것이죠. 이런 관점에서 본다면 감독학습과 무감독학습의 수학적 차이는 없는것인데, 서로 다른 차이점을 어떻게 설명할 수 있을까요?

  • Unsupervised learning과 supervised learning의 목적에 차이가 존재합니다. Supervised

learning의 경우에는 레이블이 주어진 상황에서 데이터의 분류나 회귀분석이 목적이라면 unsupervised learning의 목적은 데이터의 레이블이 없는 상황에서 데이터의 hidden structure를 파악하는 것이 목적입니다.

72

slide-73
SLIDE 73 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Question & Answer

  • 28쪽에서는 DBM이 speech recognition problem에 사용될 수 있다고 합니다. 하지만 제가

알기론 보통 DBM은 고정되어 있는 길이의 모델을 통해서 inference를 하는 것인데, speech recognition을 하려고 하면 주어지는 문장의 길이가 모르는 상태에서 어떻게 inference를 할 수 있는건가요? DBM이 speech recognition을 처리하는 방법이 궁금합니다.

  • 먼저, 주어진 음성데이터를 고정된 길이들로 나눈 후에 이들의 mfcc, melspectrogram

feature들을 추출하고, 이들로부터 다시 한번 DBM를 사용하여 고차원의 feature를 추출 합니다. 이들의 시퀀스를 마지막으로 RNN에 통과시키면 길이가 모르는 상태에서도 inference 가 가능하다고 생각합니다.

73

slide-74
SLIDE 74 SNU Spoken Language Processing Lab / 서울대학교 음성언어처리연구실

Thank you!

Q & A

74