Deep Learning: Methods and Applications
Chapter 3: Three Classes of Deep Learning Network
발표자 : 조성재, 최상우
Deep Learning: Methods and Applications Chapter 3: Three Classes of - - PowerPoint PPT Presentation
Deep Learning: Methods and Applications Chapter 3: Three Classes of Deep Learning Network : , SNU Spoken Language Processing Lab / 3.1. A Three-Way Categorization
Deep Learning: Methods and Applications
Chapter 3: Three Classes of Deep Learning Network
발표자 : 조성재, 최상우
목차
2
Machine Learning Basic Concept
3
Contents Overview
4
3.1 A Three-Way Categorization(OVERVIEW)
purposes when no information about target class labels is available.
the visible data and their associated classes
the posterior distributions of classes conditioned on the visible data.
unsupervised deep networks. This can be accomplished by better optimization or/and regularization of the deep networks in category
5
3.1 Basic Deep Learning Terminologies - 7가지 용어
1. Deep Learning
2. Deep Belief network
3. Boltzmann machine
whether to be on or off
6
3.1 Basic Deep Learning Terminologies - 7가지 용어
4. Restricted Boltzmann machine (RBM)
visible-visible or hidden-hidden connections
5. Deep neural network(DNN)
6. Deep Autoencoder
hence an unsupervised learning model
7. Distributed representation
explained by the interactions of many hidden factors.
7
3.2 Deep Networks for Unsupervised or Generative Learning
by sampling from the networks, with examples of RBMs, DBNs, DBMs, and generalized denoising autoencoders and are thus generative models
sparse coding networks and the original forms of deep autoencoders, and are thus not generative in nature
energy-based deep models are the most common
8
3.2 Deep Networks for Unsupervised or Generative Learning
9
10
3.2 Deep Networks for Unsupervised or Generative Learning
Chapter 4, is a typical example of this unsupervised model category
quite different properties and implementations.
version, and de-noising autoencoders and their stacked versions
11
3.2 Deep Networks for Unsupervised or Generative Learning
example, randomly selecting a percentage of the inputs and setting them to zeros or adding Gaussian noise to them.(안개 낀 곳에서 구별하는 것과 같은)
inputs to the next level of the stacked de-noising autoencoder.
12
3.2 Deep Networks for Unsupervised or Generative Learning
variables within the same layer
and very slow to train.
activities of hidden features in the layer below.
complex, highly desirable for solving object and speech recognition problems.
sensory inputs and very limited labeled data can then be used to only slightly fine- tune the model for a specific task at hand.
13
3.2 Deep Networks for Unsupervised or Generative Learning
Boltzmann machine (RBM).
RBM.
learned efficiently using the feature activations of one RBM as the training data for the next.
14
3.2 Deep Networks for Unsupervised or Generative Learning
15
3.2 Deep Networks for Unsupervised or Generative Learning
16
3.2 Deep Networks for Unsupervised or Generative Learning
17
3.2 Deep Networks for Unsupervised or Generative Learning
18
3.2 Deep Networks for Unsupervised or Generative Learning
19
3.2 Deep Networks for Unsupervised or Generative Learning
20
3.2 Deep Networks for Unsupervised or Generative Learning
in its bottom layer, with strong results for phone recognition obtained (Dahl et. al., 2010).
standard RBM in its ability to represent the covariance structure of the data.
architecture.
information, which is used for fine tuning the higher layers of RBMs, due to the high computational cost.
21
3.2 Deep Networks for Unsupervised or Generative Learning
(as well as supervised) learning is the sum-product network or SPN.
product operations as internal nodes in the deep network.
hierarchy.
learning of SPNs is carried out using the EM algorithm together with back-propagation.
learning its weights, where zero weights indicate removed connections.
22
3.2 Deep Networks for Unsupervised or Generative Learning
23
3.2 Deep Networks for Unsupervised or Generative Learning
properties in the SPN, it is difficult to fine tune the parameters using the discriminative information, limiting its effectiveness in classification tasks.
efficient backpropagation-style discriminative training algorithm for SPN was presented.
24
3.2 Deep Networks for Unsupervised or Generative Learning
likelihood, suffers from the same gradient diffusion problem well known in the regular DNNs.
with the most probable state of the hidden variables and to propagate gradients through this “hard” alignment only.
Domingo (2012).
25
3.2 Deep Networks for Unsupervised or Generative Learning
networks for unsupervised (as well as supervised) learning, where the depth can be as large as the length of the input data sequence.
the future using the previous data samples, and no additional class information is used for learning.
26
3.2 Deep Networks for Unsupervised or Generative Learning
recently they had not been widely used partly because they are difficult to train to capture long term dependencies, giving rise to gradient vanishing or gradient explosion problems.
using approximated second-order information or stochastic curvature estimates.
used as a generative deep network in the character-level language
27
3.2 Deep Networks for Unsupervised or Generative Learning
28
3.2 Deep Networks for Unsupervised or Generative Learning
characters to predict the transition from one latent state vector to the next.
sequential text characters.
stochastic gradient descent optimization algorithms in training generative RNNs and shown that these algorithms can outperform Hessian-free optimization methods.
modeling.
spoken language understanding.
29
3.2 Deep Networks for Unsupervised or Generative Learning
production mechanisms are exploited to construct dynamic and deep structure in probabilistic generative models
conditionally independent HMM structure by imposing dynamic constraints, in the form of polynomial trajectory, on the HMM parameters.
techniques for time-varying HMM parameters and with the applications extended to speech recognition robustness (Yu and Deng, 2009; Yu et al., 2009a).
30
account for the target-directed, articulatory-like properties in human speech generation
achieved with non-recursive or finite impulse response (FIR) filters in more recent studies
31
3.2 Deep Networks for Unsupervised or Generative Learning
3.2 Deep Networks for Unsupervised or Generative Learning
graphical models.
relationship between the variables in speech generation.
more recently been successfully applied to solve the very difficult problem of single- channel, multi-talker speech recognition, where the mixed speech is the visible variable while the un-mixed speech becomes represented in a new hidden layer in the deep generative architecture (Rennie et al., 2010; Wohlmayr et al., 2011).
32
3.2 Deep Networks for Unsupervised or Generative Learning
to their capability of embedding domain knowledge.
prediction, and topology design, all arising from inherent intractability in these tasks for most real-world applications.
Bengio et al. (2013b), where the need to marginalize latent variables is avoided altogether.
33
3.2 Deep Networks for Unsupervised or Generative Learning – to delete?
understanding combine (shallow) hidden Markov models for speech acoustics with higher layers of structure representing different levels of natural language hierarchy.
architecture, whose motivation and some technical detail may be found in Chapter 7 of the recent book on “Hierarchical HMM” or HHMM.
aspect of “distributed representation” embodied in the more recent deep generative networks of the DBN and DBM
34
3.2 Deep Networks for Unsupervised or Generative Learning – to delete?
architectures can be found in (Taylor et al., 2007) for human motion modeling for natural language and natural scene parsing.
automatically determining the optimal model structure. This contrasts with other deep architectures such as DBN where only the parameters are learned while the architectures need to be pre-defined.
language sentences can be discovered using a max-margin structure prediction architecture.
way in which these units interact with each other to form the whole is also identified.
35
information processing are shallow architectures such as conditional random fields (CRFs)
relationship between the input features and the transition features.
36
3.3 Deep Networks for Supervised Learning - CRF
by stacking the output in each lower layer of the CRF, together with the original input data, onto its higher layer.
applied to phone recognition, spoken language identification, and natural language processing.
37
3.3 Deep Networks for Supervised Learning – deep-structured CRF
3.3 Deep Networks for Supervised Learning – Multilayer perceptron (MLP)
models in speech recognition based mainly on the traditional neural network or MLP architecture using backpropagation learning with random initialization.
networks and the increased depth.
“tandem” approach (Morgan et al., 2005), where the output of the discriminatively learned neural network is treated as part of the observation variable in HMMs.
38
3.3 Deep Networks for Supervised Learning – Deep Stacking Network (DSN)
architecture, sometimes called Deep Stacking Network (DSN), together with its tensor variant and its kernel version are developed that all focus on discrimination with scalable, parallelizable learning relying on little or no generative component.
architecture in detail in Chapter 6.
39
3.3 Deep Networks for Supervised Learning – Recurrent Neural Network (RNN)
sequence associated with the input data sequence.
40
3.3 Deep Networks for Supervised Learning – Recurrent Neural Network (RNN)
themselves to perform sequence classification while embedding the long-short-term memory (LSTM) into the model.
distributions over all possible label sequences given the input sequences.
distributions over the correct label sequences.
and in a small speech task (Graves et al., 2013, 2013a) to be discussed in more detail in Chapter 7 of this book.
41
(CNN), in which each module consists of a convolutional layer and a pooling layer.
42
3.3 Deep Networks for Supervised Learning – convolutional neural network (CNN)
3.3 Deep Networks for Supervised Learning – convolutional neural network (CNN)
pooling schemes, endows the CNN with some “invariance” properties (e.g., translation invariance).
and image. And recently, the CNN is also found effective for speech recognition.
43
3.3 Deep Networks for Supervised Learning – time-delay neural network (TDNN)
speech recognition is a special case and predecessor of the CNN when weight sharing is limited to one of the two dimensions, i.e., time dimension, and there is no pooling layer.
invariance is less important than the frequency-dimension invariance for speech recognition.
with a new strategy for designing the CNN’s pooling layer demonstrated to be more effective than all previous CNNs in phone recognition
44
3.4 Hybrid Deep Networks – two viewpoints
comprises or makes use of both generative and discriminative model components.
component is mostly exploited to help with discrimination, which is the final goal of the hybrid architecture.
two viewpoints.
provide excellent initialization points in highly nonlinear parameter estimation problems
provide a prior on the set of functions representable by the model.
45
3.4 Hybrid Deep Networks – DBN-DNN model
learning with the same network structure, which is further discriminatively trained or fine-tuned using the target labels provided.
model, where the model trained using unsupervised data helps to make the discriminative model effective for supervised learning.
RBM/DBN generative, unsupervised pre-training in Chapter 5.
46
3.4 Hybrid Deep Networks – DNN-CRF and DNN-HMM
where the DNN weights are also initialized from a generative DBN but are further fine-tuned with a sequence-level discriminative criterion, which is the conditional probability of the label sequence given the input feature sequence, instead of the frame- level criterion of cross-entropy commonly used.
discriminative architecture of CRF. It can be shown that such a DNN-CRF is equivalent to a hybrid deep architecture of DNN and HMM whose parameters are learned jointly using the full-sequence maximum mutual information (MMI) criterion between the entire label sequence and the input feature sequence.
47
3.4 Hybrid Deep Networks - RBM
discriminative criterion of posterior class-label probabilities.
combined visible layer in the RBM.
the authors derived a discriminative learning algorithm for RBM as a shallow generative model.
48
3.4 Hybrid Deep Networks – pretraining of CNNs
pre-train deep convolutional neural networks (Lee et al., 2009, 2010, 2011).
performance of deep CNNs over random initialization.
al., 2013a), including denoising autoencoders, contractive autoencoders, and sparse autoencoders, is also a similar example of the category of hybrid deep network.ks
49
3.4 Hybrid Deep Networks – two-stage architecture
(Ney, 1999; He and Deng, 2011), where one task of discrimination (e.g., speech recognition) produces the output (text) that serves as the input to the second task of discrimination (e.g., machine translation).
consisting of both generative and discriminative elements.
mapping and non-monotonic alignment) are generative in nature, but their parameters are all learned for discrimination of the ultimate translated text given the speech data.
50
Question
domain knowledge를 embed하기 쉽고, uncertainty를 핸들하기 쉽다고 나오는데, 이 부분이 명확히 이해가 안가서, 이 부분에 대해 예를 들어 자세히 설명해 주실 것 부탁드려요.
On the other hand, the deep unsupervised-learning models, especially the probabilistic generative
handle uncertainty, but they are typically intractable in inference and learning for complex systems.
51
Answer
1.
generating distribution. In Proceedings of International Conference on Learning Representations (ICLR). 2013. 2.
. Vincent. Representation learning: A review and new
38:1798–1828, 2013. 3.
trainable by backprop. arXiv 1306:1091, 2013. also accepted to appear in Proceedings of International Conference on Machine Learning (ICML), 2014.
52
Answer
On the other hand, the deep unsupervised-learning models, especially the probabilistic generative ones, are easier to interpret, easier to embed domain knowledge, easier to compose, and easier to handle uncertainty, but they are typically intractable in inference and learning for complex systems.
학습할 때 autoencoder의 layer를 10으로 줄 수 있음.
learning models, especially the probabilistic generative ones, are … easier to handle uncertainty ”. Probabilistic models 이니 uncertainty(=probability)를 다루가 쉽다고 생각할 수 있음. 답은 모름.
53
Question
54
Answer
References
. Domingos. Sum-product networks: A new deep a
. Domingo. Discriminative learning of sum-product networks. Neural Information Processing Systems (NIPS), 2012.
P . Domingos (YouTube) (PDF)
55
Answer: Basic Structure
56
Deep structure Basic structure Forward Backward
Answer: Architecture with an Example
57
Question
sample을 사용하여 future sequence를 추론하는 경우 중간중간 잘못 추론한 결과가 쌓여 결국 전체 sequence를 잘못 추론하는 결과를 야기할 수 있다고 생각합니다. 이 문제를 해결할 수 있는 장치가 있는지 궁금합니다.
58
Answer
et al. 2016). 현재 예측값이 아니라 현재 참값(ground truth)을 이용해 다음 상태를 예측합니다.
59
Vanilla RNN RNN with teacher forcing
Goodfellow et al. Deep Learning. The MIT Press. 2016
Question & Answer
방법인 HMM등을 사용하는 경우를 볼 수 있습니다. RNN 기반의 variants가 HMM 기반의 모델의 성능에 필적하지 못하는 경우도 있는데 이에 대해서 발표자 분들이 어떻게 생각하시는지 궁금합니다.
대해 알지 못합니다.
이야기가 가능했을 것 같습니다.
여러 상태를 이용해 예측한다는 점입니다. 다만, HMM은 현재 상태과 이전의 모든 상태를 담고 있다고 가정합니다(Markov property). RNN은 long-term dependency에 약하고 Markov property가 없기 때문에 long-term dependency를 잘 할 수 있도록 해야합니다. 그것을 잘하게끔 하기가 어렵기 때문에 RNN이 더 좋은 성능을 못내는 것이라고 생각합니다. 이와 달리 HMM은 long-term dependency를 가지게 끔 설계하기가 RNN보다 쉬워서 HMM이 더 좋은 성능을 낸다고 생각합니다.
60
Question & Answer
을 알고 싶습니다.
parameters 라는 것이 무엇을 의미하는지 찾지 못했고 사용 목적이 무엇인지 몰라서 질 문의 의도를 파악하기 어렵습니다. Wang et al. 2009에서 소개한 time varying HMM이 맞는지 궁금합니다.
61
Question & Answer
를 deep structure 로 쌓는다고 하는데요.(p31) 이렇게 쌓아진 구조와 DNN 구조는 어떤 차이를 갖는지 궁금합니다. 가장 단순한 비교로 1 layer DNN 의 구조와 CRF는 어떻게 다른지 궁금합니다.
62
CRF 구조의 예시 Deep structured CRF 구조의 예시 DNN 구조의 예시
Question & Answer
이를 학습할 수 없다고 생각합니다. 하지만 hybrid 모델처럼 discriminative 모델의 pretraining에 unsupervised learning 방법을 적용하는 것은 가능합니다.
63
Question & Answer
autoencoder를 사용하는 예를 제시하였는데, 그렇다면 deep autoencoder에서 학습된
이때 deep autoencoder와 pre-trained DNN/CNN 각각의 역할이 무엇일지 궁금합니다.
이를 DNN/CNN의 input으로 사용한다는 의미입니다. 이때 deep autoencoder의 encoder가 feature extractor의 역할을 하게되고 DNN/CNN은 classifier의 역할을 하게 됩니다.
64
Question & Answer
이는 1차원 시계열 데이터와도, 그리고 2차원의 이미지와도 유사한 성질을 가지고 있는 것 같은데 이렇게 전처리된 데이터에 대해 RNN, 시간 축에 대한 1차원 CNN(TDNN), 2차원 CNN을 적용하는 것이 어떻게 다른지, 그리고 각 구조는 어떤 상황에 사용하는 것이 좋은지 궁금합니다.
수행속도가 중요한 경우에 사용하기 좋다고 생각합니다. 그리고 2차원 CNN이 서로 다른 시간 축들을 동시에 고려하기 때문에 1차원 CNN에 비하여 성능이 우수할 것으로 예상합니다.
65
Question & Answer
의 prior로 작용하는 regularization을 언급하였습니다(p.226). prior가 된다는 것은 무슨 과정 을 의미하는지 (generative 모델의 결과가 어떻게 discriminative 모델로 feed-in 되는지), 그 리고 이것이 적절한 prior라는 근거가 있는지 궁금합니다.
절한 priror라는 것을 증명한 논문은 읽어보지 못했으며 경험적으로 모델의 weight들을 랜덤으로 초기화하는 것보다 학습이 더 잘되고 성능이 우수하기 때문이라고 생각합니다.
66
Question & Answer
다. "stochastic gradient descent optimization algorithms" 방법과 비교하였을 때 어떤 차이 점(장단점)이 있나요?
67
찾지만 구현이 SGD에 비해 어려우며, 굉장히 불안정하고 conjugate gradient 단계의 계산량이 너무 많다고 합니다.
Question & Answer
conceptual interpretations to the same word in speech processing context? If still not, then what do we need to achieve that?
을 사용하면 장단기적인 문맥을 모두 고려할 수 있기때문에 문제를 해결할 수 있다고 생 각합니다.
68
Question & Answer
"Mesnil et al. (2013) and Yao et al. (2013) reported the success of RNNs in spoken language understanding." 라고 합니다. 그렇다면 이 실험에서는 vanilla RNN이 아닌 발전된 RNN이 적용된 것인가요?
을 적용하였습니다. 그렇지만 이 모델들은 long short-term memory (LSTM), gated recurrent unit (GRU)과 달리 gradient vanishing 문제를 해결하기 위해 제안된 모델은 아 닙니다.
69
Question
Network의 initialization 이야기할 때 빠지지 않는 방법 중 하나가 RBM 혹은 이것을 확장한 DBN을 이용하여 weight initialization을 하는 방법입니다. 다만 RBM이 bipartite구조로 각 노드의 확률을 sampling 한다는 점에서 이걸로 어떻게 weight initialization을 하는지 잘 모 르겠습니다. 4단원 앞부분에 그 내용이 설명되어있기는 한데, 이렇게 초기화된 weight에 어 떤 통계적인 의의가 있을까요?
70
Answer
weight들을 곱하게 되는데 이 weight들로 DNN의 weight를 초기화한다는 의미입니다. 이렇게 초기화된 weight가 모델에 좋은 prior를 제공해준다는 통계적인 의의가 있습니다.
71
Question & Answer
information"이라고 하고 있습니다. 하지만 여기서 표현하는 "task specific supervision information"을 input 그 자체라고 생각하면 곧 generative model이 되는 것이고, 이렇게 생각할 수 있다면 Unsupervised learning과 Supervised learning의 차이가 없어지는 것입니다. 즉, x와 y가 모두 주어지고 y=f(x)에서 함수 f를 학습시키는것이 감독학습인데, x=f(x)라고 볼 수 있다면 무감독학습이 되는것이죠. 이런 관점에서 본다면 감독학습과 무감독학습의 수학적 차이는 없는것인데, 서로 다른 차이점을 어떻게 설명할 수 있을까요?
learning의 경우에는 레이블이 주어진 상황에서 데이터의 분류나 회귀분석이 목적이라면 unsupervised learning의 목적은 데이터의 레이블이 없는 상황에서 데이터의 hidden structure를 파악하는 것이 목적입니다.
72
Question & Answer
알기론 보통 DBM은 고정되어 있는 길이의 모델을 통해서 inference를 하는 것인데, speech recognition을 하려고 하면 주어지는 문장의 길이가 모르는 상태에서 어떻게 inference를 할 수 있는건가요? DBM이 speech recognition을 처리하는 방법이 궁금합니다.
feature들을 추출하고, 이들로부터 다시 한번 DBM를 사용하여 고차원의 feature를 추출 합니다. 이들의 시퀀스를 마지막으로 RNN에 통과시키면 길이가 모르는 상태에서도 inference 가 가능하다고 생각합니다.
73
Thank you!
74