Mining for Structure Massive increase in both computational power - PowerPoint PPT Presentation

Deep Unsupervised Learning Russ Salakhutdinov Machine Learning Department Carnegie Mellon University Canadian Institute of Advanced Research 1

Mining for Structure Massive increase in both computational power and the amount of data available from web, video cameras, laboratory measurements. Images & Video Text & Language Speech & Audio Gene Expression Relational Data/ Product fMRI Tumor region Social Network Recommendation Mostly Unlabeled • Develop statistical models that can discover underlying structure, cause, or statistical correlation from data in unsupervised or semi-supervised way. • Multiple application domains. 2

Mining for Structure Massive increase in both computational power and the amount of data available from web, video cameras, laboratory measurements. Images & Video Text & Language Speech & Audio Gene Expression Deep Learning Models that Relational Data/ Product fMRI Tumor region Social Network support inferences and discover Recommendation structure at multiple levels. Mostly Unlabeled • Develop statistical models that can discover underlying structure, cause, or statistical correlation from data in unsupervised or semi-supervised way. • Multiple application domains. 3

Impact of Deep Learning • Speech Recognition • Computer Vision • Recommender Systems • Language Understanding • Drug Discovery & Medical Image Analysis 4

Building Artificial Intelligence Develop computer algorithms that can: - See and recognize objects around us - Perceive human speech - Understand natural language - Navigate around autonomously - Display human like Intelligence Personal assistants, self-driving cars, etc. 5

Speech Recognition 6

Deep Autoencoder Model Reuters dataset: 804,414 Learned latent code newswire stories: unsupervised European Community Interbank Markets Monetary/Economic Energy Markets Disasters and Accidents Leading Legal/Judicial Economic Indicators Bag of words Government Accounts/ Borrowings Earnings 7 (Hinton & Salakhutdinov, Science 2006)

Unsupervised Learning Non-probabilistic Models Probabilistic (Generative) Models Ø Sparse Coding Ø Autoencoders Ø Others (e.g. k-means) Non-Tractable Models Tractable Models Ø Generative Adversarial Ø Boltzmann Machines Ø Fully observed Networks Ø Variational Belief Nets Ø Moment Matching Autoencoders Ø NADE Networks Ø Helmholtz Machines Ø PixelRNN Ø Many others… Explicit Density p(x) Implicit Density 8

Talk Roadmap • Basic Building Blocks: Sparse Coding Ø Autoencoders Ø • Deep Generative Models Restricted Boltzmann Machines Ø Deep Boltzmann Machines Ø Helmholtz Machines / Variational Autoencoders Ø • Generative Adversarial Networks • Open Research Questions 9

Sparse Coding • Sparse coding (Olshausen & Field, 1996). Originally developed to explain early visual processing in the brain (edge detection). • Objective: Given a set of input data vectors learn a dictionary of bases such that: Sparse: mostly zeros • Each data vector is represented as a sparse linear combination of bases. 10

Sparse Coding Natural Images Learned bases: “Edges” New example = 0.8 * + 0.3 * + 0.5 * x = 0.8 * + 0.3 * + 0.5 * [0, 0, … 0.8 , …, 0.3 , …, 0.5 , …] = coefficients (feature representation) 11 Slide Credit: Honglak Lee

Sparse Coding: Training • Input image patches: • Learn dictionary of bases: Reconstruction error Sparsity penalty • Alternating Optimization: 1. Fix dictionary of bases and solve for activations a (a standard Lasso problem). 2. Fix activations a , optimize the dictionary of bases (convex QP problem). 12

Sparse Coding: Testing Time • Input: a new image patch x* , and K learned bases • Output: sparse representation a of an image patch x*. = 0.8 * + 0.3 * + 0.5 * x* = 0.8 * + 0.3 * + 0.5 * [0, 0, … 0.8 , …, 0.3 , …, 0.5 , …] = coefficients (feature representation) 13

Image Classification Evaluated on Caltech101 object category dataset. Classification Algorithm (SVM) 9K images, 101 classes Learned Features (coefficients) Input Image bases Algorithm Accuracy Baseline (Fei-Fei et al., 2004) 16% PCA 37% Sparse Coding 47% 14 Lee, Battle, Raina, Ng, 2006 Slide Credit: Honglak Lee

Interpreting Sparse Coding a Sparse features a Implicit Explicit g (a) f (x) nonlinear Linear encoding Decoding x’ x • Sparse, over-complete representation a. • Encoding a = f( x ) is implicit and nonlinear function of x . • Reconstruction (or decoding) x’ = g( a ) is linear and explicit. 15

Autoencoder Feature Representation Feed-back, Feed-forward, generative, Decoder Encoder bottom-up top-down path Input Image • Details of what goes insider the encoder and decoder matter! • Need constraints to avoid learning an identity. 16

Autoencoder Binary Features z Decoder Encoder filters D filters W. z= σ (Wx) Dz Linear Sigmoid function function path Input Image x 17

Autoencoder Binary Features z • An autoencoder with D inputs, D outputs, and K hidden units, with K<D. z= σ (Wx) Dz • Given an input x, its reconstruction is given by: Input Image x Decoder Encoder 18

Autoencoder Binary Features z • An autoencoder with D inputs, D outputs, and K hidden units, with K<D. z= σ (Wx) Dz Input Image x • We can determine the network parameters W and D by minimizing the reconstruction error: 19

Autoencoder Linear Features z • If the hidden and output layers are linear, it will learn hidden units that are a linear function of the data and minimize the squared error. z=Wx Wz • The K hidden units will span the same space as the first k principal components. The weight vectors Input Image x may not be orthogonal. • With nonlinear hidden units, we have a nonlinear generalization of PCA. 20

Another Autoencoder Model Binary Features z Encoder filters W. σ ( W T z) z= σ (Wx) Sigmoid Decoder function filters D path Binary Input x • Need additional constraints to avoid learning an identity. • Relates to Restricted Boltzmann Machines (later). 21

Predictive Sparse Decomposition Binary Features z Encoder L 1 Sparsity filters W. D z z= σ (Wx) Sigmoid Decoder function filters D path Real-valued Input x At training time path Encoder Decoder 22 Kavukcuoglu, Ranzato, Fergus, LeCun, 2009

Stacked Autoencoders Class Labels Decoder Encoder Features Decoder Encoder Sparsity Features Decoder Encoder Sparsity Input x 23

Stacked Autoencoders Class Labels • Remove decoders and Encoder use feed-forward part. • Standard, or Features convolutional neural network architecture. Encoder • Parameters can be Features fine-tuned using backpropagation. Encoder Input x 24

Deep Autoencoders Decoder 30 W 4 Top 500 RBM T T W W + � 1 1 8 2000 2000 T T 500 W W + � 2 2 7 1000 1000 W 3 T T 1000 W W + � RBM 3 3 6 500 500 T T W W + � 4 4 5 30 Code layer 30 1000 W W + � 4 4 4 W 2 500 500 2000 RBM W W + � 3 3 3 1000 1000 W W + � 2 2 2 2000 2000 2000 W W W + � 1 1 1 1 RBM Encoder Pretraining Unrolling Fine � tuning 25

Deep Autoencoders • 25x25 – 2000 – 1000 – 500 – 30 autoencoder to extract 30-D real- valued codes for Olivetti face patches. • Top : Random samples from the test dataset. • Middle : Reconstructions by the 30-dimensional deep autoencoder. • Bottom : Reconstructions by the 30-dimentinoal PCA. 26

Information Retrieval 2-D LSA space European Community Interbank Markets Monetary/Economic Energy Markets Disasters and Accidents Leading Legal/Judicial Economic Indicators Government Accounts/ Borrowings Earnings • The Reuters Corpus Volume II contains 804,414 newswire stories (randomly split into 402,207 training and 402,207 test). • “Bag-of-words” representation: each article is represented as a vector containing the counts of the most frequently used 2000 words in the training set. 27 (Hinton and Salakhutdinov, Science 2006)

Semantic Hashing European Community Monetary/Economic Address Space Disasters and Accidents Semantically Similar Documents Semantic Hashing Government Function Borrowing Energy Markets Document Accounts/Earnings • Learn to map documents into semantic 20-D binary codes. • Retrieve similar documents stored at the nearby addresses with no search at all. 28 (Salakhutdinov and Hinton, SIGIR 2007)

Searching Large Image Database using Binary Codes • Map images into binary codes for fast retrieval. • Small Codes, Torralba, Fergus, Weiss, CVPR 2008 • Spectral Hashing, Y. Weiss, A. Torralba, R. Fergus, NIPS 2008 • Kulis and Darrell, NIPS 2009, Gong and Lazebnik, CVPR 20111 • Norouzi and Fleet, ICML 2011, 29

Talk Roadmap • Basic Building Blocks: Sparse Coding Ø Autoencoders Ø • Deep Generative Models Restricted Boltzmann Machines Ø Deep Boltzmann Machines Ø Helmholtz Machines / Variational Autoencoders Ø • Generative Adversarial Networks 30

Mining for Structure Massive increase in both computational power - PowerPoint PPT Presentation

Deep Unsupervised Learning Russ Salakhutdinov Machine Learning Department Carnegie Mellon University Canadian Institute of Advanced Research 1 Mining for Structure Massive increase in both computational power and the amount of data available

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

Introduction What is data mining? to Data mining functionalities Data Mining Major

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

A software processing chain for evaluating thesaurus quality Javier Lacasta, Gilles Falquet,

and Motion Planning Introduction Dan Halperin School of Computer Science Fall 2019-2020 Tel

Modeling Dynamic ynamic E Engineering ngineering Design esign P Processes in PSI rocesses

Women and Logic in the Middle Ages Dr. Sara L. Uckelman s.l.uckelman@durham.ac.uk @SaraLUckelman

Zero entropy systems Dominique Perrin May 12, 2016 Dominique Perrin Zero entropy systems May

Feature-based Modelling and Information Systems in Engineering Emilio Sanfilippo and Stefano

Adversarial BoltzmannMachines Belief Nets Networks Variational NADE