Deep learning 3.3. Linear separability and feature design Fran - PowerPoint PPT Presentation

Deep learning 3.3. Linear separability and feature design Fran¸ cois Fleuret https://fleuret.org/dlc/ Dec 20, 2020

The main weakness of linear predictors is their lack of capacity. For classification, the populations have to be linearly separable . Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 1 / 10

The main weakness of linear predictors is their lack of capacity. For classification, the populations have to be linearly separable . “xor” Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 1 / 10

The xor example can be solved by pre-processing the data to make the two populations linearly separable. (0 , 1) (1 , 1) (0 , 0) (1 , 0) Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 2 / 10

The xor example can be solved by pre-processing the data to make the two populations linearly separable. Φ : ( x u , x v ) �→ ( x u , x v , x u x v ) . (0 , 1) (1 , 1) (0 , 0) (1 , 0) Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 2 / 10

The xor example can be solved by pre-processing the data to make the two populations linearly separable. Φ : ( x u , x v ) �→ ( x u , x v , x u x v ) . (1 , 1 , 1) (0 , 1 , 0) (0 , 0 , 0) (1 , 0 , 0) Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 2 / 10

w b y x Φ × + σ Perceptron Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 3 / 10

This is similar to the polynomial regression. If we have Φ : x �→ (1 , x , x 2 , . . . , x D ) and α = ( α 0 , . . . , α D ) then D � α d x d = α · Φ( x ) . d =0 Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 4 / 10

This is similar to the polynomial regression. If we have Φ : x �→ (1 , x , x 2 , . . . , x D ) and α = ( α 0 , . . . , α D ) then D � α d x d = α · Φ( x ) . d =0 By increasing D , we can approximate any continuous real function on a compact space (Stone-Weierstrass theorem). It means that we can make the capacity as high as we want. Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 4 / 10

We can apply the same to a more realistic binary classification problem: MNIST’s “8” vs. the other classes with a perceptron. The original 28 × 28 features are supplemented with the products of pairs of features taken at random. Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 5 / 10

We can apply the same to a more realistic binary classification problem: MNIST’s “8” vs. the other classes with a perceptron. The original 28 × 28 features are supplemented with the products of pairs of features taken at random. Train error 7 Test error 6 5 Error (%) 4 3 2 1 0 10 3 10 4 Nb. of features Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 5 / 10

Remember the bias-variance tradeoff: E (( Y − y ) 2 ) = ( E ( Y ) − y ) 2 + V ( Y ) . � �� Bias Variance The right class of models reduces the bias more and increases the variance less. Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 6 / 10

Remember the bias-variance tradeoff: E (( Y − y ) 2 ) = ( E ( Y ) − y ) 2 + V ( Y ) . � �� Bias Variance The right class of models reduces the bias more and increases the variance less. Beside increasing capacity to reduce the bias, “feature design” may also be a way of reducing capacity without hurting the bias, or with improving it. Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 6 / 10

Remember the bias-variance tradeoff: E (( Y − y ) 2 ) = ( E ( Y ) − y ) 2 + V ( Y ) . � �� Bias Variance The right class of models reduces the bias more and increases the variance less. Beside increasing capacity to reduce the bias, “feature design” may also be a way of reducing capacity without hurting the bias, or with improving it. In particular, good features should be invariant to perturbations of the signal known to keep the value to predict unchanged. Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 6 / 10

Training points Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 7 / 10

Votes (K=11) Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 7 / 10

Prediction (K=11) Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 7 / 10

Training points Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 7 / 10

Votes, radial feature (K=11) Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 7 / 10

Prediction, radial feature (K=11) Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 7 / 10

A classical example is the “Histogram of Oriented Gradient” descriptors (HOG), initially designed for person detection. Roughly: divide the image in 8 × 8 blocks, compute in each the distribution of edge orientations over 9 bins. Dalal and Triggs (2005) combined them with a SVM, and Doll´ ar et al. (2009) extended them with other modalities into the “channel features”. Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 8 / 10

Many methods (perceptron, SVM, k -means, PCA, etc.) only require to compute κ ( x , x ′ ) = Φ( x ) · Φ( x ′ ) for any ( x , x ′ ). So one needs to specify κ alone, and may keep Φ undefined. Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 9 / 10

Many methods (perceptron, SVM, k -means, PCA, etc.) only require to compute κ ( x , x ′ ) = Φ( x ) · Φ( x ′ ) for any ( x , x ′ ). So one needs to specify κ alone, and may keep Φ undefined. This is the kernel trick , which we will not talk about in this course. Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 9 / 10

Training a model composed of manually engineered features and a parametric model such as logistic regression is now referred to as “shallow learning” . The signal goes through a single processing trained from data. Fran¸ cois Fleuret Deep learning / 3.3. Linear separability and feature design 10 / 10

The end

References N. Dalal and B. Triggs. Histograms of oriented gradients for human detection . In Conference on Computer Vision and Pattern Recognition (CVPR), pages 886–893, 2005. P. Doll´ ar, Z. Tu, P. Perona, and S. Belongie. Integral channel features . In British Machine Vision Conference, pages 91.1–91.11, 2009.

Deep learning 3.3. Linear separability and feature design Fran - PowerPoint PPT Presentation

Deep learning 3.3. Linear separability and feature design Fran cois Fleuret https://fleuret.org/dlc/ Dec 20, 2020 The main weakness of linear predictors is their lack of capacity. For classification, the populations have to be linearly

Separability of Context-Free Languages by Piecewise Testable Languages Wojciech Czerwi ski

A Note on Decidable Separability by Piecewise Testable Languages Wojciech Czerwi ski Wim

Regular Separability of WSTS Roland Meyer joint work with Wojciech Czerwi nski, S lawomir

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Learning with Low Rank Approximations or how to use near separability to extract content from

Chapter 22 Learning, Linear Separability and Linear Programming CS 573: Algorithms, Fall 2013

Learning, Linear Separability and Linear Programming Lecture 22 November 12, 2013 Sariel (UIUC)

Separability of the Lambda Calculus and Term Rewriting Systems Department of Computer Science

Review Linear separability (and use of features) Class probabilities for linear

Linear Classification Linear separability Inseparability Real world problems: there may not

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Error Link Detection and Correction in Wikipedia Chengyu Wang, Rong Zhang, Xiaofeng He, Aoying

EXPLOITING AND REMODELLING SEMANTIC RELATIONSHIPS Fran Ale lexande der, , Taxonomy Manager,

Linear classifiers : prediction eq u ations L IN E AR C L ASSIFIE R S IN P YTH ON Michael ( Mike

SLOUGH BOROUGH COUNCIL AIRPORT EXPANSION CONSULTATION BRIEFING 12 September 2019 Classification:

Past Earth Network http://PastEarth.net/ The Past Earth Network funded by EPSRC as part of

Static Stages for Heterogeneous Programming Adrian Sampson, Cornell Kathryn S McKinley, Google

Frequent Items Lecture 09 February 12, 2019 Chandra (UIUC) CS498ABD 1 Spring 2019 1 / 11