Classification Basic concepts Decision tree Nave Bayesian - PowerPoint PPT Presentation

Stochastic Gradient Descent (SGD) Update parameters based on gradient of random data samples Often preferred over gradient descent because it gets close to the minimum much faster 45

Non-Linear basis function  So far we only used the observed values x 1 ,x 2 , …  However, linear regression can be applied in the same way to functions of these values 2 and x 1 x 2 so each example becomes:  Eg: to add a new variable x 1 x 1 , x 2 , … . x 1 2 , x 1 x 2  As long as these functions can be directly computed from the observed values the parameters are still linear in the data and the problem remains a multi-variate linear regression problem       2 2  y w w x w k x 0 1 1 k

Non-linear basis functions What type of functions can we use?  A few common examples:  - Polynomial:  j (x) = x j for j=0 … n f j ( x ) = ( x - m j ) - Gaussian: 2 s j 2 1 f j ( x ) = - Sigmoid: 1 + exp( - s j x ) f j ( x ) = log( x + 1) - Logs:

General linear regression problem  Using our new notations for the basis function linear regression can be written as n å y = w j f j ( x ) j = 0  Where  j ( x ) can be either x j for multivariate regression or one of the non-linear basis functions we defined  … and  0 ( x ) =1 for the intercept term

0 th Order Polynomial n=10

1 st Order Polynomial

3 rd Order Polynomial

9 th Order Polynomial

Over-fitting Root-Mean-Square (RMS) Error:

Polynomial Coefficients

Regularization Penalize large coefficient values æ ö 2 - l y i - J X,y ( w ) = 1 å å ç ÷ w j f j ( x i ) 2 2 w ç ÷ 2 è ø i j

Regularization: +

Over Regularization:

Polynomial Coefficients none exp(18) huge

LASSO • Adds an L1 regularizer to Linear Regression 60

Intepretability  Coefficients suggest importance/correlation with the output  A large positive coefficient implies that the output will increase when this input is increased (positively correlated)  A large negative coefficient implies that the output will decrease when this input is increased (negatively correlated)  A small or 0 coefficient suggests that the input is uncorrelated with the output (at least at the 1 st order)  Linear regression can be used to find best "indicators" CS 478 - Regression 61

Regression for Classification  Given input/output samples (X, y), we learn a function f such that y = f(X), which can be used on new data.  Classification: y is discrete (class labels).  Regression: y is continuous, e.g. linear regression. 1 y dependent variable (output) 0 x – independent variable (input) x – independent variable (input) CS 478 - Regression 62

From real value to discrete value February 27, 2018 Data Mining: Concepts and Techniques 63

From real value to discrete value February 27, 2018 Data Mining: Concepts and Techniques 64

From real value to discrete value Non-differentiable February 27, 2018 Data Mining: Concepts and Techniques 65

Logistic Regression Data: Inputs are continuous vectors of length K. Outputs are discrete valued. Prediction: Output is a logistic function of the linear function of the inputs . 66

Classification: Discriminant Function Decision Nonlinear Linear Tree Functions Functions

Logistic Regression Data: Inputs are continuous vectors of length K. Outputs are discrete valued. Prediction: Output is a logistic function of the linear function of the inputs . Learning: finds the parameters that minimize some objective function . 68

Recall: Least Squares for Linear Regression Learning: finds the parameters that minimize some objective function . We minimize the sum of the squares: Why? Reduces distance between true measurements and predicted values 69

Maximum Likelihood for Logistic Regression Learning: finds the parameters that maximizes the log likelihood of observing the data: 70

Review: Derivative Rules February 27, 2018 Data Mining: Concepts and Techniques 71

Stochastic Gradient Descent (Ascent) Gradient Descent Update: Partial derivative with one training example (x,y): Stochastic Gradient Descent Update: February 27, 2018 Data Mining: Concepts and Techniques 72

Regression Summary  Regression methods  Linear regression  Logistic regression  Optimization  Gradient descent  Stochastic gradient descent  Regularization February 27, 2018 Data Mining: Concepts and Techniques 73

Classification  Basic concepts  Decision tree  Naïve Bayesian classifier  Model evaluation  Support Vector Machines  Regression  Neural Networks and Deep Learning  Lazy Learners (k Nearest Neighbors)  Bayesian Belief Networks 74

Deep Learning: MIT Technology Review - 10 Breakthrough Technologies 2013 February 27, 2018 Data Mining: Concepts and Techniques 75

February 27, 2018 Data Mining: Concepts and Techniques 76

Applications  Image recognition  Speech recognition  Natural language processing February 27, 2018 Data Mining: Concepts and Techniques 77

Classification: Discriminant Function Decision Nonlinear Linear Tree Functions Functions

Neural Network and Deep Learning  A neural network is a Output vector multi-layer structure of connected input/output Output layer units (artificial neurons)  Learning by adjusting the weights so as to Hidden layer predict the correct class w ij label of the input tuples Input layer  Deep learning use more layers than shallow Input vector: X learning 80

Artificial Neuron – Perceptron Perceptron February 27, 2018 Data Mining: Concepts and Techniques 81

Neuron: A Hidden/Output Layer Unit bias m k x 0 w 0 x 1  w 1 f output y x n w n For Example n    m y sign( ) w i x i k Input weight weighted Activation  i 0 vector x vector w sum function An n -dimensional input vector x is mapped into variable y by means of the  scalar product and a nonlinear function mapping The inputs to unit are outputs from the previous layer. They are multiplied by  their corresponding weights to form a weighted sum, which is added to the bias associated with unit. Then a nonlinear activation function is applied to it. 82

From Neuron to Neural Network The input layer correspond to the attributes measured for each training tuple  They are then weighted and fed simultaneously to hidden layers  The weighted outputs of the last hidden layer are input to output layer , which emits  the network's prediction From a statistical point of view, networks perform nonlinear regression : Given  enough hidden units and enough training samples, they can closely approximate any function

Neural Networks  A family of parametric, non-linear, and hierarchical representation learning functions February 27, 2018 Data Mining: Concepts and Techniques 84

Learning Weights (and Bias) in the Network  If a small change in a weight (or bias) cause only a small change in output, we could modify weights and biases gradually to train the network  Does the perceptron work?

Artificial Neuron – Perceptron to sigmoid neuron Perceptron sigmoid neuron 87

Popular Activation Functions Tanh(x) ReLU (Rectified Linear Unit) tanh x = e x − e −x 𝑔 𝑦 = max(0, 𝑦) e x + e −x

February 27, 2018 89

Learning by Backpropagation Modifications are made in the “ backwards ” direction: from the output  layer, through each hidden layer down to the first hidden layer, hence “ backpropagation ” Steps   Initialize weights to small random numbers, associated with biases  Propagate the inputs forward (by applying activation function)  Backpropagate the error (by updating weights and biases)  Terminating condition (when error is very small, etc.) 91

Stochastic Gradient Descent  Gradient Descent (batch GD)  The cost gradient is based on the complete training set, can be costly and longer to converge to minimum  Stochastic Gradient Descent (SGD, iterative or online-GD)  Update the weight after each training sample  The gradient based on a single training sample is a stochastic approximation of the true cost gradient  Converges faster but the path towards minimum may zig-zag  Mini-Batch Gradient Descent (MB-GD)  Update the weights based on small group of training samples

Training the neural network Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Training data Initialise with random weights Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 4.1 0.1 0.2 0 etc …

Training data Feed it through to get output Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 1.4 4.1 0.1 0.2 0 etc … 2.7 0.8 1.9

Training data Compare with target output Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 1.4 4.1 0.1 0.2 0 etc … 2.7 0.8 0 1.9 error 0.8

Training data Adjust weights based on error Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 1.4 4.1 0.1 0.2 0 etc … 2.7 0.8 0 1.9 error 0.8

Training data And so on … . Fields class 1.4 2.7 1.9 0 3.8 3.4 3.2 0 6.4 2.8 1.7 1 6.4 4.1 0.1 0.2 0 etc … 2.8 0.9 1 1.7 error -0.1 Repeat this thousands, maybe millions of times – each time taking a random training instance, and making slight weight adjustments (stochastic gradient descent)

Learning for neural networks  Shallow networks  Deep networks with multiple layers - deep learning

Feature detectors

Hidden layer units become self-organised feature detectors 1 5 10 15 20 25 … … 1 Strong weight low/zero weight 63

What does this unit detect? 1 5 10 15 20 25 … … 1 strong weight low/zero weight it will send strong signal for a horizontal line in the top row, ignoring everywhere else 63

What features might you expect a good NN to learn, when trained with data like this?

vertical lines 1 63

Horizontal lines 1 63

Small circles 1 63

Hierarchical Feature Learning etc … detect lines in Specific positions Higher level detetors ( horizontal line, etc … v “ RHS vertical lune ” “ upper loop ” , etc …

Hierarchical Feature Learning  Deep learning (a.k.a. representation learning) seeks to learn rich hierarchical representations (i.e. features) automatically through multiple stage of feature learning process. Mid-level High-level Trainable Low-level output features features classifier features Feature visualization of convolutional net trained on ImageNet (Zeiler and Fergus, 2013)

Deep Learning Architectures  Commonly used architectures  convolutional neural networks  recurrent neural networks

Classification Basic concepts Decision tree Nave Bayesian - PowerPoint PPT Presentation

Classification Basic concepts Decision tree Nave Bayesian classifier Model evaluation Support Vector Machines Regression Neural Networks and Deep Learning Lazy Learners (k Nearest Neighbors) Bayesian Belief

Graph Classification Classification Outline Introduction, Overview Classification using

Classification of Symmetry Classification of Symmetry Classification of Symmetry Classification

(a) Quantitative classification (b) Qualitative classification (c) Area classification (d) Simple

Classification Image Classification Set of predefined categories [eg: table, apple, dog, giraffe]

Classification 1 Classification: Basic Concepts and Methods Classification: Basic Concepts

Library of Congress Classification: Module 1.3 1 Library of Congress Classification: Module 1.3

Classification K-nearest neighbor classification D istance functions Choice of k Choice of k

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Management of Classification Lookup Files The basics of classification The basics of

Classification Classification TNM classification Survival time Survival time Tumour size,

ADEQ Lakes Classification ADEQ Lakes Classification ADEQ Lakes Classification Project Project

OVERVIEW U.S. National Vegetation Classification A Classification Partnership Don Faber-

Welcome to the Board of Visitors Virtual Meeting 9 June 2020 CLASSIFICATION CLASSIFICATION

Need for Classification Classification required To isolate traffic of interest

Bag-of-features models for category classification for category classification Cordelia Schmid

Library of Congress Classification: Module 3.1 1 Library of Congress Classification: Module 3.1

CS 418: Interactive Computer Graphics Color Eric Shaffer Rainbow versus Black and White Color

Implicative algebras: a new foundation for forcing and realizability Alexandre Miquel D E . .

A categorial mistake in the formal verification debate Javier Blanco Pio Garc a Facultad

Theory of Computation Course note based on Computability, Complexity, and Languages: Fundamentals

Goals Explore connections between Logic and Combinatorics. Logic of provability : mainly

Soft Linear Logic, Lambda-Calculus and Intersection Types Simona Ronchi Della Rocca, Marco

Why Every Physicist Should Be a Bayesian (Towards a Complete Reconciliation between the Bayesian

CISC 323 CISC 323 Intro to Software Engineering Intro to Software Engineering