LEARNING WITH NONTRIVIAL TEACHER: LEARNING USING PRIVILEGED - PowerPoint PPT Presentation

1 LEARNING WITH NONTRIVIAL TEACHER: LEARNING USING PRIVILEGED INFORMATION Vladimir Vapnik Columbia University, NEC-labs

THE ROSENBLATT’S PERCEPTRON AND 2 CLASSICAL MACHINE LEARNING PARADIGM THE ROSENBLATT’S SCHEME: 1. Transform input vectors of space X into space Z . 2. Using training data ( x 1 , y 1 ) , ... ( x ℓ , y ℓ ) (1) construct a separating hyperplane in space Z X X Z Z GENERAL MATHEMATICAL SCHEME: 1. From a given collection of functions f ( x, α ) , α ∈ Λ choose one that minimizes the number of misclassification on the training data (1)

MAIN RESULTS OF THE VC THEORY 3 1. There exist two and only two factors responsible for generalization: a) The percent of training errors ν train . b) The capacity of the set of functions from which one chooses the desired function (the VC dimension V Cdim ). 2a. The following bounds on probability of test error ( P test ) are valid �� V Cdim P test ≤ ν train + O ∗ ℓ where ℓ is the number of observations. 2b. When ν train = 0 the following bounds are valid � V Cdim � P test ≤ O ∗ ℓ The bounds are achievable.

NEW LEARNING MODEL — LEARNING WITH 4 A NONTRIVIAL TEACHER Let us include a teacher in the learning process. During the learning process a teacher supplies training example with additional information which can include comments, comparison, explanation, logical, emotional or metaphorical reasoning, and so on. This additional (privileged) information is available only for the training examples. It is not available for test examples. Privileged information exists for almost any learning problem and can play a crucial role in the learning process: it can significantly increase the speed of learning .

THE BASIC MODELS 5 The classical learning model: . given training pairs ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) , x i ∈ X, y i ∈ {− 1 , 1 } , i = 1 , ..., ℓ, find among a given set of functions f ( x, α ) , α ∈ Λ the function y = f ( x, α ∗ ) that minimizes the probability of incorrect classifications P test The LUPI learning model : given training triplets ( x 1 , x ∗ 1 , y 1 ) , ..., ( x ℓ , x ∗ ℓ , y ℓ ) , x i ∈ X, x ∗ i ∈ X ∗ , y i ∈ {− 1 , 1 } , i = 1 , ..., ℓ, find among a given set of functions f ( x, α ) , α ∈ Λ the function y = f ( x, α ∗ ) that minimizes the probability of incorrect classifications P test .

GENERALIZATION OF PERCEPTRON — SVM 6 Generalization 1: Large margin. Minimize the functional R = ( w, w ) subject to the constraints y i [( w, z i ) + b ] ≥ 1 , i = 1 , ..., ℓ. The solution ( w ℓ , b ℓ ) has the bound � V Cdim � P test ≤ O ∗ . ℓ

GENERALIZATION OF PERCEPTRON — SVM 7 Generalization 2: Nonseparable case. Minimize the functional ℓ � R ( w, b ) = ( w, w ) + C ξ i i =1 subject to constraints y i [( w, z i ) + b ] ≥ 1 − ξ i , ξ i ≥ 0 , i = 1 , ..., ℓ. The solution ( w ℓ , b ℓ ) has the bound �� V Cdim P test ≤ ν train + O ∗ . ℓ  9  4  4  22

WHY WE HAVE SO BIG DIFFERENCE? 8 • In the separable case using ℓ examples one estimates n parameters of w . • In the non-separable case one estimates n + ℓ parameters ( n parameters of vector w and ℓ parameters of slacks). Suppose that we know set of functions ξ ( x, δ ) ≥ 0 , δ ∈ D such that ξ = ξ ( x ) = ξ ( x, δ 0 ) and has finite VCdim ∗ (let δ be an m-dimensional vector). In this situation to find optimal hyperplane in the non-separating case one needs to estimate n + m parameters using ℓ observations. Can the rate of convergence in this case be faster?

THE KEY OBSERVATION: ORACLE SVM 9 Suppose we are given triplets ( x 1 , ξ 0 1 , y 1 ) , ..., ( x ℓ , ξ 0 ℓ , y ℓ ) , where ξ 0 i = ξ 0 ( x i ) , i = 1 , ..., ℓ are the slack values with respect to the best hyperplane. Then to find the approximation ( w best , b best ) we minimize the functional R ( w, b ) = ( w, w ) subject to constraints r i = 1 − ξ 0 ( x i ) , i = 1 , ..., ℓ. y i [( w, x i ) + b ] ≥ r i , Proposition 1. For Oracle SVM the following bound holds � V Cdim � P test ≤ ν train + O ∗ . ℓ

ILLUSTRATION — I 10 Sample Training Data 10 class II class I 8 6 x 2 → 4 2 0 −2 −2 0 2 4 6 8 10 x 1 →

ILLUSTRATION —II 11 K : linear 21 SVM Oracle SVM 20 20% Bayes error 19 18 18% 17 Error rate Error Rate 16 15 14 14% 13 12 12% 11 0 5 10 15 20 25 30 35 40 45 Training Data Size

WHAT CAN A REAL TEACHER DO — I 12 One can not expect that a teacher knows values of slacks. However he can: • Supply students with a correcting space X ∗ and a set of functions ξ ( x ∗ , δ ) , δ ∈ D, in this space (with VC dimension h ∗ ) which contains a function ξ i = ξ ( x ∗ i , δ best ) that approximates the oracle slack function ξ 0 = ξ 0 ( x ∗ ) well. • During training process supply students with triplets ( x 1 , x ∗ 1 , y 1 ) , ..., ( x ℓ , x ∗ ℓ , y ℓ ) in order to estimate simultaneously both the correcting (slack) function ξ = ξ ( x ∗ , δ ℓ ) and the decision hyperplane (pair ( w ℓ , b ℓ )).

WHAT CAN A REAL TEACHER DO — II 13 The problem of learning with a teacher is to minimize the functional ℓ � ξ ( x ∗ R ( w, b, δ ) = ( w, w ) + C i , δ ) i =1 subject to constraints ξ ( x ∗ , δ ) ≥ 0 and constraints y i (( w, x ) + b ) ≥ 1 − ξ ( x ∗ i , δ ) , i = 1 , ..., ℓ. Proposition 2. With probability 1 − η the following bound holds true ( n + h ∗ )(ln 2 ℓ ( n + h ∗ ) + 1) − ln η P ( y [( w ℓ , x )+ b ℓ ] < 0) ≤ P (1 − ξ ( x ∗ , δ ℓ ) < 0)+ A . ℓ The problem is how good is the teacher: how fast the probability P (1 − ξ ( x ∗ , δ ℓ ) < 0) converges to the probability P (1 − ξ ( x ∗ , δ 0 )) < 0) .

THE BOTTOM LINE 14 The goal of a teacher is by introducing both the space X ∗ and the set of slack-functions in this space ξ ( x ∗ , δ ) , δ ∈ ∆ to try speed up the rate of convergence of the learning process from O ( 1 ℓ ) to O ( 1 ℓ ) . √ The difference between standard and fast methods is in the number of examples needed for training: √ ℓ for the standard methods and ℓ for the fast methods (i.e. 100,000 and 320; or 1000 and 32).

IDEA OF SVM ALGORITHM 15 • Transform the training pairs ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) into the pairs ( z 1 , y 1 ) , ..., ( z ℓ , y ℓ ) by mapping vectors x ∈ X into z ∈ Z . • Find in Z the hyperplane that minimizes the functional ℓ � R ( w, b ) = ( w, w ) + C ξ i i =1 subject to constraints y i [( w, z i ) + b ] ≥ 1 − ξ i , ξ i ≥ 0 , i = 1 , ..., ℓ. • Use inner product in Z space in the form ( z i , z j ) = K ( x i , x j ) .

DUAL SPACE SOLUTION FOR SVM 16 The decision function has a form � ℓ � � f ( x, α ) = sgn α i y i K ( x i , x ) + b (2) i =1 where α i ≥ 0 , i = 1 , ..., ℓ are values which maximize the functional ℓ ℓ α i − 1 � � R ( α ) = α i α j y i y j K ( x i , x j ) (3) 2 i =1 i,j =1 subject to constraints ℓ � α i y i = 0 , 0 ≤ α i ≤ C, i = 1 , ..., ℓ. i =1 Here kernel K ( · , · ) is used for two different purposes: 1. In (2) to define a set of expansion-functions K ( x i , x ) . 2. In (3) to define similarity between vectors x i and x j .

IDEA OF SVM+ ALGORITHM 17 ( x 1 , x ∗ 1 , y 1 ) , ..., ( x ℓ , x ∗ • Transform the training triplets ℓ , y ℓ ) ( z 1 , z ∗ 1 , y 1 ) , ..., ( z ℓ , z ∗ into the triplets ℓ , y ℓ ) by mapping vectors x ∈ X into vectors z ∈ Z and x ∗ ∈ X ∗ into z ∗ ∈ Z ∗ . • Define the slack-function in the form ξ i = ( w ∗ , z ∗ i ) + b ∗ and find in space Z the hyperplane that minimizes the functional ℓ � R ( w, b, w ∗ , b ∗ ) = ( w, w ) + γ ( w ∗ , w ∗ ) + C [( w ∗ , z ∗ i ) + b ∗ ] + , i =1 subject to constraints y i [( w, z i ) + b ] ≥ 1 − [( w ∗ , z ∗ i ) + b ∗ ] , i = 1 , ..., ℓ. • Use inner products in Z and Z ∗ spaces in the kernel form ( z ∗ i , z ∗ j ) = K ∗ ( x ∗ i , x ∗ ( z i , z j ) = K ( x i , x j ) , j ) .

DUAL SPACE SOLUTION FOR SVM+ 18 The decision function has a form � ℓ � � f ( x, α ) = sgn α i y i K ( x i , x ) + b i =1 where α i , i = 1 , ..., ℓ are values that maximize the functional R ( α, β ) = ℓ ℓ ℓ α i − 1 α i α j y i y j K ( x i , x j ) − 1 � � � ( α i − β i )( α j − β j ) K ∗ ( x ∗ i , x ∗ j ) 2 2 γ i =1 i,j =1 i,j =1 subject to the constraints ℓ ℓ � � α i y i = 0 , ( α i − β i ) = 0 . i =1 i =1 and the constraints α i ≥ 0 , 0 ≤ β i ≤ C

ADVANCED TECHNICAL MODEL AS 19 PRIVILEGED INFORMATION Classification of proteins into families The problem is : Given amino-acid sequences of proteins construct a rule to classify families of proteins. The decision space X is the space of amino-acid sequences. The privileged information space X ∗ is the space of 3D structure of the proteins.

LEARNING WITH NONTRIVIAL TEACHER: LEARNING USING PRIVILEGED - PowerPoint PPT Presentation

1 LEARNING WITH NONTRIVIAL TEACHER: LEARNING USING PRIVILEGED INFORMATION Vladimir Vapnik Columbia University, NEC-labs THE ROSENBLATTS PERCEPTRON AND 2 CLASSICAL MACHINE LEARNING PARADIGM THE ROSENBLATTS SCHEME: 1. Transform input

III. Shortest nontrivial cycles Jeff Erickson University of Illinois, Urbana-Champaign Todays

Boyce-Codd Normal Form We say a relation R is in BCNF if whenever X Y is a nontrivial FD

Riemannian manifolds with nontrivial Limbeek local symmetry Wouter van Limbeek University of

Teacher Teacher-Student Data Link Teacher Teacher Student Data Link Student Data Link Student

Teacher Leadership: Exploring the Teacher Leadership: Exploring the Concept and Setting a

Presentation Script The National Center on Teacher-to-Teacher Talk Quality Teaching and Learning

The power of observation: 5 ways to ensure teacher evaluations lead to teacher growth Source:

Class of 2032 Introductions Principal, Mrs.Cindy Socha Teacher, Mrs. Blatchford

Master Teacher Renewal Brinda Price 2013 Master Teacher Renewal Process Master Teacher

KS1 Meet the Teacher Meeting Meet The Staff Class Teachers Becca Ehrlich Grasshoppers Class

Combo Litplan Teacher Pack: To Kill A Mockingbird - Teacher Guide, Lesson Plans, Puzzles, Games,

The study of topological nontrivial materials Bezryadina Tatiana, Faculty of Physics, Tomsk

Coarse Classification of Binary Minimal Clones Zarathustra Brady Minimal clones A clone C is

Integer factorization: Exercise for the reader: a progress report Find a nontrivial factor of

Circuits for integer factorization D. J. Bernstein University of Illinois at Chicago Exercise

Genus 3 curves with nontrivial multiplications: Questions Jerome William Hoffman Louisiana State

NCC Education and You Study and Communication Skills Your Name The Senses Date The Senses

Tips On Serving Individuals Remotely and in the Community During the Virus Crisis March 26, 2020

CS3243: Introduction to Artificial Intelligence Semester 2, 2017/2018 Teaching Staff

Briefing by P1 Form Teachers Contents 1. Mindfulness Exercise 2. Routines Important for young

Catullus Catullus and the Invention of Roman Literature and the Invention of Roman Literature

E9 205 Machine Learning for Signal Processing Introduction to Machine Learning of Sensory Signals

WRITING BIBLICAL POETRY OCTOBER 2020 PHOST49@GMAIL.COM Why write Biblical Poetry? Share

Towards Joint Understanding of Images and Language Svetlana Lazebnik Joint work with J.