menu del dia learning (7 slides) purushottam kar (iit kanpur) - - PowerPoint PPT Presentation

menu del dia
SMART_READER_LITE
LIVE PREVIEW

menu del dia learning (7 slides) purushottam kar (iit kanpur) - - PowerPoint PPT Presentation

accelerated kernel learning 1 purushottam kar department of computer science and engineering indian institute of technology kanpur november 27, 2012 1 joint work with harish c. karnick purushottam kar (iit kanpur) accelerated kernel learning


slide-1
SLIDE 1

accelerated kernel learning1

purushottam kar

department of computer science and engineering indian institute of technology kanpur

november 27, 2012

1joint work with harish c. karnick purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 1 / 27

slide-2
SLIDE 2

menu del dia

◮ learning (7 slides)

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 2 / 27

slide-3
SLIDE 3

menu del dia

◮ learning (7 slides)

◮ introduction to machine learning purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 2 / 27

slide-4
SLIDE 4

menu del dia

◮ learning (7 slides)

◮ introduction to machine learning ◮ issues in learning purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 2 / 27

slide-5
SLIDE 5

menu del dia

◮ learning (7 slides)

◮ introduction to machine learning ◮ issues in learning

◮ kernel learning (6 slides)

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 2 / 27

slide-6
SLIDE 6

menu del dia

◮ learning (7 slides)

◮ introduction to machine learning ◮ issues in learning

◮ kernel learning (6 slides)

◮ introduction to kernel learning purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 2 / 27

slide-7
SLIDE 7

menu del dia

◮ learning (7 slides)

◮ introduction to machine learning ◮ issues in learning

◮ kernel learning (6 slides)

◮ introduction to kernel learning ◮ issues in kernel learning purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 2 / 27

slide-8
SLIDE 8

menu del dia

◮ learning (7 slides)

◮ introduction to machine learning ◮ issues in learning

◮ kernel learning (6 slides)

◮ introduction to kernel learning ◮ issues in kernel learning

◮ accelerated kernel learning (11 slides)

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 2 / 27

slide-9
SLIDE 9

menu del dia

◮ learning (7 slides)

◮ introduction to machine learning ◮ issues in learning

◮ kernel learning (6 slides)

◮ introduction to kernel learning ◮ issues in kernel learning

◮ accelerated kernel learning (11 slides)

◮ random features purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 2 / 27

slide-10
SLIDE 10

menu del dia

◮ learning (7 slides)

◮ introduction to machine learning ◮ issues in learning

◮ kernel learning (6 slides)

◮ introduction to kernel learning ◮ issues in kernel learning

◮ accelerated kernel learning (11 slides)

◮ random features ◮ other methods purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 2 / 27

slide-11
SLIDE 11

learning 101

◮ why machine learning ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 3 / 27

slide-12
SLIDE 12

learning 101

◮ why machine learning ?

◮ automate tasks that are difficult for humans purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 3 / 27

slide-13
SLIDE 13

learning 101

◮ why machine learning ?

◮ automate tasks that are difficult for humans

◮ where is machine learning used ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 3 / 27

slide-14
SLIDE 14

learning 101

◮ why machine learning ?

◮ automate tasks that are difficult for humans

◮ where is machine learning used ?

◮ point out spam mails for a gmail user purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 3 / 27

slide-15
SLIDE 15

learning 101

◮ why machine learning ?

◮ automate tasks that are difficult for humans

◮ where is machine learning used ?

◮ point out spam mails for a gmail user ◮ predict stock market prices purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 3 / 27

slide-16
SLIDE 16

learning 101

◮ why machine learning ?

◮ automate tasks that are difficult for humans

◮ where is machine learning used ?

◮ point out spam mails for a gmail user ◮ predict stock market prices ◮ predict new friends for a facebook user purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 3 / 27

slide-17
SLIDE 17

learning 101

◮ why machine learning ?

◮ automate tasks that are difficult for humans

◮ where is machine learning used ?

◮ point out spam mails for a gmail user ◮ predict stock market prices ◮ predict new friends for a facebook user

◮ how does one do machine learning ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 3 / 27

slide-18
SLIDE 18

learning 101

◮ why machine learning ?

◮ automate tasks that are difficult for humans

◮ where is machine learning used ?

◮ point out spam mails for a gmail user ◮ predict stock market prices ◮ predict new friends for a facebook user

◮ how does one do machine learning ?

◮ discover patterns in data purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 3 / 27

slide-19
SLIDE 19

learning 101

◮ why machine learning ?

◮ automate tasks that are difficult for humans

◮ where is machine learning used ?

◮ point out spam mails for a gmail user ◮ predict stock market prices ◮ predict new friends for a facebook user

◮ how does one do machine learning ?

◮ discover patterns in data ◮ what sort of patterns ? purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 3 / 27

slide-20
SLIDE 20

ml task 1 : classification

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 4 / 27

slide-21
SLIDE 21

ml task 1 : classification

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 4 / 27

slide-22
SLIDE 22

ml task 1 : classification

◮ goal : find a way to assign the

”correct” label to a set of

  • bjects

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 4 / 27

slide-23
SLIDE 23

ml task 1 : classification

◮ goal : find a way to assign the

”correct” label to a set of

  • bjects

◮ observe a gmail user as he tags

his mails as spam or useful

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 4 / 27

slide-24
SLIDE 24

ml task 1 : classification

◮ goal : find a way to assign the

”correct” label to a set of

  • bjects

◮ observe a gmail user as he tags

his mails as spam or useful

◮ can we figure out a pattern ? purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 4 / 27

slide-25
SLIDE 25

ml task 1 : classification

◮ goal : find a way to assign the

”correct” label to a set of

  • bjects

◮ observe a gmail user as he tags

his mails as spam or useful

◮ can we figure out a pattern ? ◮ can we automatically detect

spam mails for him ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 4 / 27

slide-26
SLIDE 26

ml task 1 : classification

◮ goal : find a way to assign the

”correct” label to a set of

  • bjects

◮ observe a gmail user as he tags

his mails as spam or useful

◮ can we figure out a pattern ? ◮ can we automatically detect

spam mails for him ?

◮ can we use his patterns to

tag his girlfriend’s emails ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 4 / 27

slide-27
SLIDE 27

ml task 1 : classification

◮ goal : find a way to assign the

”correct” label to a set of

  • bjects

◮ observe a gmail user as he tags

his mails as spam or useful

◮ can we figure out a pattern ? ◮ can we automatically detect

spam mails for him ?

◮ can we use his patterns to

tag his girlfriend’s emails ?

figure: linear classification figure: non-linear classification

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 4 / 27

slide-28
SLIDE 28

ml task 2 : regression

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 5 / 27

slide-29
SLIDE 29

ml task 2 : regression

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 5 / 27

slide-30
SLIDE 30

ml task 2 : regression

◮ goal : more like generalized

curve fitting

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 5 / 27

slide-31
SLIDE 31

ml task 2 : regression

◮ goal : more like generalized

curve fitting

◮ observe variables such as

company performance, past trends etc and the stock prices

  • f a given company

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 5 / 27

slide-32
SLIDE 32

ml task 2 : regression

◮ goal : more like generalized

curve fitting

◮ observe variables such as

company performance, past trends etc and the stock prices

  • f a given company

◮ can we predict today’s stock

prices for the company ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 5 / 27

slide-33
SLIDE 33

ml task 2 : regression

◮ goal : more like generalized

curve fitting

◮ observe variables such as

company performance, past trends etc and the stock prices

  • f a given company

◮ can we predict today’s stock

prices for the company ?

◮ no ”labels” here

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 5 / 27

slide-34
SLIDE 34

ml task 2 : regression

◮ goal : more like generalized

curve fitting

◮ observe variables such as

company performance, past trends etc and the stock prices

  • f a given company

◮ can we predict today’s stock

prices for the company ?

◮ no ”labels” here

◮ non-discrete pattern purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 5 / 27

slide-35
SLIDE 35

ml task 2 : regression

◮ goal : more like generalized

curve fitting

◮ observe variables such as

company performance, past trends etc and the stock prices

  • f a given company

◮ can we predict today’s stock

prices for the company ?

◮ no ”labels” here

◮ non-discrete pattern

figure: real valued regression figure: dangers of overfitting

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 5 / 27

slide-36
SLIDE 36
  • ther ml tasks

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 6 / 27

slide-37
SLIDE 37
  • ther ml tasks

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 6 / 27

slide-38
SLIDE 38
  • ther ml tasks

◮ ranking

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 6 / 27

slide-39
SLIDE 39
  • ther ml tasks

◮ ranking

◮ find the top 10 facebook

users with whom I am likely to make friends

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 6 / 27

slide-40
SLIDE 40
  • ther ml tasks

◮ ranking

◮ find the top 10 facebook

users with whom I am likely to make friends

◮ clustering

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 6 / 27

slide-41
SLIDE 41
  • ther ml tasks

◮ ranking

◮ find the top 10 facebook

users with whom I am likely to make friends

◮ clustering

◮ given genome data, discover

familia, genera and species

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 6 / 27

slide-42
SLIDE 42
  • ther ml tasks

◮ ranking

◮ find the top 10 facebook

users with whom I am likely to make friends

◮ clustering

◮ given genome data, discover

familia, genera and species

◮ component analysis

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 6 / 27

slide-43
SLIDE 43
  • ther ml tasks

◮ ranking

◮ find the top 10 facebook

users with whom I am likely to make friends

◮ clustering

◮ given genome data, discover

familia, genera and species

◮ component analysis

◮ find principal or independent

components in data

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 6 / 27

slide-44
SLIDE 44
  • ther ml tasks

◮ ranking

◮ find the top 10 facebook

users with whom I am likely to make friends

◮ clustering

◮ given genome data, discover

familia, genera and species

◮ component analysis

◮ find principal or independent

components in data

◮ useful in signal processing,

dimensionality reduction

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 6 / 27

slide-45
SLIDE 45
  • ther ml tasks

◮ ranking

◮ find the top 10 facebook

users with whom I am likely to make friends

◮ clustering

◮ given genome data, discover

familia, genera and species

◮ component analysis

◮ find principal or independent

components in data

◮ useful in signal processing,

dimensionality reduction

figure: clustering problems figure: principal component analysis

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 6 / 27

slide-46
SLIDE 46

a mathematical abstraction

◮ domain : a set X of objects we are interested in

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 7 / 27

slide-47
SLIDE 47

a mathematical abstraction

◮ domain : a set X of objects we are interested in

◮ emails, stocks, facebook users, living organisms, analog signals purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 7 / 27

slide-48
SLIDE 48

a mathematical abstraction

◮ domain : a set X of objects we are interested in

◮ emails, stocks, facebook users, living organisms, analog signals ◮ set may be discrete/continuous, finite/infinite purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 7 / 27

slide-49
SLIDE 49

a mathematical abstraction

◮ domain : a set X of objects we are interested in

◮ emails, stocks, facebook users, living organisms, analog signals ◮ set may be discrete/continuous, finite/infinite ◮ may have a variety of structure (topological/geometric) purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 7 / 27

slide-50
SLIDE 50

a mathematical abstraction

◮ domain : a set X of objects we are interested in

◮ emails, stocks, facebook users, living organisms, analog signals ◮ set may be discrete/continuous, finite/infinite ◮ may have a variety of structure (topological/geometric)

◮ label set : the property Y of the objects we are interested in

predicting

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 7 / 27

slide-51
SLIDE 51

a mathematical abstraction

◮ domain : a set X of objects we are interested in

◮ emails, stocks, facebook users, living organisms, analog signals ◮ set may be discrete/continuous, finite/infinite ◮ may have a variety of structure (topological/geometric)

◮ label set : the property Y of the objects we are interested in

predicting

◮ classification : discrete label set : Y = {±1} for spam classification purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 7 / 27

slide-52
SLIDE 52

a mathematical abstraction

◮ domain : a set X of objects we are interested in

◮ emails, stocks, facebook users, living organisms, analog signals ◮ set may be discrete/continuous, finite/infinite ◮ may have a variety of structure (topological/geometric)

◮ label set : the property Y of the objects we are interested in

predicting

◮ classification : discrete label set : Y = {±1} for spam classification ◮ regression : continuous label set : Y ⊂ R purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 7 / 27

slide-53
SLIDE 53

a mathematical abstraction

◮ domain : a set X of objects we are interested in

◮ emails, stocks, facebook users, living organisms, analog signals ◮ set may be discrete/continuous, finite/infinite ◮ may have a variety of structure (topological/geometric)

◮ label set : the property Y of the objects we are interested in

predicting

◮ classification : discrete label set : Y = {±1} for spam classification ◮ regression : continuous label set : Y ⊂ R ◮ ranking, clustering, component analysis : more structured label sets purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 7 / 27

slide-54
SLIDE 54

a mathematical abstraction

◮ domain : a set X of objects we are interested in

◮ emails, stocks, facebook users, living organisms, analog signals ◮ set may be discrete/continuous, finite/infinite ◮ may have a variety of structure (topological/geometric)

◮ label set : the property Y of the objects we are interested in

predicting

◮ classification : discrete label set : Y = {±1} for spam classification ◮ regression : continuous label set : Y ⊂ R ◮ ranking, clustering, component analysis : more structured label sets

◮ true pattern : f ∗ : X −

→ Y

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 7 / 27

slide-55
SLIDE 55

a mathematical abstraction

◮ domain : a set X of objects we are interested in

◮ emails, stocks, facebook users, living organisms, analog signals ◮ set may be discrete/continuous, finite/infinite ◮ may have a variety of structure (topological/geometric)

◮ label set : the property Y of the objects we are interested in

predicting

◮ classification : discrete label set : Y = {±1} for spam classification ◮ regression : continuous label set : Y ⊂ R ◮ ranking, clustering, component analysis : more structured label sets

◮ true pattern : f ∗ : X −

→ Y

◮ mathematically captures the notion of “correct” labellings purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 7 / 27

slide-56
SLIDE 56

the learning process

◮ supervised learning

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-57
SLIDE 57

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-58
SLIDE 58

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking ◮ shall not discuss unsupervised, semi-supervised learning today purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-59
SLIDE 59

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking ◮ shall not discuss unsupervised, semi-supervised learning today

◮ learn from the teacher

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-60
SLIDE 60

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking ◮ shall not discuss unsupervised, semi-supervised learning today

◮ learn from the teacher

◮ we are given access to lots of domain elements with their true labels purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-61
SLIDE 61

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking ◮ shall not discuss unsupervised, semi-supervised learning today

◮ learn from the teacher

◮ we are given access to lots of domain elements with their true labels ◮ training set : {(x1, f ∗ (x1)) , (x2, f ∗ (x2)) , . . . , (xn, f ∗ (xn))} purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-62
SLIDE 62

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking ◮ shall not discuss unsupervised, semi-supervised learning today

◮ learn from the teacher

◮ we are given access to lots of domain elements with their true labels ◮ training set : {(x1, f ∗ (x1)) , (x2, f ∗ (x2)) , . . . , (xn, f ∗ (xn))} ◮ hypothesis : a pattern h : X −

→ Y we infer using training data

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-63
SLIDE 63

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking ◮ shall not discuss unsupervised, semi-supervised learning today

◮ learn from the teacher

◮ we are given access to lots of domain elements with their true labels ◮ training set : {(x1, f ∗ (x1)) , (x2, f ∗ (x2)) , . . . , (xn, f ∗ (xn))} ◮ hypothesis : a pattern h : X −

→ Y we infer using training data

◮ goal : learn a hypothesis that is close to the true pattern purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-64
SLIDE 64

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking ◮ shall not discuss unsupervised, semi-supervised learning today

◮ learn from the teacher

◮ we are given access to lots of domain elements with their true labels ◮ training set : {(x1, f ∗ (x1)) , (x2, f ∗ (x2)) , . . . , (xn, f ∗ (xn))} ◮ hypothesis : a pattern h : X −

→ Y we infer using training data

◮ goal : learn a hypothesis that is close to the true pattern

◮ formalizing closeness of hypothesis to true pattern

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-65
SLIDE 65

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking ◮ shall not discuss unsupervised, semi-supervised learning today

◮ learn from the teacher

◮ we are given access to lots of domain elements with their true labels ◮ training set : {(x1, f ∗ (x1)) , (x2, f ∗ (x2)) , . . . , (xn, f ∗ (xn))} ◮ hypothesis : a pattern h : X −

→ Y we infer using training data

◮ goal : learn a hypothesis that is close to the true pattern

◮ formalizing closeness of hypothesis to true pattern

◮ how often do we give out a wrong answer : P [h(x) = f ∗(x)] purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-66
SLIDE 66

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking ◮ shall not discuss unsupervised, semi-supervised learning today

◮ learn from the teacher

◮ we are given access to lots of domain elements with their true labels ◮ training set : {(x1, f ∗ (x1)) , (x2, f ∗ (x2)) , . . . , (xn, f ∗ (xn))} ◮ hypothesis : a pattern h : X −

→ Y we infer using training data

◮ goal : learn a hypothesis that is close to the true pattern

◮ formalizing closeness of hypothesis to true pattern

◮ how often do we give out a wrong answer : P [h(x) = f ∗(x)] ◮ more generally, utilize loss functions : ℓ : Y × Y −

→ R

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-67
SLIDE 67

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking ◮ shall not discuss unsupervised, semi-supervised learning today

◮ learn from the teacher

◮ we are given access to lots of domain elements with their true labels ◮ training set : {(x1, f ∗ (x1)) , (x2, f ∗ (x2)) , . . . , (xn, f ∗ (xn))} ◮ hypothesis : a pattern h : X −

→ Y we infer using training data

◮ goal : learn a hypothesis that is close to the true pattern

◮ formalizing closeness of hypothesis to true pattern

◮ how often do we give out a wrong answer : P [h(x) = f ∗(x)] ◮ more generally, utilize loss functions : ℓ : Y × Y −

→ R

◮ closeness defined as average loss : E ℓ (h(x), f ∗(x)) purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-68
SLIDE 68

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking ◮ shall not discuss unsupervised, semi-supervised learning today

◮ learn from the teacher

◮ we are given access to lots of domain elements with their true labels ◮ training set : {(x1, f ∗ (x1)) , (x2, f ∗ (x2)) , . . . , (xn, f ∗ (xn))} ◮ hypothesis : a pattern h : X −

→ Y we infer using training data

◮ goal : learn a hypothesis that is close to the true pattern

◮ formalizing closeness of hypothesis to true pattern

◮ how often do we give out a wrong answer : P [h(x) = f ∗(x)] ◮ more generally, utilize loss functions : ℓ : Y × Y −

→ R

◮ closeness defined as average loss : E ℓ (h(x), f ∗(x)) ◮ zero-one loss : ℓ(y1, y2) = 1y1=y2 (for classification) purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-69
SLIDE 69

the learning process

◮ supervised learning

◮ includes tasks such as classification, regression, ranking ◮ shall not discuss unsupervised, semi-supervised learning today

◮ learn from the teacher

◮ we are given access to lots of domain elements with their true labels ◮ training set : {(x1, f ∗ (x1)) , (x2, f ∗ (x2)) , . . . , (xn, f ∗ (xn))} ◮ hypothesis : a pattern h : X −

→ Y we infer using training data

◮ goal : learn a hypothesis that is close to the true pattern

◮ formalizing closeness of hypothesis to true pattern

◮ how often do we give out a wrong answer : P [h(x) = f ∗(x)] ◮ more generally, utilize loss functions : ℓ : Y × Y −

→ R

◮ closeness defined as average loss : E ℓ (h(x), f ∗(x)) ◮ zero-one loss : ℓ(y1, y2) = 1y1=y2 (for classification) ◮ quadratic loss : ℓ(y1, y2) = (y1 − y2)2 (for regression) purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 8 / 27

slide-70
SLIDE 70

issues in the learning process

◮ how to learn a hypothesis from a training set

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 9 / 27

slide-71
SLIDE 71

issues in the learning process

◮ how to learn a hypothesis from a training set ◮ how do i select my training set ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 9 / 27

slide-72
SLIDE 72

issues in the learning process

◮ how to learn a hypothesis from a training set ◮ how do i select my training set ? ◮ how many training points should i choose ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 9 / 27

slide-73
SLIDE 73

issues in the learning process

◮ how to learn a hypothesis from a training set ◮ how do i select my training set ? ◮ how many training points should i choose ? ◮ how do i output my hypothesis to the end user ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 9 / 27

slide-74
SLIDE 74

issues in the learning process

◮ how to learn a hypothesis from a training set ◮ how do i select my training set ? ◮ how many training points should i choose ? ◮ how do i output my hypothesis to the end user ? ◮ ...

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 9 / 27

slide-75
SLIDE 75

issues in the learning process

◮ how to learn a hypothesis from a training set ◮ how do i select my training set ? ◮ how many training points should i choose ? ◮ how do i output my hypothesis to the end user ? ◮ ... ◮ shall only address the first and the last issue in this talk

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 9 / 27

slide-76
SLIDE 76

issues in the learning process

◮ how to learn a hypothesis from a training set ◮ how do i select my training set ? ◮ how many training points should i choose ? ◮ how do i output my hypothesis to the end user ? ◮ ... ◮ shall only address the first and the last issue in this talk ◮ shall find the nearest carpet for rest of the issues

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 9 / 27

slide-77
SLIDE 77

kernel learning 101

◮ take the example of spam classification

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 10 / 27

slide-78
SLIDE 78

kernel learning 101

◮ take the example of spam classification ◮ assume that emails that look similar have the same label

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 10 / 27

slide-79
SLIDE 79

kernel learning 101

◮ take the example of spam classification ◮ assume that emails that look similar have the same label

◮ essentially saying that the true pattern is smooth purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 10 / 27

slide-80
SLIDE 80

kernel learning 101

◮ take the example of spam classification ◮ assume that emails that look similar have the same label

◮ essentially saying that the true pattern is smooth ◮ can infer the label of a new email using labels of emails seen before purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 10 / 27

slide-81
SLIDE 81

kernel learning 101

◮ take the example of spam classification ◮ assume that emails that look similar have the same label

◮ essentially saying that the true pattern is smooth ◮ can infer the label of a new email using labels of emails seen before

◮ how to quantify “similarity” ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 10 / 27

slide-82
SLIDE 82

kernel learning 101

◮ take the example of spam classification ◮ assume that emails that look similar have the same label

◮ essentially saying that the true pattern is smooth ◮ can infer the label of a new email using labels of emails seen before

◮ how to quantify “similarity” ?

◮ a bivariate function K : X × X −

→ R

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 10 / 27

slide-83
SLIDE 83

kernel learning 101

◮ take the example of spam classification ◮ assume that emails that look similar have the same label

◮ essentially saying that the true pattern is smooth ◮ can infer the label of a new email using labels of emails seen before

◮ how to quantify “similarity” ?

◮ a bivariate function K : X × X −

→ R

◮ e.g. the dot product in euclidean spaces purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 10 / 27

slide-84
SLIDE 84

kernel learning 101

◮ take the example of spam classification ◮ assume that emails that look similar have the same label

◮ essentially saying that the true pattern is smooth ◮ can infer the label of a new email using labels of emails seen before

◮ how to quantify “similarity” ?

◮ a bivariate function K : X × X −

→ R

◮ e.g. the dot product in euclidean spaces ◮ K(x1, x2) = x1, x2 := x12 x22 cos(∠(x1, x2)) purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 10 / 27

slide-85
SLIDE 85

kernel learning 101

◮ take the example of spam classification ◮ assume that emails that look similar have the same label

◮ essentially saying that the true pattern is smooth ◮ can infer the label of a new email using labels of emails seen before

◮ how to quantify “similarity” ?

◮ a bivariate function K : X × X −

→ R

◮ e.g. the dot product in euclidean spaces ◮ K(x1, x2) = x1, x2 := x12 x22 cos(∠(x1, x2)) ◮ e.g. number of shared friends on facebook purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 10 / 27

slide-86
SLIDE 86

learning using similarities

◮ a new email can be given the label of the most similar email in the

training set

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 11 / 27

slide-87
SLIDE 87

learning using similarities

◮ a new email can be given the label of the most similar email in the

training set

◮ not a good idea : would be slow and prone to noise purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 11 / 27

slide-88
SLIDE 88

learning using similarities

◮ a new email can be given the label of the most similar email in the

training set

◮ not a good idea : would be slow and prone to noise

◮ take all training emails and ask them to vote

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 11 / 27

slide-89
SLIDE 89

learning using similarities

◮ a new email can be given the label of the most similar email in the

training set

◮ not a good idea : would be slow and prone to noise

◮ take all training emails and ask them to vote

◮ training emails that are similar to new email have more influence purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 11 / 27

slide-90
SLIDE 90

learning using similarities

◮ a new email can be given the label of the most similar email in the

training set

◮ not a good idea : would be slow and prone to noise

◮ take all training emails and ask them to vote

◮ training emails that are similar to new email have more influence ◮ some training emails are more useful than others purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 11 / 27

slide-91
SLIDE 91

learning using similarities

◮ a new email can be given the label of the most similar email in the

training set

◮ not a good idea : would be slow and prone to noise

◮ take all training emails and ask them to vote

◮ training emails that are similar to new email have more influence ◮ some training emails are more useful than others ◮ more resilient to noise but still can be slow purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 11 / 27

slide-92
SLIDE 92

learning using similarities

◮ a new email can be given the label of the most similar email in the

training set

◮ not a good idea : would be slow and prone to noise

◮ take all training emails and ask them to vote

◮ training emails that are similar to new email have more influence ◮ some training emails are more useful than others ◮ more resilient to noise but still can be slow

◮ kernel learning uses hypotheses of the form

h(x) =

n

  • i=1

αiyiK(x, xi)

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 11 / 27

slide-93
SLIDE 93

learning using similarities

◮ a new email can be given the label of the most similar email in the

training set

◮ not a good idea : would be slow and prone to noise

◮ take all training emails and ask them to vote

◮ training emails that are similar to new email have more influence ◮ some training emails are more useful than others ◮ more resilient to noise but still can be slow

◮ kernel learning uses hypotheses of the form

h(x) =

n

  • i=1

αiyiK(x, xi)

◮ αi denotes the usefulness of training email xi purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 11 / 27

slide-94
SLIDE 94

learning using similarities

◮ a new email can be given the label of the most similar email in the

training set

◮ not a good idea : would be slow and prone to noise

◮ take all training emails and ask them to vote

◮ training emails that are similar to new email have more influence ◮ some training emails are more useful than others ◮ more resilient to noise but still can be slow

◮ kernel learning uses hypotheses of the form

h(x) =

n

  • i=1

αiyiK(x, xi)

◮ αi denotes the usefulness of training email xi ◮ for classification one uses sign(h(x)) purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 11 / 27

slide-95
SLIDE 95

a toy example

◮ take X ⊂ R2 and K(x1, x2) = x1, x2 (linear kernel)

h(x) =

n

  • i=1

αiyi x, xi =

  • x,

n

  • i=1

αiyixi

  • = x, w (linear hypothesis)

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 12 / 27

slide-96
SLIDE 96

a toy example

◮ take X ⊂ R2 and K(x1, x2) = x1, x2 (linear kernel)

h(x) =

n

  • i=1

αiyi x, xi =

  • x,

n

  • i=1

αiyixi

  • = x, w (linear hypothesis)

◮ if αi were absent then w = yi=1

xi −

yi=−1

xj : weaker model

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 12 / 27

slide-97
SLIDE 97

a toy example

◮ take X ⊂ R2 and K(x1, x2) = x1, x2 (linear kernel)

h(x) =

n

  • i=1

αiyi x, xi =

  • x,

n

  • i=1

αiyixi

  • = x, w (linear hypothesis)

◮ if αi were absent then w = yi=1

xi −

yi=−1

xj : weaker model

◮ αi found by solving an optimization problem : details out of scope

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 12 / 27

slide-98
SLIDE 98

a toy example

◮ take X ⊂ R2 and K(x1, x2) = x1, x2 (linear kernel)

h(x) =

n

  • i=1

αiyi x, xi =

  • x,

n

  • i=1

αiyixi

  • = x, w (linear hypothesis)

◮ if αi were absent then w = yi=1

xi −

yi=−1

xj : weaker model

◮ αi found by solving an optimization problem : details out of scope

figure: linear classifier figure: utility of weight variables αi

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 12 / 27

slide-99
SLIDE 99

enter mercer kernels

◮ linear hypothesis are too weak to detect complex patterns in data

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 13 / 27

slide-100
SLIDE 100

enter mercer kernels

◮ linear hypothesis are too weak to detect complex patterns in data

◮ in practice more complex notions of similarity are used purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 13 / 27

slide-101
SLIDE 101

enter mercer kernels

◮ linear hypothesis are too weak to detect complex patterns in data

◮ in practice more complex notions of similarity are used ◮ most often, mercer kernels are used purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 13 / 27

slide-102
SLIDE 102

enter mercer kernels

◮ linear hypothesis are too weak to detect complex patterns in data

◮ in practice more complex notions of similarity are used ◮ most often, mercer kernels are used

◮ mercer kernels satisfy the conditions of the mercer’s theorem

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 13 / 27

slide-103
SLIDE 103

enter mercer kernels

◮ linear hypothesis are too weak to detect complex patterns in data

◮ in practice more complex notions of similarity are used ◮ most often, mercer kernels are used

◮ mercer kernels satisfy the conditions of the mercer’s theorem

◮ loosely speaking, they correspond to measures of similarity that are

actually inner products in some hilbert space

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 13 / 27

slide-104
SLIDE 104

enter mercer kernels

◮ linear hypothesis are too weak to detect complex patterns in data

◮ in practice more complex notions of similarity are used ◮ most often, mercer kernels are used

◮ mercer kernels satisfy the conditions of the mercer’s theorem

◮ loosely speaking, they correspond to measures of similarity that are

actually inner products in some hilbert space

◮ more formally, a similarity function K is a mercer kernel if there exists

a map Φ : X − → H to some hilbert space H such that for all x1, x2 ∈ X, K(x1, x2) = Φ(x1), Φ(x2)

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 13 / 27

slide-105
SLIDE 105

enter mercer kernels

◮ linear hypothesis are too weak to detect complex patterns in data

◮ in practice more complex notions of similarity are used ◮ most often, mercer kernels are used

◮ mercer kernels satisfy the conditions of the mercer’s theorem

◮ loosely speaking, they correspond to measures of similarity that are

actually inner products in some hilbert space

◮ more formally, a similarity function K is a mercer kernel if there exists

a map Φ : X − → H to some hilbert space H such that for all x1, x2 ∈ X, K(x1, x2) = Φ(x1), Φ(x2)

◮ mercer kernels give us hypotheses that are linear in the hilbert space

h(x) =

n

  • i=1

αiyi Φ(x), Φ(xi) = Φ(x), w for some w ∈ H

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 13 / 27

slide-106
SLIDE 106

a toy example

◮ consider X ⊂ R2 s.t. x = (p, q) and K(x1, x2) = (x1, x2 + 1)2

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 14 / 27

slide-107
SLIDE 107

a toy example

◮ consider X ⊂ R2 s.t. x = (p, q) and K(x1, x2) = (x1, x2 + 1)2 ◮ one can show that the corresponding map is six dimensional

Φ(x) =

  • p2, q2,

√ 2pq, √ 2p, √ 2q, 1

  • ∈ R6

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 14 / 27

slide-108
SLIDE 108

a toy example

◮ consider X ⊂ R2 s.t. x = (p, q) and K(x1, x2) = (x1, x2 + 1)2 ◮ one can show that the corresponding map is six dimensional

Φ(x) =

  • p2, q2,

√ 2pq, √ 2p, √ 2q, 1

  • ∈ R6

◮ it is able to implement quadratic hypotheses

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 14 / 27

slide-109
SLIDE 109

a toy example

◮ consider X ⊂ R2 s.t. x = (p, q) and K(x1, x2) = (x1, x2 + 1)2 ◮ one can show that the corresponding map is six dimensional

Φ(x) =

  • p2, q2,

√ 2pq, √ 2p, √ 2q, 1

  • ∈ R6

◮ it is able to implement quadratic hypotheses

◮ e.g. h(x) = p2 + q2 − 1 for w = (1, 1, 0, 0, 0, −1) purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 14 / 27

slide-110
SLIDE 110

a toy example

◮ consider X ⊂ R2 s.t. x = (p, q) and K(x1, x2) = (x1, x2 + 1)2 ◮ one can show that the corresponding map is six dimensional

Φ(x) =

  • p2, q2,

√ 2pq, √ 2p, √ 2q, 1

  • ∈ R6

◮ it is able to implement quadratic hypotheses

◮ e.g. h(x) = p2 + q2 − 1 for w = (1, 1, 0, 0, 0, −1)

figure: non linear problem figure: kernel trick in action

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 14 / 27

slide-111
SLIDE 111

issues in kernel learning

◮ frequently one requires complex kernels having high dimensional maps

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 15 / 27

slide-112
SLIDE 112

issues in kernel learning

◮ frequently one requires complex kernels having high dimensional maps

◮ e.g. the gaussian kernel K(x1, x2) = exp

  • x1−x22

2

2σ2

  • has an infinite

dimensional map

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 15 / 27

slide-113
SLIDE 113

issues in kernel learning

◮ frequently one requires complex kernels having high dimensional maps

◮ e.g. the gaussian kernel K(x1, x2) = exp

  • x1−x22

2

2σ2

  • has an infinite

dimensional map

◮ cannot explicitly compute the map Φ purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 15 / 27

slide-114
SLIDE 114

issues in kernel learning

◮ frequently one requires complex kernels having high dimensional maps

◮ e.g. the gaussian kernel K(x1, x2) = exp

  • x1−x22

2

2σ2

  • has an infinite

dimensional map

◮ cannot explicitly compute the map Φ ◮ the kernel trick : can compute K(x1, x2) without computing Φ purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 15 / 27

slide-115
SLIDE 115

issues in kernel learning

◮ frequently one requires complex kernels having high dimensional maps

◮ e.g. the gaussian kernel K(x1, x2) = exp

  • x1−x22

2

2σ2

  • has an infinite

dimensional map

◮ cannot explicitly compute the map Φ ◮ the kernel trick : can compute K(x1, x2) without computing Φ ◮ have to use the implicit form h(x) =

n

  • i=1

αiyiK(x, xi) : slow

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 15 / 27

slide-116
SLIDE 116

issues in kernel learning

◮ frequently one requires complex kernels having high dimensional maps

◮ e.g. the gaussian kernel K(x1, x2) = exp

  • x1−x22

2

2σ2

  • has an infinite

dimensional map

◮ cannot explicitly compute the map Φ ◮ the kernel trick : can compute K(x1, x2) without computing Φ ◮ have to use the implicit form h(x) =

n

  • i=1

αiyiK(x, xi) : slow

◮ why only mercer kernels ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 15 / 27

slide-117
SLIDE 117

issues in kernel learning

◮ frequently one requires complex kernels having high dimensional maps

◮ e.g. the gaussian kernel K(x1, x2) = exp

  • x1−x22

2

2σ2

  • has an infinite

dimensional map

◮ cannot explicitly compute the map Φ ◮ the kernel trick : can compute K(x1, x2) without computing Φ ◮ have to use the implicit form h(x) =

n

  • i=1

αiyiK(x, xi) : slow

◮ why only mercer kernels ?

◮ for algorithmic convenience and a clean theory purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 15 / 27

slide-118
SLIDE 118

issues in kernel learning

◮ frequently one requires complex kernels having high dimensional maps

◮ e.g. the gaussian kernel K(x1, x2) = exp

  • x1−x22

2

2σ2

  • has an infinite

dimensional map

◮ cannot explicitly compute the map Φ ◮ the kernel trick : can compute K(x1, x2) without computing Φ ◮ have to use the implicit form h(x) =

n

  • i=1

αiyiK(x, xi) : slow

◮ why only mercer kernels ?

◮ for algorithmic convenience and a clean theory ◮ can use non-mercer indefinite kernels as well : out of scope purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 15 / 27

slide-119
SLIDE 119

fast kernel learning : the basic idea

◮ two ways of representing mercer kernel hypotheses

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 16 / 27

slide-120
SLIDE 120

fast kernel learning : the basic idea

◮ two ways of representing mercer kernel hypotheses

◮ h(x) =

n

  • i=1

αiyiK(x, xi)

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 16 / 27

slide-121
SLIDE 121

fast kernel learning : the basic idea

◮ two ways of representing mercer kernel hypotheses

◮ h(x) =

n

  • i=1

αiyiK(x, xi)

◮ requires upto n (and in practice Ω (n)) operations purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 16 / 27

slide-122
SLIDE 122

fast kernel learning : the basic idea

◮ two ways of representing mercer kernel hypotheses

◮ h(x) =

n

  • i=1

αiyiK(x, xi)

◮ requires upto n (and in practice Ω (n)) operations ◮ h(x) = Φ(x), w for some w ∈ H purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 16 / 27

slide-123
SLIDE 123

fast kernel learning : the basic idea

◮ two ways of representing mercer kernel hypotheses

◮ h(x) =

n

  • i=1

αiyiK(x, xi)

◮ requires upto n (and in practice Ω (n)) operations ◮ h(x) = Φ(x), w for some w ∈ H ◮ requires a single operation but in a high dimensional space purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 16 / 27

slide-124
SLIDE 124

fast kernel learning : the basic idea

◮ two ways of representing mercer kernel hypotheses

◮ h(x) =

n

  • i=1

αiyiK(x, xi)

◮ requires upto n (and in practice Ω (n)) operations ◮ h(x) = Φ(x), w for some w ∈ H ◮ requires a single operation but in a high dimensional space

◮ can we find an approximate map for the kernel in some low

dimensional space ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 16 / 27

slide-125
SLIDE 125

fast kernel learning : the basic idea

◮ two ways of representing mercer kernel hypotheses

◮ h(x) =

n

  • i=1

αiyiK(x, xi)

◮ requires upto n (and in practice Ω (n)) operations ◮ h(x) = Φ(x), w for some w ∈ H ◮ requires a single operation but in a high dimensional space

◮ can we find an approximate map for the kernel in some low

dimensional space ?

◮ Z : X −

→ RD such that for all x1, x2 ∈ X, Z(x1), Z(x2) ≈ K(x1, x2)

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 16 / 27

slide-126
SLIDE 126

fast kernel learning : the basic idea

◮ two ways of representing mercer kernel hypotheses

◮ h(x) =

n

  • i=1

αiyiK(x, xi)

◮ requires upto n (and in practice Ω (n)) operations ◮ h(x) = Φ(x), w for some w ∈ H ◮ requires a single operation but in a high dimensional space

◮ can we find an approximate map for the kernel in some low

dimensional space ?

◮ Z : X −

→ RD such that for all x1, x2 ∈ X, Z(x1), Z(x2) ≈ K(x1, x2)

◮ h(x) = Z(x), w for some w ∈ RD purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 16 / 27

slide-127
SLIDE 127

fast kernel learning : the basic idea

◮ two ways of representing mercer kernel hypotheses

◮ h(x) =

n

  • i=1

αiyiK(x, xi)

◮ requires upto n (and in practice Ω (n)) operations ◮ h(x) = Φ(x), w for some w ∈ H ◮ requires a single operation but in a high dimensional space

◮ can we find an approximate map for the kernel in some low

dimensional space ?

◮ Z : X −

→ RD such that for all x1, x2 ∈ X, Z(x1), Z(x2) ≈ K(x1, x2)

◮ h(x) = Z(x), w for some w ∈ RD ◮ would get power of kernel as well as speed of linear hypothesis purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 16 / 27

slide-128
SLIDE 128

the underlying math

◮ why should such approximate maps exist ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 17 / 27

slide-129
SLIDE 129

the underlying math

◮ why should such approximate maps exist ? ◮ johnson-lindenstrauss flattening lemma [cont. math., 26:189–206, 1984.]

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 17 / 27

slide-130
SLIDE 130

the underlying math

◮ why should such approximate maps exist ? ◮ johnson-lindenstrauss flattening lemma [cont. math., 26:189–206, 1984.]

◮ given n points x1, . . . , xn ∈ H, there exists a map Ψ : H −

→ RD

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 17 / 27

slide-131
SLIDE 131

the underlying math

◮ why should such approximate maps exist ? ◮ johnson-lindenstrauss flattening lemma [cont. math., 26:189–206, 1984.]

◮ given n points x1, . . . , xn ∈ H, there exists a map Ψ : H −

→ RD

◮ for all i, j, Ψ(xi), Ψ(xj) = xi, xj ± ǫ purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 17 / 27

slide-132
SLIDE 132

the underlying math

◮ why should such approximate maps exist ? ◮ johnson-lindenstrauss flattening lemma [cont. math., 26:189–206, 1984.]

◮ given n points x1, . . . , xn ∈ H, there exists a map Ψ : H −

→ RD

◮ for all i, j, Ψ(xi), Ψ(xj) = xi, xj ± ǫ ◮ need D = O

log n

ǫ2

  • dimensional map

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 17 / 27

slide-133
SLIDE 133

the underlying math

◮ why should such approximate maps exist ? ◮ johnson-lindenstrauss flattening lemma [cont. math., 26:189–206, 1984.]

◮ given n points x1, . . . , xn ∈ H, there exists a map Ψ : H −

→ RD

◮ for all i, j, Ψ(xi), Ψ(xj) = xi, xj ± ǫ ◮ need D = O

log n

ǫ2

  • dimensional map

◮ problem ??

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 17 / 27

slide-134
SLIDE 134

the underlying math

◮ why should such approximate maps exist ? ◮ johnson-lindenstrauss flattening lemma [cont. math., 26:189–206, 1984.]

◮ given n points x1, . . . , xn ∈ H, there exists a map Ψ : H −

→ RD

◮ for all i, j, Ψ(xi), Ψ(xj) = xi, xj ± ǫ ◮ need D = O

log n

ǫ2

  • dimensional map

◮ problem ??

◮ all algorithmic implementations of the jl-lemma require explicit access

to xi ∈ H

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 17 / 27

slide-135
SLIDE 135

the underlying math

◮ why should such approximate maps exist ? ◮ johnson-lindenstrauss flattening lemma [cont. math., 26:189–206, 1984.]

◮ given n points x1, . . . , xn ∈ H, there exists a map Ψ : H −

→ RD

◮ for all i, j, Ψ(xi), Ψ(xj) = xi, xj ± ǫ ◮ need D = O

log n

ǫ2

  • dimensional map

◮ problem ??

◮ all algorithmic implementations of the jl-lemma require explicit access

to xi ∈ H

◮ for us, calculating vectors in the hilbert space is prohibitive purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 17 / 27

slide-136
SLIDE 136

the underlying math

◮ why should such approximate maps exist ? ◮ johnson-lindenstrauss flattening lemma [cont. math., 26:189–206, 1984.]

◮ given n points x1, . . . , xn ∈ H, there exists a map Ψ : H −

→ RD

◮ for all i, j, Ψ(xi), Ψ(xj) = xi, xj ± ǫ ◮ need D = O

log n

ǫ2

  • dimensional map

◮ problem ??

◮ all algorithmic implementations of the jl-lemma require explicit access

to xi ∈ H

◮ for us, calculating vectors in the hilbert space is prohibitive ◮ the number of dimensions depends upon the number of points purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 17 / 27

slide-137
SLIDE 137

the underlying math

◮ why should such approximate maps exist ? ◮ johnson-lindenstrauss flattening lemma [cont. math., 26:189–206, 1984.]

◮ given n points x1, . . . , xn ∈ H, there exists a map Ψ : H −

→ RD

◮ for all i, j, Ψ(xi), Ψ(xj) = xi, xj ± ǫ ◮ need D = O

log n

ǫ2

  • dimensional map

◮ problem ??

◮ all algorithmic implementations of the jl-lemma require explicit access

to xi ∈ H

◮ for us, calculating vectors in the hilbert space is prohibitive ◮ the number of dimensions depends upon the number of points ◮ not satisfactory purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 17 / 27

slide-138
SLIDE 138

the underlying math

figure: dimensionality reduction via jl transform

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 18 / 27

slide-139
SLIDE 139

structure theorems

◮ characterization of certain kernel families

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 19 / 27

slide-140
SLIDE 140

structure theorems

◮ characterization of certain kernel families

bochner’s theorem [rudin, fourier analysis on groups, 1962]

every translation invariant mercer kernel on a locally compact abelian group is the fourier-steiltjes transform of some bounded positive measure

  • n the pontryagin dual group, K(x1, x2) =
  • Γ γ (x1 − x2) dµ(γ)

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 19 / 27

slide-141
SLIDE 141

structure theorems

◮ characterization of certain kernel families

bochner’s theorem [rudin, fourier analysis on groups, 1962]

every translation invariant mercer kernel on a locally compact abelian group is the fourier-steiltjes transform of some bounded positive measure

  • n the pontryagin dual group, K(x1, x2) =
  • Γ γ (x1 − x2) dµ(γ)

schoenberg’s theorem [duke math. journ., 9(1):96–108, 1942]

every dot product mercer kernel arises from an analytic function having a maclaurin series with non-negative coefficients, K(x1, x2) =

  • i=0

an x1, x2n

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 19 / 27

slide-142
SLIDE 142

structure theorems

◮ characterization of certain kernel families

bochner’s theorem [rudin, fourier analysis on groups, 1962]

every translation invariant mercer kernel on a locally compact abelian group is the fourier-steiltjes transform of some bounded positive measure

  • n the pontryagin dual group, K(x1, x2) =
  • Γ γ (x1 − x2) dµ(γ)

schoenberg’s theorem [duke math. journ., 9(1):96–108, 1942]

every dot product mercer kernel arises from an analytic function having a maclaurin series with non-negative coefficients, K(x1, x2) =

  • i=0

an x1, x2n

◮ allows us to develop fast routines for radial basis, homogeneous and

dot product kernels

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 19 / 27

slide-143
SLIDE 143

random features : the basic idea

◮ a kernel whose map is one-dimensional is called a rank-one kernel

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 20 / 27

slide-144
SLIDE 144

random features : the basic idea

◮ a kernel whose map is one-dimensional is called a rank-one kernel ◮ one can interpret structure theorems as telling us that every kernel is

a positive combination of rank-one kernels, i.e. for µ ≥ 0 K(x1, x2) =

  • Ω Kω(x1, x2)dµ(ω) = E

ω∼µ Kω(x1, x2)

where for all ω ∈ Ω, Kω : X × X − → R is a rank-one kernel i.e. for some Φω : X − → R, for all x1, x2 ∈ X, Kω(x1, x2) = Φω(x1) · Φω(x1)

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 20 / 27

slide-145
SLIDE 145

random features : the basic idea

◮ a kernel whose map is one-dimensional is called a rank-one kernel ◮ one can interpret structure theorems as telling us that every kernel is

a positive combination of rank-one kernels, i.e. for µ ≥ 0 K(x1, x2) =

  • Ω Kω(x1, x2)dµ(ω) = E

ω∼µ Kω(x1, x2)

where for all ω ∈ Ω, Kω : X × X − → R is a rank-one kernel i.e. for some Φω : X − → R, for all x1, x2 ∈ X, Kω(x1, x2) = Φω(x1) · Φω(x1)

◮ a random Kω gives us an unbiased estimate of K on all pairs of points

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 20 / 27

slide-146
SLIDE 146

random features : the basic idea

◮ a kernel whose map is one-dimensional is called a rank-one kernel ◮ one can interpret structure theorems as telling us that every kernel is

a positive combination of rank-one kernels, i.e. for µ ≥ 0 K(x1, x2) =

  • Ω Kω(x1, x2)dµ(ω) = E

ω∼µ Kω(x1, x2)

where for all ω ∈ Ω, Kω : X × X − → R is a rank-one kernel i.e. for some Φω : X − → R, for all x1, x2 ∈ X, Kω(x1, x2) = Φω(x1) · Φω(x1)

◮ a random Kω gives us an unbiased estimate of K on all pairs of points

◮ once we have an unbiased estimate for a quantity, independent

repetitions can help reduce variance

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 20 / 27

slide-147
SLIDE 147

random features : implementation

◮ select D values {ω1, ω2, . . . , ωD} randomly from distribution µ over Ω

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 21 / 27

slide-148
SLIDE 148

random features : implementation

◮ select D values {ω1, ω2, . . . , ωD} randomly from distribution µ over Ω ◮ create the map

Z(x) = (Φω1(x), Φω2(x), . . . , ΦωD(x)) ∈ RD

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 21 / 27

slide-149
SLIDE 149

random features : implementation

◮ select D values {ω1, ω2, . . . , ωD} randomly from distribution µ over Ω ◮ create the map

Z(x) = (Φω1(x), Φω2(x), . . . , ΦωD(x)) ∈ RD

theorem (approximation guarantee for random features)

for a compact domain X ⊂ Rd, for any ǫ, δ > 0, take D = O d

ǫ2 log 1 ǫδ

  • and construct a D-dimensional map, then with probability (1 − δ),

sup

x1,x2∈X

|K(x1, x2) − Z(x1), Z(x2)| ≤ ǫ

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 21 / 27

slide-150
SLIDE 150

random features : properties

◮ the guarantee is uniform unlike the jl-lemma guarantee

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 22 / 27

slide-151
SLIDE 151

random features : properties

◮ the guarantee is uniform unlike the jl-lemma guarantee

◮ holds for all (possibly infinite) pairs of points from X purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 22 / 27

slide-152
SLIDE 152

random features : properties

◮ the guarantee is uniform unlike the jl-lemma guarantee

◮ holds for all (possibly infinite) pairs of points from X

◮ hypothesis is of the form h(x) = Z(x), w, for some w ∈ RD

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 22 / 27

slide-153
SLIDE 153

random features : properties

◮ the guarantee is uniform unlike the jl-lemma guarantee

◮ holds for all (possibly infinite) pairs of points from X

◮ hypothesis is of the form h(x) = Z(x), w, for some w ∈ RD

◮ evaluating a hypothesis takes O (D) time purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 22 / 27

slide-154
SLIDE 154

random features : properties

◮ the guarantee is uniform unlike the jl-lemma guarantee

◮ holds for all (possibly infinite) pairs of points from X

◮ hypothesis is of the form h(x) = Z(x), w, for some w ∈ RD

◮ evaluating a hypothesis takes O (D) time

◮ procedure gives approximation to the kernel function directly

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 22 / 27

slide-155
SLIDE 155

random features : properties

◮ the guarantee is uniform unlike the jl-lemma guarantee

◮ holds for all (possibly infinite) pairs of points from X

◮ hypothesis is of the form h(x) = Z(x), w, for some w ∈ RD

◮ evaluating a hypothesis takes O (D) time

◮ procedure gives approximation to the kernel function directly

◮ same random features can be used for different tasks : classification,

regression etc

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 22 / 27

slide-156
SLIDE 156

random features : properties

figure: random features providing dimensionality reduction

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 23 / 27

slide-157
SLIDE 157

random features : in action

◮ several constructions for various families

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 24 / 27

slide-158
SLIDE 158

random features : in action

◮ several constructions for various families

◮ translation invariant kernels [rahimi, recht, nips 2007] purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 24 / 27

slide-159
SLIDE 159

random features : in action

◮ several constructions for various families

◮ translation invariant kernels [rahimi, recht, nips 2007] ◮ homogeneous kernels [vedaldi, zisserman, cvpr 2010] purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 24 / 27

slide-160
SLIDE 160

random features : in action

◮ several constructions for various families

◮ translation invariant kernels [rahimi, recht, nips 2007] ◮ homogeneous kernels [vedaldi, zisserman, cvpr 2010] ◮ dot product kernels [k., karnick, aistats 2012] purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 24 / 27

slide-161
SLIDE 161

random features : in action

◮ several constructions for various families

◮ translation invariant kernels [rahimi, recht, nips 2007] ◮ homogeneous kernels [vedaldi, zisserman, cvpr 2010] ◮ dot product kernels [k., karnick, aistats 2012]

0.0004 0.0008 0.0012 0.0016 0.002

1000 2000 3000 4000 5000 Error Number of random features (D)

Homogeneous Poly Kernel

d = 10 d = 20 d = 50 d = 100 d = 200

10 20 30 40 50 60

1000 2000 3000 4000 5000 Error Number of random features (D)

Polynomial Kernel

d = 50 d = 50 with H0/1 d = 100 d = 100 with H0/1 d = 200 d = 200 with H0/1

0.1 0.2 0.3 0.4 0.5

1000 2000 3000 4000 5000 Error Number of random features (D)

Exponential Kernel

d = 200 d = 200 with H0/1

figure: approximation error in reconstructing kernel values

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 24 / 27

slide-162
SLIDE 162

random features : in action

dataset K + libsvm RF + liblinear H0/1 + liblinear nursery N = 13000 d = 8 acc = 99.8% trn = 10.8s tst = 1.7s acc = 99.6% trn = 2.52s (4.3×) tst = 0.6s (2.8×) D = 500 acc = 97.96% trn = 0.4s (27×) tst = 0.18s (9.4×) D = 100 cod-rna N = 60000 d = 8 acc = 95.2% trn = 91.5s tst = 17.1s acc = 94.9% trn = 11.5s (8×) tst = 2.8s (6.1×) D = 500 acc = 93.8% trn= 0.67s (136×) tst = 1.4s (12×) D = 50 adult N = 49000 d = 123 acc = 83.7% trn = 263.3s tst = 33.4s acc = 82.9% trn = 39.8s (6.6×) tst = 14.3s (2.3×) D = 500 acc = 84.8% trn = 7.18s (37×) tst = 9.4s (3.6×) D = 100 covertype N=581000 d = 54 acc = 80.6% trn = 194.1s tst = 695.8s acc = 76.2% trn = 21.4s (9×) tst = 207s (3.6×) D = 1000 acc = 75.5% trn = 3.7s (52×) tst = 80.4s (8.7×) D = 100

figure: speedups for exponential kernel K(x1, x2) = exp

  • x1,x2

σ2

  • purushottam kar (iit kanpur)

accelerated kernel learning november 27, 2012 25 / 27

slide-163
SLIDE 163
  • ther approaches

◮ alternative approaches exist that given a set of training points

x1, . . . , xn, approximate the gram matrix G = [gij], gij = K(xi, xj)

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 26 / 27

slide-164
SLIDE 164
  • ther approaches

◮ alternative approaches exist that given a set of training points

x1, . . . , xn, approximate the gram matrix G = [gij], gij = K(xi, xj)

◮ cholesky decomposition : finds a rank D approximation to G purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 26 / 27

slide-165
SLIDE 165
  • ther approaches

◮ alternative approaches exist that given a set of training points

x1, . . . , xn, approximate the gram matrix G = [gij], gij = K(xi, xj)

◮ cholesky decomposition : finds a rank D approximation to G ◮ nystr¨

  • m method : chooses a subsample of training points ˆ

x1, . . . , ˆ xD as anchor points and creates a D dimensional map

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 26 / 27

slide-166
SLIDE 166
  • ther approaches

◮ alternative approaches exist that given a set of training points

x1, . . . , xn, approximate the gram matrix G = [gij], gij = K(xi, xj)

◮ cholesky decomposition : finds a rank D approximation to G ◮ nystr¨

  • m method : chooses a subsample of training points ˆ

x1, . . . , ˆ xD as anchor points and creates a D dimensional map

◮ advantages

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 26 / 27

slide-167
SLIDE 167
  • ther approaches

◮ alternative approaches exist that given a set of training points

x1, . . . , xn, approximate the gram matrix G = [gij], gij = K(xi, xj)

◮ cholesky decomposition : finds a rank D approximation to G ◮ nystr¨

  • m method : chooses a subsample of training points ˆ

x1, . . . , ˆ xD as anchor points and creates a D dimensional map

◮ advantages

◮ data dependency helps in hard learning instances [yang et al, nips 2010] purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 26 / 27

slide-168
SLIDE 168
  • ther approaches

◮ alternative approaches exist that given a set of training points

x1, . . . , xn, approximate the gram matrix G = [gij], gij = K(xi, xj)

◮ cholesky decomposition : finds a rank D approximation to G ◮ nystr¨

  • m method : chooses a subsample of training points ˆ

x1, . . . , ˆ xD as anchor points and creates a D dimensional map

◮ advantages

◮ data dependency helps in hard learning instances [yang et al, nips 2010]

◮ disadvantages

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 26 / 27

slide-169
SLIDE 169
  • ther approaches

◮ alternative approaches exist that given a set of training points

x1, . . . , xn, approximate the gram matrix G = [gij], gij = K(xi, xj)

◮ cholesky decomposition : finds a rank D approximation to G ◮ nystr¨

  • m method : chooses a subsample of training points ˆ

x1, . . . , ˆ xD as anchor points and creates a D dimensional map

◮ advantages

◮ data dependency helps in hard learning instances [yang et al, nips 2010]

◮ disadvantages

◮ slower than random features as the hypothesis takes Ω

  • D2

time to evaluate in worst case : O (D) time using random features

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 26 / 27

slide-170
SLIDE 170
  • ther approaches

◮ alternative approaches exist that given a set of training points

x1, . . . , xn, approximate the gram matrix G = [gij], gij = K(xi, xj)

◮ cholesky decomposition : finds a rank D approximation to G ◮ nystr¨

  • m method : chooses a subsample of training points ˆ

x1, . . . , ˆ xD as anchor points and creates a D dimensional map

◮ advantages

◮ data dependency helps in hard learning instances [yang et al, nips 2010]

◮ disadvantages

◮ slower than random features as the hypothesis takes Ω

  • D2

time to evaluate in worst case : O (D) time using random features

◮ expensive preprocessing required : increases time taken to learn purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 26 / 27

slide-171
SLIDE 171

conclusion

◮ what all families admit such random feature constructions ?

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 27 / 27

slide-172
SLIDE 172

conclusion

◮ what all families admit such random feature constructions ?

◮ there do exist that dont [balcan et al., mach. learn., 65(1): 79–94, 2006] purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 27 / 27

slide-173
SLIDE 173

conclusion

◮ what all families admit such random feature constructions ?

◮ there do exist that dont [balcan et al., mach. learn., 65(1): 79–94, 2006]

◮ introduce data awareness in methods

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 27 / 27

slide-174
SLIDE 174

conclusion

◮ what all families admit such random feature constructions ?

◮ there do exist that dont [balcan et al., mach. learn., 65(1): 79–94, 2006]

◮ introduce data awareness in methods ◮ explore applications in other kernel learning tasks

purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 27 / 27

slide-175
SLIDE 175

conclusion

◮ what all families admit such random feature constructions ?

◮ there do exist that dont [balcan et al., mach. learn., 65(1): 79–94, 2006]

◮ introduce data awareness in methods ◮ explore applications in other kernel learning tasks

◮ some work in clustering [chitta et al., icdm 2012] purushottam kar (iit kanpur) accelerated kernel learning november 27, 2012 27 / 27