Algorithms for NLP CS 11-711 Fall 2020 Lecture 3: Nonlinear text - - PowerPoint PPT Presentation

algorithms for nlp
SMART_READER_LITE
LIVE PREVIEW

Algorithms for NLP CS 11-711 Fall 2020 Lecture 3: Nonlinear text - - PowerPoint PPT Presentation

Algorithms for NLP CS 11-711 Fall 2020 Lecture 3: Nonlinear text classification Emma Strubell Announcements Project 1: Text classification will be available after class today, due Friday September 25 Han will lead recitation this


slide-1
SLIDE 1

Emma Strubell

Algorithms for NLP

CS 11-711 · Fall 2020

Lecture 3: Nonlinear text classification

slide-2
SLIDE 2

Announcements

2

■ Project 1: Text classification will be available after class today,

due Friday September 25

■ Han will lead recitation this Friday to introduce Project 1, and cover environment

setup (Python, Jupyter, NumPy, PyTorch).

slide-3
SLIDE 3

Recap

Representing text as a bag of words

3

■ Given a text ■ Choose a label ■ The bag-of-words is a fixed-length vector of word counts: ■ Length of x is equal to the size of the vocabulary, V

xt w = (w1, w2, . . . , wT) ∈ V∗

el y ∈ Y.

The drinks were strong but the fish tacos were bland

w = x =

a a r d v a r k

… 1 1 … 1 … 1 2 1 … 2 …

b u t t h e w e r e b l a n d z y t h e r … t a c

  • s

s t r

  • n

g t a c

  • fi

s h … … … …

slide-4
SLIDE 4

Recap

4

■ Let score the compatibility of bag-of-words x and label y.

Then:

■ In a linear classifier this scoring function has the simple form:

where θ is a vector of weights, and f is a feature function

ˆ y = argmax

y

ψ(x, y).

x ψ(x, y).

ψ(x, y) = θ · f (x, y) = X

j=1

θj × fj(x, y),

Linear classification on bag-of-words

slide-5
SLIDE 5

Recap

5

■ In classification, the feature function is usually a simple combination of x and y,

such as:

■ If we have K labels, this corresponds to column vectors that look like:

fj(x, y) = ( x

xbland if y = negative

  • therwise

f (x, y = 1) = f (x, y = 2) = [

x0 x1 … x|V| … … x0 x1 … x|V| …

f (x, y = K) =

… x0 x1 … x|V|

| {z }

(K−1)×V

T T T

Feature functions

| {z }

(K−2)×V

{z

V

| {z }

(K−1)×V

slide-6
SLIDE 6

How to obtain θ ?

6

slide-7
SLIDE 7

How to obtain θ ?

6

■ The learning problem is to find the right weights θ.

slide-8
SLIDE 8

How to obtain θ ?

6

■ The learning problem is to find the right weights θ. ■ Naïve Bayes: set θ equal to the empirical frequencies:

P = count(y) P

y0 count(y0).

= count(y, j) PV

j0=1 count(y, j0)

ˆ µy = p(y) =

<latexit sha1_base64="SVP5fbdei0Qd/DdSgRoYJhRZw=">AEXHicfVJb9MwFE63AqNjsIHECy+GaVLRtUOJHiZNAEPvEwMiV3QXCrHOWmt+RLZztbI+Ifywit/Aycp0GwDS4k/n4vP53O+OPM2MHge2tpuX3r9p2Vu53Ve2v3H6xvPDw2KtcUjqjiSp/GxABnEo4sxOMw1ExBxO4vN3pf/kArRhSn62RQYjQSaSpYwSG0zj9Z9bOBYO2ylY4r+6rt0e9jzCHFJLtFaX6Io7OLcrW+q75TYraznd1BRgx56cbMfT4l1he91tuYI7SFM9ESQ2dgVCDOJsCB2Sgl3X7xfqIswTZT9R9Ve4EO7xd9zp7oei9yPi1Ai64aQPTRe3xz0B9VC18FwDjaj+TocbyNcKJoLkBayokxZ8NBZkeOaMsoB9/BuYGM0HMygbMAJRFgRq6aiEdbwZKgVOnwSYsq62KGI8KYQsQhsnyzueorjTf5znKbvhk5JrPcgqR1oTnyCpUjhclTAO1vAiAUM0CV0SnRBNqgwhC6xfKTIFfgG0+hIqRM2lVvUEpFuGsQcIlVUIQmTx3OCWC8SKBlOTceodN+hvf1Jqd5IJlZt6lWd2mTtCZxUqzCZOEl5qrhNc0h21qcfXv4PcQZqHhIBD8mIEmVunApFaRD7OZ4Ke4hP+LZPJPZIDNZ7mKQHhM2QKVgXS+FiVXBnA80SrPGoSv5VdEwUkDR2v46GZVkcEQ6vyu86ON7tD1/2dz+92tx/O5fmSvQkehZ1o2H0OtqPkSH0VFEWwct0/rW8s/2u32anutDl1qzXMeRY3VfvwLq317IQ=</latexit>

ˆ φy,j = p(xj | y) =

<latexit sha1_base64="ULBOLgVvgMA7ekXaKmi5LjqeziY=">AEiHicfVLbtQwFE3KAGV4tWXJxlBVmqEPzRSkwqJSViwQRSJPlA9jBznZuLWj8h2kaRv4UtfBJ/g/MAmrZgKfHxvdc+x9cnyjgzdjT6Gc7d6t2+c3f+Xv/+g4ePHi8sLh0YlWsK+1RxpY8iYoAzCfuWQ5HmQYiIg6H0enbKn94BtowJT/bIoOJIDPJEkaJ9aHpYri0giNRYpuCJe5rObCr46FDmENidbqHF1J+RqHUvcoJouqigbujVUNGCI1m/O45TYsnD/kqL0DbCRM8EuZiWBcJMIiyITSnh5RfnLvEiTGNl/8E69HroPi7bs/HInfTwnNkA1+zjfpNEuZ83Rr6MTVuea06YnZjGqK6cLy6ONUT3QdTBuwXLQjr3p4twEx4rmAqSlnBhzPB5ldlISbRnl4Po4N5ARekpmcOyhJALMpKxfz6EVH4lRorT/pEV19PKOkghjChH5yqo95mquCt6UO85t8npSMpnlFiRtiJKcI6tQZQUMw3U8sIDQjXzWhFNiSbUesP4Ll6iSYGfge1ehIpJaZKavSMpEn6tQcI5VUIQGb8ocUIE40UMCcm5dSU2yW98U2vW4jOWmbZLF02b+t6SFivNZkwSXtmz9mg37KfU4vrfx+/Av4WGD17gxw0sUp7JY3hnH+bGX6GK/i/Sib/VHrYvVZC/CXqVqgMpCla/zLlQEczbTKs47ga/trof4AkviON/XQ3dZUeEOr9rvOjY3Bi/3Nj89Gp5Z7e15nzwNHgeDIJxsBXsBO+DvWA/oGERfgu/hz96/d6ot9V705TOhe2eJ0Fn9HZ/AQ/jiSw=</latexit>
slide-9
SLIDE 9

How to obtain θ ?

6

■ The learning problem is to find the right weights θ. ■ Naïve Bayes: set θ equal to the empirical frequencies: ■ Perceptron update:

θ + f (x(i), y(i)) − f (x(i), ˆ y).

θ(t+1) ←θ(t) −

P = count(y) P

y0 count(y0).

= count(y, j) PV

j0=1 count(y, j0)

ˆ µy = p(y) =

<latexit sha1_base64="SVP5fbdei0Qd/DdSgRoYJhRZw=">AEXHicfVJb9MwFE63AqNjsIHECy+GaVLRtUOJHiZNAEPvEwMiV3QXCrHOWmt+RLZztbI+Ifywit/Aycp0GwDS4k/n4vP53O+OPM2MHge2tpuX3r9p2Vu53Ve2v3H6xvPDw2KtcUjqjiSp/GxABnEo4sxOMw1ExBxO4vN3pf/kArRhSn62RQYjQSaSpYwSG0zj9Z9bOBYO2ylY4r+6rt0e9jzCHFJLtFaX6Io7OLcrW+q75TYraznd1BRgx56cbMfT4l1he91tuYI7SFM9ESQ2dgVCDOJsCB2Sgl3X7xfqIswTZT9R9Ve4EO7xd9zp7oei9yPi1Ai64aQPTRe3xz0B9VC18FwDjaj+TocbyNcKJoLkBayokxZ8NBZkeOaMsoB9/BuYGM0HMygbMAJRFgRq6aiEdbwZKgVOnwSYsq62KGI8KYQsQhsnyzueorjTf5znKbvhk5JrPcgqR1oTnyCpUjhclTAO1vAiAUM0CV0SnRBNqgwhC6xfKTIFfgG0+hIqRM2lVvUEpFuGsQcIlVUIQmTx3OCWC8SKBlOTceodN+hvf1Jqd5IJlZt6lWd2mTtCZxUqzCZOEl5qrhNc0h21qcfXv4PcQZqHhIBD8mIEmVunApFaRD7OZ4Ke4hP+LZPJPZIDNZ7mKQHhM2QKVgXS+FiVXBnA80SrPGoSv5VdEwUkDR2v46GZVkcEQ6vyu86ON7tD1/2dz+92tx/O5fmSvQkehZ1o2H0OtqPkSH0VFEWwct0/rW8s/2u32anutDl1qzXMeRY3VfvwLq317IQ=</latexit>

ˆ φy,j = p(xj | y) =

<latexit sha1_base64="ULBOLgVvgMA7ekXaKmi5LjqeziY=">AEiHicfVLbtQwFE3KAGV4tWXJxlBVmqEPzRSkwqJSViwQRSJPlA9jBznZuLWj8h2kaRv4UtfBJ/g/MAmrZgKfHxvdc+x9cnyjgzdjT6Gc7d6t2+c3f+Xv/+g4ePHi8sLh0YlWsK+1RxpY8iYoAzCfuWQ5HmQYiIg6H0enbKn94BtowJT/bIoOJIDPJEkaJ9aHpYri0giNRYpuCJe5rObCr46FDmENidbqHF1J+RqHUvcoJouqigbujVUNGCI1m/O45TYsnD/kqL0DbCRM8EuZiWBcJMIiyITSnh5RfnLvEiTGNl/8E69HroPi7bs/HInfTwnNkA1+zjfpNEuZ83Rr6MTVuea06YnZjGqK6cLy6ONUT3QdTBuwXLQjr3p4twEx4rmAqSlnBhzPB5ldlISbRnl4Po4N5ARekpmcOyhJALMpKxfz6EVH4lRorT/pEV19PKOkghjChH5yqo95mquCt6UO85t8npSMpnlFiRtiJKcI6tQZQUMw3U8sIDQjXzWhFNiSbUesP4Ll6iSYGfge1ehIpJaZKavSMpEn6tQcI5VUIQGb8ocUIE40UMCcm5dSU2yW98U2vW4jOWmbZLF02b+t6SFivNZkwSXtmz9mg37KfU4vrfx+/Av4WGD17gxw0sUp7JY3hnH+bGX6GK/i/Sib/VHrYvVZC/CXqVqgMpCla/zLlQEczbTKs47ga/trof4AkviON/XQ3dZUeEOr9rvOjY3Bi/3Nj89Gp5Z7e15nzwNHgeDIJxsBXsBO+DvWA/oGERfgu/hz96/d6ot9V705TOhe2eJ0Fn9HZ/AQ/jiSw=</latexit>
slide-10
SLIDE 10

How to obtain θ ?

6

■ The learning problem is to find the right weights θ. ■ Naïve Bayes: set θ equal to the empirical frequencies: ■ Perceptron update: ■ Large-margin update:

θ + f (x(i), y(i)) − f (x(i), ˆ y).

θ(t+1) ←θ(t) −

P = count(y) P

y0 count(y0).

= count(y, j) PV

j0=1 count(y, j0)

with

θ(t+1) ← θ(t) + f(x(i), y (i)) − f(x(i), ˆ y)

<latexit sha1_base64="RXVil6+ULiMq+WxrHfc+47A0JnU=">ADjXicbVJb9MwFHYbOka5rINHXiKqS0bVTNA8IDQBA/wOCS6TWpK5TgnjTVfItvpFln5m0j8Fx5wkiItXS0l/vJ95ZzTpQxqs10+qfT9R709h7uP+o/fvL02cHg8PmFlrkiMCOSXUVYQ2MCpgZahcZQowjxhcRtdfK/1yDUpTKX6aIoMFxytBE0qwcdRykIURt6FJweDylx2Z42Bc+iGDxGCl5I2/JTvxuOaSclRdtxVLx+WJXzRg7L/ZrYcpNrYox8vBcDqZ1se/D4INGKLNOV8ednthLEnOQRjCsNbzYJqZhcXKUMKg7Ie5hgyTa7yCuYMCc9ALW7em9I8cE/uJVO4Rxq/Zux4Wc60LHjlLjk2qt7WK3KXNc5N8XFgqstyAIE2iJGe+kX7VZz+mCohQOYKOpq9UmKFSbGTaN/dDdNCmwNpvUjVid15hYXcfetQMANkZxjEb+2YI5ZUMCc6ZKW2ok/94V1tO4jXN9KZDt02L+m7aJpSKrqjArJp8Pf427a7UhPW7XYKtjV3iqlyZgbBlDQmTGsJopWSetYKX2/51UBcAJ64zjT203RoLtzjB9prcBxenk+Dt5PTHu+HZl80K7aOX6BUaoQB9QGfoOzpHM0TQb/S30+vseQfe+T97kx7XY2Pi9Q63jf/gFH5C9z</latexit>

ˆ y = argmax

y∈Y

θ · f(x(i), y) + c(y (i), y)

<latexit sha1_base64="A8+pB2HUH1eWwMsv3S3SIvbkjY=">AERXicfVJLb9QwEM52eZTl1cKRi6GqtEtLtVsQcEGqgAMXRJHoA9XLauJMdq36EdlO28jK7+PKlR/BDXEFJ1mgaQuWEn+ZR+abmS/OBLduOPzaWeheunzl6uK13vUbN2/dXlq+s2t1bhjuMC202Y/BouAKdx3AvczgyBjgXvx4avKv3eExnKtPrgiw7GEqeIpZ+CabL0ZXG0lM3QwflJ93a6NBSajA1IEx+picQfnWm1Ly351nVRWPijXSdGAXl0sZ/OwPmiHPTmgLwgFMxUwsnEF4RyRagEN2Mg/MeyPFWUJZo94+ig0CH9Yu/35OleHGsD7kPBjNwUo0P9uT5YUxTLJSrHBFh7MBpmbuzBOM4Elj2aW8yAHcIUDwJUINGOfT36kqwGS0JSbcKjHKmtpzM8SGsLGYfIqjt71lcZL/Id5C59PvZcZblDxZpCaS6I06TaI0m4QeZEQAwNXwmZgLmw7d7q6TIzFEfo2o0wOfY2rau3KMUyfBtUeMy0lKCSh56mILkoEkwhF6701Ka/8UWjWU+OeGbnUzpxtQLgnJUGz7lCkQlrlphbXO4Zo7W7x59jWEXBt8Ggu8yNOC0CUwavZRhN1N6n1bwf5Fc/YkMsN2WrwmEZqoR6AyVLxv5CW2RxlOj86xF+Fx+T8ANIw8SYe2lNRBDk6Kz8zoPdzY3R43N909Wtl7OpbkY3YseRP1oFD2LtqI30Xa0E7HO0w7tYCftfu5+637v/mhCFzrznLtR63R/gL9xHct</latexit>

ˆ µy = p(y) =

<latexit sha1_base64="SVP5fbdei0Qd/DdSgRoYJhRZw=">AEXHicfVJb9MwFE63AqNjsIHECy+GaVLRtUOJHiZNAEPvEwMiV3QXCrHOWmt+RLZztbI+Ifywit/Aycp0GwDS4k/n4vP53O+OPM2MHge2tpuX3r9p2Vu53Ve2v3H6xvPDw2KtcUjqjiSp/GxABnEo4sxOMw1ExBxO4vN3pf/kArRhSn62RQYjQSaSpYwSG0zj9Z9bOBYO2ylY4r+6rt0e9jzCHFJLtFaX6Io7OLcrW+q75TYraznd1BRgx56cbMfT4l1he91tuYI7SFM9ESQ2dgVCDOJsCB2Sgl3X7xfqIswTZT9R9Ve4EO7xd9zp7oei9yPi1Ai64aQPTRe3xz0B9VC18FwDjaj+TocbyNcKJoLkBayokxZ8NBZkeOaMsoB9/BuYGM0HMygbMAJRFgRq6aiEdbwZKgVOnwSYsq62KGI8KYQsQhsnyzueorjTf5znKbvhk5JrPcgqR1oTnyCpUjhclTAO1vAiAUM0CV0SnRBNqgwhC6xfKTIFfgG0+hIqRM2lVvUEpFuGsQcIlVUIQmTx3OCWC8SKBlOTceodN+hvf1Jqd5IJlZt6lWd2mTtCZxUqzCZOEl5qrhNc0h21qcfXv4PcQZqHhIBD8mIEmVunApFaRD7OZ4Ke4hP+LZPJPZIDNZ7mKQHhM2QKVgXS+FiVXBnA80SrPGoSv5VdEwUkDR2v46GZVkcEQ6vyu86ON7tD1/2dz+92tx/O5fmSvQkehZ1o2H0OtqPkSH0VFEWwct0/rW8s/2u32anutDl1qzXMeRY3VfvwLq317IQ=</latexit>

ˆ φy,j = p(xj | y) =

<latexit sha1_base64="ULBOLgVvgMA7ekXaKmi5LjqeziY=">AEiHicfVLbtQwFE3KAGV4tWXJxlBVmqEPzRSkwqJSViwQRSJPlA9jBznZuLWj8h2kaRv4UtfBJ/g/MAmrZgKfHxvdc+x9cnyjgzdjT6Gc7d6t2+c3f+Xv/+g4ePHi8sLh0YlWsK+1RxpY8iYoAzCfuWQ5HmQYiIg6H0enbKn94BtowJT/bIoOJIDPJEkaJ9aHpYri0giNRYpuCJe5rObCr46FDmENidbqHF1J+RqHUvcoJouqigbujVUNGCI1m/O45TYsnD/kqL0DbCRM8EuZiWBcJMIiyITSnh5RfnLvEiTGNl/8E69HroPi7bs/HInfTwnNkA1+zjfpNEuZ83Rr6MTVuea06YnZjGqK6cLy6ONUT3QdTBuwXLQjr3p4twEx4rmAqSlnBhzPB5ldlISbRnl4Po4N5ARekpmcOyhJALMpKxfz6EVH4lRorT/pEV19PKOkghjChH5yqo95mquCt6UO85t8npSMpnlFiRtiJKcI6tQZQUMw3U8sIDQjXzWhFNiSbUesP4Ll6iSYGfge1ehIpJaZKavSMpEn6tQcI5VUIQGb8ocUIE40UMCcm5dSU2yW98U2vW4jOWmbZLF02b+t6SFivNZkwSXtmz9mg37KfU4vrfx+/Av4WGD17gxw0sUp7JY3hnH+bGX6GK/i/Sib/VHrYvVZC/CXqVqgMpCla/zLlQEczbTKs47ga/trof4AkviON/XQ3dZUeEOr9rvOjY3Bi/3Nj89Gp5Z7e15nzwNHgeDIJxsBXsBO+DvWA/oGERfgu/hz96/d6ot9V705TOhe2eJ0Fn9HZ/AQ/jiSw=</latexit>
slide-11
SLIDE 11

How to obtain θ ?

6

■ The learning problem is to find the right weights θ. ■ Naïve Bayes: set θ equal to the empirical frequencies: ■ Perceptron update: ■ Large-margin update: ■ Logistic regression update:

θ + f (x(i), y(i)) − f (x(i), ˆ y).

θ(t+1) ←θ(t) −

P = count(y) P

y0 count(y0).

= count(y, j) PV

j0=1 count(y, j0)

with

θ(t+1) ← θ(t) + f(x(i), y (i)) − f(x(i), ˆ y)

<latexit sha1_base64="RXVil6+ULiMq+WxrHfc+47A0JnU=">ADjXicbVJb9MwFHYbOka5rINHXiKqS0bVTNA8IDQBA/wOCS6TWpK5TgnjTVfItvpFln5m0j8Fx5wkiItXS0l/vJ95ZzTpQxqs10+qfT9R709h7uP+o/fvL02cHg8PmFlrkiMCOSXUVYQ2MCpgZahcZQowjxhcRtdfK/1yDUpTKX6aIoMFxytBE0qwcdRykIURt6FJweDylx2Z42Bc+iGDxGCl5I2/JTvxuOaSclRdtxVLx+WJXzRg7L/ZrYcpNrYox8vBcDqZ1se/D4INGKLNOV8ednthLEnOQRjCsNbzYJqZhcXKUMKg7Ie5hgyTa7yCuYMCc9ALW7em9I8cE/uJVO4Rxq/Zux4Wc60LHjlLjk2qt7WK3KXNc5N8XFgqstyAIE2iJGe+kX7VZz+mCohQOYKOpq9UmKFSbGTaN/dDdNCmwNpvUjVid15hYXcfetQMANkZxjEb+2YI5ZUMCc6ZKW2ok/94V1tO4jXN9KZDt02L+m7aJpSKrqjArJp8Pf427a7UhPW7XYKtjV3iqlyZgbBlDQmTGsJopWSetYKX2/51UBcAJ64zjT203RoLtzjB9prcBxenk+Dt5PTHu+HZl80K7aOX6BUaoQB9QGfoOzpHM0TQb/S30+vseQfe+T97kx7XY2Pi9Q63jf/gFH5C9z</latexit>

ˆ y = argmax

y∈Y

θ · f(x(i), y) + c(y (i), y)

<latexit sha1_base64="A8+pB2HUH1eWwMsv3S3SIvbkjY=">AERXicfVJLb9QwEM52eZTl1cKRi6GqtEtLtVsQcEGqgAMXRJHoA9XLauJMdq36EdlO28jK7+PKlR/BDXEFJ1mgaQuWEn+ZR+abmS/OBLduOPzaWeheunzl6uK13vUbN2/dXlq+s2t1bhjuMC202Y/BouAKdx3AvczgyBjgXvx4avKv3eExnKtPrgiw7GEqeIpZ+CabL0ZXG0lM3QwflJ93a6NBSajA1IEx+picQfnWm1Ly351nVRWPijXSdGAXl0sZ/OwPmiHPTmgLwgFMxUwsnEF4RyRagEN2Mg/MeyPFWUJZo94+ig0CH9Yu/35OleHGsD7kPBjNwUo0P9uT5YUxTLJSrHBFh7MBpmbuzBOM4Elj2aW8yAHcIUDwJUINGOfT36kqwGS0JSbcKjHKmtpzM8SGsLGYfIqjt71lcZL/Id5C59PvZcZblDxZpCaS6I06TaI0m4QeZEQAwNXwmZgLmw7d7q6TIzFEfo2o0wOfY2rau3KMUyfBtUeMy0lKCSh56mILkoEkwhF6701Ka/8UWjWU+OeGbnUzpxtQLgnJUGz7lCkQlrlphbXO4Zo7W7x59jWEXBt8Ggu8yNOC0CUwavZRhN1N6n1bwf5Fc/YkMsN2WrwmEZqoR6AyVLxv5CW2RxlOj86xF+Fx+T8ANIw8SYe2lNRBDk6Kz8zoPdzY3R43N909Wtl7OpbkY3YseRP1oFD2LtqI30Xa0E7HO0w7tYCftfu5+637v/mhCFzrznLtR63R/gL9xHct</latexit>

ˆ µy = p(y) =

<latexit sha1_base64="SVP5fbdei0Qd/DdSgRoYJhRZw=">AEXHicfVJb9MwFE63AqNjsIHECy+GaVLRtUOJHiZNAEPvEwMiV3QXCrHOWmt+RLZztbI+Ifywit/Aycp0GwDS4k/n4vP53O+OPM2MHge2tpuX3r9p2Vu53Ve2v3H6xvPDw2KtcUjqjiSp/GxABnEo4sxOMw1ExBxO4vN3pf/kArRhSn62RQYjQSaSpYwSG0zj9Z9bOBYO2ylY4r+6rt0e9jzCHFJLtFaX6Io7OLcrW+q75TYraznd1BRgx56cbMfT4l1he91tuYI7SFM9ESQ2dgVCDOJsCB2Sgl3X7xfqIswTZT9R9Ve4EO7xd9zp7oei9yPi1Ai64aQPTRe3xz0B9VC18FwDjaj+TocbyNcKJoLkBayokxZ8NBZkeOaMsoB9/BuYGM0HMygbMAJRFgRq6aiEdbwZKgVOnwSYsq62KGI8KYQsQhsnyzueorjTf5znKbvhk5JrPcgqR1oTnyCpUjhclTAO1vAiAUM0CV0SnRBNqgwhC6xfKTIFfgG0+hIqRM2lVvUEpFuGsQcIlVUIQmTx3OCWC8SKBlOTceodN+hvf1Jqd5IJlZt6lWd2mTtCZxUqzCZOEl5qrhNc0h21qcfXv4PcQZqHhIBD8mIEmVunApFaRD7OZ4Ke4hP+LZPJPZIDNZ7mKQHhM2QKVgXS+FiVXBnA80SrPGoSv5VdEwUkDR2v46GZVkcEQ6vyu86ON7tD1/2dz+92tx/O5fmSvQkehZ1o2H0OtqPkSH0VFEWwct0/rW8s/2u32anutDl1qzXMeRY3VfvwLq317IQ=</latexit>

ˆ φy,j = p(xj | y) =

<latexit sha1_base64="ULBOLgVvgMA7ekXaKmi5LjqeziY=">AEiHicfVLbtQwFE3KAGV4tWXJxlBVmqEPzRSkwqJSViwQRSJPlA9jBznZuLWj8h2kaRv4UtfBJ/g/MAmrZgKfHxvdc+x9cnyjgzdjT6Gc7d6t2+c3f+Xv/+g4ePHi8sLh0YlWsK+1RxpY8iYoAzCfuWQ5HmQYiIg6H0enbKn94BtowJT/bIoOJIDPJEkaJ9aHpYri0giNRYpuCJe5rObCr46FDmENidbqHF1J+RqHUvcoJouqigbujVUNGCI1m/O45TYsnD/kqL0DbCRM8EuZiWBcJMIiyITSnh5RfnLvEiTGNl/8E69HroPi7bs/HInfTwnNkA1+zjfpNEuZ83Rr6MTVuea06YnZjGqK6cLy6ONUT3QdTBuwXLQjr3p4twEx4rmAqSlnBhzPB5ldlISbRnl4Po4N5ARekpmcOyhJALMpKxfz6EVH4lRorT/pEV19PKOkghjChH5yqo95mquCt6UO85t8npSMpnlFiRtiJKcI6tQZQUMw3U8sIDQjXzWhFNiSbUesP4Ll6iSYGfge1ehIpJaZKavSMpEn6tQcI5VUIQGb8ocUIE40UMCcm5dSU2yW98U2vW4jOWmbZLF02b+t6SFivNZkwSXtmz9mg37KfU4vrfx+/Av4WGD17gxw0sUp7JY3hnH+bGX6GK/i/Sib/VHrYvVZC/CXqVqgMpCla/zLlQEczbTKs47ga/trof4AkviON/XQ3dZUeEOr9rvOjY3Bi/3Nj89Gp5Z7e15nzwNHgeDIJxsBXsBO+DvWA/oGERfgu/hz96/d6ot9V705TOhe2eJ0Fn9HZ/AQ/jiSw=</latexit>

θ(t+1) ← θ(t) + f(x(i), y (i)) − Ey|x h f(x(i), y) i

<latexit sha1_base64="FZhWaDfVpIRI7w8uTmyhx8A+678=">AH5HictVRdb9s2FUzZ261jzb4164BUGkJTUsb8AGDAWKbisGDO0yIGk7hI5A0ZTNRpQ0kqtEvwHexv2un+1h/2WvexKlF07dranCjB8eXnuB+85ZFJmXOnh8O9bO+/0dt/t37jv/f+Bx/evbf30TNVJKyM1pkhXyREMUynrMzXGXpSEZFk7Hly+W2z/yKScWL/FTXJRsLMs15yinR4Ir3evkBToTBesY0sRcm0EdRaBHOWKqJlMUrdG0bNo9aX2qD5m/eHloj1HtjBDd376PZ0Sb2ob+QWehBwgTORVkHpsaYZ4jLIieUZKZX6xdqYswnRT6hqoh9EOD+s26y49FZeMapQBYB6ghbucQv1jtFL26dPFLKM4naAFtGpHCSAbdZG2nqSTUnJ5Y+EHFx0/tGhCGTnkz5W3YE4d9HEfLzVF3pmVct+4KvskH8Z2vSYATmAckCaKWBVhcjEKX8aZMQe7XjBczbw4huC5DZaQ4w4QNnr4FTX9t8aoTfU/Mf7B4qyRNdERuzD3HZ3xJXA1x7pAr0M7t01xUEFLRP4dKG0BcLx09DSivJ8FezkEUer8KVUIAhEt4ke3YweDAZbQ8yP8Wt7UxiWfDrT4tT/61cJVUJkOvhxv0AWR86zc4d+huXOnTNpcF8me0w9N9CZ9DW219h7AdPdsvaTckP763PxwM2w9tGlFn7HvdxLv7YzxpKCVYLmGVHqPBqWemyI1JxmzPq4Uqwk9JM2TmYORFMjU37Mlp0AJ4JSgsJv1yj1rsaYhQqhYJIJvRqut7jXPb3nml06/HhudlpVlOXaG0yhDo3lm0YTDNdBZDQahkOviM4I3AYNjzEIZaXMjGVXTK8fhIqxUWlbfa2lRMBaspy9oUQJ98bnBKBM/qCUtJlWlrsEoX9rbRHE+ueKm6Kc3dmHygTuMCKOI5XF2gseVy3d0S6Gj08XcMuJDsCT4U8k0YWETtxjboGbKf4UN+Z/IeHZWCDBXD+WaRuAwzQjKEqWG+u0nxWK4WQqi6pca3gjvm0UEpAUJu7wbD3MIUCQ0X5bRrPRoPoi8Ho5y/3Hz7qpHnb+8T7zAu8yPvKe+j94J14Zx7t/dX7Z7e3u9tP+7/1f+/4aA7t7qYj721r/nv138vCY=</latexit>
slide-12
SLIDE 12

How to obtain θ ?

6

■ The learning problem is to find the right weights θ. ■ Naïve Bayes: set θ equal to the empirical frequencies: ■ Perceptron update: ■ Large-margin update: ■ Logistic regression update: ■ All these methods for supervised learning assume a labeled dataset of N examples:

aset {(x(i), y(i))}N

i=1.

θ + f (x(i), y(i)) − f (x(i), ˆ y).

θ(t+1) ←θ(t) −

P = count(y) P

y0 count(y0).

= count(y, j) PV

j0=1 count(y, j0)

with

θ(t+1) ← θ(t) + f(x(i), y (i)) − f(x(i), ˆ y)

<latexit sha1_base64="RXVil6+ULiMq+WxrHfc+47A0JnU=">ADjXicbVJb9MwFHYbOka5rINHXiKqS0bVTNA8IDQBA/wOCS6TWpK5TgnjTVfItvpFln5m0j8Fx5wkiItXS0l/vJ95ZzTpQxqs10+qfT9R709h7uP+o/fvL02cHg8PmFlrkiMCOSXUVYQ2MCpgZahcZQowjxhcRtdfK/1yDUpTKX6aIoMFxytBE0qwcdRykIURt6FJweDylx2Z42Bc+iGDxGCl5I2/JTvxuOaSclRdtxVLx+WJXzRg7L/ZrYcpNrYox8vBcDqZ1se/D4INGKLNOV8ednthLEnOQRjCsNbzYJqZhcXKUMKg7Ie5hgyTa7yCuYMCc9ALW7em9I8cE/uJVO4Rxq/Zux4Wc60LHjlLjk2qt7WK3KXNc5N8XFgqstyAIE2iJGe+kX7VZz+mCohQOYKOpq9UmKFSbGTaN/dDdNCmwNpvUjVid15hYXcfetQMANkZxjEb+2YI5ZUMCc6ZKW2ok/94V1tO4jXN9KZDt02L+m7aJpSKrqjArJp8Pf427a7UhPW7XYKtjV3iqlyZgbBlDQmTGsJopWSetYKX2/51UBcAJ64zjT203RoLtzjB9prcBxenk+Dt5PTHu+HZl80K7aOX6BUaoQB9QGfoOzpHM0TQb/S30+vseQfe+T97kx7XY2Pi9Q63jf/gFH5C9z</latexit>

ˆ y = argmax

y∈Y

θ · f(x(i), y) + c(y (i), y)

<latexit sha1_base64="A8+pB2HUH1eWwMsv3S3SIvbkjY=">AERXicfVJLb9QwEM52eZTl1cKRi6GqtEtLtVsQcEGqgAMXRJHoA9XLauJMdq36EdlO28jK7+PKlR/BDXEFJ1mgaQuWEn+ZR+abmS/OBLduOPzaWeheunzl6uK13vUbN2/dXlq+s2t1bhjuMC202Y/BouAKdx3AvczgyBjgXvx4avKv3eExnKtPrgiw7GEqeIpZ+CabL0ZXG0lM3QwflJ93a6NBSajA1IEx+picQfnWm1Ly351nVRWPijXSdGAXl0sZ/OwPmiHPTmgLwgFMxUwsnEF4RyRagEN2Mg/MeyPFWUJZo94+ig0CH9Yu/35OleHGsD7kPBjNwUo0P9uT5YUxTLJSrHBFh7MBpmbuzBOM4Elj2aW8yAHcIUDwJUINGOfT36kqwGS0JSbcKjHKmtpzM8SGsLGYfIqjt71lcZL/Id5C59PvZcZblDxZpCaS6I06TaI0m4QeZEQAwNXwmZgLmw7d7q6TIzFEfo2o0wOfY2rau3KMUyfBtUeMy0lKCSh56mILkoEkwhF6701Ka/8UWjWU+OeGbnUzpxtQLgnJUGz7lCkQlrlphbXO4Zo7W7x59jWEXBt8Ggu8yNOC0CUwavZRhN1N6n1bwf5Fc/YkMsN2WrwmEZqoR6AyVLxv5CW2RxlOj86xF+Fx+T8ANIw8SYe2lNRBDk6Kz8zoPdzY3R43N909Wtl7OpbkY3YseRP1oFD2LtqI30Xa0E7HO0w7tYCftfu5+637v/mhCFzrznLtR63R/gL9xHct</latexit>

ˆ µy = p(y) =

<latexit sha1_base64="SVP5fbdei0Qd/DdSgRoYJhRZw=">AEXHicfVJb9MwFE63AqNjsIHECy+GaVLRtUOJHiZNAEPvEwMiV3QXCrHOWmt+RLZztbI+Ifywit/Aycp0GwDS4k/n4vP53O+OPM2MHge2tpuX3r9p2Vu53Ve2v3H6xvPDw2KtcUjqjiSp/GxABnEo4sxOMw1ExBxO4vN3pf/kArRhSn62RQYjQSaSpYwSG0zj9Z9bOBYO2ylY4r+6rt0e9jzCHFJLtFaX6Io7OLcrW+q75TYraznd1BRgx56cbMfT4l1he91tuYI7SFM9ESQ2dgVCDOJsCB2Sgl3X7xfqIswTZT9R9Ve4EO7xd9zp7oei9yPi1Ai64aQPTRe3xz0B9VC18FwDjaj+TocbyNcKJoLkBayokxZ8NBZkeOaMsoB9/BuYGM0HMygbMAJRFgRq6aiEdbwZKgVOnwSYsq62KGI8KYQsQhsnyzueorjTf5znKbvhk5JrPcgqR1oTnyCpUjhclTAO1vAiAUM0CV0SnRBNqgwhC6xfKTIFfgG0+hIqRM2lVvUEpFuGsQcIlVUIQmTx3OCWC8SKBlOTceodN+hvf1Jqd5IJlZt6lWd2mTtCZxUqzCZOEl5qrhNc0h21qcfXv4PcQZqHhIBD8mIEmVunApFaRD7OZ4Ke4hP+LZPJPZIDNZ7mKQHhM2QKVgXS+FiVXBnA80SrPGoSv5VdEwUkDR2v46GZVkcEQ6vyu86ON7tD1/2dz+92tx/O5fmSvQkehZ1o2H0OtqPkSH0VFEWwct0/rW8s/2u32anutDl1qzXMeRY3VfvwLq317IQ=</latexit>

ˆ φy,j = p(xj | y) =

<latexit sha1_base64="ULBOLgVvgMA7ekXaKmi5LjqeziY=">AEiHicfVLbtQwFE3KAGV4tWXJxlBVmqEPzRSkwqJSViwQRSJPlA9jBznZuLWj8h2kaRv4UtfBJ/g/MAmrZgKfHxvdc+x9cnyjgzdjT6Gc7d6t2+c3f+Xv/+g4ePHi8sLh0YlWsK+1RxpY8iYoAzCfuWQ5HmQYiIg6H0enbKn94BtowJT/bIoOJIDPJEkaJ9aHpYri0giNRYpuCJe5rObCr46FDmENidbqHF1J+RqHUvcoJouqigbujVUNGCI1m/O45TYsnD/kqL0DbCRM8EuZiWBcJMIiyITSnh5RfnLvEiTGNl/8E69HroPi7bs/HInfTwnNkA1+zjfpNEuZ83Rr6MTVuea06YnZjGqK6cLy6ONUT3QdTBuwXLQjr3p4twEx4rmAqSlnBhzPB5ldlISbRnl4Po4N5ARekpmcOyhJALMpKxfz6EVH4lRorT/pEV19PKOkghjChH5yqo95mquCt6UO85t8npSMpnlFiRtiJKcI6tQZQUMw3U8sIDQjXzWhFNiSbUesP4Ll6iSYGfge1ehIpJaZKavSMpEn6tQcI5VUIQGb8ocUIE40UMCcm5dSU2yW98U2vW4jOWmbZLF02b+t6SFivNZkwSXtmz9mg37KfU4vrfx+/Av4WGD17gxw0sUp7JY3hnH+bGX6GK/i/Sib/VHrYvVZC/CXqVqgMpCla/zLlQEczbTKs47ga/trof4AkviON/XQ3dZUeEOr9rvOjY3Bi/3Nj89Gp5Z7e15nzwNHgeDIJxsBXsBO+DvWA/oGERfgu/hz96/d6ot9V705TOhe2eJ0Fn9HZ/AQ/jiSw=</latexit>

θ(t+1) ← θ(t) + f(x(i), y (i)) − Ey|x h f(x(i), y) i

<latexit sha1_base64="FZhWaDfVpIRI7w8uTmyhx8A+678=">AH5HictVRdb9s2FUzZ261jzb4164BUGkJTUsb8AGDAWKbisGDO0yIGk7hI5A0ZTNRpQ0kqtEvwHexv2un+1h/2WvexKlF07dranCjB8eXnuB+85ZFJmXOnh8O9bO+/0dt/t37jv/f+Bx/evbf30TNVJKyM1pkhXyREMUynrMzXGXpSEZFk7Hly+W2z/yKScWL/FTXJRsLMs15yinR4Ir3evkBToTBesY0sRcm0EdRaBHOWKqJlMUrdG0bNo9aX2qD5m/eHloj1HtjBDd376PZ0Sb2ob+QWehBwgTORVkHpsaYZ4jLIieUZKZX6xdqYswnRT6hqoh9EOD+s26y49FZeMapQBYB6ghbucQv1jtFL26dPFLKM4naAFtGpHCSAbdZG2nqSTUnJ5Y+EHFx0/tGhCGTnkz5W3YE4d9HEfLzVF3pmVct+4KvskH8Z2vSYATmAckCaKWBVhcjEKX8aZMQe7XjBczbw4huC5DZaQ4w4QNnr4FTX9t8aoTfU/Mf7B4qyRNdERuzD3HZ3xJXA1x7pAr0M7t01xUEFLRP4dKG0BcLx09DSivJ8FezkEUer8KVUIAhEt4ke3YweDAZbQ8yP8Wt7UxiWfDrT4tT/61cJVUJkOvhxv0AWR86zc4d+huXOnTNpcF8me0w9N9CZ9DW219h7AdPdsvaTckP763PxwM2w9tGlFn7HvdxLv7YzxpKCVYLmGVHqPBqWemyI1JxmzPq4Uqwk9JM2TmYORFMjU37Mlp0AJ4JSgsJv1yj1rsaYhQqhYJIJvRqut7jXPb3nml06/HhudlpVlOXaG0yhDo3lm0YTDNdBZDQahkOviM4I3AYNjzEIZaXMjGVXTK8fhIqxUWlbfa2lRMBaspy9oUQJ98bnBKBM/qCUtJlWlrsEoX9rbRHE+ueKm6Kc3dmHygTuMCKOI5XF2gseVy3d0S6Gj08XcMuJDsCT4U8k0YWETtxjboGbKf4UN+Z/IeHZWCDBXD+WaRuAwzQjKEqWG+u0nxWK4WQqi6pca3gjvm0UEpAUJu7wbD3MIUCQ0X5bRrPRoPoi8Ho5y/3Hz7qpHnb+8T7zAu8yPvKe+j94J14Zx7t/dX7Z7e3u9tP+7/1f+/4aA7t7qYj721r/nv138vCY=</latexit>
slide-13
SLIDE 13

Today

7

Engineered features

Nonlinear classification & evaluating classifiers

slide-14
SLIDE 14

Today

7

Engineered features linear classification

Nonlinear classification & evaluating classifiers

slide-15
SLIDE 15

Today

7

~mid 2010s Engineered features Learned features linear classification

Nonlinear classification & evaluating classifiers

slide-16
SLIDE 16

Today

7

~mid 2010s Engineered features Learned features linear classification nonlinear classification

Nonlinear classification & evaluating classifiers

slide-17
SLIDE 17

A simple feed-forward architecture

8

slide-18
SLIDE 18

A simple feed-forward architecture

■ Suppose we want to label stories as

8

s Y = {Good, Bad, Okay}.

slide-19
SLIDE 19

A simple feed-forward architecture

■ Suppose we want to label stories as ■ What makes a good story?

8

s Y = {Good, Bad, Okay}.

slide-20
SLIDE 20

A simple feed-forward architecture

■ Suppose we want to label stories as ■ What makes a good story? ■ Exciting plot, compelling characters, interesting setting…

8

s Y = {Good, Bad, Okay}.

slide-21
SLIDE 21

A simple feed-forward architecture

■ Suppose we want to label stories as ■ What makes a good story? ■ Exciting plot, compelling characters, interesting setting… ■ Let’s call this vector of features z.

8

s Y = {Good, Bad, Okay}.

slide-22
SLIDE 22

A simple feed-forward architecture

■ Suppose we want to label stories as ■ What makes a good story? ■ Exciting plot, compelling characters, interesting setting… ■ Let’s call this vector of features z. ■ If z is well-chosen, it will be easy to predict from x (the words), and it will make it

easy to predict the label, y.

8

s Y = {Good, Bad, Okay}.

slide-23
SLIDE 23

A simple feed-forward architecture

9

. . . . . . x z y

slide-24
SLIDE 24

A simple feed-forward architecture

■ Let’s predict each zk from x by binary logistic

regression:

9

. . . . . . x z y

Pr(zk = 1 | x) =σ(θ(x→z)

k

· x)

slide-25
SLIDE 25

A simple feed-forward architecture

■ Let’s predict each zk from x by binary logistic

regression:

9

. . . . . . x z y

Pr(zk = 1 | x) =σ(θ(x→z)

k

· x)

logistic fn aka sigmoid σ

= 1 1 + e−θ(x→z)

k

x

<latexit sha1_base64="SWzkVHoh/1moFiZ3e/XQhAxJPE=">AF9nichVRfb9MwEM9aCqMw2ECB14M06SGdVNTkOBl0gRo4gUo0jaG5q5yXKf1FifBcbYGyxKfhDfEK1+HT8DX4PKnW9t1ECnJ+e53d7+7s+1GPo9Vq/V7oVK9Vrt+Y/Fm/dbtpTt3l1fu7cdhIinbo6EfygOXxMznAdtTXPnsIJKMCNdn9yT15n90ymTMQ+DXZVGrCvIOAep0SBqrdS+baGXaGxGjJFzJFuqHXHNgj7zFNEyvAMzZjBuJ7rPNPIfqNMy23TRGkh2Ghjvh0PidKpsetrpYS2ECZyIMiop1OEeYCwIGpIia8/GzORF2HaD9UVW3gQxvpxbqMj0VieinkiBqA2UJjdTkBvI10bHJjUW43jEk530hmZEpNCSARs/Z+pJQvVux8ALGXfemykgNJ3yrMvzsJ0Cu9Nzo3tsqZzv3JdJryIB/6lLguAXegHBGk4+RgcdS2i4hXRWqUsNmE9mTkcRmCB6ZxDmWADvbD19Qxj8X2nmo/jUx6U6Rjvr7EhvFNPsncCoRliF6KtRsb0ldbm638QZcFpxRWrfLpwJ7t4n5IE8ECRX0Sx4dOK1JdTaTi1GemjpOYRYSekAE7BDEgsVdnZ8Vg9ZA0deKOENFMq1kx6aiDhOhQvIrL541pYp59kOE+W97GoeRIliAS0SeYmPoNDs4KE+h8YoPwWBUMmBK6JDAg1ScDxhAhNphsw/ZWq6ECq6Ovby7FOUXAFryQJ2RkMhSNB/qrFHBPfTPvNI4iujceyN5XmtafZPeRSXRoVbarDnBUOJR/wAIYJl0F+I0yr4TdUOP/W8RsGs5DsHRD8EDFJVCiBSXG8DcxmgB/jTPwXEjbSGAnidFk6JwDFZC0IxZoU9wWfhgz7A5kmERThC/50QhAPGg4wWeTbsVCNiQzuz2uyzstzedZ5vtj89Xt1+VW3PRemQ9sRqWY72wtq23Vsfas2jlT3Wp+qD6sDaqfa/9qP0soJWF0ue+NfXUfv0FfrMKzA=</latexit>
slide-26
SLIDE 26

A simple feed-forward architecture

■ Let’s predict each zk from x by binary logistic

regression:

9

. . . . . . x z y

Pr(zk = 1 | x) =σ(θ(x→z)

k

· x)

logistic fn aka sigmoid σ

= 1 1 + e−θ(x→z)

k

x

<latexit sha1_base64="SWzkVHoh/1moFiZ3e/XQhAxJPE=">AF9nichVRfb9MwEM9aCqMw2ECB14M06SGdVNTkOBl0gRo4gUo0jaG5q5yXKf1FifBcbYGyxKfhDfEK1+HT8DX4PKnW9t1ECnJ+e53d7+7s+1GPo9Vq/V7oVK9Vrt+Y/Fm/dbtpTt3l1fu7cdhIinbo6EfygOXxMznAdtTXPnsIJKMCNdn9yT15n90ymTMQ+DXZVGrCvIOAep0SBqrdS+baGXaGxGjJFzJFuqHXHNgj7zFNEyvAMzZjBuJ7rPNPIfqNMy23TRGkh2Ghjvh0PidKpsetrpYS2ECZyIMiop1OEeYCwIGpIia8/GzORF2HaD9UVW3gQxvpxbqMj0VieinkiBqA2UJjdTkBvI10bHJjUW43jEk530hmZEpNCSARs/Z+pJQvVux8ALGXfemykgNJ3yrMvzsJ0Cu9Nzo3tsqZzv3JdJryIB/6lLguAXegHBGk4+RgcdS2i4hXRWqUsNmE9mTkcRmCB6ZxDmWADvbD19Qxj8X2nmo/jUx6U6Rjvr7EhvFNPsncCoRliF6KtRsb0ldbm638QZcFpxRWrfLpwJ7t4n5IE8ECRX0Sx4dOK1JdTaTi1GemjpOYRYSekAE7BDEgsVdnZ8Vg9ZA0deKOENFMq1kx6aiDhOhQvIrL541pYp59kOE+W97GoeRIliAS0SeYmPoNDs4KE+h8YoPwWBUMmBK6JDAg1ScDxhAhNphsw/ZWq6ECq6Ovby7FOUXAFryQJ2RkMhSNB/qrFHBPfTPvNI4iujceyN5XmtafZPeRSXRoVbarDnBUOJR/wAIYJl0F+I0yr4TdUOP/W8RsGs5DsHRD8EDFJVCiBSXG8DcxmgB/jTPwXEjbSGAnidFk6JwDFZC0IxZoU9wWfhgz7A5kmERThC/50QhAPGg4wWeTbsVCNiQzuz2uyzstzedZ5vtj89Xt1+VW3PRemQ9sRqWY72wtq23Vsfas2jlT3Wp+qD6sDaqfa/9qP0soJWF0ue+NfXUfv0FfrMKzA=</latexit>
slide-27
SLIDE 27

A simple feed-forward architecture

■ Let’s predict each zk from x by binary logistic

regression:

■ The weights can be collected into a matrix,

9

. . . . . . x z y

Pr(zk = 1 | x) =σ(θ(x→z)

k

· x) Θ(x!z) = [θ(x!z)

1

, θ(x!z)

2

, . . . , θ(x!z)

Kz

]>,

logistic fn aka sigmoid σ

= 1 1 + e−θ(x→z)

k

x

<latexit sha1_base64="SWzkVHoh/1moFiZ3e/XQhAxJPE=">AF9nichVRfb9MwEM9aCqMw2ECB14M06SGdVNTkOBl0gRo4gUo0jaG5q5yXKf1FifBcbYGyxKfhDfEK1+HT8DX4PKnW9t1ECnJ+e53d7+7s+1GPo9Vq/V7oVK9Vrt+Y/Fm/dbtpTt3l1fu7cdhIinbo6EfygOXxMznAdtTXPnsIJKMCNdn9yT15n90ymTMQ+DXZVGrCvIOAep0SBqrdS+baGXaGxGjJFzJFuqHXHNgj7zFNEyvAMzZjBuJ7rPNPIfqNMy23TRGkh2Ghjvh0PidKpsetrpYS2ECZyIMiop1OEeYCwIGpIia8/GzORF2HaD9UVW3gQxvpxbqMj0VieinkiBqA2UJjdTkBvI10bHJjUW43jEk530hmZEpNCSARs/Z+pJQvVux8ALGXfemykgNJ3yrMvzsJ0Cu9Nzo3tsqZzv3JdJryIB/6lLguAXegHBGk4+RgcdS2i4hXRWqUsNmE9mTkcRmCB6ZxDmWADvbD19Qxj8X2nmo/jUx6U6Rjvr7EhvFNPsncCoRliF6KtRsb0ldbm638QZcFpxRWrfLpwJ7t4n5IE8ECRX0Sx4dOK1JdTaTi1GemjpOYRYSekAE7BDEgsVdnZ8Vg9ZA0deKOENFMq1kx6aiDhOhQvIrL541pYp59kOE+W97GoeRIliAS0SeYmPoNDs4KE+h8YoPwWBUMmBK6JDAg1ScDxhAhNphsw/ZWq6ECq6Ovby7FOUXAFryQJ2RkMhSNB/qrFHBPfTPvNI4iujceyN5XmtafZPeRSXRoVbarDnBUOJR/wAIYJl0F+I0yr4TdUOP/W8RsGs5DsHRD8EDFJVCiBSXG8DcxmgB/jTPwXEjbSGAnidFk6JwDFZC0IxZoU9wWfhgz7A5kmERThC/50QhAPGg4wWeTbsVCNiQzuz2uyzstzedZ5vtj89Xt1+VW3PRemQ9sRqWY72wtq23Vsfas2jlT3Wp+qD6sDaqfa/9qP0soJWF0ue+NfXUfv0FfrMKzA=</latexit>
slide-28
SLIDE 28

A simple feed-forward architecture

■ Let’s predict each zk from x by binary logistic

regression:

■ The weights can be collected into a matrix, ■ so that E[z] = σ(Θ(x→z)x), where σ is applied

element-wise.

9

. . . . . . x z y

Pr(zk = 1 | x) =σ(θ(x→z)

k

· x) Θ(x!z) = [θ(x!z)

1

, θ(x!z)

2

, . . . , θ(x!z)

Kz

]>,

logistic fn aka sigmoid σ

= 1 1 + e−θ(x→z)

k

x

<latexit sha1_base64="SWzkVHoh/1moFiZ3e/XQhAxJPE=">AF9nichVRfb9MwEM9aCqMw2ECB14M06SGdVNTkOBl0gRo4gUo0jaG5q5yXKf1FifBcbYGyxKfhDfEK1+HT8DX4PKnW9t1ECnJ+e53d7+7s+1GPo9Vq/V7oVK9Vrt+Y/Fm/dbtpTt3l1fu7cdhIinbo6EfygOXxMznAdtTXPnsIJKMCNdn9yT15n90ymTMQ+DXZVGrCvIOAep0SBqrdS+baGXaGxGjJFzJFuqHXHNgj7zFNEyvAMzZjBuJ7rPNPIfqNMy23TRGkh2Ghjvh0PidKpsetrpYS2ECZyIMiop1OEeYCwIGpIia8/GzORF2HaD9UVW3gQxvpxbqMj0VieinkiBqA2UJjdTkBvI10bHJjUW43jEk530hmZEpNCSARs/Z+pJQvVux8ALGXfemykgNJ3yrMvzsJ0Cu9Nzo3tsqZzv3JdJryIB/6lLguAXegHBGk4+RgcdS2i4hXRWqUsNmE9mTkcRmCB6ZxDmWADvbD19Qxj8X2nmo/jUx6U6Rjvr7EhvFNPsncCoRliF6KtRsb0ldbm638QZcFpxRWrfLpwJ7t4n5IE8ECRX0Sx4dOK1JdTaTi1GemjpOYRYSekAE7BDEgsVdnZ8Vg9ZA0deKOENFMq1kx6aiDhOhQvIrL541pYp59kOE+W97GoeRIliAS0SeYmPoNDs4KE+h8YoPwWBUMmBK6JDAg1ScDxhAhNphsw/ZWq6ECq6Ovby7FOUXAFryQJ2RkMhSNB/qrFHBPfTPvNI4iujceyN5XmtafZPeRSXRoVbarDnBUOJR/wAIYJl0F+I0yr4TdUOP/W8RsGs5DsHRD8EDFJVCiBSXG8DcxmgB/jTPwXEjbSGAnidFk6JwDFZC0IxZoU9wWfhgz7A5kmERThC/50QhAPGg4wWeTbsVCNiQzuz2uyzstzedZ5vtj89Xt1+VW3PRemQ9sRqWY72wtq23Vsfas2jlT3Wp+qD6sDaqfa/9qP0soJWF0ue+NfXUfv0FfrMKzA=</latexit>
slide-29
SLIDE 29

A simple feed-forward architecture

■ Let’s predict each zk from x by binary logistic

regression:

■ The weights can be collected into a matrix, ■ so that E[z] = σ(Θ(x→z)x), where σ is applied

element-wise.

9

. . . . . . x z y

Pr(zk = 1 | x) =σ(θ(x→z)

k

· x) Θ(x!z) = [θ(x!z)

1

, θ(x!z)

2

, . . . , θ(x!z)

Kz

]>,

logistic fn aka sigmoid σ

= 1 1 + e−θ(x→z)

k

x

<latexit sha1_base64="SWzkVHoh/1moFiZ3e/XQhAxJPE=">AF9nichVRfb9MwEM9aCqMw2ECB14M06SGdVNTkOBl0gRo4gUo0jaG5q5yXKf1FifBcbYGyxKfhDfEK1+HT8DX4PKnW9t1ECnJ+e53d7+7s+1GPo9Vq/V7oVK9Vrt+Y/Fm/dbtpTt3l1fu7cdhIinbo6EfygOXxMznAdtTXPnsIJKMCNdn9yT15n90ymTMQ+DXZVGrCvIOAep0SBqrdS+baGXaGxGjJFzJFuqHXHNgj7zFNEyvAMzZjBuJ7rPNPIfqNMy23TRGkh2Ghjvh0PidKpsetrpYS2ECZyIMiop1OEeYCwIGpIia8/GzORF2HaD9UVW3gQxvpxbqMj0VieinkiBqA2UJjdTkBvI10bHJjUW43jEk530hmZEpNCSARs/Z+pJQvVux8ALGXfemykgNJ3yrMvzsJ0Cu9Nzo3tsqZzv3JdJryIB/6lLguAXegHBGk4+RgcdS2i4hXRWqUsNmE9mTkcRmCB6ZxDmWADvbD19Qxj8X2nmo/jUx6U6Rjvr7EhvFNPsncCoRliF6KtRsb0ldbm638QZcFpxRWrfLpwJ7t4n5IE8ECRX0Sx4dOK1JdTaTi1GemjpOYRYSekAE7BDEgsVdnZ8Vg9ZA0deKOENFMq1kx6aiDhOhQvIrL541pYp59kOE+W97GoeRIliAS0SeYmPoNDs4KE+h8YoPwWBUMmBK6JDAg1ScDxhAhNphsw/ZWq6ECq6Ovby7FOUXAFryQJ2RkMhSNB/qrFHBPfTPvNI4iujceyN5XmtafZPeRSXRoVbarDnBUOJR/wAIYJl0F+I0yr4TdUOP/W8RsGs5DsHRD8EDFJVCiBSXG8DcxmgB/jTPwXEjbSGAnidFk6JwDFZC0IxZoU9wWfhgz7A5kmERThC/50QhAPGg4wWeTbsVCNiQzuz2uyzstzedZ5vtj89Xt1+VW3PRemQ9sRqWY72wtq23Vsfas2jlT3Wp+qD6sDaqfa/9qP0soJWF0ue+NfXUfv0FfrMKzA=</latexit>

matrix-vector product. dims: [k, V] * [V, 1] = [k, 1]

slide-30
SLIDE 30

A simple feed-forward architecture

10

. . . . . . x z y

slide-31
SLIDE 31

A simple feed-forward architecture

■ Next we predict y from z, again using logistic

regression (multiclass): Vector of probabilities over each possible y is denoted:

10

. . . . . . x z y | = exp(θ(z→y)

j

· z + bj) P

j0∈Y exp(θ(z→y) j0

· z + bj0) ,

Pr(y = j | z)

slide-32
SLIDE 32

A simple feed-forward architecture

■ Next we predict y from z, again using logistic

regression (multiclass): Vector of probabilities over each possible y is denoted:

10

. . . . . . x z y | = exp(θ(z→y)

j

· z + bj) P

j0∈Y exp(θ(z→y) j0

· z + bj0) ,

Pr(y = j | z)

additive bias/offset vector

slide-33
SLIDE 33

A simple feed-forward architecture

■ Next we predict y from z, again using logistic

regression (multiclass): Vector of probabilities over each possible y is denoted:

10

. . . . . . x z y | = exp(θ(z→y)

j

· z + bj) P

j0∈Y exp(θ(z→y) j0

· z + bj0) ,

Pr(y = j | z)

additive bias/offset vector

p(y | z) = SoftMax(Θ(z→y)z + b).

slide-34
SLIDE 34

A simple feed-forward architecture

11

. . . . . . x z y

slide-35
SLIDE 35

A simple feed-forward architecture

11

■ In reality, we never observe z, it is a hidden layer. We don’t bother predicting 0/1

values for z, we compute it directly from x.

. . . . . . x z y

slide-36
SLIDE 36

A simple feed-forward architecture

11

■ In reality, we never observe z, it is a hidden layer. We don’t bother predicting 0/1

values for z, we compute it directly from x.

■ This makes p(y | x) a complex, nonlinear function of x.

. . . . . . x z y

slide-37
SLIDE 37

A simple feed-forward architecture

11

■ In reality, we never observe z, it is a hidden layer. We don’t bother predicting 0/1

values for z, we compute it directly from x.

■ This makes p(y | x) a complex, nonlinear function of x. ■ We can have multiple hidden layers z(1), z(2), … adding even more expressiveness.

. . . . . . x z y

slide-38
SLIDE 38

A simple feed-forward architecture

■ To summarize:

where

11

p(y | z) = SoftMax(Θ(z→y)z + b).

z =σ(Θ(x→z)x)

σ(Θ(x→z)x) = h σ(θ(x→z)

1

· x), σ(θ(x→z)

2

· x), ..., σ(θ(x→z)

Kz

· x) iT

<latexit sha1_base64="naie3XV/whZb+QED1FE+NWJT7wI=">AGv3ichVRb9MwFM5WKPcNnjkxTBNathWNQUJXiaNiyYQAoa0G5rbyHGd1lucBMfZmln+T/waJ7gp3By6dZuHUSKcnzOd+5f7MUBT1S7/WtuvnbjZv3Wwu3Gnbv37j9YXHq4l0SpGyXRkEkDzySsICHbFdxFbCDWDIivIDte8dvc/v+CZMJj8IdlcWsK8g5D6nRIHKXap9WMGe0FgNmSKmp5tq1bENwgHzFZEyOkWXzGBcLXS+aeafUa7ltlDWSnYaH2HQ+J0pmxGyuVhDYQJnIgyMjVGcI8RFgQNaQk0N+MmciLMO1H6pqsNtRDm9nFuYqPRWrcDHLETcBsoLE6HnID+dbQkSmMZTj3CJLzPhpD80Kk0JBNUFRqS8J1TvbBl7IuPXZTAFh6JTnU56F3S6xW65zbuxUPZ37Vecq4U8K90eQDswTwgSNMptgCHXscuI14XqVnBLie0JyOP2xA8NM1zyFoFsHM+fEd5/YXQKUL9x6exMu7VMdpZT29Xq7TPYZdjbCK0JltRsY0cMKBMUi8M6YaGNAuZ58KwUnDyfBJTtcZxJ+zhRwAs5dRXeuR7darZku+qN7Zq5zw5IPhqrb23EXl9utdvGgq4JTCctW9Wy7S/Nd3I9oKlioaECS5NBpx6qriVScBgwGkyYsJvSYDNghiCERLOnq4qc3aAU0feRHEt5QoUI76aGJSJMeIDMF5VctuXKWbDVPmvupqHcapYSMtEfhog6D2/QVCfw4ZVkIFAqORQK6JDAotWcM8AlSbSDFlwtR0I1R0deIX2adK8gScJQvZKY2EIGH/mcY+ETzI+swnaCMxok/lmeNZq1/wuOkmtKoHFMDWKNwBDviIbASGFTQaFpdbLDcYwO/Y7ALyT5BgV9iJomKJFRS3lMGdjPAT3Au/gsJf8QYCeJ0W7oAJrJRxDFLNSmvPaCKGHYG8gojacKvuJfFAoBiA8TL/Fs2q1EACGdy/S7Kux1Ws7zVufri+XNxU1F6zH1lOraTnWS2vTem9tW7sWrf2o/az9rv2pv64P6mE9LqHzc5XPI2vqWd/AS9EVTg=</latexit>

■ In reality, we never observe z, it is a hidden layer. We don’t bother predicting 0/1

values for z, we compute it directly from x.

■ This makes p(y | x) a complex, nonlinear function of x. ■ We can have multiple hidden layers z(1), z(2), … adding even more expressiveness.

. . . . . . x z y

slide-39
SLIDE 39

Activation functions

12

slide-40
SLIDE 40

Activation functions

12

■ The sigmoid in is called an activation function.

in z = σ(Θ(x→z)x) i

slide-41
SLIDE 41

Activation functions

12

■ The sigmoid in is called an activation function. ■ In general, we can write to indicate an arbitrary activation function.

in z = σ(Θ(x→z)x) i

e z = f (Θ(x→z)x) t

slide-42
SLIDE 42

Activation functions

12

■ The sigmoid in is called an activation function. ■ In general, we can write to indicate an arbitrary activation function. ■ Other choices include:

in z = σ(Θ(x→z)x) i

e z = f (Θ(x→z)x) t

slide-43
SLIDE 43

Activation functions

12

■ The sigmoid in is called an activation function. ■ In general, we can write to indicate an arbitrary activation function. ■ Other choices include: ■ Hyperbolic tangent: tanh, centered at 0, helps avoid saturation

in z = σ(Θ(x→z)x) i

e z = f (Θ(x→z)x) t

slide-44
SLIDE 44

Activation functions

12

■ The sigmoid in is called an activation function. ■ In general, we can write to indicate an arbitrary activation function. ■ Other choices include: ■ Hyperbolic tangent: tanh, centered at 0, helps avoid saturation ■ Rectified linear unit: ReLU(a) = max(0, a), which is fast to evaluate, easy to

analyze, even further avoids saturation.

in z = σ(Θ(x→z)x) i

e z = f (Θ(x→z)x) t

slide-45
SLIDE 45

Activation functions

12

■ The sigmoid in is called an activation function. ■ In general, we can write to indicate an arbitrary activation function. ■ Other choices include: ■ Hyperbolic tangent: tanh, centered at 0, helps avoid saturation ■ Rectified linear unit: ReLU(a) = max(0, a), which is fast to evaluate, easy to

analyze, even further avoids saturation.

in z = σ(Θ(x→z)x) i

e z = f (Θ(x→z)x) t

■ Leaky ReLU:

) = ( a, a ≥ 0 .0001a,

  • therwise.
slide-46
SLIDE 46

Training neural networks

13

Gradient descent

slide-47
SLIDE 47

Training neural networks

13

■ In general, neural networks are learned by gradient descent, using minibatches:

where

θ(z!y)

k

θ(z!y)

k

⌘(t)rθ(z→y)

k

`(i),

Gradient descent

slide-48
SLIDE 48

Training neural networks

13

■ In general, neural networks are learned by gradient descent, using minibatches:

where

■ is the learning rate at update t

θ(z!y)

k

θ(z!y)

k

⌘(t)rθ(z→y)

k

`(i),

I ⌘(t)

Gradient descent

slide-49
SLIDE 49

Training neural networks

13

■ In general, neural networks are learned by gradient descent, using minibatches:

where

■ is the learning rate at update t ■ is the loss on instance (minibatch) i

θ(z!y)

k

θ(z!y)

k

⌘(t)rθ(z→y)

k

`(i),

I ⌘(t)

I `(i)

Gradient descent

slide-50
SLIDE 50

Training neural networks

13

■ In general, neural networks are learned by gradient descent, using minibatches:

where

■ is the learning rate at update t ■ is the loss on instance (minibatch) i ■ is the gradient of the loss with respect to the output weights

θ(z!y)

k

θ(z!y)

k

⌘(t)rθ(z→y)

k

`(i),

I ⌘(t)

I `(i)

I rθ(z→y)

k

`(i)

ts θ(z!y)

k

,

rθ(z→y)

k

`(i) =   @`(i) @✓(z!y)

k,1

, @`(i) @✓(z!y)

k,2

, . . . , @`(i) @✓(z!y)

k,Ky

 

Gradient descent

slide-51
SLIDE 51

Training neural networks

14

Backpropagation

slide-52
SLIDE 52

Training neural networks

14

■ If we don’t observe z, how can we learn the weights ?

Backpropagation

ts Θ(x!z)

slide-53
SLIDE 53

Training neural networks

14

■ If we don’t observe z, how can we learn the weights ? ■ Backpropagation: compute a loss on y, and apply the chain rule from calculus to

compute a gradient on all parameters.

Backpropagation

ts Θ(x!z)

slide-54
SLIDE 54

Training neural networks

14

■ If we don’t observe z, how can we learn the weights ? ■ Backpropagation: compute a loss on y, and apply the chain rule from calculus to

compute a gradient on all parameters.

Backpropagation

ts Θ(x!z)

■ Backpropagation as an algorithm: construct a directed acyclic computation graph

with nodes for inputs, outputs, hidden layers, parameters.

slide-55
SLIDE 55

Training neural networks

15

■ Backpropagation as an algorithm: construct a directed acyclic computation graph

with nodes for inputs, outputs, hidden layers, parameters.

Backpropagation

x(i) z ˆ y `(i) y(i) Θ vx vz vˆ

y

vΘ gˆ

y

g gz gz vy vΘ g gˆ

y

slide-56
SLIDE 56

Training neural networks

15

■ Backpropagation as an algorithm: construct a directed acyclic computation graph

with nodes for inputs, outputs, hidden layers, parameters.

Backpropagation

x(i) z ˆ y `(i) y(i) Θ vx vz vˆ

y

vΘ gˆ

y

g gz gz vy vΘ g gˆ

y

■ Forward pass: values (e.g. vx) go from parents to children

slide-57
SLIDE 57

Training neural networks

15

■ Backpropagation as an algorithm: construct a directed acyclic computation graph

with nodes for inputs, outputs, hidden layers, parameters.

Backpropagation

x(i) z ˆ y `(i) y(i) Θ vx vz vˆ

y

vΘ gˆ

y

g gz gz vy vΘ g gˆ

y

■ Forward pass: values (e.g. vx) go from parents to children ■ Backward pass: gradients (e.g. gz) go from children to parents, implementing

the chain rule

slide-58
SLIDE 58

Training neural networks

15

■ Backpropagation as an algorithm: construct a directed acyclic computation graph

with nodes for inputs, outputs, hidden layers, parameters.

Backpropagation

x(i) z ˆ y `(i) y(i) Θ vx vz vˆ

y

vΘ gˆ

y

g gz gz vy vΘ g gˆ

y

■ Forward pass: values (e.g. vx) go from parents to children ■ Backward pass: gradients (e.g. gz) go from children to parents, implementing

the chain rule

■ As long as the gradient is implemented for a layer/operation, you can add it to

the graph, and let automatic differentiation compute updates for every layer.

slide-59
SLIDE 59

How to represent text for classification?

Another choice of R: word embeddings

16

slide-60
SLIDE 60

■ Text is naturally viewed as a sequence of tokens w1, w2, …, wT

How to represent text for classification?

Another choice of R: word embeddings

16

slide-61
SLIDE 61

■ Text is naturally viewed as a sequence of tokens w1, w2, …, wT ■ Context is lost when this sequence is converted to a bag-of-words.

How to represent text for classification?

Another choice of R: word embeddings

16

slide-62
SLIDE 62

■ Text is naturally viewed as a sequence of tokens w1, w2, …, wT ■ Context is lost when this sequence is converted to a bag-of-words. ■ Instead, a lookup layer can compute embeddings (real-valued vectors) for

each type, resulting in a matrix

How to represent text for classification?

Another choice of R: word embeddings

16

w

=

  • 1.36

1.77 0.71

  • 0.25

0.11

  • 1.36

0.03 0.71

  • 0.45
  • 0.23
  • 0.58

1.43

  • 1.27
  • 0.71
  • 0.23

0.69 1.43 1.88 0.84

  • 0.33

0.11 1.36

  • 1.08

0.84 0.14 0.11

  • 0.18

… … … … … … … … …

  • 0.067

0.93

  • 5.6

0.74

  • 0.07
  • 0.067
  • 0.36
  • 5.6
  • 1.58

t h e w e r e t h e w e r e b l a n d t a c

  • s

b u t s t r

  • n

g d r i n k s

n, resulting in a m where X(0) ∈ RKe×M.

slide-63
SLIDE 63

Evaluating classifiers

17

slide-64
SLIDE 64

Evaluating your classifier

18

slide-65
SLIDE 65

Evaluating your classifier

■ Want to predict future performance, on unseen data.

18

slide-66
SLIDE 66

Evaluating your classifier

■ Want to predict future performance, on unseen data. ■ It’s hard to predict the future. Do not evaluate on data that was already used for:

18

slide-67
SLIDE 67

Evaluating your classifier

■ Want to predict future performance, on unseen data. ■ It’s hard to predict the future. Do not evaluate on data that was already used for: ■ training

18

slide-68
SLIDE 68

Evaluating your classifier

■ Want to predict future performance, on unseen data. ■ It’s hard to predict the future. Do not evaluate on data that was already used for: ■ training ■ hyperparameter selection

18

slide-69
SLIDE 69

Evaluating your classifier

■ Want to predict future performance, on unseen data. ■ It’s hard to predict the future. Do not evaluate on data that was already used for: ■ training ■ hyperparameter selection ■ selecting classification model, model structure

18

slide-70
SLIDE 70

Evaluating your classifier

■ Want to predict future performance, on unseen data. ■ It’s hard to predict the future. Do not evaluate on data that was already used for: ■ training ■ hyperparameter selection ■ selecting classification model, model structure ■ preprocessing decisions, such as vocabulary selection

18

slide-71
SLIDE 71

Evaluating your classifier

■ Want to predict future performance, on unseen data. ■ It’s hard to predict the future. Do not evaluate on data that was already used for: ■ training ■ hyperparameter selection ■ selecting classification model, model structure ■ preprocessing decisions, such as vocabulary selection ■ Even if you follow all these rules, you will probably still over-estimate your

classifier’s performance, because real future data will differ from your test set in ways that you cannot anticipate.

18

slide-72
SLIDE 72

Accuracy

19

slide-73
SLIDE 73

Accuracy

■ Most basic metric: accuracy. How often is the classifier right?

The problem with accuracy is rare labels, also known as class imbalance.

19

acc(y, ˆ y) = 1 N

N

X

i=1

δ(y(i) = ˆ y).

slide-74
SLIDE 74

Accuracy

■ Most basic metric: accuracy. How often is the classifier right?

The problem with accuracy is rare labels, also known as class imbalance.

19

acc(y, ˆ y) = 1 N

N

X

i=1

δ(y(i) = ˆ y).

slide-75
SLIDE 75

Accuracy

■ Most basic metric: accuracy. How often is the classifier right?

The problem with accuracy is rare labels, also known as class imbalance.

■ Consider a system for detecting whether a tweet is written in Telugu.

19

acc(y, ˆ y) = 1 N

N

X

i=1

δ(y(i) = ˆ y).

slide-76
SLIDE 76

Accuracy

■ Most basic metric: accuracy. How often is the classifier right?

The problem with accuracy is rare labels, also known as class imbalance.

■ Consider a system for detecting whether a tweet is written in Telugu. ■ 0.3% of tweets are written in Telugu [Bergsma et al. 2012].

19

acc(y, ˆ y) = 1 N

N

X

i=1

δ(y(i) = ˆ y).

slide-77
SLIDE 77

Accuracy

■ Most basic metric: accuracy. How often is the classifier right?

The problem with accuracy is rare labels, also known as class imbalance.

■ Consider a system for detecting whether a tweet is written in Telugu. ■ 0.3% of tweets are written in Telugu [Bergsma et al. 2012]. ■ A system that says ŷ = NotTelugu 100% of the time is 99.7% accurate.

19

acc(y, ˆ y) = 1 N

N

X

i=1

δ(y(i) = ˆ y).

slide-78
SLIDE 78

Beyond “right” and “wrong”

20

correct labels predicted labels

slide-79
SLIDE 79

Beyond “right” and “wrong”

■ For any label, there are two ways to be “wrong:”

20

correct labels predicted labels

slide-80
SLIDE 80

Beyond “right” and “wrong”

■ For any label, there are two ways to be “wrong:” ■ False positive: the system incorrectly

predicts the label.

20

correct labels predicted labels

slide-81
SLIDE 81

Beyond “right” and “wrong”

■ For any label, there are two ways to be “wrong:” ■ False positive: the system incorrectly

predicts the label.

■ False negative: the system incorrectly fails

to predict the label.

20

correct labels predicted labels

slide-82
SLIDE 82

Beyond “right” and “wrong”

■ For any label, there are two ways to be “wrong:” ■ False positive: the system incorrectly

predicts the label.

■ False negative: the system incorrectly fails

to predict the label.

■ Similarly, there are two ways to be “right:”

20

correct labels predicted labels

slide-83
SLIDE 83

Beyond “right” and “wrong”

■ For any label, there are two ways to be “wrong:” ■ False positive: the system incorrectly

predicts the label.

■ False negative: the system incorrectly fails

to predict the label.

■ Similarly, there are two ways to be “right:” ■ True positive: the system correctly predicts

the label.

20

correct labels predicted labels

slide-84
SLIDE 84

Beyond “right” and “wrong”

■ For any label, there are two ways to be “wrong:” ■ False positive: the system incorrectly

predicts the label.

■ False negative: the system incorrectly fails

to predict the label.

■ Similarly, there are two ways to be “right:” ■ True positive: the system correctly predicts

the label.

■ True negative: the system correctly predicts

that the label does not apply to this instance.

20

correct labels predicted labels

slide-85
SLIDE 85

Precision and recall

21

correct labels predicted labels

slide-86
SLIDE 86

Precision and recall

21

correct labels predicted labels

recall = TP TP + FN =

<latexit sha1_base64="L312lWJX2ZlV0ijiKCWHOpePN+Y=">AEsXicfVJdTxQxFJ3BVXH9An30pUpIdgXJLphoYkiIGuOLigmfoeva6XR2Cu10naASdM/46/xVd/8N975QFlAm3R65tzb3tPbE+WCGzsY/ApnrnWu37g5e6t7+87de/fn5h/sGFVoyrapEkrvRcQwTO2bkVbC/XjMhIsN3o6E0V3z1m2nCVbdkyZyNJhlPOCUWqPF8+GoR9JhmzJL/BfXs0vDvkdYsMQSrdUJuhCG4FLNJb5XLacVy/t+GZUN6KNnV8dxSqwrfb+72CK0jDRE0lOx65EmGcIS2JTSoTb9/5cXYRprOw/qvZBD+2Vf/b87Es/LiEGnkPctbRGZ2n3EO9ZXTo62Bz3PgQivMYNam1Di2dZiBG1EITajb2vQwoeC7jxU5nlsYrAzqgS6DYQsWgnZsjudnRjhWtJAs1QYw6Gg9yOHNGWU8F8FxeG5YQekQk7AJgRyczI1c/s0SIwMUqUhplZVLPndzgijSlBJmVfnMxVpFXxQ4Km7wcOZ7lhWUZbQolhUBWocozKObQCtKAIRqDloRTQk0xIKzoK/nyqRMHDM7fREqR84kdfUpSZGEf80ydkKVlCSLnzqcEMlFGbOEFMJ6h01yhq9qzXJ8zHPTdum0aVMXvGux0nzCM3g8HFt5mkaltTi+tvFbxm8hWYfQOCnGlilQYljTM9vM0EP8YV/F8mz/5kApy+lqsFwGWqFqicZc43RhfKMBxNtCryKcGX9tdC4QCSQMebfDa9rckAQw4v2u8y2FldGa6trH5+vrDxurXmbPAoeBL0gmHwItgI3gebwXZAw2/h9/BH+LOz1tnvfO1ETepM2O5GEyNztFvrTGYbg=</latexit>

■ Recall: fraction of positive instances that were

correctly classified.

slide-87
SLIDE 87

Precision and recall

21

correct labels predicted labels

recall = TP TP + FN =

<latexit sha1_base64="L312lWJX2ZlV0ijiKCWHOpePN+Y=">AEsXicfVJdTxQxFJ3BVXH9An30pUpIdgXJLphoYkiIGuOLigmfoeva6XR2Cu10naASdM/46/xVd/8N975QFlAm3R65tzb3tPbE+WCGzsY/ApnrnWu37g5e6t7+87de/fn5h/sGFVoyrapEkrvRcQwTO2bkVbC/XjMhIsN3o6E0V3z1m2nCVbdkyZyNJhlPOCUWqPF8+GoR9JhmzJL/BfXs0vDvkdYsMQSrdUJuhCG4FLNJb5XLacVy/t+GZUN6KNnV8dxSqwrfb+72CK0jDRE0lOx65EmGcIS2JTSoTb9/5cXYRprOw/qvZBD+2Vf/b87Es/LiEGnkPctbRGZ2n3EO9ZXTo62Bz3PgQivMYNam1Di2dZiBG1EITajb2vQwoeC7jxU5nlsYrAzqgS6DYQsWgnZsjudnRjhWtJAs1QYw6Gg9yOHNGWU8F8FxeG5YQekQk7AJgRyczI1c/s0SIwMUqUhplZVLPndzgijSlBJmVfnMxVpFXxQ4Km7wcOZ7lhWUZbQolhUBWocozKObQCtKAIRqDloRTQk0xIKzoK/nyqRMHDM7fREqR84kdfUpSZGEf80ydkKVlCSLnzqcEMlFGbOEFMJ6h01yhq9qzXJ8zHPTdum0aVMXvGux0nzCM3g8HFt5mkaltTi+tvFbxm8hWYfQOCnGlilQYljTM9vM0EP8YV/F8mz/5kApy+lqsFwGWqFqicZc43RhfKMBxNtCryKcGX9tdC4QCSQMebfDa9rckAQw4v2u8y2FldGa6trH5+vrDxurXmbPAoeBL0gmHwItgI3gebwXZAw2/h9/BH+LOz1tnvfO1ETepM2O5GEyNztFvrTGYbg=</latexit>

■ Recall: fraction of positive instances that were

correctly classified.

■ The “never Telugu” classifier has 0 recall.

slide-88
SLIDE 88

Precision and recall

21

correct labels predicted labels

recall = TP TP + FN =

<latexit sha1_base64="L312lWJX2ZlV0ijiKCWHOpePN+Y=">AEsXicfVJdTxQxFJ3BVXH9An30pUpIdgXJLphoYkiIGuOLigmfoeva6XR2Cu10naASdM/46/xVd/8N975QFlAm3R65tzb3tPbE+WCGzsY/ApnrnWu37g5e6t7+87de/fn5h/sGFVoyrapEkrvRcQwTO2bkVbC/XjMhIsN3o6E0V3z1m2nCVbdkyZyNJhlPOCUWqPF8+GoR9JhmzJL/BfXs0vDvkdYsMQSrdUJuhCG4FLNJb5XLacVy/t+GZUN6KNnV8dxSqwrfb+72CK0jDRE0lOx65EmGcIS2JTSoTb9/5cXYRprOw/qvZBD+2Vf/b87Es/LiEGnkPctbRGZ2n3EO9ZXTo62Bz3PgQivMYNam1Di2dZiBG1EITajb2vQwoeC7jxU5nlsYrAzqgS6DYQsWgnZsjudnRjhWtJAs1QYw6Gg9yOHNGWU8F8FxeG5YQekQk7AJgRyczI1c/s0SIwMUqUhplZVLPndzgijSlBJmVfnMxVpFXxQ4Km7wcOZ7lhWUZbQolhUBWocozKObQCtKAIRqDloRTQk0xIKzoK/nyqRMHDM7fREqR84kdfUpSZGEf80ydkKVlCSLnzqcEMlFGbOEFMJ6h01yhq9qzXJ8zHPTdum0aVMXvGux0nzCM3g8HFt5mkaltTi+tvFbxm8hWYfQOCnGlilQYljTM9vM0EP8YV/F8mz/5kApy+lqsFwGWqFqicZc43RhfKMBxNtCryKcGX9tdC4QCSQMebfDa9rckAQw4v2u8y2FldGa6trH5+vrDxurXmbPAoeBL0gmHwItgI3gebwXZAw2/h9/BH+LOz1tnvfO1ETepM2O5GEyNztFvrTGYbg=</latexit>

■ Recall: fraction of positive instances that were

correctly classified.

■ The “never Telugu” classifier has 0 recall. ■ The “always Telugu” classifier has perfect recall.

slide-89
SLIDE 89

Precision and recall

21

correct labels predicted labels

recall = TP TP + FN =

<latexit sha1_base64="L312lWJX2ZlV0ijiKCWHOpePN+Y=">AEsXicfVJdTxQxFJ3BVXH9An30pUpIdgXJLphoYkiIGuOLigmfoeva6XR2Cu10naASdM/46/xVd/8N975QFlAm3R65tzb3tPbE+WCGzsY/ApnrnWu37g5e6t7+87de/fn5h/sGFVoyrapEkrvRcQwTO2bkVbC/XjMhIsN3o6E0V3z1m2nCVbdkyZyNJhlPOCUWqPF8+GoR9JhmzJL/BfXs0vDvkdYsMQSrdUJuhCG4FLNJb5XLacVy/t+GZUN6KNnV8dxSqwrfb+72CK0jDRE0lOx65EmGcIS2JTSoTb9/5cXYRprOw/qvZBD+2Vf/b87Es/LiEGnkPctbRGZ2n3EO9ZXTo62Bz3PgQivMYNam1Di2dZiBG1EITajb2vQwoeC7jxU5nlsYrAzqgS6DYQsWgnZsjudnRjhWtJAs1QYw6Gg9yOHNGWU8F8FxeG5YQekQk7AJgRyczI1c/s0SIwMUqUhplZVLPndzgijSlBJmVfnMxVpFXxQ4Km7wcOZ7lhWUZbQolhUBWocozKObQCtKAIRqDloRTQk0xIKzoK/nyqRMHDM7fREqR84kdfUpSZGEf80ydkKVlCSLnzqcEMlFGbOEFMJ6h01yhq9qzXJ8zHPTdum0aVMXvGux0nzCM3g8HFt5mkaltTi+tvFbxm8hWYfQOCnGlilQYljTM9vM0EP8YV/F8mz/5kApy+lqsFwGWqFqicZc43RhfKMBxNtCryKcGX9tdC4QCSQMebfDa9rckAQw4v2u8y2FldGa6trH5+vrDxurXmbPAoeBL0gmHwItgI3gebwXZAw2/h9/BH+LOz1tnvfO1ETepM2O5GEyNztFvrTGYbg=</latexit>

precision = TP TP + FP =

<latexit sha1_base64="sCHI9VBUq0LibUzZDHZL9e36d8=">AE3XicfVPb9MwFE5GgVF+bXDkYpgmtWxU7UCy6QJEOICFGm/0Fwqx3Fab3Yc2c62yPKRG+LKf8Wd/4MriJekhXUrWEr8r3Pfp8/v0SZ4MZ2u9/DhUuNy1euLl5rXr9x89btpeU7u0blmrIdqoTS+xExTPCU7VhuBdvPNCMyEmwvOnpR5veOmTZcpdu2yNhAklHKE06JBWi4HI5WcSQdtmNmif/oWnat1/YIC5ZYorU6QefSkFyrsMS3yum0RHnbr6OiDtro0fw8HhPrCt9urk4itIkw0SNJToeuQJinCEtix5QI98H7M3URprGy/6jaBj20Vfz9nuyPZe6HBdTIWsDZRFM4G3MP9dbRoa+S9XbDQyjOYzSlkK0dJqBGlEpTShbrv4YGKr96WYHPKA8pL02eR+2X4HBpdvpVgNdDHqTYCWYjP5weWGAY0VzyVJLBTHmoNfN7MARbTkVzDdxblhG6BEZsQMIUyKZGbiqJTxaBSRGidLwpBZV6NkVjkhjChkBszyCOZ8rwXm5g9wmzwaOp1luWUrQkukFWo7C8Uc7DCigICQjUHrYiOCRhioQvB2DNlxkwcMzt7ECoHziRV9RlJkYRvzVJ2QpWUJI0fOpwQyURs4TkwnqHTKN51mzHh/zExcOq1takKfW6w0H/EU7hl6vmr8WRimscXVu4lfMrgLzd6AwHcZ08QqDUrqLvZwNyN8H5fh/5g8/cOEcPZYrhIAhyktUBlLna9/CqEMw9FIqzybEXxhfSUNiAJOF7z2eymgEN2TvfheD3Y1O73Fn4/2Tla3nk9ZcDO4FD4JW0AueBlvB6Af7AQ0/Bb+CH+GvxrDxqfG58aXmroQTtbcDWZG4+tvpRipKQ=</latexit>

■ Recall: fraction of positive instances that were

correctly classified.

■ The “never Telugu” classifier has 0 recall. ■ The “always Telugu” classifier has perfect recall.

■ Precision: fraction of positive predictions that were

correct.

slide-90
SLIDE 90

Precision and recall

21

correct labels predicted labels

recall = TP TP + FN =

<latexit sha1_base64="L312lWJX2ZlV0ijiKCWHOpePN+Y=">AEsXicfVJdTxQxFJ3BVXH9An30pUpIdgXJLphoYkiIGuOLigmfoeva6XR2Cu10naASdM/46/xVd/8N975QFlAm3R65tzb3tPbE+WCGzsY/ApnrnWu37g5e6t7+87de/fn5h/sGFVoyrapEkrvRcQwTO2bkVbC/XjMhIsN3o6E0V3z1m2nCVbdkyZyNJhlPOCUWqPF8+GoR9JhmzJL/BfXs0vDvkdYsMQSrdUJuhCG4FLNJb5XLacVy/t+GZUN6KNnV8dxSqwrfb+72CK0jDRE0lOx65EmGcIS2JTSoTb9/5cXYRprOw/qvZBD+2Vf/b87Es/LiEGnkPctbRGZ2n3EO9ZXTo62Bz3PgQivMYNam1Di2dZiBG1EITajb2vQwoeC7jxU5nlsYrAzqgS6DYQsWgnZsjudnRjhWtJAs1QYw6Gg9yOHNGWU8F8FxeG5YQekQk7AJgRyczI1c/s0SIwMUqUhplZVLPndzgijSlBJmVfnMxVpFXxQ4Km7wcOZ7lhWUZbQolhUBWocozKObQCtKAIRqDloRTQk0xIKzoK/nyqRMHDM7fREqR84kdfUpSZGEf80ydkKVlCSLnzqcEMlFGbOEFMJ6h01yhq9qzXJ8zHPTdum0aVMXvGux0nzCM3g8HFt5mkaltTi+tvFbxm8hWYfQOCnGlilQYljTM9vM0EP8YV/F8mz/5kApy+lqsFwGWqFqicZc43RhfKMBxNtCryKcGX9tdC4QCSQMebfDa9rckAQw4v2u8y2FldGa6trH5+vrDxurXmbPAoeBL0gmHwItgI3gebwXZAw2/h9/BH+LOz1tnvfO1ETepM2O5GEyNztFvrTGYbg=</latexit>

precision = TP TP + FP =

<latexit sha1_base64="sCHI9VBUq0LibUzZDHZL9e36d8=">AE3XicfVPb9MwFE5GgVF+bXDkYpgmtWxU7UCy6QJEOICFGm/0Fwqx3Fab3Yc2c62yPKRG+LKf8Wd/4MriJekhXUrWEr8r3Pfp8/v0SZ4MZ2u9/DhUuNy1euLl5rXr9x89btpeU7u0blmrIdqoTS+xExTPCU7VhuBdvPNCMyEmwvOnpR5veOmTZcpdu2yNhAklHKE06JBWi4HI5WcSQdtmNmif/oWnat1/YIC5ZYorU6QefSkFyrsMS3yum0RHnbr6OiDtro0fw8HhPrCt9urk4itIkw0SNJToeuQJinCEtix5QI98H7M3URprGy/6jaBj20Vfz9nuyPZe6HBdTIWsDZRFM4G3MP9dbRoa+S9XbDQyjOYzSlkK0dJqBGlEpTShbrv4YGKr96WYHPKA8pL02eR+2X4HBpdvpVgNdDHqTYCWYjP5weWGAY0VzyVJLBTHmoNfN7MARbTkVzDdxblhG6BEZsQMIUyKZGbiqJTxaBSRGidLwpBZV6NkVjkhjChkBszyCOZ8rwXm5g9wmzwaOp1luWUrQkukFWo7C8Uc7DCigICQjUHrYiOCRhioQvB2DNlxkwcMzt7ECoHziRV9RlJkYRvzVJ2QpWUJI0fOpwQyURs4TkwnqHTKN51mzHh/zExcOq1takKfW6w0H/EU7hl6vmr8WRimscXVu4lfMrgLzd6AwHcZ08QqDUrqLvZwNyN8H5fh/5g8/cOEcPZYrhIAhyktUBlLna9/CqEMw9FIqzybEXxhfSUNiAJOF7z2eymgEN2TvfheD3Y1O73Fn4/2Tla3nk9ZcDO4FD4JW0AueBlvB6Af7AQ0/Bb+CH+GvxrDxqfG58aXmroQTtbcDWZG4+tvpRipKQ=</latexit>

■ Recall: fraction of positive instances that were

correctly classified.

■ The “never Telugu” classifier has 0 recall. ■ The “always Telugu” classifier has perfect recall.

■ Precision: fraction of positive predictions that were

correct.

■ The “never Telugu” classifier 0 precision.

slide-91
SLIDE 91

Precision and recall

21

correct labels predicted labels

recall = TP TP + FN =

<latexit sha1_base64="L312lWJX2ZlV0ijiKCWHOpePN+Y=">AEsXicfVJdTxQxFJ3BVXH9An30pUpIdgXJLphoYkiIGuOLigmfoeva6XR2Cu10naASdM/46/xVd/8N975QFlAm3R65tzb3tPbE+WCGzsY/ApnrnWu37g5e6t7+87de/fn5h/sGFVoyrapEkrvRcQwTO2bkVbC/XjMhIsN3o6E0V3z1m2nCVbdkyZyNJhlPOCUWqPF8+GoR9JhmzJL/BfXs0vDvkdYsMQSrdUJuhCG4FLNJb5XLacVy/t+GZUN6KNnV8dxSqwrfb+72CK0jDRE0lOx65EmGcIS2JTSoTb9/5cXYRprOw/qvZBD+2Vf/b87Es/LiEGnkPctbRGZ2n3EO9ZXTo62Bz3PgQivMYNam1Di2dZiBG1EITajb2vQwoeC7jxU5nlsYrAzqgS6DYQsWgnZsjudnRjhWtJAs1QYw6Gg9yOHNGWU8F8FxeG5YQekQk7AJgRyczI1c/s0SIwMUqUhplZVLPndzgijSlBJmVfnMxVpFXxQ4Km7wcOZ7lhWUZbQolhUBWocozKObQCtKAIRqDloRTQk0xIKzoK/nyqRMHDM7fREqR84kdfUpSZGEf80ydkKVlCSLnzqcEMlFGbOEFMJ6h01yhq9qzXJ8zHPTdum0aVMXvGux0nzCM3g8HFt5mkaltTi+tvFbxm8hWYfQOCnGlilQYljTM9vM0EP8YV/F8mz/5kApy+lqsFwGWqFqicZc43RhfKMBxNtCryKcGX9tdC4QCSQMebfDa9rckAQw4v2u8y2FldGa6trH5+vrDxurXmbPAoeBL0gmHwItgI3gebwXZAw2/h9/BH+LOz1tnvfO1ETepM2O5GEyNztFvrTGYbg=</latexit>

precision = TP TP + FP =

<latexit sha1_base64="sCHI9VBUq0LibUzZDHZL9e36d8=">AE3XicfVPb9MwFE5GgVF+bXDkYpgmtWxU7UCy6QJEOICFGm/0Fwqx3Fab3Yc2c62yPKRG+LKf8Wd/4MriJekhXUrWEr8r3Pfp8/v0SZ4MZ2u9/DhUuNy1euLl5rXr9x89btpeU7u0blmrIdqoTS+xExTPCU7VhuBdvPNCMyEmwvOnpR5veOmTZcpdu2yNhAklHKE06JBWi4HI5WcSQdtmNmif/oWnat1/YIC5ZYorU6QefSkFyrsMS3yum0RHnbr6OiDtro0fw8HhPrCt9urk4itIkw0SNJToeuQJinCEtix5QI98H7M3URprGy/6jaBj20Vfz9nuyPZe6HBdTIWsDZRFM4G3MP9dbRoa+S9XbDQyjOYzSlkK0dJqBGlEpTShbrv4YGKr96WYHPKA8pL02eR+2X4HBpdvpVgNdDHqTYCWYjP5weWGAY0VzyVJLBTHmoNfN7MARbTkVzDdxblhG6BEZsQMIUyKZGbiqJTxaBSRGidLwpBZV6NkVjkhjChkBszyCOZ8rwXm5g9wmzwaOp1luWUrQkukFWo7C8Uc7DCigICQjUHrYiOCRhioQvB2DNlxkwcMzt7ECoHziRV9RlJkYRvzVJ2QpWUJI0fOpwQyURs4TkwnqHTKN51mzHh/zExcOq1takKfW6w0H/EU7hl6vmr8WRimscXVu4lfMrgLzd6AwHcZ08QqDUrqLvZwNyN8H5fh/5g8/cOEcPZYrhIAhyktUBlLna9/CqEMw9FIqzybEXxhfSUNiAJOF7z2eymgEN2TvfheD3Y1O73Fn4/2Tla3nk9ZcDO4FD4JW0AueBlvB6Af7AQ0/Bb+CH+GvxrDxqfG58aXmroQTtbcDWZG4+tvpRipKQ=</latexit>

■ Recall: fraction of positive instances that were

correctly classified.

■ The “never Telugu” classifier has 0 recall. ■ The “always Telugu” classifier has perfect recall.

■ Precision: fraction of positive predictions that were

correct.

■ The “never Telugu” classifier 0 precision. ■ The “always Telugu” classifier has 0.003 precision.

slide-92
SLIDE 92

Combining precision and recall

22

slide-93
SLIDE 93

■ Inherent tradeoff between precision and recall. Choice is problem-specific.

Combining precision and recall

22

slide-94
SLIDE 94

■ Inherent tradeoff between precision and recall. Choice is problem-specific. ■ For a preliminary medical diagnosis, we might prefer high recall. False positives

can be screened out later.

Combining precision and recall

22

slide-95
SLIDE 95

■ Inherent tradeoff between precision and recall. Choice is problem-specific. ■ For a preliminary medical diagnosis, we might prefer high recall. False positives

can be screened out later.

■ The “beyond reasonable doubt” standard of U.S. criminal law implies a

preference for high precision.

Combining precision and recall

22

slide-96
SLIDE 96

■ Inherent tradeoff between precision and recall. Choice is problem-specific. ■ For a preliminary medical diagnosis, we might prefer high recall. False positives

can be screened out later.

■ The “beyond reasonable doubt” standard of U.S. criminal law implies a

preference for high precision.

■ Most often, we weight them equally using F1 measure: harmonic mean of precision

and recall.

Combining precision and recall

22

F1 = 2 · precision · recall precision + recall

<latexit sha1_base64="gluqdi4WIqSk7wgy9umHKkSkbMU=">AFHnicfVNLb9QwE6XBcryaAtHLoaq0i6UareA4FKoAFVcgEXqC9VL5DjOrls7jhynbWT5v3Dkl3BDXOHfMHn0kXbBUuLxzDeb8YzQSJ4avr9PzOtK+2r167P3ujcvHX7ztz8wt3tVGWasi2qhNK7AUmZ4DHbMtwItptoRmQg2E5w8Law7xwynXIVb5o8YSNJxjGPOCUGVP7CzPclHEiLzYQZ4r7arnk86DmEBYsM0VodoQtmMD4udZHrFtxoeU9t4zySuihJ9PteEKMzV2vs1RLaA1hoseSHPs2R5jHCEtiJpQI+8W5c3ERpqEy/4jaAz60m5+d6/uxzJyfQ4ykC5g1dKJOJtxBvGW070pjdZ2/D8F5iE6gBREtrWbARpRMI02o3Rw6+CDixkfXAELRKS+qPA07LEb/uDUtlqndOpWn+t4Z9eBe63z5xf7K/1yocvCoBYWvXoN/YXWCIeKZpLFhgqSpnuDfmJGlmjDqWCug7OUJYQekDHbAzEmkqUjW/aUQ0ugCVGkNHyxQaX2vIclMk1zGQCyKEF60VYop9n2MhO9HFkeJ5lhMa0CRZlARqGiQVHIWEjchAI1Ry4IjohUDQDbQwFPxdmwsQhM81EqBzZNCqjNygFEs6axeyIKilJHD6yOCKSizxkEcmEcRan0Yk8rTL4SFP0rpKx1WZOjAoBivNxzyGR4KhKSenqYZtYnD57+B3DN5Csw9A8FPCNDFKA5NqDBy8zRg/wIX4PySPT5EgNtOyJQFIpiBSlhsXTVQqUMB2OtsqRB+J/SRQuIBFUvMKzpluFgIYcXGy/y8L26srg6crq52eL62/q1pz17nsPva438F546957b+htebQ13retV63f7W/tH+2f5VQVsztc89r7Hav/8CIW+yA=</latexit>
slide-97
SLIDE 97

■ Inherent tradeoff between precision and recall. Choice is problem-specific. ■ For a preliminary medical diagnosis, we might prefer high recall. False positives

can be screened out later.

■ The “beyond reasonable doubt” standard of U.S. criminal law implies a

preference for high precision.

■ Most often, we weight them equally using F1 measure: harmonic mean of precision

and recall.

Combining precision and recall

22

F1 = 2 · precision · recall precision + recall

<latexit sha1_base64="gluqdi4WIqSk7wgy9umHKkSkbMU=">AFHnicfVNLb9QwE6XBcryaAtHLoaq0i6UareA4FKoAFVcgEXqC9VL5DjOrls7jhynbWT5v3Dkl3BDXOHfMHn0kXbBUuLxzDeb8YzQSJ4avr9PzOtK+2r167P3ujcvHX7ztz8wt3tVGWasi2qhNK7AUmZ4DHbMtwItptoRmQg2E5w8Law7xwynXIVb5o8YSNJxjGPOCUGVP7CzPclHEiLzYQZ4r7arnk86DmEBYsM0VodoQtmMD4udZHrFtxoeU9t4zySuihJ9PteEKMzV2vs1RLaA1hoseSHPs2R5jHCEtiJpQI+8W5c3ERpqEy/4jaAz60m5+d6/uxzJyfQ4ykC5g1dKJOJtxBvGW070pjdZ2/D8F5iE6gBREtrWbARpRMI02o3Rw6+CDixkfXAELRKS+qPA07LEb/uDUtlqndOpWn+t4Z9eBe63z5xf7K/1yocvCoBYWvXoN/YXWCIeKZpLFhgqSpnuDfmJGlmjDqWCug7OUJYQekDHbAzEmkqUjW/aUQ0ugCVGkNHyxQaX2vIclMk1zGQCyKEF60VYop9n2MhO9HFkeJ5lhMa0CRZlARqGiQVHIWEjchAI1Ry4IjohUDQDbQwFPxdmwsQhM81EqBzZNCqjNygFEs6axeyIKilJHD6yOCKSizxkEcmEcRan0Yk8rTL4SFP0rpKx1WZOjAoBivNxzyGR4KhKSenqYZtYnD57+B3DN5Csw9A8FPCNDFKA5NqDBy8zRg/wIX4PySPT5EgNtOyJQFIpiBSlhsXTVQqUMB2OtsqRB+J/SRQuIBFUvMKzpluFgIYcXGy/y8L26srg6crq52eL62/q1pz17nsPva438F546957b+htebQ13retV63f7W/tH+2f5VQVsztc89r7Hav/8CIW+yA=</latexit>

min(precision, recall) ≤ F1 ≤ 2 · min(precision, recall)

<latexit sha1_base64="5hf/30datH8Q35WsniBmRcgzYK4=">AFzXichVRLb9NAEHYDgRJeLRy5LFSVbBqJCDBpVIFqOICDVJfqBui9XqdbOu1zXrd1lqWK/+K/8GdK/wGxo+0cZqCJcezM9/MfDuPuHAE9Xp/FxoXLvevHFz8Vbr9p279+4vLT/YS6JUrZLoyCSBy5JWMBDtqu4CthBLBkRbsD23eM3uX3/hMmER+GOymI2EGQUcp9TokA1XG7sr2JXaKzGTBHzWdtqresYhAPmKyJldIpmzGBcK3S+sfPWa7ljmjrBQc9Gy+HY+J0plxWquVhDYQJnIkyNlQZwjzEGFB1JiSQH8yZiovwtSL1BVZHeBD7eziXMXHIjXDHLENmA20EQdj7mBfG10ZApjGW54BMm5hybQnIgUWjJgExRMfUmo3ukbeCHj1gdTA0LRKc+rPA/bL7Fbw+65sVfd6dyvOlcJL+KBf6XLA2AX6gFB7G7RBTh87jlxKsi2RVsNqEzFXlyC8FDY58j2pXdycfhC8rpF0KviPQfn+HSme9UzostCthBWrevowigPsRTQVLFQ0IEly2O3EaqCJVJwGDEimCYsJPSYjdghiSARLBrpYAYNWQeMhP5LwhgoV2mkPTUSZMIFZM47mbXlynm2w1T5rwah3GqWEjLRH4aIBWhfJ+Qx+HCKshAIFRy4IromEBDFGwdzMdUmjELTpiqX4SKgU78InuNkivgLFnITmkBAm9pxr7RPAg85hP0kAZjRN/Is8rTds74XFSVemsLFML+qdwJPmIh9Ak2PFi0etq+IwVLn5b+C2DXkj2Hghux0wSFUlgUm6tgd6M8GOci/9CwoBMkCDWr6ULAnCZvARzEJtyj+BIEoYdkcySuMa4Uv+BVEIQHyoeIlndbcSAQPZnR2/y8Jeb737fL38cXK5utqNBetR9YTy7a61ktr03pn9a1dizZ+NH41fjf+NLebafNr81sJbSxUPg+t2tP8/hc2sv2L</latexit>
slide-98
SLIDE 98

■ Inherent tradeoff between precision and recall. Choice is problem-specific. ■ For a preliminary medical diagnosis, we might prefer high recall. False positives

can be screened out later.

■ The “beyond reasonable doubt” standard of U.S. criminal law implies a

preference for high precision.

■ Most often, we weight them equally using F1 measure: harmonic mean of precision

and recall.

■ Can generalize F-measure to adjust the tradeoff, such that recall is β-times as

important as precision:

Combining precision and recall

22

F1 = 2 · precision · recall precision + recall

<latexit sha1_base64="gluqdi4WIqSk7wgy9umHKkSkbMU=">AFHnicfVNLb9QwE6XBcryaAtHLoaq0i6UareA4FKoAFVcgEXqC9VL5DjOrls7jhynbWT5v3Dkl3BDXOHfMHn0kXbBUuLxzDeb8YzQSJ4avr9PzOtK+2r167P3ujcvHX7ztz8wt3tVGWasi2qhNK7AUmZ4DHbMtwItptoRmQg2E5w8Law7xwynXIVb5o8YSNJxjGPOCUGVP7CzPclHEiLzYQZ4r7arnk86DmEBYsM0VodoQtmMD4udZHrFtxoeU9t4zySuihJ9PteEKMzV2vs1RLaA1hoseSHPs2R5jHCEtiJpQI+8W5c3ERpqEy/4jaAz60m5+d6/uxzJyfQ4ykC5g1dKJOJtxBvGW070pjdZ2/D8F5iE6gBREtrWbARpRMI02o3Rw6+CDixkfXAELRKS+qPA07LEb/uDUtlqndOpWn+t4Z9eBe63z5xf7K/1yocvCoBYWvXoN/YXWCIeKZpLFhgqSpnuDfmJGlmjDqWCug7OUJYQekDHbAzEmkqUjW/aUQ0ugCVGkNHyxQaX2vIclMk1zGQCyKEF60VYop9n2MhO9HFkeJ5lhMa0CRZlARqGiQVHIWEjchAI1Ry4IjohUDQDbQwFPxdmwsQhM81EqBzZNCqjNygFEs6axeyIKilJHD6yOCKSizxkEcmEcRan0Yk8rTL4SFP0rpKx1WZOjAoBivNxzyGR4KhKSenqYZtYnD57+B3DN5Csw9A8FPCNDFKA5NqDBy8zRg/wIX4PySPT5EgNtOyJQFIpiBSlhsXTVQqUMB2OtsqRB+J/SRQuIBFUvMKzpluFgIYcXGy/y8L26srg6crq52eL62/q1pz17nsPva438F546957b+htebQ13retV63f7W/tH+2f5VQVsztc89r7Hav/8CIW+yA=</latexit>

Fβ = (1 + β2) precision · recall (β2 · precision) + recall

<latexit sha1_base64="OEijKjUDTt9md5BWbXclwVw2qQ=">AFeHicfVNLb9QwE5DF8ryauHIxVBVbKBUmwUJLpUqQBUXoEh9oXq7chxn162dRI7TNrL8G7jCT+OvcGLy6CPbLZYSj2e+mfk8nglSwTPd7/+Zc2/Nd27fWbjbvXf/wcNHi0uPd7MkV5Tt0EQkaj8gGRM8Zjua8H2U8WIDATbC4/lva9E6YynsTbukjZUJxzCNOiQbVaMl1V3AgDdYTpok9ND39yvcswoJFmiVnKIpMxhfVbrI9srtrNRyz6iohY89Hq2HU+INoX1uiuNhNYRJmosydnIFAjzGFJ9IQSYX5YeyUvwjRM9A1ZPeBDe8XluYmPZW5HBeRIe4BZR+fqdMIt5FtFR7Yy1uFGR5Cch+gcWhJR0igGbETFNFKEmu0tCx9k3PxqW0AoOuVlWdht2rs5si/MA6aO134Necm4WU8G903c0RDqAcEKPnV48Ah8OBVwe8KVCvgU3n8y4DjxaX+2v9aqHrgt8Iy06ztqBrhjhMaC5ZrKkgWXbg91M9NERpTgWzXZxnLCX0mIzZAYgxkSwbmqpbLVoBTYiRMEXa1Rpr3oYIrOskAEgy+Jm07ZSOct2kOvo/dDwOM01i2mdKMoF0gkqWx+FHC6sRQECoYoDV0QnBIqnYUDgKa+kmTBxwnT7IlQOTRZV2VuUAglnxWJ2ShMpSRy+NDgikosiZBHJhbYGZ9G5PKs0q+EJT7OmSmd1mbowghonio95DI8E41jNZFsN20Tj6t/Fnxi8hWJfgOC3lCmiEwVM6gGz8DZj/AyX4v+QPL5Agti+lqkIwGXKEiQpi42t51UkGcPBWCV52iJ8zb8iCgFIBWv8aztViOgIf3p9rsu7A7W/Ddrg+9vlzc+NK254Dx1njs9x3feORvOZ2fL2XGoy92f7i/39/zfDuq86Hg1J1rfJ4rdUZ/APUHdxF</latexit>

min(precision, recall) ≤ F1 ≤ 2 · min(precision, recall)

<latexit sha1_base64="5hf/30datH8Q35WsniBmRcgzYK4=">AFzXichVRLb9NAEHYDgRJeLRy5LFSVbBqJCDBpVIFqOICDVJfqBui9XqdbOu1zXrd1lqWK/+K/8GdK/wGxo+0cZqCJcezM9/MfDuPuHAE9Xp/FxoXLvevHFz8Vbr9p279+4vLT/YS6JUrZLoyCSBy5JWMBDtqu4CthBLBkRbsD23eM3uX3/hMmER+GOymI2EGQUcp9TokA1XG7sr2JXaKzGTBHzWdtqresYhAPmKyJldIpmzGBcK3S+sfPWa7ljmjrBQc9Gy+HY+J0plxWquVhDYQJnIkyNlQZwjzEGFB1JiSQH8yZiovwtSL1BVZHeBD7eziXMXHIjXDHLENmA20EQdj7mBfG10ZApjGW54BMm5hybQnIgUWjJgExRMfUmo3ukbeCHj1gdTA0LRKc+rPA/bL7Fbw+65sVfd6dyvOlcJL+KBf6XLA2AX6gFB7G7RBTh87jlxKsi2RVsNqEzFXlyC8FDY58j2pXdycfhC8rpF0KviPQfn+HSme9UzostCthBWrevowigPsRTQVLFQ0IEly2O3EaqCJVJwGDEimCYsJPSYjdghiSARLBrpYAYNWQeMhP5LwhgoV2mkPTUSZMIFZM47mbXlynm2w1T5rwah3GqWEjLRH4aIBWhfJ+Qx+HCKshAIFRy4IromEBDFGwdzMdUmjELTpiqX4SKgU78InuNkivgLFnITmkBAm9pxr7RPAg85hP0kAZjRN/Is8rTds74XFSVemsLFML+qdwJPmIh9Ak2PFi0etq+IwVLn5b+C2DXkj2Hghux0wSFUlgUm6tgd6M8GOci/9CwoBMkCDWr6ULAnCZvARzEJtyj+BIEoYdkcySuMa4Uv+BVEIQHyoeIlndbcSAQPZnR2/y8Jeb737fL38cXK5utqNBetR9YTy7a61ktr03pn9a1dizZ+NH41fjf+NLebafNr81sJbSxUPg+t2tP8/hc2sv2L</latexit>
slide-99
SLIDE 99

Trading off precision and recall: ROC curve

23

slide-100
SLIDE 100

Trading off precision and recall: ROC curve

23

slide-101
SLIDE 101

Evaluating multi-class classification

24

slide-102
SLIDE 102

■ Recall and precision imply binary classification: each instance is either positive or

negative.

Evaluating multi-class classification

24

slide-103
SLIDE 103

■ Recall and precision imply binary classification: each instance is either positive or

negative.

■ In multi-class classification, each instance is positive for one class, and negative for

all other classes.

Evaluating multi-class classification

24

slide-104
SLIDE 104

■ Recall and precision imply binary classification: each instance is either positive or

negative.

■ In multi-class classification, each instance is positive for one class, and negative for

all other classes.

■ Two ways to combine performance across classes:

Evaluating multi-class classification

24

slide-105
SLIDE 105

■ Recall and precision imply binary classification: each instance is either positive or

negative.

■ In multi-class classification, each instance is positive for one class, and negative for

all other classes.

■ Two ways to combine performance across classes: ■ Macro F1: Compute F1 per class, then average across all classes.

Evaluating multi-class classification

24

slide-106
SLIDE 106

■ Recall and precision imply binary classification: each instance is either positive or

negative.

■ In multi-class classification, each instance is positive for one class, and negative for

all other classes.

■ Two ways to combine performance across classes: ■ Macro F1: Compute F1 per class, then average across all classes. ■ Micro F1: Compute total number of true positives, false positives, false

negatives pooled across all classes, and compute a single F1 .

Evaluating multi-class classification

24

slide-107
SLIDE 107

■ Recall and precision imply binary classification: each instance is either positive or

negative.

■ In multi-class classification, each instance is positive for one class, and negative for

all other classes.

■ Two ways to combine performance across classes: ■ Macro F1: Compute F1 per class, then average across all classes. ■ Micro F1: Compute total number of true positives, false positives, false

negatives pooled across all classes, and compute a single F1 .

Evaluating multi-class classification

24

→ Weights all classes equally, regardless of frequency.

slide-108
SLIDE 108

■ Recall and precision imply binary classification: each instance is either positive or

negative.

■ In multi-class classification, each instance is positive for one class, and negative for

all other classes.

■ Two ways to combine performance across classes: ■ Macro F1: Compute F1 per class, then average across all classes. ■ Micro F1: Compute total number of true positives, false positives, false

negatives pooled across all classes, and compute a single F1 .

Evaluating multi-class classification

24

→ Weights all classes equally, regardless of frequency. → Emphasizes performance on high frequency classes.

slide-109
SLIDE 109

Comparing classifiers

25

slide-110
SLIDE 110

■ Suppose two teams build classifiers to solve a

problem:

Comparing classifiers

25

slide-111
SLIDE 111

■ Suppose two teams build classifiers to solve a

problem:

■ C1 gets 82% accuracy

Comparing classifiers

25

slide-112
SLIDE 112

■ Suppose two teams build classifiers to solve a

problem:

■ C1 gets 82% accuracy ■ C2 gets 73% accuracy

Comparing classifiers

25

slide-113
SLIDE 113

■ Suppose two teams build classifiers to solve a

problem:

■ C1 gets 82% accuracy ■ C2 gets 73% accuracy ■ Will C1 be more accurate in the future?

Comparing classifiers

25

slide-114
SLIDE 114

■ Suppose two teams build classifiers to solve a

problem:

■ C1 gets 82% accuracy ■ C2 gets 73% accuracy ■ Will C1 be more accurate in the future? ■ What if the test set has 1000 examples?

Comparing classifiers

25

slide-115
SLIDE 115

■ Suppose two teams build classifiers to solve a

problem:

■ C1 gets 82% accuracy ■ C2 gets 73% accuracy ■ Will C1 be more accurate in the future? ■ What if the test set has 1000 examples? ■ What if the test set has 11 examples?

Comparing classifiers

25

slide-116
SLIDE 116

Hypothesis testing

26

slide-117
SLIDE 117

■ Consider two hypotheses that explain the observed data:

Hypothesis testing

26

slide-118
SLIDE 118

■ Consider two hypotheses that explain the observed data: ■ H1: C1 is more accurate than C2 and therefore can be expected to be more

accurate in the future (in the limit of an infinite number of independent evaluations).

Hypothesis testing

26

slide-119
SLIDE 119

■ Consider two hypotheses that explain the observed data: ■ H1: C1 is more accurate than C2 and therefore can be expected to be more

accurate in the future (in the limit of an infinite number of independent evaluations).

■ H0: C1 is not more accurate than C2 , and its superior performance on the test

set was due only to luck. This is the null hypothesis.

Hypothesis testing

26

slide-120
SLIDE 120

■ Consider two hypotheses that explain the observed data: ■ H1: C1 is more accurate than C2 and therefore can be expected to be more

accurate in the future (in the limit of an infinite number of independent evaluations).

■ H0: C1 is not more accurate than C2 , and its superior performance on the test

set was due only to luck. This is the null hypothesis.

■ If the test set is small, H0 might be true.

Hypothesis testing

26

slide-121
SLIDE 121

■ Consider two hypotheses that explain the observed data: ■ H1: C1 is more accurate than C2 and therefore can be expected to be more

accurate in the future (in the limit of an infinite number of independent evaluations).

■ H0: C1 is not more accurate than C2 , and its superior performance on the test

set was due only to luck. This is the null hypothesis.

■ If the test set is small, H0 might be true. ■ If the test set is large, probability of observing a 9% difference in accuracy

(73% → 82%) becomes vanishingly small unless C1 really is more accurate.

Hypothesis testing

26

slide-122
SLIDE 122

■ Consider two hypotheses that explain the observed data: ■ H1: C1 is more accurate than C2 and therefore can be expected to be more

accurate in the future (in the limit of an infinite number of independent evaluations).

■ H0: C1 is not more accurate than C2 , and its superior performance on the test

set was due only to luck. This is the null hypothesis.

■ If the test set is small, H0 might be true. ■ If the test set is large, probability of observing a 9% difference in accuracy

(73% → 82%) becomes vanishingly small unless C1 really is more accurate.

■ These probabilities are quantified by hypothesis testing.

Hypothesis testing

26

slide-123
SLIDE 123

The binomial test

27

slide-124
SLIDE 124

■ If the two classifiers are equally accurate, then each time they disagree, they are

equally likely to be correct.

The binomial test

27

slide-125
SLIDE 125

■ If the two classifiers are equally accurate, then each time they disagree, they are

equally likely to be correct.

■ Over 30 such disagreements, each classifier will “win” roughly half the time.

The binomial test

27

slide-126
SLIDE 126

■ If the two classifiers are equally accurate, then each time they disagree, they are

equally likely to be correct.

■ Over 30 such disagreements, each classifier will “win” roughly half the time. ■ The total probability mass in the pink region is less than 5%. If the data are in this

region, we reject the null hypothesis with p < 0.05.

The binomial test

27

slide-127
SLIDE 127

Other hypothesis tests

28

slide-128
SLIDE 128

■ The binomial test compares two classifiers in terms of accuracy. It can be computed

in closed form using e.g. SciPy or R.

Other hypothesis tests

28

slide-129
SLIDE 129

■ The binomial test compares two classifiers in terms of accuracy. It can be computed

in closed form using e.g. SciPy or R.

■ Hypotheses about other metrics, such as F1, cannot be tested in this way.

Other hypothesis tests

28

slide-130
SLIDE 130

■ The binomial test compares two classifiers in terms of accuracy. It can be computed

in closed form using e.g. SciPy or R.

■ Hypotheses about other metrics, such as F1, cannot be tested in this way. ■ For these hypotheses, best approach is randomization: randomly sample many test

sets, and count how often each hypothesis holds.

Other hypothesis tests

28

slide-131
SLIDE 131

Comparing classifiers

29

F1

slide-132
SLIDE 132

Comparing classifiers

29

Nils Reimers and Iryna Gurevych. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. EMNLP 2017.

F1

Two recent, state-of-the-art systems for NER are

proposed by Ma and Hovy (2016)5 and by Lample et al. (2016)6. Lample et al. report an F1-score of 90.94% and Ma and Hovy report an F1-score of 91.21%. Ma and Hovy draw the conclusion that

their system achieves a significant improvement

  • ver the system by Lample et al.
slide-133
SLIDE 133

Comparing classifiers

29

Nils Reimers and Iryna Gurevych. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. EMNLP 2017.

F1

Two recent, state-of-the-art systems for NER are

proposed by Ma and Hovy (2016)5 and by Lample et al. (2016)6. Lample et al. report an F1-score of 90.94% and Ma and Hovy report an F1-score of 91.21%. Ma and Hovy draw the conclusion that

their system achieves a significant improvement

  • ver the system by Lample et al.

Lample et al. (reported)

slide-134
SLIDE 134

Comparing classifiers

29

Nils Reimers and Iryna Gurevych. Reporting score distributions makes a difference: Performance study of LSTM-networks for sequence tagging. EMNLP 2017.

F1

Two recent, state-of-the-art systems for NER are

proposed by Ma and Hovy (2016)5 and by Lample et al. (2016)6. Lample et al. report an F1-score of 90.94% and Ma and Hovy report an F1-score of 91.21%. Ma and Hovy draw the conclusion that

their system achieves a significant improvement

  • ver the system by Lample et al.

Ma & Hovy (reported) Lample et al. (reported)

slide-135
SLIDE 135

Announcements

30

■ Project 1: Text classification will be available after class today,

due Friday September 25

■ Han will lead recitation this Friday to introduce Project 1, and cover environment

setup (Python, Jupyter, NumPy, PyTorch).