Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

  Deconstructing Data Science David Bamman, UC Berkeley   Info 290   Lecture 9: Logistic regression Feb 22, 2016

Generative vs. Discriminative models • Generative models specify a joint distribution over the labels and the data. With this you could generate new data P ( x , y ) = P ( y ) P ( x | y ) • Discriminative models specify the conditional distribution of the label y given the data x. These models focus on how to discriminate between the classes P ( y | x )

Generating 0.12 0.06 0.00 the a of love sword poison hamlet romeo king capulet be woe him most P ( x | y = Hamlet ) 0.12 0.06 0.00 the a of love sword poison hamlet romeo king capulet be woe him most P ( x | y = Romeo and Juliet )

Generative models • With generative models (e.g., Naive Bayes), we ultimately also care about P(y | x), but we get there by modeling more. prior likelihood posterior P ( Y = y ) P ( x | Y = y ) P ( Y = y | x ) = y ∈ Y P ( Y = y ) P ( x | Y = y ) � • Discriminative models focus on modeling P(y | x) — and only P(y | x) — directly.

Remember F � x i β i = x 1 β 1 + x 2 β 2 + . . . + x F β F i = 1 F � x i = x i × x 2 × . . . × x F i = 1 exp( x ) = e x ≈ 2 . 7 x exp( x + y ) = exp( x ) exp( y ) log( x ) = y → e y = x log( xy ) = log( x ) + log( y ) 5

Classification A mapping h from input data x (drawn from instance space 𝓨 ) to a label (or labels) y from some enumerable output space 𝒵 𝓨 = set of all skyscrapers 𝒵 = {art deco, neo-gothic, modern} x = the empire state building y = art deco

x = feature vector β = coefficients Feature Value Feature β follow clinton 0 follow clinton -3.1 follow trump 0 follow trump 6.8 “benghazi” 0 “benghazi” 1.4 negative sentiment + negative sentiment + 0 3.2 “benghazi” “benghazi” “illegal immigrants” 0 “illegal immigrants” 8.7 “republican” in profile 0 “republican” in profile 7.9 “democrat” in profile 0 “democrat” in profile -3.0 self-reported location self-reported location 1 -1.7 = Berkeley = Berkeley 7

Logistic regression �� F � i = 1 x i β i exp P ( y | x , β ) = �� F � 1 + exp i = 1 x i β i Y = { 0 , 1 } output space

follows follows benghazi trump clinton β 0.7 1.2 -1.1 follows follows exp(a)/ a= ∑ x i β i benghazi exp(a) trump clinton 1+exp(a) x 1 1 1 0 1.9 6.69 87.0% x 2 0 0 1 -1.1 0.33 25.0% x 3 1 0 1 -0.4 0.67 40.1% 9

β = coefficients Feature β follow clinton -3.1 follow trump 6.8 How do we get good values for β ? “benghazi” 1.4 negative sentiment + 3.2 “benghazi” “illegal immigrants” 8.7 “republican” in profile 7.9 “democrat” in profile -3.0 self-reported location -1.7 = Berkeley 10

Likelihood Remember the likelihood of data is its probability under some parameter values In maximum likelihood estimation, we pick the values of the parameters under which the data is most likely. 11

Likelihood fair 0.5 0.4 0.3 0.2 =.17 x .17 x .17   P( | ) 2 6 6 0.1 = 0.004913 0.0 1 2 3 4 5 6 not fair 0.5 0.4 = .1 x .5 x .5   P( | ) 0.3 2 6 6 = 0.025 0.2 0.1 0.0 1 2 3 4 5 6

Conditional likelihood N For all training data, we want � P ( y i | x i , β ) probability of the true label y for each data point x to high i This principle gives us a way to pick the values of the parameters β that maximize the probability of the training data <x, y> 13

The value of β that maximizes likelihood also maximizes the log likelihood N N � P ( y i | x i , β ) = arg max � P ( y i | x i , β ) arg max log β β i = 1 i = 1 The log likelihood is an easier form to work with: N N � P ( y i | x i , β ) = � log P ( y i | x i , β ) log i = 1 i = 1 14

• We want to find the value of β that leads to the highest value of the log likelihood: N � ( β ) = � log P ( y i | x i , β ) i = 1 • Solution: derivatives! 15

x + α (-2x) 0 [ α = 0.1] x .1(-2x) -25 8.00 1.60 6.40 1.28 5.12 1.02 -x^2 4.10 0.82 -50 3.28 0.66 2.62 0.52 2.10 0.42 -75 1.68 0.34 1.34 0.27 1.07 0.21 0.86 0.17 -100 0.69 0.14 -10 -5 0 5 10 x d We can get to maximum value of this dx − x 2 = − 2 x function by following the gradient 16

We want to find the values of β that make the value of this function the greatest � log P ( 1 | x , β ) + � log P ( 0 | x , β ) < x , y =+ 1 > < x , y = 0 > � � ( β ) = � ( y − ˆ p ( x )) x i � β i < x , y > 17

Gradient descent If y is 1 and p(x) = 0, then this still pushes the weights a lot If y is 1 and p(x) = 0.99, then this still pushes the weights just a little bit 18

Stochastic g.d. • Batch gradient descent reasons over every training data point for each update of β . This can be slow to converge. • Stochastic gradient descent updates β after each data point. 19

Perceptron 20

Stochastic g.d. β i + α ( y − ˆ p ( x )) x i p is between Logistic regression 0 and 1 stochastic update Perceptron β i + α ( y − ˆ y ) x i ŷ is exactly   0 or 1 stochastic update The perceptron is an approximation to logistic regression

Practicalities • When calculating the P(y | x) or in calculating the gradient, you don’t need to loop through all features — only those with nonzero values • (Which makes sparse, binary values useful) �� F � i = 1 x i β i exp � � ( β ) = � ( y − ˆ p ( x )) x i P ( y | x , β ) = � β i �� F � 1 + exp i = 1 x i β i < x , y >

� � ( β ) = � ( y − ˆ p ( x )) x i � β i < x , y > If a feature x i only shows up with one class (e.g., democrats), what are the possible values of its corresponding β i ? � � � ( β ) = � ( 1 − 0 ) 1 � ( β ) = � ( 1 − 0 . 9999999 ) 1 � β i � β i < x , y > < x , y > always positive

β = coefficients Feature β Many features that show up follow clinton -3.1 rarely may likely only appear (by follow trump + follow 7299302 chance) with one label NFL + follow bieber “benghazi” 1.4 More generally, may appear so few times that the noise of negative sentiment + 3.2 “benghazi” randomness dominates “illegal immigrants” 8.7 “republican” in profile 7.9 “democrat” in profile -3.0 self-reported location -1.7 = Berkeley 25

Feature selection • We could threshold features by minimum count but that also throws away information • We can take a probabilistic approach and encode a prior belief that all β should be 0 unless we have strong evidence otherwise 26

L2 regularization N F � � β 2 � ( β ) = log P ( y i | x i , β ) η j − i = 1 j = 1 � �� we want this to be high but we want this to be small • We can do this by changing the function we’re trying to optimize by adding a penalty for having values of β that are high • This is equivalent to saying that each β element is drawn from aNormal distribution centered on 0. • η controls how much of a penalty to pay for coefficients that are far from 0 (optimize on development data) 27

no L2 some L2 high L2 regularization regularization regularization 33.83 Won Bin 2.17 Eddie Murphy 0.41 Family Film 29.91 Alexander Beyer 1.98 Tom Cruise 0.41 Thriller 24.78 Bloopers 1.70 Tyler Perry 0.36 Fantasy 23.01 Daniel Brühl 1.70 Michael Douglas 0.32 Action 22.11 Ha Jeong-woo 1.66 Robert Redford 0.25 Buddy film 20.49 Supernatural 1.66 Julia Roberts 0.24 Adventure 18.91 Kristine DeBell 1.64 Dance 0.20 Comp Animation 18.61 Eddie Murphy 1.63 Schwarzenegger 0.19 Animation 18.33 Cher 1.63 Lee Tergesen 0.18 Science Fiction 18.18 Michael Douglas 1.62 Cher 0.18 Bruce Willis 28

μ σ 2 β ∼ Norm ( μ , σ 2 ) β �� F � � � i = 1 x i β i exp y ∼ Ber x y � � �� F � 1 + exp i = 1 x i β i 29

L1 regularization N F � � � ( β ) = log P ( y i | x i , β ) η | β j | − i = 1 j = 1 � �� we want this to be high but we want this to be small • L1 regularization encourages coefficients to be exactly 0. • η again controls how much of a penalty to pay for coefficients that are far from 0 (optimize on development data) 30

What do the coefficients mean? exp ( x 0 β 0 + x 1 β 1 ) P ( y | x , β ) = 1 + exp ( x 0 β 0 + x 1 β 1 ) P ( y | x , β )( 1 + exp ( x 0 β 0 + x 1 β 1 )) = exp ( x 0 β 0 + x 1 β 1 ) P ( y | x , β ) + P ( y | x , β ) exp ( x 0 β 0 + x 1 β 1 ) = exp ( x 0 β 0 + x 1 β 1 )

Odds • Ratio of an event occurring to its not taking place P ( x ) 1 − P ( x ) 0 . 75 0 . 25 = 3 Green Bay Packers 1 = 3 : 1 vs. SF 49ers probability of odds for GB GB winning winning

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 9: Logistic regression Feb 22, 2016 Generative vs. Discriminative models Generative models specify a joint distribution over the labels and the data. With

C Constructing i (and Deconstructing) (and Deconstructing) the Postmortem Interval the

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 7: Data and

Deconstructing Alice & Bob Carlos Caleiro CLC, Dep. Mathematics, IST, TU Lisbon, Portugal

Deconstructing MinBFT for Security and Verifiability Vincent Rahli, Francisco Rocha, Marcus V

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 20: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 18: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Probabilistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 2: Survey of

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 10: Validity Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Naive Bayes Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 17: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 11: Topic models

A Possibility Theorem on Majority Decisions by Amartya K. Sen Kian Mintz-Woo University of

FIND THE NEEDLE IN THE HAYSTACK February 2017 CHALLENGES OF SYSTEM INTEGRATION CANopen Device

Introduction to Artificial Intelligence Classification Algorithms Decision Trees and Overfitting

In collaboration with I. Yegorova and P. Salucci Outline: 1) description of the (pilot) project

Prevent Extremism and Radicalisation (Information for VCSE Sector Version 3 23 Dec 2015)

Charmonium spectral functions from 2+1 flavour lattice QCD asztor 12 Attila P

Social Policy Paradigms, welfare state reforms and the crisis B. Palier From Neo-liberalism to

Food Policy & Food Systems Fall, 2018 - Prof. E. Messer Week 10 - (November 13) Biofuels,

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 9: Logistic regression Feb 22, 2016 Generative vs. Discriminative models Generative models specify a joint distribution over the labels and the data. With

C Constructing i (and Deconstructing) (and Deconstructing) the Postmortem Interval the

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 7: Data and

Deconstructing Alice &amp; Bob Carlos Caleiro CLC, Dep. Mathematics, IST, TU Lisbon, Portugal

Deconstructing MinBFT for Security and Verifiability Vincent Rahli, Francisco Rocha, Marcus V

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 20: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 18: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Probabilistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 2: Survey of

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 10: Validity Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Naive Bayes Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 17: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 11: Topic models

A Possibility Theorem on Majority Decisions by Amartya K. Sen Kian Mintz-Woo University of

FIND THE NEEDLE IN THE HAYSTACK February 2017 CHALLENGES OF SYSTEM INTEGRATION CANopen Device

Introduction to Artificial Intelligence Classification Algorithms Decision Trees and Overfitting

In collaboration with I. Yegorova and P. Salucci Outline: 1) description of the (pilot) project

Prevent Extremism and Radicalisation (Information for VCSE Sector Version 3 23 Dec 2015)

Charmonium spectral functions from 2+1 flavour lattice QCD asztor 12 Attila P

Social Policy Paradigms, welfare state reforms and the crisis B. Palier From Neo-liberalism to

Food Policy &amp; Food Systems Fall, 2018 - Prof. E. Messer Week 10 - (November 13) Biofuels,

Deconstructing Alice & Bob Carlos Caleiro CLC, Dep. Mathematics, IST, TU Lisbon, Portugal

Food Policy & Food Systems Fall, 2018 - Prof. E. Messer Week 10 - (November 13) Biofuels,