[PPT] - @ Estimating uncertainty in real-world conditions using Bayesian PowerPoint Presentation

SLIDE 1

Reading into Scarce Data

Estimating uncertainty in real-world conditions using Bayesian inference

@

Max Sklar Foursquare Engineer @maxsklar

SLIDE 2

We're talking about this problem:

When do I have

enough data to be confident in my answer?

What if I'm forced

to give an estimate even if I don't have enough data?

How can we do

this in a non- arbitrary way? Rotten Tomatoes ratings: Is this fair?

SLIDE 3

Scarce data is inevitable!

Most instances won't have a lot of data.

Tradoff to dividing data into

more buckets, but getting less data per bucket.

Power curves

○ A few buckets will have tons of examples, while most won't have many ○ Ex. Venues, Users, Words in documents

Every social network has this

distribution

SLIDE 4

What can we say about the sparse buckets?

Look at the roulette wheel vs. the restaurant story. Wheel: Red, Red, Red, Red Restaurant: +, +, +, + Takeaway: You're willing to make a judgement about a restaurant after going 4 times, but you're not willing to accept data points from the wheel. Why is this? Turns out the intuition is very logical!

SLIDE 5

What can we say about the sparse buckets?

Suppose we're pulling data out of a multinomial distribution. Each distribution is represented by a pie chart. We don't know which pie chart we're on - only the result

f throwing darts at it.

We should have an idea before starting of which distributions are more likely than others. We are willing to change our mind given data using Bayes rule:

SLIDE 6

A clean way of doing this: Dirichlet Prior

Normal Distribution

has a mean
has a standard deviation,

which could be interpreted as how confident we are in the mean. Dirichlet Prior

has an expected

multinomial distribution (think a pie chart)

has a weight that tells us

how confident we are. high weight = we're not so willing to change our mind with new data. low weight = we place high value on new data. Collecting more data increases the weight.

SLIDE 7

Estimating food preferences

You decide to poll 1000 random people in a given area on what they like better: pizza or hot dogs. You get the following result:

x 638 x 362

638/1000 = 63.8% 362/1000 = 36.2%

The familiar way to find probability is to divide the number of events for that state by the total number of events. This is the maximum likelihood estimation. In this case, because there's a lot of data, this is actually a very good probability distribution to guess for the next person coming out of this

sample. But then there are other cases...

SLIDE 8

Cases where maximum likelihood fails

No data:

Small amount of data, one category is a no show Small amount of data, both items show Large amount of data, one category is a no show

x 0 x 0 MLE = 0/0 (Indeterminate) x 0 x 2

MLE for hot dog is 0. If the next person likes hot dogs that's infinite loss, but it's very possible.

x 1 x 4

MLE for pizza is 20%. But this could have easily been produced by a 50-50 split, and LLL tells us not to be too bold.

Where would you rather be, 16 Handles, or 14th Street Post

ffice?

x1000 x 0

We still don't want to guess 0

SLIDE 9

Solution: Add prior data

When we count up our data, we initialize both counts to
zero. Instead, initialize both counts to something > 0. It

doesn't even have to be an integer!

Count: .5 + 0 = 0.5 Probability = .5 Was: INDET Count: .5 + 2 = 2.5 Probability = 83% Was: 100% Count: .5 + 1 = 1.5 Probability = 25% Was: 20% Count: 638.5 Probability = 63.8% Was: ~Same Count: .5 + 0 = 0.5 Probability = .5 Was: INDET Count: .5 + 0 = 0.5 Probability = 17% Was 0% Count: .5 + 4 = 4.5 Probability = 75% Was: 80% Count: 362.5 Probability = 36.2% Was: ~ Same Count = 1 Was: 0 Count = 3 Was: 2 Count = 6 Was: 5 Count = 1001 Was: 1000

When sparse data isn't a problem (4th column), the number

doesn't change much.

Now even the post office will be assigned a probability of
ne out of every 2001 people.

SLIDE 10

The initial counts correspond to a dirichlet distribution!

n1: number of observations for state 1 n2: number of observations for state 2 N = n1+n2: total number of observations a1: prior on state1 a2: prior on state2 W = a1+a2: total prior (the weight of the dirichlet distribution) MLE estimate for state 1 = n1 / N Dirichlet Prior Estimate for state 1 = (n1+) / (N+W)

SLIDE 11

Math for the Beta Distribution

The Beta Prior is the dirichlet prior with 2

categories.

When using bayes rule to find a posterior

distribution after seeing data, the result is a new beta prior!

The end result is the same as we saw before from

adding additional data, so there's no need to understand this.

But trust me, the math works.
See for yourself! It's not as bad as it looks.

SLIDE 12

Beta, Gamma, Dirichlet

Beta Prior

○ Use it when you're dividing into two groups, or deciding between true and false

Dirichlet Prior

○ This is when you have multiple categories to choose from. Add a little bit of prior data to each category.

Gamma Prior

○ This is when you have a rate of

ccurance of events. Add some

prior events, and some prior time. ○ Can be used to build Dirichlets

SLIDE 13

How do we find the prior

Without any other data, it's subjective.
Same weight on all categories?
Roulette Wheel vs. Restaurant again
When would you want high

weight? Roulette Wheel.

When would you want low

weight? Team of observers example.

Cautionary tale: Dirichtlet Jellybean

example.

SLIDE 14

How do we find the prior?

With Big Data, we can get fancier!

We may not have a lot of data about this barbecue, but

we've been to lots of barbecues

Use our experience - find the dirichlet distribution which
ptimizes the likelihood of past data.

SLIDE 15

A short python script will do it

Available on github: BayesPy/ConjugatePriorTools

○ The dirichlet probability distribution over counts (below) is easy to implement in python.

Uses Gradient Descent (below), and second order methods.
Let's run it!

SLIDE 16

Finally, some applications in the Foursquare app:

Tip Sentiment: Using a modified version of the afinn word list, each tip is classified either positive, negative, or neutral. Here is the prior that we found: Positive 1.4 Neutral 3.1 Negative .4 High Sentiment: McNally Jackson Bookstore Pepe's Pizza Low Sentiment: 14th Street Post Office

SLIDE 17

Venue without Tips 1 negative tip 5 negative tips 30 tips - 10 of each

SLIDE 18

Prior for partitioned popularity with a 48-dimensional Dirichlet

Monday - Thursday compressed, 2 hour intervals
Total weight: about 31 (Why so high?)
What if the time buckets are sampled unevenly?

SLIDE 19

Likes, and Dislikes: #allnew4sq

Foursquare released an update

n June 7, 2012 that allowed

users to like and dislike venues. We now have a lot of data points, but most venues have very few likes or dislikes. We've done some initial analysis, but there are a few issues. Let's look at the list and try to diagnose.... Beta Prior: likes, dislikes 7.50, 0.90 likes are much more common

SLIDE 20

Examples of like/dislike ratings:

Highly liked venues: at&t park, 38/38, 98% magic kingdom park, 31/31, 98% blue bottle coffee, 28/28, 98% navy pier (chicago), 26/26, 97% shake shake, 24/24, 97% musée du louvre, 24/24, 97% museum of natural history, 24/24, 97% MoMa, 44/45, 96% uefa euro 2012 russia / czech 261/272, 95%

This list illustrates some of the work that still needs to be done

SLIDE 21

Examples of like/dislike ratings:

Highly disliked venues: (Not sure if people actually dislike) vh1 big morning buzz, 17/41, 50% cross country cookout (History Channel event in Nashville) 14/30, 56% Penny Farthing, 1/5, 63% (More confident, shows lack of data) Left: Moscow Passport Office, 1/5, 63% 14th Street Post office: 0/1: 80%

This list illustrates some of the work that still needs to be done

SLIDE 22

These projects will make these usable:

1) We need more than a weeks worth of data. The negative signal in particular is weak. 2) There's a signal in "not rating" even though you've been

somewhere. We need to look at who is rating (what's worked in

the past) 3) Combine likes with a better tip sentiment algorithm 4) Personalization - netflix-style matrix factorization

SLIDE 23

Reading into Scarce Data

Estimating uncertainty in real-world conditions using Bayesian inference

@

Max Sklar Foursquare Engineer @maxsklar

We're talking about this problem:

enough data to be confident in my answer?

to give an estimate even if I don't have enough data?

this in a non- arbitrary way? Rotten Tomatoes ratings: Is this fair?

Scarce data is inevitable!

Most instances won't have a lot of data.

more buckets, but getting less data per bucket.

○ A few buckets will have tons of examples, while most won't have many ○ Ex. Venues, Users, Words in documents

distribution

What can we say about the sparse buckets?

What can we say about the sparse buckets?

Suppose we're pulling data out of a multinomial distribution. Each distribution is represented by a pie chart. We don't know which pie chart we're on - only the result

We should have an idea before starting of which distributions are more likely than others. We are willing to change our mind given data using Bayes rule:

A clean way of doing this: Dirichlet Prior

Normal Distribution

which could be interpreted as how confident we are in the mean. Dirichlet Prior

multinomial distribution (think a pie chart)

how confident we are. high weight = we're not so willing to change our mind with new data. low weight = we place high value on new data. Collecting more data increases the weight.

Estimating food preferences

You decide to poll 1000 random people in a given area on what they like better: pizza or hot dogs. You get the following result:

x 638 x 362

638/1000 = 63.8% 362/1000 = 36.2%

Cases where maximum likelihood fails

x 0 x 0 MLE = 0/0 (Indeterminate) x 0 x 2

x 1 x 4

Where would you rather be, 16 Handles, or 14th Street Post

x1000 x 0

Solution: Add prior data

doesn't even have to be an integer!

doesn't change much.

The initial counts correspond to a dirichlet distribution!

Math for the Beta Distribution

Beta, Gamma, Dirichlet

○ Use it when you're dividing into two groups, or deciding between true and false

○ This is when you have multiple categories to choose from. Add a little bit of prior data to each category.

○ This is when you have a rate of

prior events, and some prior time. ○ Can be used to build Dirichlets

How do we find the prior

weight? Roulette Wheel.

weight? Team of observers example.

example.

How do we find the prior?

With Big Data, we can get fancier!

we've been to lots of barbecues

A short python script will do it

○ The dirichlet probability distribution over counts (below) is easy to implement in python.

Finally, some applications in the Foursquare app:

Tip Sentiment: Using a modified version of the afinn word list, each tip is classified either positive, negative, or neutral. Here is the prior that we found: Positive 1.4 Neutral 3.1 Negative .4 High Sentiment: McNally Jackson Bookstore Pepe's Pizza Low Sentiment: 14th Street Post Office

Venue without Tips 1 negative tip 5 negative tips 30 tips - 10 of each

Prior for partitioned popularity with a 48-dimensional Dirichlet

Likes, and Dislikes: #allnew4sq

Foursquare released an update

users to like and dislike venues. We now have a lot of data points, but most venues have very few likes or dislikes. We've done some initial analysis, but there are a few issues. Let's look at the list and try to diagnose.... Beta Prior: likes, dislikes 7.50, 0.90 likes are much more common

Examples of like/dislike ratings:

This list illustrates some of the work that still needs to be done

Examples of like/dislike ratings:

This list illustrates some of the work that still needs to be done

These projects will make these usable:

1) We need more than a weeks worth of data. The negative signal in particular is weak. 2) There's a signal in "not rating" even though you've been

the past) 3) Combine likes with a better tip sentiment algorithm 4) Personalization - netflix-style matrix factorization

Questions?

Thanks! Follow me on Twitter: @maxsklar