@ Estimating uncertainty in real-world conditions using Bayesian - - PowerPoint PPT Presentation

estimating uncertainty in real world conditions using
SMART_READER_LITE
LIVE PREVIEW

@ Estimating uncertainty in real-world conditions using Bayesian - - PowerPoint PPT Presentation

Reading into Scarce Data @ Estimating uncertainty in real-world conditions using Bayesian inference Max Sklar Foursquare Engineer @maxsklar We're talking about this problem: - When do I have enough data to be confident in my answer? -


slide-1
SLIDE 1

Reading into Scarce Data

Estimating uncertainty in real-world conditions using Bayesian inference

@

Max Sklar Foursquare Engineer @maxsklar

slide-2
SLIDE 2

We're talking about this problem:

  • When do I have

enough data to be confident in my answer?

  • What if I'm forced

to give an estimate even if I don't have enough data?

  • How can we do

this in a non- arbitrary way? Rotten Tomatoes ratings: Is this fair?

slide-3
SLIDE 3

Scarce data is inevitable!

Most instances won't have a lot of data.

  • Tradoff to dividing data into

more buckets, but getting less data per bucket.

  • Power curves

○ A few buckets will have tons of examples, while most won't have many ○ Ex. Venues, Users, Words in documents

  • Every social network has this

distribution

slide-4
SLIDE 4

What can we say about the sparse buckets?

Look at the roulette wheel vs. the restaurant story. Wheel: Red, Red, Red, Red Restaurant: +, +, +, + Takeaway: You're willing to make a judgement about a restaurant after going 4 times, but you're not willing to accept data points from the wheel. Why is this? Turns out the intuition is very logical!

slide-5
SLIDE 5

What can we say about the sparse buckets?

Suppose we're pulling data out of a multinomial distribution. Each distribution is represented by a pie chart. We don't know which pie chart we're on - only the result

  • f throwing darts at it.

We should have an idea before starting of which distributions are more likely than others. We are willing to change our mind given data using Bayes rule:

slide-6
SLIDE 6

A clean way of doing this: Dirichlet Prior

Normal Distribution

  • has a mean
  • has a standard deviation,

which could be interpreted as how confident we are in the mean. Dirichlet Prior

  • has an expected

multinomial distribution (think a pie chart)

  • has a weight that tells us

how confident we are. high weight = we're not so willing to change our mind with new data. low weight = we place high value on new data. Collecting more data increases the weight.

slide-7
SLIDE 7

Estimating food preferences

You decide to poll 1000 random people in a given area on what they like better: pizza or hot dogs. You get the following result:

x 638 x 362

638/1000 = 63.8% 362/1000 = 36.2%

The familiar way to find probability is to divide the number of events for that state by the total number of events. This is the maximum likelihood estimation. In this case, because there's a lot of data, this is actually a very good probability distribution to guess for the next person coming out of this

  • sample. But then there are other cases...
slide-8
SLIDE 8

Cases where maximum likelihood fails

No data:

Small amount of data, one category is a no show Small amount of data, both items show Large amount of data, one category is a no show

x 0 x 0 MLE = 0/0 (Indeterminate) x 0 x 2

MLE for hot dog is 0. If the next person likes hot dogs that's infinite loss, but it's very possible.

x 1 x 4

MLE for pizza is 20%. But this could have easily been produced by a 50-50 split, and LLL tells us not to be too bold.

Where would you rather be, 16 Handles, or 14th Street Post

  • ffice?

x1000 x 0

We still don't want to guess 0

slide-9
SLIDE 9

Solution: Add prior data

  • When we count up our data, we initialize both counts to
  • zero. Instead, initialize both counts to something > 0. It

doesn't even have to be an integer!

Count: .5 + 0 = 0.5 Probability = .5 Was: INDET Count: .5 + 2 = 2.5 Probability = 83% Was: 100% Count: .5 + 1 = 1.5 Probability = 25% Was: 20% Count: 638.5 Probability = 63.8% Was: ~Same Count: .5 + 0 = 0.5 Probability = .5 Was: INDET Count: .5 + 0 = 0.5 Probability = 17% Was 0% Count: .5 + 4 = 4.5 Probability = 75% Was: 80% Count: 362.5 Probability = 36.2% Was: ~ Same Count = 1 Was: 0 Count = 3 Was: 2 Count = 6 Was: 5 Count = 1001 Was: 1000

  • When sparse data isn't a problem (4th column), the number

doesn't change much.

  • Now even the post office will be assigned a probability of
  • ne out of every 2001 people.
slide-10
SLIDE 10

The initial counts correspond to a dirichlet distribution!

n1: number of observations for state 1 n2: number of observations for state 2 N = n1+n2: total number of observations a1: prior on state1 a2: prior on state2 W = a1+a2: total prior (the weight of the dirichlet distribution) MLE estimate for state 1 = n1 / N Dirichlet Prior Estimate for state 1 = (n1+) / (N+W)

slide-11
SLIDE 11

Math for the Beta Distribution

  • The Beta Prior is the dirichlet prior with 2

categories.

  • When using bayes rule to find a posterior

distribution after seeing data, the result is a new beta prior!

  • The end result is the same as we saw before from

adding additional data, so there's no need to understand this.

  • But trust me, the math works.
  • See for yourself! It's not as bad as it looks.
slide-12
SLIDE 12

Beta, Gamma, Dirichlet

  • Beta Prior

○ Use it when you're dividing into two groups, or deciding between true and false

  • Dirichlet Prior

○ This is when you have multiple categories to choose from. Add a little bit of prior data to each category.

  • Gamma Prior

○ This is when you have a rate of

  • ccurance of events. Add some

prior events, and some prior time. ○ Can be used to build Dirichlets

slide-13
SLIDE 13

How do we find the prior

  • Without any other data, it's subjective.
  • Same weight on all categories?
  • Roulette Wheel vs. Restaurant again
  • When would you want high

weight? Roulette Wheel.

  • When would you want low

weight? Team of observers example.

  • Cautionary tale: Dirichtlet Jellybean

example.

slide-14
SLIDE 14

How do we find the prior?

With Big Data, we can get fancier!

  • We may not have a lot of data about this barbecue, but

we've been to lots of barbecues

  • Use our experience - find the dirichlet distribution which
  • ptimizes the likelihood of past data.
slide-15
SLIDE 15

A short python script will do it

  • Available on github: BayesPy/ConjugatePriorTools

○ The dirichlet probability distribution over counts (below) is easy to implement in python.

  • Uses Gradient Descent (below), and second order methods.
  • Let's run it!
slide-16
SLIDE 16

Finally, some applications in the Foursquare app:

Tip Sentiment: Using a modified version of the afinn word list, each tip is classified either positive, negative, or neutral. Here is the prior that we found: Positive 1.4 Neutral 3.1 Negative .4 High Sentiment: McNally Jackson Bookstore Pepe's Pizza Low Sentiment: 14th Street Post Office

slide-17
SLIDE 17

Venue without Tips 1 negative tip 5 negative tips 30 tips - 10 of each

slide-18
SLIDE 18

Prior for partitioned popularity with a 48-dimensional Dirichlet

  • Monday - Thursday compressed, 2 hour intervals
  • Total weight: about 31 (Why so high?)
  • What if the time buckets are sampled unevenly?
slide-19
SLIDE 19

Likes, and Dislikes: #allnew4sq

Foursquare released an update

  • n June 7, 2012 that allowed

users to like and dislike venues. We now have a lot of data points, but most venues have very few likes or dislikes. We've done some initial analysis, but there are a few issues. Let's look at the list and try to diagnose.... Beta Prior: likes, dislikes 7.50, 0.90 likes are much more common

slide-20
SLIDE 20

Examples of like/dislike ratings:

Highly liked venues: at&t park, 38/38, 98% magic kingdom park, 31/31, 98% blue bottle coffee, 28/28, 98% navy pier (chicago), 26/26, 97% shake shake, 24/24, 97% musée du louvre, 24/24, 97% museum of natural history, 24/24, 97% MoMa, 44/45, 96% uefa euro 2012 russia / czech 261/272, 95%

This list illustrates some of the work that still needs to be done

slide-21
SLIDE 21

Examples of like/dislike ratings:

Highly disliked venues: (Not sure if people actually dislike) vh1 big morning buzz, 17/41, 50% cross country cookout (History Channel event in Nashville) 14/30, 56% Penny Farthing, 1/5, 63% (More confident, shows lack of data) Left: Moscow Passport Office, 1/5, 63% 14th Street Post office: 0/1: 80%

This list illustrates some of the work that still needs to be done

slide-22
SLIDE 22

These projects will make these usable:

1) We need more than a weeks worth of data. The negative signal in particular is weak. 2) There's a signal in "not rating" even though you've been

  • somewhere. We need to look at who is rating (what's worked in

the past) 3) Combine likes with a better tip sentiment algorithm 4) Personalization - netflix-style matrix factorization

slide-23
SLIDE 23

Questions?

Thanks! Follow me on Twitter: @maxsklar