Estimating Uncertainty of Categorical Web Data Davide Ceolin, - - PowerPoint PPT Presentation

estimating uncertainty of categorical web data
SMART_READER_LITE
LIVE PREVIEW

Estimating Uncertainty of Categorical Web Data Davide Ceolin, - - PowerPoint PPT Presentation

Estimating Uncertainty of Categorical Web Data Davide Ceolin, Willem Robert van Hage, Wan Fokkink, Guus Schreiber VU University Amsterdam Suppose we know that this bag contains balls... Suppose we know that this bag contains balls... We


slide-1
SLIDE 1

Estimating Uncertainty of Categorical Web Data

Davide Ceolin, Willem Robert van Hage, Wan Fokkink, Guus Schreiber VU University Amsterdam

slide-2
SLIDE 2

Suppose we know that this bag contains balls...

slide-3
SLIDE 3

Suppose we know that this bag contains balls... We draw some of them...

slide-4
SLIDE 4

Suppose we know that this bag contains balls... We draw some of them...

slide-5
SLIDE 5

Suppose we know that this bag contains balls... We draw some of them...

slide-6
SLIDE 6

Suppose we know that this bag contains balls... We draw some of them...

slide-7
SLIDE 7

Suppose we know that this bag contains balls... We draw some of them...

slide-8
SLIDE 8

Suppose we know that this bag contains balls... We draw some of them...

slide-9
SLIDE 9

Suppose we know that this bag contains balls... We draw some of them... What can say about the bag content?

slide-10
SLIDE 10

Bag content

  • A binomial

distribution can represent the sample

  • But, does it represent

also the entire bag content?

slide-11
SLIDE 11

Few new

  • bservations can

cause a dramatic change in the proportions!

slide-12
SLIDE 12

Few new

  • bservations can

cause a dramatic change in the proportions!

slide-13
SLIDE 13

Few new

  • bservations can

cause a dramatic change in the proportions!

slide-14
SLIDE 14

Few new

  • bservations can

cause a dramatic change in the proportions!

slide-15
SLIDE 15

Few new

  • bservations can

cause a dramatic change in the proportions! (from 80/20 to 50/50)

slide-16
SLIDE 16

Estimating the second order probability

  • We should estimate

the uncertainty about the ratio p.

  • The Beta is the best

candidate to describe p (because it is conjugated to the multinomial).

slide-17
SLIDE 17

Of course, this was a metaphore...

slide-18
SLIDE 18

Of course, this was a metaphore... WEB

slide-19
SLIDE 19

Of course, this was a metaphore... WEB Classes

  • f URIs /

Web pages / Links / ...

slide-20
SLIDE 20

Of course, this was a metaphore... WEB Classes

  • f URIs /

Web pages / Links / ... Does this change something?

slide-21
SLIDE 21

Deal with Web Samples

The Web makes the situation more complicated:

  • Samples can be biased;
  • The Web evolves over time;
  • Different domains imply distinct subpopulations;
  • Data are accessed incrementally, by crawling.
slide-22
SLIDE 22

Deploying second order probabilities

Why? Possible bias, time variability, sub- populations increase uncertainty. Rationale: Instead of trying to estimate the correct proportion among categories, we compute a set of candidate values. Over time: More evidence makes the set smaller. Soundness: Conjugacy guarantees correct choice and update of probability distributions.

slide-23
SLIDE 23

A natural history example

WEB Cat 1 Cat 2

What is the right annotation for this specimen?

SAMPLE

slide-24
SLIDE 24

The distribution may help us

  • The Beta-Binomial is

a Binomial which parameter p is randomly drawn from a Beta distribution.

  • Beta-binomial is

more smoothed than

  • Binomial. As data

size grows, it tends to the Binomial.

slide-25
SLIDE 25

Linked Open Piracy

Linked Open Piracy is a repository about piracy attacks. Time, place, attack type and ship type of each attack are recorded. The repository is known to be accurate, but incomplete. Let us see how to deal with this issue.

slide-26
SLIDE 26

Estimating attack type proportions

  • Unknown population

size.

  • Our variables are not

necessarily iid.

  • Data are represented

by a multinomial distribution.

  • Using a Dirichlet prior

we can estimate their uncertainty.

slide-27
SLIDE 27

New attack type prediction

WEB

How is is possible to estimate future proportions in this situation? In many regions, new attack types show up over time.

slide-28
SLIDE 28

Dirichlet Process can help us!

  • First: mapping. Attack types → [0..1]
  • A priori, U[0...1] (events equally likely).
  • Class of new observation can be

– Drawn from U[0..1] (names are mapped

manually);

– Proportional to already observed data.

  • The weight of observations increases as more

data are seen.

slide-29
SLIDE 29

Dirichlet processes as generalized Dirichlet distributions

  • Uncertainty about the

proportions and uncertainty about the classes.

  • Simulations driven by

Dirichlet Process can provide good estimates.

slide-30
SLIDE 30

Results

Simulation Projection Average error 0.29 0.35 Variance 0.09 0.21

  • Per region, we

predict year n+1 proportions, based

  • n year n data.
  • Dirichlet process

performs better than a projection of the current proportions.

slide-31
SLIDE 31

Conclusions

Web data are characterized by more layers of uncertainty. Second order probabilities help handling part of these layers. Dirichlet process helps to compensate when not all categories are known. There is still much to do! Consider concrete domain data, integrate with logics, etc...

slide-32
SLIDE 32

Thank you! Questions? d.ceolin@vu.nl http://www.few.vu.nl/~dceolin http://www.cs.vu.nl/lop