Approximate Counting By Sampling CompSci 590.02 Instructor: - - PowerPoint PPT Presentation

approximate counting by sampling
SMART_READER_LITE
LIVE PREVIEW

Approximate Counting By Sampling CompSci 590.02 Instructor: - - PowerPoint PPT Presentation

Approximate Counting By Sampling CompSci 590.02 Instructor: AshwinMachanavajjhala Lecture 3 : 590.02 Spring 13 1 Recap Till now we saw Efficient sampling techniques to get uniformly random samples Reservoir sampling Sampling


slide-1
SLIDE 1

Approximate Counting By Sampling

CompSci 590.02 Instructor: AshwinMachanavajjhala

1 Lecture 3 : 590.02 Spring 13

slide-2
SLIDE 2

Recap

Till now we saw …

  • Efficient sampling techniques to get uniformly random samples

– Reservoir sampling – Sampling using a tree index – Sampling using a nearest neighbor index

Today’s class

  • Use sampling for approximate counting.

Lecture 3 : 590.02 Spring 13 2

slide-3
SLIDE 3

Counting Problems

  • Given a decision problem S, compute the number of feasible

solutions to S (denoted by #S). Example:

  • #DNF: Count the number of satisfying assignments of a boolean

formula in DNF

– E.g., – Let n = number of variables – Let m = number of disjuncts

  • Counting the number of triangles in a graph

Lecture 3 : 590.02 Spring 13 3

slide-4
SLIDE 4

Applications of DNF counting

  • Advertising

– Contracts are of the following form: Need 1 million impressions [Males, 15-25, CA] OR [Males, 15-35, TX] – Use historical data to estimate whether such a contract can be fulfilled.

  • Web Search

– Given a keyword query q = (k1, k2, …, km) Find the number of documents that contain at least one keyword.

Lecture 3 : 590.02 Spring 13 4

slide-5
SLIDE 5

DNF Counting is Hard

  • Checking whether a DNF formula is unsatisfiable is NP-hard
  • #DNF ε #P
  • #P is the class of all problems for which there exist a non-

deterministic polynomial time algorithm A such that for any instance I, the number of accepting computations is #I.

– i.e., we can verify in polynomial time whether #I > 1.

Lecture 3 : 590.02 Spring 13 5

slide-6
SLIDE 6

FPRAS

  • Our goal is design an fully polynomial randomized approximation

scheme (FPRAS).

  • For every input DNF, error parameter ε > 0, and confidence

parameter 0 < δ < 1, the algorithm must output a value C’ s.t. P[(1-ε) C < C’ < (1+ε) C] > 1-δ where C is the true number of satisfying assignments, in time polynomial in the input DNF, 1/ε and log(1/δ)

Lecture 3 : 590.02 Spring 13 6

slide-7
SLIDE 7

FPRAS

  • Sometimes, FPRAS are defined without the δ …
  • For every input DNF, error parameter ε > 0, the algorithm must
  • utput a value C’ s.t.

P[(1-ε) C < C’ < (1+ε) C] > 3/4 where C is the true number of satisfying assignments, in time polynomial in the input DNF, and 1/ε

  • Exercise: The two definitions are equivalent.

Lecture 3 : 590.02 Spring 13 7

slide-8
SLIDE 8

Monte Carlo Method

  • Suppose U is a universe of elements

– In DNF counting, U = set of all assignments from {0,1}n

  • Let G be a subset of interest in U

– In DNF counting, G = set of all satisfying assignments.

Lecture 3 : 590.02 Spring 13 8

For i = 1 to N

  • Choose u ε U, uniformly at random
  • Check whether u ε G ?
  • Let Xi = 1 if u ε G, Xi = 0 otherwise

Return

slide-9
SLIDE 9

Monte Carlo Method

When should you use it?

  • Easy to uniformly sample from U
  • Easy to check whether sample is in G
  • N is polynomial in the size of the input.

Lecture 3 : 590.02 Spring 13 9

slide-10
SLIDE 10

Chernoff Bound

Theorem:

Lecture 3 : 590.02 Spring 13 10

slide-11
SLIDE 11

Upper Chernoff Bound Proof

Lecture 3 : 590.02 Spring 13 11

slide-12
SLIDE 12

Simpler Upper Tail Bound

Lecture 3 : 590.02 Spring 13 12

slide-13
SLIDE 13

Simpler Lower Tail Bound

Lecture 3 : 590.02 Spring 13 13

slide-14
SLIDE 14

DNF Counting

  • |U| = 2n
  • |G| can be exponentially smaller than |U|

Example:

  • Every satisfying assignment must contain x1 = 1
  • |G| = 2n/2
  • Large |U|/|G| leads to an exponential number of samples for

convergence.

Lecture 3 : 590.02 Spring 13 14

slide-15
SLIDE 15

Importance Sampling

  • Set U’ = {(u, i) | u is an assignment that satisfies disjunct i }
  • Set G’ = {(u, i) | u is an assignment that satisfies disjunct i

but does not satisfy any disjunct j < i }

  • |G’| = |G|

– Each assignment appears exactly once.

  • Easy to check if sample is in G’
  • |U’| / |G’| ≤ m

– Each assignment appears at most m times in U’

  • We are done if we can sample uniformly from U’

Lecture 3 : 590.02 Spring 13 15

slide-16
SLIDE 16

Importance Sampling

  • Given a DNF formula, it is easy to construct a satisfying

assignment.

– E.g., – Pick a clause (e.g. 1st) – Create a satisfying assignment for variables in that clause (e.g, 1001) – Randomly choose 0 or 1 for the remaining variables.

  • If a disjunct i has ki literals, there are 2n-ki satisfying assignments

(u,i)

  • |U’| = ∑i 2n-ki

Lecture 3 : 590.02 Spring 13 16

slide-17
SLIDE 17

Importance Sampling

Theorem: The above algorithm is an (ε,δ) FPRAS if

Lecture 3 : 590.02 Spring 13 17

For i = 1 to N

  • Choose a disjunct i, with probability 2n-ki/|U’|
  • Generate a random assignment satisfying disjunct i
  • Check whether u ε G ?
  • Let Xi = 1 if u ε G, Xi = 0 otherwise

Return

slide-18
SLIDE 18

Summary of DNF Counting

  • #DNF is a #P-hard problem
  • Monte Carlo method can result in a (ε,δ) FPRAS if

– Can sample from U in PTIME – Can check membership in G PTIME – |G| is not very small compared to |U|

  • Monte Carlo on a modified domain results in a (ε,δ) FPRAS for

#DNF

Lecture 3 : 590.02 Spring 13 18

slide-19
SLIDE 19

Applications of Triangle Counting

  • Measures of homophily

– If A-B and B-C are edges, what is the probability that A-C is also an edge

  • Clustering Coefficient: 3 x # triangles / # connected triples
  • Transitivity Ratio: # triangles / # connected triples

Lecture 3 : 590.02 Spring 13 19

slide-20
SLIDE 20

Triangle Counting is “Easy”

  • Naïve method: O(n3)
  • Well known methods that take O(dmax

2n) and O(m1.5)

  • Still not efficient for a very large graph

– Twitter in 2009 – 54,981,152 nodes – 1,963,263,821 edges – Max degree > 3 million – Clustering Coefficient ~ 0.1

Lecture 3 : 590.02 Spring 13 20

slide-21
SLIDE 21

Is there an FPRAS?

  • Exercise

Lecture 3 : 590.02 Spring 13 21

slide-22
SLIDE 22

References

  • R. Karp, M. Luby, N. Madras, "Monte Carlo Estimation Algorithm for Enumeration

Problems", Journal of Algorithms 10(3) 1989

Lecture 3 : 590.02 Spring 13 22