Random Sampling Florian Schoppmann August 24, 2010 Non-Sequential - - PowerPoint PPT Presentation

random sampling
SMART_READER_LITE
LIVE PREVIEW

Random Sampling Florian Schoppmann August 24, 2010 Non-Sequential - - PowerPoint PPT Presentation

Random Sampling Florian Schoppmann August 24, 2010 Non-Sequential Sequential Sequential with Reservoir Sequential With Reservoir and Replacement . . . . . . . . . . . . Sampling Algorithms Input: List [ 1 . . . N ] Length of


slide-1
SLIDE 1

Random Sampling

Florian Schoppmann August 24, 2010

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement

slide-2
SLIDE 2

Sampling Algorithms

Input:

  • List[1 . . . N]
  • Length of database N (if known)
  • Length of sample n

Output:

  • Sample[1 . . . n]

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement

slide-3
SLIDE 3

Considerations

  • Online vs. random-access
  • Sequential vs. non-sequential
  • Samples for independent categories

Desiderata:

  • Parallelizable
  • If random access, running time close to O(n)
  • Constant memory

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement

slide-4
SLIDE 4

Random Indices

1: m ← 0 2: while m < n do 3:

R ← random({1 . . . N})

4:

if List[R] / ∈ Sample then

5:

m ← m + 1

6:

Sample[m] ← List[R]

  • ca. N ln

N N−n+1 iterations in expectation

  • Space/time trade off in line 4

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement

slide-5
SLIDE 5

Random Remaining Indices

1: for m ← 1, . . . , n do 2:

R ← random({1 . . . N − m + 1})

3:

j ← index of R'th non-null element in List

4:

Sample[m] ← List[j]

5:

List[j] ← null

  • Prohibitive running time Θ(nN)
  • Modifies List

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement

slide-6
SLIDE 6

The Fisher-Yates Shuffle

1: for m ← 1, . . . , n do 2:

R ← random({m . . . N})

3:

Swap List[m] and List[R]

4: Sample[1 . . . n] ← List[1 . . . n]

  • Running time Θ(n)
  • Modifies List

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement

slide-7
SLIDE 7

Probabilistic Sampling

1: for t ← 1, . . . , N do 2:

with probability n

N do 3:

Append List[t] to Sample

  • Running time Θ(N)
  • Only expected sample size n (mean of B(N, n

N))

  • Standard deviation

√ n(1 − n/N)

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement

slide-8
SLIDE 8

Selection Sampling

1: m ← 0 2: for t ← 1, . . . , N do 3:

with probability n−m

N−t do 4:

m ← m + 1

5:

Sample[m] ← List[t]

  • Running time Θ(N)
  • Completely unbiased!

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement

slide-9
SLIDE 9

Random Number Generation

(Digression)

Running time O(n) possible by skipping rows? Idea 1:

  • Let S ∈ {0 . . . N − n} RV for # rows to skip

Pr[S ≤ s] = 1 − (N − n)s+1 Ns+1 Idea 2 (Vitter, 1984):

  • von Neumann’s rejection & “squeeze” method

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement

slide-10
SLIDE 10

Vitter (1984)

c =

N N−n+1,

g(x) = n

N

( 1 − x

N

)n−1, h(x) = n

N

( 1 −

x N−n+1

)n−1

5 10 15

0,1 0,2 c· g(x) h(x) Pr[S = x] N = 20, n = 5

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement

slide-11
SLIDE 11

Reservoir Sampling

1: Sample[1 . . . n] ← List[1 . . . n] 2: for t ← n + 1, . . . , N do 3:

with probability n

t do 4:

R ← random({1 . . . n})

5:

Sample[R] ← List[t]

  • Completely unbiased!
  • O(n(1 + log N

n )) by optimizing (Vitter, 1985)

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement

slide-12
SLIDE 12

Reservoir, with Replacement

1: for t ← 1, . . . , N do 2:

for i ← 1, . . . , n do

3:

with probability 1

t do 4:

Sample[i] ← List[t]

  • Completely unbiased!

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement

slide-13
SLIDE 13

Bibliography

Knuth (1997): The Art of Computer Programming, Vol. 2 Vitter (1984): Faster methods for random sampling Vitter (1985): Random sampling with a reservoir Park, Ostrouchov, Samatova, Geist (2004): Reservoir-Based Random Sampling with Replacement from Data Stream

. . . . . Non-Sequential . . . . Sequential . Sequential with Reservoir . . Sequential With Reservoir and Replacement