Reservoir Sampling

I have been working with Stream processing a lot lately. Here’s my usecase.

Problem:

I have a stream of pageviews. But I don’t want to process every single pageview, because then I will process one pageview over and over again. How do I sample the data so that I process just a subset of data but with equal probability of processing each pageview.

For example,

1
[pageview1, pageview2, pageview1, pageview3]

Now I want to choose 3 pageviews out of 4

Solution:

We choose [pageview1, pageview2, pageview1] as the initial reservior. Then I choose pageview3 with a probability of 3/4

Code

1
2
3
4
5
6
7
8
9
10
11
12
const samples = [];
const sampleSize = 10;
pageviews.forEach((record, index) => {
  if (samples.length < sampleSize) {
    samples.push(record);
  } else {
    let rand = Math.floor(Math.random() * index);
    if (rand < sampleSize) {
      samples[rand] = record;
    }
  }
});

I might not explain the math clearly, because I’m not very good at math. If you want to know more I recommend this and this and this and this. There’s also my question on SO about proof.

Feb 20th, 2017

Comments