Reservoir Sampling
I have been working with Stream processing a lot lately. Here’s my usecase.
Problem:
I have a stream of pageviews. But I don’t want to process every single pageview, because then I will process one pageview over and over again. How do I sample the data so that I process just a subset of data but with equal probability of processing each pageview.
For example,
1


Now I want to choose 3 pageviews out of 4
Solution:
We choose [pageview1, pageview2, pageview1]
as the initial reservior. Then I choose pageview3 with a probability of 3/4
Code
1 2 3 4 5 6 7 8 9 10 11 12 

I might not explain the math clearly, because I’m not very good at math. If you want to know more I recommend this and this and this and this. There’s also my question on SO about proof.