I have been working with Stream processing a lot lately. Here’s my usecase.
I have a stream of pageviews. But I don’t want to process every single pageview, because then I will process one pageview over and over again. How do I sample the data so that I process just a subset of data but with equal probability of processing each pageview.
Now I want to choose 3 pageviews out of 4
[pageview1, pageview2, pageview1] as the initial reservior. Then I choose pageview3 with a probability of 3/4
1 2 3 4 5 6 7 8 9 10 11 12