(no subject)
Mar. 2nd, 2012 01:10 pmDear practical probstats fans,
Apologies for any bad terminology, it's been a loooong time since I had the terms of art loaded in braincache.
I have a pool of items which is of unknown size. I want to take a sample of size k from the pool. The size of the pool might turn out to be less than k; if so I'll just take the whole pool. That's okay. This is an engineering task, and not a theory task, so it doesn't have to be strictly sound or easy to describe in equations, it just has to avoid any nasty gotchas or corner case effects, and be "more useful" than just doing a simple random sample over the whole pool.
Because this is an engineering task, we're also interested in performance: We want to take the sample in a single pass There are a bunch of different pools, and the distribution of pool sizes follows a power law -- mostly small, but a few truly enormous ones that won't fit in memory and cause problems for complexity analyses that just wave their hands at constants. Going through some of these lists twice is a very bad idea.
Anyhow. The pool contains three types of items: blue items, green items, and numbered green items. To the extent that it's possible, I'd like to have an equal number of blue, green, and numbered-green items in my sample... but I would rather have k items than have the item types be perfectly balanced.
This is basically a stratified sample, but I'm going to fuss with it. Assuming I have plenty of items of all types, the blue and the green strata samples will just be a simple random sample. But for the numbered green items, I want to favor those items with higher numbers. To do this I'm doing something called "probability proportional to size" sampling, and letting the item's number be the item's size. This means an item numbered 4 is twice as likely to be picked as an item numbered 2. It slightly abuses the official purpose of PPS sampling (where you're, say, looking for a sample made of students, and picking which schools -- which have different sizes -- to draw those students from) but it'll work for us.*
So with plenty of items of all types, I will draw a blue simple random sample of size k/3, a green simple random sample of size k/3, and a numbered green PPS sample of size k/3, and then just combine them to form my final sample.
We modify the strata sample size in order to cope with undersized pools, which are very frequent (remember the pool sizes are ~pareto). The item types are not necessarily evenly distributed, so instead of drawing strata samples of size k/3 I'm drawing samples of size k, so that if we have e.g. a whole pile of blue items and not very many green ones, the blue ones can cover for the green ones, and our final sample size k is maintained.
This causes a problem, because if I then have k blue, k green, and k numbered green items, I need to pick k/3 from each pile. I can just pick k/3 items with uniform probability from the blue and green piles because they're SRS, but here's my question: If I pick k/3 items with uniform probability from a PPS sample, does that change the sample distribution, and if so, how?
Here's another question: The numbered green items are, in fact, green; the plain green items are just items we don't have numbers for. Is it a really bad idea to toss any discarded numbered green items into the subpool under consideration for the plain green pile?
* incidentally, a simple random sample is a special case of a PPS sample where all the item sizes are 1, which is what gives me confidence in PPS as a method. In a simple random sample, the probability of item i being in a sample of size k drawn from a population of size N is k/N. In a PPS sample, if item i has size si, and the sum of all si in the population is S, the membership probability is k(si) / S. This obviously causes problems if k(si) > S, which happens either if k is too close to S or if you have some seriously off-size items in your pool. Normally if this happened you'd let item i be in your sample multiple times, but we're just waving our hands at it and ignoring this effect, because hey, it's engineering. Anyhow, you can see how if the size of all items is 1, si = 1 and S = N(1) = N, so you have k(1) / N = k/N. Yahtzee.
Apologies for any bad terminology, it's been a loooong time since I had the terms of art loaded in braincache.
I have a pool of items which is of unknown size. I want to take a sample of size k from the pool. The size of the pool might turn out to be less than k; if so I'll just take the whole pool. That's okay. This is an engineering task, and not a theory task, so it doesn't have to be strictly sound or easy to describe in equations, it just has to avoid any nasty gotchas or corner case effects, and be "more useful" than just doing a simple random sample over the whole pool.
Because this is an engineering task, we're also interested in performance: We want to take the sample in a single pass There are a bunch of different pools, and the distribution of pool sizes follows a power law -- mostly small, but a few truly enormous ones that won't fit in memory and cause problems for complexity analyses that just wave their hands at constants. Going through some of these lists twice is a very bad idea.
Anyhow. The pool contains three types of items: blue items, green items, and numbered green items. To the extent that it's possible, I'd like to have an equal number of blue, green, and numbered-green items in my sample... but I would rather have k items than have the item types be perfectly balanced.
This is basically a stratified sample, but I'm going to fuss with it. Assuming I have plenty of items of all types, the blue and the green strata samples will just be a simple random sample. But for the numbered green items, I want to favor those items with higher numbers. To do this I'm doing something called "probability proportional to size" sampling, and letting the item's number be the item's size. This means an item numbered 4 is twice as likely to be picked as an item numbered 2. It slightly abuses the official purpose of PPS sampling (where you're, say, looking for a sample made of students, and picking which schools -- which have different sizes -- to draw those students from) but it'll work for us.*
So with plenty of items of all types, I will draw a blue simple random sample of size k/3, a green simple random sample of size k/3, and a numbered green PPS sample of size k/3, and then just combine them to form my final sample.
We modify the strata sample size in order to cope with undersized pools, which are very frequent (remember the pool sizes are ~pareto). The item types are not necessarily evenly distributed, so instead of drawing strata samples of size k/3 I'm drawing samples of size k, so that if we have e.g. a whole pile of blue items and not very many green ones, the blue ones can cover for the green ones, and our final sample size k is maintained.
This causes a problem, because if I then have k blue, k green, and k numbered green items, I need to pick k/3 from each pile. I can just pick k/3 items with uniform probability from the blue and green piles because they're SRS, but here's my question: If I pick k/3 items with uniform probability from a PPS sample, does that change the sample distribution, and if so, how?
Here's another question: The numbered green items are, in fact, green; the plain green items are just items we don't have numbers for. Is it a really bad idea to toss any discarded numbered green items into the subpool under consideration for the plain green pile?
* incidentally, a simple random sample is a special case of a PPS sample where all the item sizes are 1, which is what gives me confidence in PPS as a method. In a simple random sample, the probability of item i being in a sample of size k drawn from a population of size N is k/N. In a PPS sample, if item i has size si, and the sum of all si in the population is S, the membership probability is k(si) / S. This obviously causes problems if k(si) > S, which happens either if k is too close to S or if you have some seriously off-size items in your pool. Normally if this happened you'd let item i be in your sample multiple times, but we're just waving our hands at it and ignoring this effect, because hey, it's engineering. Anyhow, you can see how if the size of all items is 1, si = 1 and S = N(1) = N, so you have k(1) / N = k/N. Yahtzee.
no subject
Date: 2012-03-02 08:02 pm (UTC)no subject
Date: 2012-03-02 08:22 pm (UTC)Did you mean k^2, or did you mean item size? Item size differs from item to item.
no subject
Date: 2012-03-05 05:26 am (UTC)I meant sample size * number_of_pools.
Treat each pool independently, then weight the final "extraction"based on whatever rules you want
Assuming you want to keep the actual "elements" in addition to the counts of blue,green and green numbered, you would keep all the samples in an ordered list and keep the counts.
When you're finished sampling, you will pull K samples from your collected pools based on your distribution rules
I think this works efficiently where the number Of pools is relatively small. It doesnt really care what e size of any pool is
no subject
Date: 2012-03-05 01:32 pm (UTC)(as far as relatively small: there are hundreds of pools and millions of items... we're juuust on the border of web scale)