infryq: Kitchen scene at dawn, post-processed to appear as if painted (Default)
[personal profile] infryq
Dear practical probstats fans,


Apologies for any bad terminology, it's been a loooong time since I had the terms of art loaded in braincache.

I have a pool of items which is of unknown size. I want to take a sample of size k from the pool. The size of the pool might turn out to be less than k; if so I'll just take the whole pool. That's okay. This is an engineering task, and not a theory task, so it doesn't have to be strictly sound or easy to describe in equations, it just has to avoid any nasty gotchas or corner case effects, and be "more useful" than just doing a simple random sample over the whole pool.

Because this is an engineering task, we're also interested in performance: We want to take the sample in a single pass There are a bunch of different pools, and the distribution of pool sizes follows a power law -- mostly small, but a few truly enormous ones that won't fit in memory and cause problems for complexity analyses that just wave their hands at constants. Going through some of these lists twice is a very bad idea.

Anyhow. The pool contains three types of items: blue items, green items, and numbered green items. To the extent that it's possible, I'd like to have an equal number of blue, green, and numbered-green items in my sample... but I would rather have k items than have the item types be perfectly balanced.

This is basically a stratified sample, but I'm going to fuss with it. Assuming I have plenty of items of all types, the blue and the green strata samples will just be a simple random sample. But for the numbered green items, I want to favor those items with higher numbers. To do this I'm doing something called "probability proportional to size" sampling, and letting the item's number be the item's size. This means an item numbered 4 is twice as likely to be picked as an item numbered 2. It slightly abuses the official purpose of PPS sampling (where you're, say, looking for a sample made of students, and picking which schools -- which have different sizes -- to draw those students from) but it'll work for us.*

So with plenty of items of all types, I will draw a blue simple random sample of size k/3, a green simple random sample of size k/3, and a numbered green PPS sample of size k/3, and then just combine them to form my final sample.

We modify the strata sample size in order to cope with undersized pools, which are very frequent (remember the pool sizes are ~pareto). The item types are not necessarily evenly distributed, so instead of drawing strata samples of size k/3 I'm drawing samples of size k, so that if we have e.g. a whole pile of blue items and not very many green ones, the blue ones can cover for the green ones, and our final sample size k is maintained.

This causes a problem, because if I then have k blue, k green, and k numbered green items, I need to pick k/3 from each pile. I can just pick k/3 items with uniform probability from the blue and green piles because they're SRS, but here's my question: If I pick k/3 items with uniform probability from a PPS sample, does that change the sample distribution, and if so, how?

Here's another question: The numbered green items are, in fact, green; the plain green items are just items we don't have numbers for. Is it a really bad idea to toss any discarded numbered green items into the subpool under consideration for the plain green pile?

* incidentally, a simple random sample is a special case of a PPS sample where all the item sizes are 1, which is what gives me confidence in PPS as a method. In a simple random sample, the probability of item i being in a sample of size k drawn from a population of size N is k/N. In a PPS sample, if item i has size si, and the sum of all si in the population is S, the membership probability is k(si) / S. This obviously causes problems if k(si) > S, which happens either if k is too close to S or if you have some seriously off-size items in your pool. Normally if this happened you'd let item i be in your sample multiple times, but we're just waving our hands at it and ignoring this effect, because hey, it's engineering. Anyhow, you can see how if the size of all items is 1, si = 1 and S = N(1) = N, so you have k(1) / N = k/N. Yahtzee.

Date: 2012-03-02 08:02 pm (UTC)
From: [identity profile] nanoplane.livejournal.com
So is it acceptable to take an initial sample of K* Sample size? then using the desired distribution rules, take the "real" sample out of that subset... This means you only traverse each pool once and can weight the results based on pool distribution or color or whatever..

Date: 2012-03-05 05:26 am (UTC)
From: (Anonymous)
Sorry not reading right
I meant sample size * number_of_pools.

Treat each pool independently, then weight the final "extraction"based on whatever rules you want

Assuming you want to keep the actual "elements" in addition to the counts of blue,green and green numbered, you would keep all the samples in an ordered list and keep the counts.

When you're finished sampling, you will pull K samples from your collected pools based on your distribution rules

I think this works efficiently where the number Of pools is relatively small. It doesnt really care what e size of any pool is

Profile

infryq: Kitchen scene at dawn, post-processed to appear as if painted (Default)
infryq

August 2022

S M T W T F S
 1 23456
78910111213
14151617181920
21222324252627
28293031   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 16th, 2026 11:40 am
Powered by Dreamwidth Studios