infryq

Dear practical probstats fans,

Apologies for any bad terminology, it's been a loooong time since I had the terms of art loaded in braincache.

I have a pool of items which is of unknown size. I want to take a sample of size k from the pool. The size of the pool might turn out to be less than k; if so I'll just take the whole pool. That's okay. This is an engineering task, and not a theory task, so it doesn't have to be strictly sound or easy to describe in equations, it just has to avoid any nasty gotchas or corner case effects, and be "more useful" than just doing a simple random sample over the whole pool.

Because this is an engineering task, we're also interested in performance: We want to take the sample in a single pass There are a bunch of different pools, and the distribution of pool sizes follows a power law -- mostly small, but a few truly enormous ones that won't fit in memory and cause problems for complexity analyses that just wave their hands at constants. Going through some of these lists twice is a very bad idea.

Anyhow. The pool contains three types of items: blue items, green items, and numbered green items. To the extent that it's possible, I'd like to have an equal number of blue, green, and numbered-green items in my sample... but I would rather have k items than have the item types be perfectly balanced.

This is basically a stratified sample, but I'm going to fuss with it. Assuming I have plenty of items of all types, the blue and the green strata samples will just be a simple random sample. But for the numbered green items, I want to favor those items with higher numbers. To do this I'm doing something called "probability proportional to size" sampling, and letting the item's number be the item's size. This means an item numbered 4 is twice as likely to be picked as an item numbered 2. It slightly abuses the official purpose of PPS sampling (where you're, say, looking for a sample made of students, and picking which schools -- which have different sizes -- to draw those students from) but it'll work for us.*

So with plenty of items of all types, I will draw a blue simple random sample of size k/3, a green simple random sample of size k/3, and a numbered green PPS sample of size k/3, and then just combine them to form my final sample.

We modify the strata sample size in order to cope with undersized pools, which are very frequent (remember the pool sizes are ~pareto). The item types are not necessarily evenly distributed, so instead of drawing strata samples of size k/3 I'm drawing samples of size k, so that if we have e.g. a whole pile of blue items and not very many green ones, the blue ones can cover for the green ones, and our final sample size k is maintained.

This causes a problem, because if I then have k blue, k green, and k numbered green items, I need to pick k/3 from each pile. I can just pick k/3 items with uniform probability from the blue and green piles because they're SRS, but here's my question: If I pick k/3 items with uniform probability from a PPS sample, does that change the sample distribution, and if so, how?

Here's another question: The numbered green items are, in fact, green; the plain green items are just items we don't have numbers for. Is it a really bad idea to toss any discarded numbered green items into the subpool under consideration for the plain green pile?

* incidentally, a simple random sample is a special case of a PPS sample where all the item sizes are 1, which is what gives me confidence in PPS as a method. In a simple random sample, the probability of item i being in a sample of size k drawn from a population of size N is k/N. In a PPS sample, if item i has size si, and the sum of all si in the population is S, the membership probability is k(si) / S. This obviously causes problems if k(si) > S, which happens either if k is too close to S or if you have some seriously off-size items in your pool. Normally if this happened you'd let item i be in your sample multiple times, but we're just waving our hands at it and ignoring this effect, because hey, it's engineering. Anyhow, you can see how if the size of all items is 1, si = 1 and S = N(1) = N, so you have k(1) / N = k/N. Yahtzee.

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Most Popular Tags

adulting - 9 uses
adventures - 24 uses
baking - 116 uses
chores - 211 uses
cleaning - 22 uses
cooking - 110 uses
crochet - 13 uses
dreams - 388 uses
embroidery - 88 uses
family - 91 uses
family-health - 11 uses
fitness - 43 uses
food - 9 uses
food+ - 63 uses
food- - 51 uses
games - 77 uses
garden - 154 uses
health - 11 uses
health+ - 25 uses
health- - 47 uses
house - 76 uses
knitting - 88 uses
late-schedule - 21 uses
lll - 42 uses
market - 80 uses
mending - 20 uses
mentoring - 19 uses
mood- - 32 uses
p adventures - 16 uses
p-health - 10 uses
pottery - 55 uses
public-health - 13 uses
reading - 530 uses
repairs - 15 uses
research - 23 uses
self-care - 31 uses
sewing - 454 uses
shopping - 141 uses
social - 86 uses
spinning - 169 uses
therapy - 58 uses
travel - 12 uses
tv - 770 uses
vacation - 39 uses
weather - 25 uses
white-rotary-77 - 67 uses
work - 547 uses
work+ - 129 uses
work- - 12 uses
writing - 11 uses

Threaded | Top-Level Comments Only

From:

nanoplane.livejournal.com

So is it acceptable to take an initial sample of K* Sample size? then using the desired distribution rules, take the "real" sample out of that subset... This means you only traverse each pool once and can weight the results based on pool distribution or color or whatever..

infryq

Sample size = k.

Did you mean k^2, or did you mean item size? Item size differs from item to item.

From: (Anonymous)

Sorry not reading right
I meant sample size * number_of_pools.

Treat each pool independently, then weight the final "extraction"based on whatever rules you want

Assuming you want to keep the actual "elements" in addition to the counts of blue,green and green numbered, you would keep all the samples in an ordered list and keep the counts.

When you're finished sampling, you will pull K samples from your collected pools based on your distribution rules

I think this works efficiently where the number Of pools is relatively small. It doesnt really care what e size of any pool is

Each pool is treated separately. We don't want a sample that represents the entire corpus, we want a sample of items that represents a pool, so we can do something with that sample (i.e. use it to find other unclassified items that are similar to the sample, so we can add those items to the pool) and then move on to the next pool. Each item is unique within a pool; you can think of them as words in a topic model or photographs in an object recognition model. The blue/green counts are just to ensure the sample is balanced between items from different sources (and some of the green ones are timestamped, so we want to weight them for recency).

(as far as relatively small: there are hundreds of pools and millions of items... we're juuust on the border of web scale)

(no subject)

(no subject)

no subject

no subject

no subject

no subject

Profile

August 2022

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags