infryq

So this is a thing: http://news.bbc.co.uk/2/hi/science_and_environment/10538198.stm

The results from the inquiry into the "hidden climate data undermining global warming!!1!1omg!" are back. There was no malpractice. The data are available. You can get similar results with about 2 days of code to do the analysis.

The only* problems seem to have been (1) the data were hard to collect until 2009 when everything was archived, (2) some of the data collection could have been cleaner had they consulted a statistician, and (3) in one of the experiments it was unclear how the data were combined.

As a researcher who deals in dozens of data sets and subsamples and combinations and weird experimental conditions for the work we do with information retrieval (basically Google), I can understand all of these problems. If I made every data set I've ever used for every one-off sanity test and statistic available online, I would first go nuts trying to keep it all straight, and second run out of disk space several times over. I have had to make specific efforts not to blow away the data sets and demo versions of the software for all our old papers, "just in case" someone wants to use them. It's a pain in the ass.

On the other hand, you betcha if we published something awesome I'd make sure it was accessible from the outside. One-offs are one thing. Big steps are another. Trouble is noticing whether you're doing one or the other, from the inside. So, okay, wrist slapped on that one. It's a similar thing with the data combinations nonsense -- if you're just writing a minor paper that's got a finicky interaction on it, firstly hardly anyone in academia can write, and secondly "minor paper"s don't have a reason to bring in a pro tech writer to explain your hard-to-explain thing. Big News, again, is another story -- unfortunately we don't really have infrastructure for that. Find me a department that keeps a tech writer on retainer for their doctoral community to use for editing, revision, and wordsmithing, and I'll.. seriously consider moving there. Not like statistics. Statistics professionals are easier to find.

Statistics and study design is not easy, and many researchers are dreadfully undereducated here. There are so many ways to set up an experiment, and so many axes to split on, and a different specific protocol to follow for each. I have yet to find a book that lays out what all the axes are, so you can look up what you're doing in a table, page to a section, and read off the proper statistical method like a script. That would be awesome. There's also a lot more art and craft to statistics than many people realize. Whether a sample size is "enough" to consider the sample mean to be normally distributed is not something that you can really take out of a book (in high school they told us ~20 was enough. This past summer I took a refresher, and they tell us ~30. Go figure). Classic frequentist statistics is highly counterintuitive and it's a shame that most research depends on its jargon (p-values, F-tests) without always providing a foundational understanding of what's going on.

Bringing in a statistician for every study you do is probably impractical. However, increased demand for statisticians (and the following increase in supply) is probably not a bad thing. I'm for it.

What do other people think? How do you keep research accountable for the big stuff where it may be important for a broader and less-well-prepared community to understand and have access to the results, and still keep it sustainable for the incremental, day-to-day results that are only read by a handful of people in the world? What's the responsible thing to do here?

* There was some intimation that the university either denied access or turned away people asking about the data, but I wasn't able to find details on the accusation. It's not clear how they would have denied people access to the data anyhow, since the lab was putting them online.

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Most Popular Tags

adulting - 9 uses
adventures - 24 uses
baking - 116 uses
chores - 211 uses
cleaning - 22 uses
cooking - 110 uses
crochet - 13 uses
dreams - 388 uses
embroidery - 88 uses
family - 91 uses
family-health - 11 uses
fitness - 43 uses
food+ - 63 uses
food- - 51 uses
games - 77 uses
garden - 154 uses
health - 11 uses
health+ - 25 uses
health- - 47 uses
house - 76 uses
knitting - 88 uses
late-schedule - 21 uses
lll - 42 uses
market - 80 uses
mending - 20 uses
mentoring - 19 uses
mood- - 32 uses
movies - 9 uses
p adventures - 16 uses
p-health - 10 uses
pottery - 55 uses
public-health - 13 uses
reading - 530 uses
repairs - 15 uses
research - 23 uses
self-care - 31 uses
sewing - 454 uses
shopping - 141 uses
social - 86 uses
spinning - 169 uses
therapy - 58 uses
travel - 12 uses
tv - 770 uses
vacation - 39 uses
weather - 25 uses
white-rotary-77 - 67 uses
work - 547 uses
work+ - 129 uses
work- - 12 uses
writing - 11 uses

Flat | Top-Level Comments Only

From:

noequal.livejournal.com

This is about the best post on this subject possible. I'm not sure if we need more statisticians per say, but we definitely need to increase numeracy in the country.

I don't care if folks know a normal distribution from a Poisson, but I'd be happy if they know a p-value. I met some very smart people recently who have never take a stats course. This makes me sad.

Anyway, that's my two cents: Increased numeracy across the board.

gustavolacerda.livejournal.com

I agree with your post. It's simply impractical to document everything you do. But if you can manage it, it definitely tends to make your conclusions more objective (see e.g. Domingos - A Process-Oriented Heuristic for Model Selection (www.cs.washington.edu/homes/pedrod/papers/mlc98.pdf)).

The near-Gaussianity of the sample mean statistic is a consequence of the CLT.

It all depends on how close to Gaussian the variable is in the first place. It's easy to come up with distributions for which even 1000 samples wouldn't bring you close to Gaussian (e.g. very high kurtosis distributions).

I think that educating people in Bayesian Stats would generally provide them with a more consistent worldview.

integreillumine.livejournal.com

I... can't actually answer the question fully right now. But I really do think every academic researcher should be required to take truly advanced statistics courses. When I worked in psychology, it was truly painful to realize how many actually lacked knowledge of what these advanced stats tests were really about, when in most cases human psychology is a place that has *more* inherent variability and factors - that are very difficultly isolated at best - and which have results that the common populace (and psychological and scientific community alike) are most likely to be casually interested in, invested in, and act on without full knowledge of the fuller or most accurate picture.

(no subject)

(no subject)

no subject

on 20 vs 30 for the sample mean to be Gaussian

no subject

Profile

August 2022

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags