infryq: Kitchen scene at dawn, post-processed to appear as if painted (Default)
[personal profile] infryq
So this is a thing: http://news.bbc.co.uk/2/hi/science_and_environment/10538198.stm

The results from the inquiry into the "hidden climate data undermining global warming!!1!1omg!" are back. There was no malpractice. The data are available. You can get similar results with about 2 days of code to do the analysis.

The only* problems seem to have been (1) the data were hard to collect until 2009 when everything was archived, (2) some of the data collection could have been cleaner had they consulted a statistician, and (3) in one of the experiments it was unclear how the data were combined.

As a researcher who deals in dozens of data sets and subsamples and combinations and weird experimental conditions for the work we do with information retrieval (basically Google), I can understand all of these problems. If I made every data set I've ever used for every one-off sanity test and statistic available online, I would first go nuts trying to keep it all straight, and second run out of disk space several times over. I have had to make specific efforts not to blow away the data sets and demo versions of the software for all our old papers, "just in case" someone wants to use them. It's a pain in the ass.

On the other hand, you betcha if we published something awesome I'd make sure it was accessible from the outside. One-offs are one thing. Big steps are another. Trouble is noticing whether you're doing one or the other, from the inside. So, okay, wrist slapped on that one. It's a similar thing with the data combinations nonsense -- if you're just writing a minor paper that's got a finicky interaction on it, firstly hardly anyone in academia can write, and secondly "minor paper"s don't have a reason to bring in a pro tech writer to explain your hard-to-explain thing. Big News, again, is another story -- unfortunately we don't really have infrastructure for that. Find me a department that keeps a tech writer on retainer for their doctoral community to use for editing, revision, and wordsmithing, and I'll.. seriously consider moving there. Not like statistics. Statistics professionals are easier to find.

Statistics and study design is not easy, and many researchers are dreadfully undereducated here. There are so many ways to set up an experiment, and so many axes to split on, and a different specific protocol to follow for each. I have yet to find a book that lays out what all the axes are, so you can look up what you're doing in a table, page to a section, and read off the proper statistical method like a script. That would be awesome. There's also a lot more art and craft to statistics than many people realize. Whether a sample size is "enough" to consider the sample mean to be normally distributed is not something that you can really take out of a book (in high school they told us ~20 was enough. This past summer I took a refresher, and they tell us ~30. Go figure). Classic frequentist statistics is highly counterintuitive and it's a shame that most research depends on its jargon (p-values, F-tests) without always providing a foundational understanding of what's going on.

Bringing in a statistician for every study you do is probably impractical. However, increased demand for statisticians (and the following increase in supply) is probably not a bad thing. I'm for it.


What do other people think? How do you keep research accountable for the big stuff where it may be important for a broader and less-well-prepared community to understand and have access to the results, and still keep it sustainable for the incremental, day-to-day results that are only read by a handful of people in the world? What's the responsible thing to do here?

* There was some intimation that the university either denied access or turned away people asking about the data, but I wasn't able to find details on the accusation. It's not clear how they would have denied people access to the data anyhow, since the lab was putting them online.

Date: 2010-07-07 10:24 pm (UTC)
From: [identity profile] noequal.livejournal.com
This is about the best post on this subject possible. I'm not sure if we need more statisticians per say, but we definitely need to increase numeracy in the country.

I don't care if folks know a normal distribution from a Poisson, but I'd be happy if they know a p-value. I met some very smart people recently who have never take a stats course. This makes me sad.

Anyway, that's my two cents: Increased numeracy across the board.

on 20 vs 30 for the sample mean to be Gaussian

Date: 2010-07-08 11:52 pm (UTC)
From: [identity profile] gustavolacerda.livejournal.com
I agree with your post. It's simply impractical to document everything you do. But if you can manage it, it definitely tends to make your conclusions more objective (see e.g. Domingos - A Process-Oriented Heuristic for Model Selection (www.cs.washington.edu/homes/pedrod/papers/mlc98.pdf)).

The near-Gaussianity of the sample mean statistic is a consequence of the CLT.

It all depends on how close to Gaussian the variable is in the first place. It's easy to come up with distributions for which even 1000 samples wouldn't bring you close to Gaussian (e.g. very high kurtosis distributions).

I think that educating people in Bayesian Stats would generally provide them with a more consistent worldview.

Date: 2010-07-09 05:51 pm (UTC)
From: [identity profile] integreillumine.livejournal.com
I... can't actually answer the question fully right now. But I really do think every academic researcher should be required to take truly advanced statistics courses. When I worked in psychology, it was truly painful to realize how many actually lacked knowledge of what these advanced stats tests were really about, when in most cases human psychology is a place that has *more* inherent variability and factors - that are very difficultly isolated at best - and which have results that the common populace (and psychological and scientific community alike) are most likely to be casually interested in, invested in, and act on without full knowledge of the fuller or most accurate picture.

Profile

infryq: Kitchen scene at dawn, post-processed to appear as if painted (Default)
infryq

August 2022

S M T W T F S
 1 23456
78910111213
14151617181920
21222324252627
28293031   

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 15th, 2026 01:29 pm
Powered by Dreamwidth Studios