The Power — and Perils — of Statistics

A recent article in ScienceNews calls into question scientific results established by statistics. It was excerpted by WUWT at great length (I wonder whether Anthony Watts knows the difference between “fair use” and violation of copyright), apparently an attempt to discredit global warming science because, after all, it uses some statistics. Of course Watts fails to consider the outrageous statistical follies coming from his side of the fence (many from himself and his contributors). Lubos gets in on the act, purportedly defending statistics but insisting on a ridiculously high confidence level — and of course, getting in some potshots at global warming science himself.

The ScienceNews article argues for caution accepting published results based on statistical analysis, specifically mentioning results in medical research. It’s based on several papers, including most notably Ioannidis (2005), whose results are disputed by some, but which definitely contains insights worth being aware of. The gist of it is that in many circumstances statistically-based results should be taken with not just a grain, but a whole block of salt. WUWT of course titles its article with a quote from the ScienceNews article, “science’s dirtiest secret: The “scientific method” of testing hypotheses by statistical analysis stands on a flimsy foundation.”

First let’s be clear about one thing: the foundation is not flimsy, it’s solid as a rock. Statistics works, it does what it’s supposed to do. But it is susceptible to misinterpretation, to false results purely due to randomness, to bias, and of course to error. That’s what the ScienceNews article is really about, although it takes liberties (in my opinion) in order to sensationalize the issue. But hey, that’s what magazines (not peer-reviewed journals) do.

One of the dangers inherent in statistical results is over-reliance on the “p-value.” The p-value is the probability of getting the observed result just by random accident, even when there’s no significant effect and the “null hypothesis” (usually, that there’s nothing interesting happening at all) is true. The de facto standard p-value for “statistical significance” is 0.05, or 5%, and when p-values are smaller than this the result is often said to be established with “95% confidence.”

Problem is, that a p-value of 5% doesn’t mean that the alternate hypothesis (the new drug reduces the chances of a heart attack) is 95% certain. For one thing, the p-value only addresses the null hypothesis, so a significant result is evidence against the null, but not necessarily for the alternate. Any deviation from the null hypothesis will negate it, even those that have nothing to do with the alternate.

For another thing, a p-value of 5% means that even when the null hypothesis is true, we can still expect the result to happen 5% of the time. That’s a small, but still substantial probability. The 5% level is a good choice, and has served science well for a century, but no choice is perfect. Anything less and it’s just too hard to get results — too many important truths will be missed. Anything more, and the fraction of falsehoods will be so great that truth will get lost in the noise. There’s a reason that 5% is the de facto standard, because in most cases it works. Well. Even so, we need to be aware that “false alarms” not only can, but will happen (that too is the nature of statistics). That’s why replication (when possible) is so important.

Another important consideration is that often (particularly in medical research) we have prior information that the chance of the alternate hypothesis being true is very small. This gives us a useful prior probability with which we can apply a Bayesian analysis. There’s great value in Bayesian analysis, and it sometimes reveals that although a given observation is unlikely (even 5% unlikely) under the null hypothesis, it’s even more unlikely under the alternate hypothesis. In that case, in spite of the apparently significant p-value, the evidence is against the alternate. Of course, Bayesian analysis has its own hazards, including the fact that if the prior probability isn’t known precisely, then it becomes a choice which is susceptible to (and often accused of) bias.

Yet another danger is inherent in the way scientific results are selected by researchers for publication. A “significant” result is far more likely to be touted than one which is not, especially in fields like clinical trials which have immediate and important financial consequences. Negative results are in fact important, but they don’t get the press, publication, or visibility of positive results. This can lead to the phenomenon where 20 different research groups are investigating the same question, and only one of them gets a result which is significant at the 5% p-value level, but that’s the one which gets published. With 20 experiments, we expect one of them to give a p-value of 5% or less — but the others aren’t visible in the scientific literature, so the published evidence is one-sided.

Another aspect of the same issue is that often an investigation will study a vast number of possible factors. One might, for instance, study the influence of 30,000 genes on the likelihood of developing heart disease. Using a 5% p-value, we expect 1500 of the genes to show a “significant” result just by accident. The vast number of “false positives” can be properly dealt with by statistics, but unfortunately it’s not always done properly. I myself first met this issue when studying variable stars to look for changes in their periods of variation. In a sample of some 400 such stars, about 20 showed “significant” period changes — which is just what you’d expect by chance, if none of them had really changed period.

Another factor which we all try to avoid but is still very real is bias. This is especially problematic when there’s anything subjective about the data. In clinical trials, for instance, it’s sometimes necessary to exclude certain subjects from the analysis for legitimate reasons. But there’s at least temptation to preferentially exclude those which contradict the preferred hypothesis, while preferentially retaining those which support it. And if any of the measures is anything less than perfectly precise, there’s an all-too-human tendency to rate the “treatment” subject’s health one better while rating the “placebo” subject’s response one worse. This emphasizes the importance of the classic “double blind” experiment, which is highly desirable, but frankly, just not always possible.

It also happens that, even with a single data set, researchers who are motivated to support a preferred hypothesis will keep trying new statistical tests until they find one that gives what they want. It’s possible to compensate for such a “suite” of tests statistically but that is often not done, and using a suite of tests should really be decided on before the data is even collected — but that’s even more rare.

As if that weren’t bad enough, accidents happen, and I’m not referring to statistical fluctuations but to plain old errors. These can be data errors or faulty analysis. After all, most scientists are not statisticians, and even standard statistical tests have subtleties which are often unappreciated. To err is human.

All told, there are many ways for the statistical analysis of experiments to give incorrect results. This may be especially true in medical research, for which the financial incentive is high, the prior probability is often very low, and the sample size may be severely limited by circumstances beyond anyone’s control. But that hardly means that the foundation of statistical analysis is flimsy; that’s just sensationalism. Nor do we need to adhere to an impractically high standard of statistical significance — as desirable as it is, effects are often small and gathering more data (as in clinical trials) can be very expensive and time-consuming, while delays in availability of new treatments can be devastating for a patient with serious disease and few treatment options.

It’s important for researchers to be aware of statistical pitfalls, it’s valuable to consider all available results (including negative ones), and for the scientific community and the public to take results with at least a grain of salt. But distrusting statistics on principle is not just incorrect, it’s a foolish choice.

100 responses to “The Power — and Perils — of Statistics

  1. My cynical side tells me Watts is just trying to warm his readership up for why he hasn’t published, and will never publish, any analysis of the USHCN. To analyze that data would mean statistics, and we already “know” (via ScienceNews) that statistics in science are fatally flawed. Ergo, he’s off the hook, and can stick with his pictures.

    He’s just rationalizing his stupidity.

    Sadly, the newest “The Economist” takes Watts seriously. They mention his surfacestation.org science fair project, and Menne’s work in response, but let Watts get away with his “my analysis is forthcoming” dodge. It’s too bad the article didn’t point out the smears, slurs, and defamation that Watts freely provides at and allows from WTFWT.

  2. Jonathan Gilligan

    I’ve found Ziliak and McCloskey’s polemic, “The Cult of Statistical Significance” (U. Michigan, 2008) very useful for thinking clearly about statistical significance, practical significance, and getting away from considering significance as a Boolean property.

    A nice, early, discussion of these matters, far removed from climate change, but very applicable nonetheless, is J.M. Brophy and L. Joseph, “Placing Trials in Context Using Bayesian Analysis: GUSTO Revisited by Reverend Bayes,” JAMA 273, 871-875 (1995), PMID 7869558.

  3. How about a post on power at some point?

  4. Daniel the Yooper

    Tamino:

    Thanks for the great commentary for laymen like me. I’ve taken the liberty of sharing your insights on statistics and analysis with my pharmaceutical counterparts (with proper source attribution of course).

    You continue to build on your foundation of being a great resource for both the scientific community and the public at large.

    Thanks again!

    Daniel the Yooper

  5. Great summary of the issues, especially on the medical side of things, which is my field. It’s critical to understand those constraints on what would otherwise be ideal conditions — double-blind RTCs with n=thousands. That’s not the way the world spins, unfortunately, and you describe the trade-off very well.

  6. Excellent analysis. It has been my experience that there is a tremendous variation in statistical sophistication among scientists. Usually, in a field, you’ll have at least a few statistically savvy folks who will keep the rest of us honest.

    The denialists…not so much. Watts analysis makes it clear that his target is not just climate science, but science itself.

  7. Well, Watts’s blog itself illustrates the problems of employing poorly designed and/or poorly understood statistical methods.

    But I don’t think we should generalize the incompetence of Watts and his guest posters to the whole field. Statistics can still be useful.

  8. I’m a lawyer by training, and the statistical analyses just don’t look jury-ready to me. We’ve seen, especially from the contrarians and denialists, that statistical reports can be misinterpreted, sometimes intentionally.

    So as a lawyer I want to point out that the effects of warming are not statistically derived in any way. No statistical argument can deny the dramatic worldwide decline in glaciers. No statistical argument can deny the dramatic worldwide shift in migratory bird nesting and breeding. No statistical argument can deny the earlier blooming of the cherry trees in Washington or Japan. The measures of increasing CO2 are not based on a statistical study.

    In fact, much of the claims of Watt and his colleagues in contrariness are based on atttempts to minimize the real, measured data, with statistical sleight-of-hand.

    Watt may be right to headline the claim that some statistical claims are flimsily based. He’s got to be astoundingly inept not to recognize that the article is an indictment of his attempts to disprove warming by claiming hard measures of heat can be statistically manipulated to say what Watt wants them to say.

    • Gavin's Pussycat

      Ed, especially as a lawyer you need to grok Bayesian. Perhaps you are already aware, but I would recommend Jaynes (2003).

      • Dikran Marsupial

        I would certainly second the recommendation of Jaynes’ excellent book. Bayes factors seem much more appealing than frequentist hypothesis testing, as it is at least a direct answer to the question we actually want answered, namely “how much greater is the evidence for the alternative hypothesis than that for the null hypothesis

        As for priors adding a bias (when not using uninformative priors), the Bayesian approach does have the advantage of making prior beliefs explicit; if you disagree with the reasoning behind the stated prior, then substitute your own and see if it makes a significant [sic] difference.

  9. Daniel J. Andrews

    So Watts et al run around claiming there’s no statistically significant level of warming at 0.05 (re: Jones comment), and now they turn and say how flimsy statistical analysis is? Does this mean if warming is statistically significant at 0.05, they’ll dismiss it as “flimsy”, and when it is not statistically significant, they’ll say “See, it’s not statistically significant.”? You’re going to get fat eating all that cake and having it too.

    Incidentally, I noticed Stuart Hurlbert commented in that post. Our biostatistics prof had us pretty much memorize Hurlbert’s 1984 paper on pseudoreplication and for good reason. It seems half the ecology papers I read reference Hurlbert (1984), and go out of their way to show they’re not committing the sin of pseudoreplication. :-)

  10. Adrian Cockcroft

    I see two sides to this issue, on the one hand, researchers should know how to use statistics correctly, and part of peer review is to make sure that their published results are following best practices in statistics. On the other hand, the general public is statistically illiterate and many people make public assertions and decisions that are based on statistical fallacies. I think the “Fooled by Randomness” and “Black Swan” books by Taleb are a useful popularization of these fallacies for people who aren’t climate science researchers but are trying to understand and evaluate the claims and counter-claims.

  11. Philippe Chantreau

    Funny how nobody is saying that insurance companies are built on a flimsy foundation…

  12. Whattzit’s copyright-violating parading of that Science News article yet again reveals his lack of background in how science is actually done–real science, in practice, as opposed to what students do in science classes. Oh, wait, Whattzit never got a science degree, and he’s not a working scientist. Well, he read a magazine article.

    Working scientists know that statistical tests are merely some of the tools in the really big toolbox labeled “The Scientific Process,” which is accurately defined by “science is what scientists do.”

    Real, working scientists do not abandon their alternate hypothesis merely because one statistical test fails to reject the null hypothesis. That happens only in classroom exercises. That’s because the “hypotheses” being tested by the statistical tests are not identical to the scientist’s hypotheses of the domain! The statistical test is specific to the particular experimental setup or other data collection method.

    For example, if the scientist samples a population other than the population that is relevant to the scientist’s real (substantive, domain) hypothesis, then the statistical test will be completely correct, but only about that irrelevant population, and so will be irrelevant to the scientist’s real hypothesis.

    The lay media usually incorrectly translate statistical test probabilities about one experiment into statements such as “That means there is only a 5% chance that this drug does anything.” In fact, that experiment’s statistical test means only that there is a 5% chance of group A’s dependent measure being different from group B’s dependent measure, under a slew of assumptions and in the very specific circumstances of that study. The probability of the drug doing anything can be computed only on the basis of a much larger set of evidence than that one experiment.

    Scientists do not often use Bayesian statistics formally, but in fact all scientists are Bayesians, because all humans are Bayesians insofar as we all take into account multiple lines of evidence, weighting each piece of evidence based on our confidence in it. The giant toolbox that is The Scientific Process includes all that.

    Real, working scientists know all that.

  13. Peter Coles wrote an excellent commentary on the ScienceNews article over at his blog

    http://telescoper.wordpress.com/2010/03/19/sciences-dirtiest-secret/

  14. The fundamental problem with over reliance on statistics is the ease at which one can overlook unknown dependencies.

    In my misbegotten youth within our group of delinquents we had more then a few that made good money tossing coins.

    If one imparts the same angular momentum on a coin the relationship between it’s initial condition and ending condition is quite remarkable.

    The 50-50 statistical probability of a coin toss assumes 3 things.

    Coins are balanced, they are not, the heads side is almost always heavier then the tails side.
    Random initial starting point.
    Random angular momentum applied.

    When I toss a coin it lands face down in my hand, if you bet heads I will open the palm of my hand to reveal tails, if you bet tails I will flip it over onto the back of my other hand to reveal heads.

    Lesson 1 – statistics and probability – the coin toss AKA how to lose your lunch money gambling with harry and his delinquent friends by over reliance on statistics and probability.

    • harrywr2 (is the 2 silent?),

      Actually, your little missive does highlight what really is one of the dirty little secrets of statistics–the difficulty in defining a truly random event. It is something Kolmogorov never succeeded in doing to his own satisfaction, although his algorithmic approach probably comes the closest. To my knowledge, it remains a very fundamental and unsolved problem.

      Again, though as your little applied probability experiments show, though, it’s only a problem if only one party knows the coin isn’t really random.

  15. http://www.stat.columbia.edu/~cook/movabletype/archives/2010/03/more_on_misunde.html

    (quoting only his postscript, which we are in the process of falsifying – I’ve put a couple of pointers there and recommend others add them as other discussions are noticed worth tracking):
    “P.S. If there were a stat blogosphere like there’s an econ blogosphere, Siegfried’s article would’ve spurred a ping-ponging discussion, bouncing from blog to blog. Unfortunately, there’s just me out here and a few others (see blogroll), not enough of a critical mass to keep discussion in the air.”

    Ha! refutation in progress.

  16. One reason why double-blind studies are so highly touted in biology and medicine is how easy it is to accidentally introduce bias. Let’s say that you do a drug trial, and there’s a subject who isn’t responding in the expected manner to treatment. So, curious, you take a look at the subject’s chart–and what do you know? Somebody made a mistake! This guy wasn’t even eligible for the study. Obviously, he should be excluded from the analysis, right? Wrong. Because you didn’t take a second look at the guys who did respond in the expected manner, so any mistakes there remain uncaught. So when you subsequently apply statistics, you find that the treated and control groups are significantly different. Because of course they are–different in the degree of scrutiny for possible errors.

    In the absence of double blind, you have to follow a stringent rule–nothing that you’ve learned from looking at the results can influence how you analyze the data. All decisions about how you will analyze the data, what statistical tests will be used, the criteria for excluding bad data, have to be made before you look at the data. Otherwise, you are cherry-picking and your p values are going to be wrong even if you do the math right. This is a hard prescription to follow–and one that is routinely violated by global warming critics

  17. Agreeing with AndyB that power is an important consideration … I thought that 15 years ago, with a suite of freeware for power analysis and a few papers in ecology journals touting the importance of power consideration, power analyses would be much more common by now. They still seem rare. I have suggested to the steering committee of my agency’s funding branch that all proposals should include some sort of power analysis in the section of the proposal that details deliverables. That suggestion was never adopted. (Perhaps it’s because the reviewers wouldn’t know what to make of it?)
    Anyway, it seems that funding agencies are more often demanding a requirement for a social sciences component in interdisciplinary research in natural sciences. I think an important contribution of this component might be to balance Type I and Type II errors on the basis of what is available (time and money) for the research. If power can be estimated and is reported, it should reduce the frequency of unreported non-significant results (less of a file drawer problem).

    [Response: It’s certainly valuable to report IF it can be estimated — but that’s the rub. The null hypothesis is generally specifically chosen to make it possible to calculate the probability of a type I error, but the alternate often doesn’t permit such easy computation of the probability of a type II error.]

  18. Dear Tamino,

    five sigma is not a “ludicrously high” confidence level. In more serious disciplines of hard science, it’s the standard threshold for claiming a discovery of an effect. Search for “five sigma discovery” – you will find nearly 2,000 pages on Google:

    http://www.google.com/search?q=“five+sigma+discovery”

    It’s never ludicrous because the amount of data/events that you typically need for a 5-sigma discovery is just roughly 3 times bigger than the data for a 3-sigma discovery: the amount of required events often increases just as the square of the number of standard deviations because the relative error often goes like 1/sqrt(N).

    So if a hypothesis is correct, it’s very likely that scientists will be able to do just a little bit of extra work to increase their 3-sigma claims to 5-sigma claims. If they can’t do it, it is really mostly because their original conjecture was wrong and the 3-sigma evidence was fake or a false positive (sometimes a cherry-picked, deliberately looked for one).

    Cheers
    Lubos

    [Response: There’s nothing wrong with 5-sigma, we’d all love to have it all the time, but *requiring* it is indeed ludicrous. The purpose of statistics is to get as much information as possible from the available data; a 5-sigma standard ignores *most* of what can be learned. How foolish.

    It takes money, effort, and time — often, lots of each — to do research, and the suggestion that results which don’t reach 5 sigma should be discounted is nothing but a recipe for missing out on a lot of good science. In some fields, particularly medical research, getting more data involves a helluva lot more than just running the experimental apparatus 3 times as often. When testing a treatment for dying AIDS patients, it can be prohibitively expensive or time-consuming to just expand the scope of the trials by three. Frankly, it’s unethical to impose unnecessarily delays on, or simply eliminate, a potential breakthrough, just because someone has a statistical stick up his ass. Sometimes it’s downright impossible — how many times should we repeat the global-warming experiment?

    As for your implications of evidence being “fake” or “cherry-picked, deliberately looked for” — is that how you do science? Not me. Subconscious bias is hard to eliminate, but implying deliberate fraud without proof is a cowardly way to discredit science you don’t like.]

  19. Well said, Tamino! I knew Anthony was engaging in misdirection, but my first year stats were too long forgotten to explain it well.

  20. Dear Tamino,

    once again, it’s not “ridiculous” to require a 5-sigma confidence as a condition for a serious scientist’s claim of a discovery of an effect. It’s what particle physicists, cosmologists, and many others require all the time. They require it from themselves and others, too. You would never see a serious team’s paper claiming a 3-sigma discovery of a new quark. Everyone knows it would be preposterous to make such far-reaching statements. In a serious discipline, 3 sigma is just a vague hint, not real evidence.

    The purpose of statistics as a tool to fool the people may be to get “as much information as possible”. [edit]

    [Response: How pathetic that you again focus on vague implications of fraud to justify your idiotic standard for statistics. Maybe that *is* the way you do science. I find your attitude disgusting. I think we’ve heard all from you we need to.]

  21. Hi Tamino, thanks for your reply. I think it’s important, in light of Lubos’s use of particle physics examples, to note that different disciplines have different acceptable/conventional Type I errors. He’s a weenie for suggesting that everybody has to adopt the same one as somebody else without considering other factors (like power or the implications of failing to reject a false null hypothesis). In some fields, it’s difficult to control lots of variables, so it’s more important to replicate studies under different environments to learn how broadly some finding applies.
    As a fisheries guy, I would much rather have some hypothesis test replicated independently several times and then compare Fisher’s combined p-value in a meta-analysis to some less stringent Type I error than to have one larger study testing the hypothesis in a much narrower context and comparing to a much smaller Type I error.

  22. You would never see a serious team’s paper claiming a 3-sigma discovery of a new quark.

    This is a far more extraordinary claim (“new quark discovered) than something like “does recent warming appear to be a significant trend?”.

    Doesn’t surprise me at all that the bar’s raised higher.

  23. Horatio Algeranon

    Wonder what Motl thinks of the Gravity Probe B Frame dragging result

    “Our lastest data analysis indicates clear observation of frame-dragging. The statistical uncertainty is 14%
    (~5 marcs/yr). “

    Or of this earlier result “Confirming the Frame-Dragging Effect with Satellite Laser Ranging”

    the Satellite laser tracking to LAGEOS-1 and -2 appears to confirm General Relativity’s prediction of the Lense-Thirring precession at the 8-12% level (1-sigma)

    But hey, it’s always a piece of cake for scientists to confirm this stuff to 5-sigma (or better) — and anything less is just useless fluff.

    Or, as Motl puts it

    If they can’t do it, it is really mostly because their original conjecture was wrong and the 3-sigma evidence was fake or a false positive

  24. Lubos, I think you are being disingenuous. In fields like particle physics, where “bump hunting” is common, the five-sigma rule makes sense so that researchers do not fool themselves. Indeed, I have seen beautiful six-sigma signals–in the wrong-sign particles.

    This is a far cry from what is done in much of physics and the rest of science where theory dictates where to look and what to look for. A 90-95% CL result is often sufficient to “take to the bank.”

    There is also the question of what a confidence interval means. Suppose we applied your 99.99% CI rule to CO2 sensitivity. True, it would probably extend the confidence interval down to about 1 degree per doubling–but on the high side, given the rightward skew of probability, it would extend the upper bound to 7 or 8 degrees per doubling. This upper bound is so high that it would utterly dominate the risk calculus of climate change. That is, forcing us to consider the upper range of sensitivity, which is most likely nonphysical, would require extreme, draconian action. Careful what you wish for, Lubos.

  25. Jonathan Gilligan

    My previous post had a couple of typoes. Please kill it and post this instead:

    Lubos misses the important matter of distinguishing the practical significance (e.g., clinical significance) from statistical significance. One way of looking at this is to consider the cost of a Type-I error and the cost of a Type-II error. In quark-hunting, the cost of a Type-I error is high (incorrectly reporting a new discovery), while a Type-II error is much less costly (delaying the correct report of a new discovery). I am treating delays as potential Type-II errors because delaying action is a decision to act, at least temporarily, as though the null hypothesis is true; thus, delaying action while waiting for further tests can, in practice, be a type-II error if knowing the true state of things would compel prompt action.

    Thus, it pays quark hunters to wait until they’re very certain.

    In testing AIDS drugs, these sorts of Type-II errors (delayed release of effective drugs) are costly, as Tamino points out (dead patients). Similarly, demanding five-sigma evidence for a heart attack or stroke before starting treatment would be very foolish. Thus, it’s often a bad choice in medicine to wait for too much certainty before making a decision. Consider, for instance, the costs of excessive diagnostic testing or the well-known ACT-UP criticisms of the FDA for delaying approval of AZT in the late 1980s

    Sheila Jasanoff makes a very useful distinction, in The Fifth Branch, between “research science” in which the cost of delaying reporting is small enough that one can wait for arbitrary degrees of certainty and “regulatory science,” where the cost of delaying action can be large enough that regulators do not have the luxury of long delays and must thus accept making important and costly decisions under great uncertainty.

    Steven Pacala et al. developed this line of thought further in a pithy paper, “False Alarms over Environmental False Alarms,” Science 301, 1187-88 (2003), in which they conclude, using cost-benefit analysis, that current regulatory policy is inefficient because it worries too much about false-alarms (Type-I errors) and too little about missed alarms (Type-II errors).

  26. Lubos, most people have to pay a high price fo reach datum, it comes slowly, and you can’t get more than a few points at a time. In my research area (neuropsychology) we’ve produced some findings that are ridiculously powerful (R>.8, predicting psychomotor speed on the basis of age, sex and education), some that are middling (effect size of 0.4, the superiority of women on tests of psychomotor speed), and some that are small (p = .032, the significance level for the effect of exercise on executive functioning in our study fo 40 people). It depends on the study, the ease of collecting and expanding sample sizes, and the link to previous findings. Context is everything, and follow-up is required, whether by us or by other groups. Your standards may be fine for particle physics, I wouldn’t presume to tell you otherwise. But, you must know that they don’t have to translate into other fields. Smaller effect sizes, when they are consistent across multiple measures, data sets, and research groups, are more convincing than a massive finding reported by one group in a single paper.
    I’d also raise the faint possibility that scientists in a particular field are likely to be better judges of what is unexpected or poorly supported than those who live their professional lives in another field.

  27. Tamino writes: Maybe that *is* the way you do science.

    Does Motl actually still do science? What exactly is he up to now that he’s no longer at Harvard? Wikipedia is not particularly helpful and I’m disinclined to wade through the swamps of his blog.

  28. Reading Stewart’s post, I gather that Lubos won’t grant credence to anything that shows human causation or warming until we burn up three or four more planets to be sure we understand the effect, right?

    What dimension does Lubos live in? We only have one Earth.

  29. Horatio Algeranon

    It’s funny that Motl mentions particle physics as opposed to his own illustrious field (string theory).

    Wonder why that might be.

  30. If someone wants to be fraudulent, it’s just as easy to fake a 5 sigma result as a 2 or 3 sigma.

  31. The proportion of terrorists flying on U.S. passenger jets is statistically insignificant on any day — every day. At some point you have to make a decision about proximate causation, and whether you will do anything about it, if the effects are enormous.

    An iceberg ripped a statistically insignificant part of the Titanic’s hull . . .

    Wasn’t it Disraeli who said Twain said there are “lied, damned lies, and statistics?” A lie can convey useful information. Statistics can, too.

    That entire discussion, with regard to warming, is irrelevant.

  32. Andrew Hobbs

    Well I am not that familiar with the physical sciences but clearly a 5 sigma limit is not used exclusively.

    A simple search on Google Scholar of particle physics papers turns up the value of 95% confidence limit or p value of 0.05 as a very commonly cited value in the abstracts; although 90% confidence levels seem to feature very frequently! In addition P. Sinervo has an interesting article on “Signal Significance in Particle Physics”. He gives several examples where a stringent limit wasn’t used. Clearly many physicists are just as interested in not missing their moment of fame or fortune by making a type II error.

    “… the discovery of the Y meson in 1977 by L. Lederman and colleagues. …. The experimenters estimated that they observed a signal of approximately 770 events on top of a non-resonant background of 350 candidates. They characterized the signal as “significant” but made no attempt to quantify or explain exactly what they meant.”

    “The discovery of the W- boson at CERN in 1983 by the UA1 collaboration [2] was made by observing 6 events produced in proton-antiproton collisions where a high energy electron or antielectron was observed in coincidence with a signature for a recoiling energetic neutrino. The collaboration estimated the background to these 6 events as being “negligible” and claimed discovery of the expected charged weak intermediate vector boson.” again without any statistical testing.

    “The discovery of B mesons in 1983 by the CLEO collaboration [3] … observed a total event rate of 17 events on a background of between 4 and 7 events. They … claimed definitive observation of a new particle, but made no statement that quantified the statistical power
    of the observation.”

    Finally some real statistics. “… the discovery of Bo meson flavour mixing in 1987 by the ARGUS collaboration … ” where the experimenters characterized “their as a “3 sigma” observation”.

  33. What dimension does Lubos live in? We only have one Earth.

    And thankfully, only one Lubos :)

    He is rather … unique.

  34. Daniel J. Andrews

    Five-sigma simply isn’t going to happen in the wildlife biology field either. We barely get enough money to run even the minimum experiment or observations (actually, we sometimes don’t get the money at all).

    People working in fields where the variables are strictly controlled have difficulty understanding how we can do science in fields where we have little control over many of the environmental variables.

    I suspect it is similar to the way people who work with engineering models can’t accept the many uncertainties in climate models so conclude those models can’t be any good.

  35. This is insane.

    I’m a physicist, and it’s definitely true that at LEAST three sigma confidence is ABSOLUTELY required for claiming anything in physics. Go to arxiv.org and look at any experimental paper claiming to have discovered anything. Go to a physics journal and look for anything claiming to have discovered anything. You will NEVER EVER EVER see less than three sigma! Five sigma is the gold-standard. You can’t claim to be sure unless you have five sigma.

    No one’s saying that it’s EASY to get that much, but you just can’t be sure you’re right if you don’t. It’s not easy, but it is NEEDED.

    Even in physics, with claims everyone believes are probably true, you need ~five sigma to claim proof. For example, look at dark matter. Everyone believes it’s there, we’ve indirectly detected it, but no claims of direct detection have been made DESPITE SEEING THINGS TO A FEW SIGMA. We haven’t seen signals to five sigma, no discovery, the end.

    Even though no physicist questions the existence of dark mater, we still do not claim to have directly detected it! This is the standard of proof in physics!!! It’s very high!!!!

    Similarly, engineering has a “5-sigma” standard, too. Do you want your buildings to withstand three sigma deviations from average? No, millions of people would die if that was the best that was done (c.f., earthquakes in countries without proper building codes). There are hundreds of millions of buildings in the countries, even to a one-in-a-million level, that would be hundreds of buildings collapsing a year, could you imagine that!

    The thing is, with theories, it’s no different! One can easily construct thousands (or more!) different theories a year. If you have proof on a 1% level, and you randomly pick theories, you’d expect 1000s/100 = 10s of theories to be right just by chance! That’s insane!

    It’s even worse when you realize that theories aren’t chosen randomly, they’re as close to the line of being possible as possible! This is a large systematic bias towards false-positives. So in reality, you could expect on the order of hundreds of falsely correct theories with a percent level confidence. That’s totally unacceptable for any field!

    And, really, this is why you can’t say each field can choose its own confidence levels. Statistics has to choose for them; the number of false positives has to be low enough that you can be sure in their correctness if you want to make positive claims about their predictions.

    It’s obviously not always the case that you can do that well. Sometimes, you really do have to look through dozens–or hundreds–of potentially correct theories to try to carefully determine if they’re correct because your experiments were not good enough. This happens all the time in good science.

    What does not happen in good science, however, is making a claim of prediction/detection based off of something like that.

    What happens in good science is you understand where your sources of error and limitations on sensitivity come from, you look at your 100 inconclusive results, and you devise new experiments and observations that can distinguish between cases, and devise new tricks that can get you just a little bit more, and eventually, before you know it, you’re up to three of four sigma! Then it’s often just a matter of collecting more data, or making small improvements to get up to five or more sigma.

    In fact, if you look at the history of astronomy over the past 100 or so years, this is *exactly* what happened! A hundred years ago, you could often expect order 100% or more errors! A very very careful observation was order 10%. Nothing was better than that.

    But following exactly the story I just told, they improved themselves into much more of a precision science! Even though many observations may not be great, there are many that are. Deviations in the very uniform CMB are on the level of 10^-5 and have easily been detectable for decades.

    And, to compare with physics, most physicists still see astronomy as overly sloppy and imprecise!

    So, in this case, Lubos is absolutely correct. You really do need this level of significance to make certain claims. It should not come as a surprise that significant results should require statistically significant proof!

    [Response: You and Lubos continue to ignore the truth. Requiring 5-sigma in all cases in all fields ignores the *bulk* of significant results. That’s idiocy.

    The two of you also seem to have this impression that physics is somehow “better” than other fields of science — as though physicists know how to define the proper use of statistics better than statisticians! That’s arrogance.]

    • Calabi, you have a very parochial view of science. You ought to learn a little about what scientists actually do in some very different fields (say, evolutionary biology or geochemistry or hydrology).

      In other fields where people are confronted by very different kinds of problems, they use different analytical methods.

      Your lack of awareness of science outside your own very narrow discipline is a bit embarrassing.

    • I’m a physicist, and it’s definitely true that at LEAST three sigma confidence is ABSOLUTELY required for claiming anything in physics.</blockquote.

      No, it isn't.

      Similarly, engineering has a “5-sigma” standard, too.

      No, it doesn’t.

    • Calabi,
      You need to get out more. Most of the fields that require 5 sigma require it because their own analyses and methods would not be reliable if they required less. In other words, it is because they apply multiple requirements to their data to separate signal from noise and thereby may wind up manufacturing a signal if they require a less stringent significance. Alternatively, it is because they require a very pure sample on which to conduct analysis–you don’t want to determine decay fractions based on background events. Try looking at Phys Rev E sometime, or even Phys. Rev. B.

      • Horatio Algeranon

        Incidentally, what level of confidence do string theorists (like Motl) require for experimental confirmation of their own predictions?

        Oh, wait….

  36. I think what people need to realize is that fields like particle physics have been driven to require higher significance because their methodology tends to produce spurious results if lower significance is allowed.

    In particle physics, one is looking for extremely rare events–events that happen only a few times out of millions or billions of events. One must reconstruct these events with the tools available–e.g. energy and momentum measurements of decay particles at various places along the trajectories of those particles. Usually the end result is a histogram–# of events per mass. Since there is usually a degree of uncertainty as to the proper mass of a new particle, one has to pare away at the data making various “cuts” on the original data.

    Any particle physics grad student knows that this is a recipe for generating false positives. So, ultimately, the true significance of a particle physics discovery is a whole lot less than the claimed discovery. As Tamino says, it is possible to compensate for the potential of the method to generate false positives and then estimate the true significance of the result. However, it is easier to simply require a higher significance and have the tacit understanding that even a five-sigma result can go away in particle physics.

    Lubos SHOULD know this.

  37. May I make a point regarding the purpose and use of statistical probability, commonly called “statistics”? Statistics can only define the probability that samples drawn from a population are different from one another. If the samples are found to be statistically different, causality of what has caused the difference often is inferred. This is a mistake. Statistics per se has nothing to do with causality. It is how the samples from populations are produced that defines whether causation is involved. Only if the samples are produced by an interventional experiment with appropriate randomization and “controls” can causation be studied. Therefore, observation alone can never define causation. Not recognizing this limitation of statistics leads to a great deal of acrimony among debaters.

  38. Gavin's Pussycat

    Congratulations are due to the American people.

    I’m sentimental, if you know what I mean
    I love the country but I can’t stand the scene.
    And I’m neither left or right
    I’m just staying home tonight,
    getting lost in that hopeless little screen.
    But I’m stubborn as those garbage bags
    that Time cannot decay,
    I’m junk but I’m still holding up
    this little wild bouquet:
    Democracy is coming to the U.S.A.

    Next.

  39. Gavin's Pussycat

    Oops, that should be on Open Thread.

  40. Guys, it is pretty silly to argue against at least a 3 sigma standard deviation as being the minimum baseline for the claim of a detection of a signal.

    As others pointed out, you assuredly won’t get a Nobel prize for anything less than 5 sigma certainty. That is absolutely standard, across all disciplines of physics and hard sciences, whether its astrophysics, condensed matter, particle physics or atomic physics, etc etc and is more or less what is taught in any experimental class in university.

    You are of course free to assume say a 2 sigma effect is real and base your research on possibilities like that (indeed that is what model builders like yours truly, make our living off of), however that is living life on the edge and the sad fact is most such signals dissappear.

    You might object that requiring a 5 sigma effect is arbitrary (why 5 and not 6?), and indeed it is. But it comes from centuries of practise and convention in data analysis, statistics and experimental physics.

    [Response: I’m getting sick of this crap. The claim that 5-sigma comes from “centuries” of practice and convention is pure bull. The suggestion that we ignore all 2- or even 3-sigma events just because some physicists (not all!) think their science is superior to others is offensive.

    It’s the 2-sigma convention that comes from a century of the practice of statistics. Next time your doctor tells you that you probably have cancer but it’s only a 2-sigma detection, tell him you’ll defer any treatment until his confidence gets to 5-sigma.]

    • As others pointed out, you assuredly won’t get a Nobel prize for anything less than 5 sigma certainty.

      Lubos seems to think otherwise:

      “I think that Svensmark and a few others must feel somewhat unpleasantly because they have found something that may be a spectacular discovery in their discipline, and possibly the first discovery of this discipline that could deserve a Nobel prize. ”

      and their results were

      “almost always significant at the 95% level” ;)

      • Yeah, exactly, similar to what I posted before. They’re hypocrites, which totally surprises me to less than 95% significance level :)

  41. Calabi, I am ALSO a physicist, and I’ve SEEN physics papers claiming significant results with less than 3 sigma confidence. Some of them are listed above on this very thread! Don’t you read before responding to what you’re reading?

  42. I’m trying to not stereotype physicists (thank you Ray, BPL, and others). But Motl, Calabi, and Haelfix’s myopia and outrageous overconfidence in the universality of the personal knowledge are similar to Roger Penrose’s bizarre insistence that consciousness can be caused only by quantum phenomena, and some physicists’ adamant belief in psychic phenomena (to the bemusement of conjurers).

    I think physics education needs to take a lesson from the medical profession’s broadening of doctors’ educations.

    Nobody on this thread has mentioned it yet, which is disappointing but not surprising (sigh…): There is an entire field devoted to judgment and decision making. It provides a framework for choosing combinations of narrower tools such as statistics, and for decades has covered all the rudimentary topics that Motl et al. are oblivious to.

    [Response: In spite of the possibility that some of his beliefs may be foibles (maybe), Penrose is one of my heroes.]

  43. Yet Lubos praises VS’s baloney and doesn’t demand 5-sigma significance for his “proof” that temperatures have been indistinguishable from a random walk 1880-present …

    Hmmm, strange, that.

    • Well, given that the null hypothesis of the ADF test is that the underlying process does contain a unit root, 5-sigma significance would only make it harder to reject VS’s claims. Which is a pretty strong clue as to what precisely is wrong there, of course.

  44. It seems clear that Luboš Motl (and a few other physicists who speak without thinking) are completely unable to see the difference between hard sciences where you are looking for one thing that is either there or not, and systems science, where controlling for every single variable is not only impossible, but undesirable. Often the very goal is to see how a number of factors interrelate. Knowing the climate forcing of CO2 in isolation just isn’t a very useful result, for example.

    Medicine and climatology fall firmly into the latter category, even if there are cases for both where lab experiments can give the kind of results that physicists are familiar with.

    Oh, and physicists? Get over yourselves. Your field is not more “serious” than any other, and so many of your results have yet to find a practical application. Just because so many of your theories are doomed to remain hypothetical forever is no reason to take it out on fields facing completely different challenges.

  45. Simple test…
    Google scholar “neutrino” “mass”
    Find first three abstracts that mention statistical tests…

    The Heidelberg-Moscow experiment gives the most stringent limit on the Majorana neutrino mass. After 24 kg yr of data with pulse shape measurements, we set a lower limit on the half-life of the 0νββ decay in 76Ge of T1/20ν≥5.7×1025yr at 90% C.L. (after PDG98 [C. Caso et al., Eur. Phys. J. C3, 1 (1998]), the sensitivity of the experiment being T1/20ν≥1.6×1025yr at 90% C.L. We thus exclude an effective Majorana neutrino mass greater than 0.2 eV (0.39 eV sensitivity), using the matrix elements of A. Staudt, K. Muto, and H. V. Klapdor-Kleingrothaus, Europhys. Lett. 13, 31 (1990). This limit sets strong constraints on degenerate neutrino mass models.

    Baudis, L. et al. (1999), Limits on the Majorana Neutrino Mass in the 0.1 eV Range, Physical Review Letters, 83(1), 41, doi:10.1103/PhysRevLett.83.41.

    We combine the constraints from the recent Lyα forest analysis of the Sloan Digital Sky Survey (SDSS) and the SDSS galaxy bias analysis with previous constraints from SDSS galaxy clustering, the latest supernovae, and 1st year WMAP cosmic microwave background anisotropies. We find significant improvements on all of the cosmological parameters compared to previous constraints, which highlights the importance of combining Lyα forest constraints with other probes. Combining WMAP and the Lyα forest we find for the primordial slope ns=0.98±0.02. We see no evidence of running, dn/dln⁡k=-0.003±0.010, a factor of 3 improvement over previous constraints. We also find no evidence of tensors, r<0.36 (95% c.l.). Inflationary models predict the absence of running and many among them satisfy these constraints, particularly negative curvature models such as those based on spontaneous symmetry breaking. A positive correlation between tensors and primordial slope disfavors chaotic inflation-type models with steep slopes: while the V∝ϕ2 model is within the 2-sigma contour, V∝ϕ4 is outside the 3-sigma contour. For the amplitude we find σ8=0.90±0.03 from the Lyα forest and WMAP alone. We find no evidence of neutrino mass: for the case of 3 massive neutrino families with an inflationary prior, ∑mν<0.42 eV and the mass of lightest neutrino is m1<0.13 eV at 95% c.l. For the 3 massless +1 massive neutrino case we find mν<0.79 eV for the massive neutrino, excluding at 95% c.l. all neutrino mass solutions compatible with the LSND results. We explore dark energy constraints in models with a fairly general time dependence of dark energy equation of state, finding Ωλ=0.72±0.02, w(z=0.3)=-0.98-0.12+0.10, the latter changing to w(z=0.3)=-0.92-0.10+0.09 if tensors are allowed. We find no evidence for variation of the equation of state with redshift, w(z=1)=-1.03-0.28+0.21. These results rely on the current understanding of the Lyα forest and other probes, which need to be explored further both observationally and theoretically, but extensive tests reveal no evidence of inconsistency among different data sets used here.

    Seljak, U. et al. (2005), Cosmological parameter analysis including SDSS Ly alpha forest and galaxy bias: Constraints on the primordial spectrum of fluctuations, neutrino mass, and dark energy, Physical Review D, 71(10), 103515, doi:10.1103/PhysRevD.71.103515.

    We constrain fν≡Ων/Ωm, the fractional contribution of neutrinos to the total mass density in the Universe, by comparing the power spectrum of fluctuations derived from the 2 Degree Field Galaxy Redshift Survey with power spectra for models with four components: baryons, cold dark matter, massive neutrinos, and a cosmological constant. Adding constraints from independent cosmological probes we find fν<0.13 (at 95% confidence) for a prior of 0.1<Ωm<0.5, and assuming the scalar spectral index n=1. This translates to an upper limit on the total neutrino mass mν,tot<1.8 eV for “concordance” values of Ωm and the Hubble constant.

    Elgaroy, O. et al. (2002), New Upper Limit on the Total Neutrino Mass from the 2 Degree Field Galaxy Redshift Survey, Physical Review Letters, 89(6), 061301, doi:10.1103/PhysRevLett.89.061301.

    Some commentators are ignoring good points about differences between fields (replication, data mining, and practical impacts of type-IIs,) raised by
    Steve L,
    Ray Ladbury
    and
    Jonathan Gilligan

  46. Guys, not to apologize for Lubos et al., but the pitfalls of a physics education are the same as any other discipline that demands focused specialization and promises strong predictive power.

    When you have seen the techniques you have learned work and you haven’t ventured far outside your own specialized domain, it is easy to develop the impression that your techniques are the only ones that work. This is particularly the case when your understanding of statistical reasoning is neither broad nor deep.

    An experimental particle physicist has probably seen dozens of three sigma signals appear and vanish on deeper examination before he gets his PhD. It’s easy to reach the conclusion that generating a spurious 3 sigma signal–rather than to realize the fact that you are generating hundreds of plots in a day means your 3 sigma signals are not 3 sigma after all. Indeed, your 5 sigma signals may not even be 3-sigma signals.

    You see the same overconfidence in geologists, engineers, economists, programmers and doctors. You rarely see it in domains where they study REALLY complicated things like ecology, biology, etc. Although climate science has its roots in the physical sciences, you rarely see climate scientists telling particle physicists how to do their job. Some domains of knowledge lend themselves better to combatting hubris.

  47. > Roger Penrose’s bizarre insistence that
    > consciousness can be caused only by
    > quantum phenomena

    It did strike me as funny to see this:
    http://www.lbl.gov/Science-Articles/Archive/PBD-quantum-secrets.html

    Nature is indeed full of marvels.

  48. Philippe Chantreau

    Once again, I think Ray hit the nail on the head: “fields like particle physics have been driven to require higher significance because their methodology tends to produce spurious results if lower significance is allowed.”

    What more is really there to say?

  49. At the risk of repeating somebody else due to not reading fully (apologies), this from calabi is really something:
    “Even in physics, with claims everyone believes are probably true, you need ~five sigma to claim proof. For example, look at dark matter. Everyone believes it’s there, we’ve indirectly detected it, but no claims of direct detection have been made DESPITE SEEING THINGS TO A FEW SIGMA. We haven’t seen signals to five sigma, no discovery, the end. Even though no physicist questions the existence of dark mater, we still do not claim to have directly detected it!”

    So everybody in a particular field knows something is true, but not officially, because a very stringent alpha has been adopted in lieu of ‘proof’. So do researchers in this field act (plan their research, interpret their data, etc) under the null assumption that dark matter doesn’t exist, or do they act as though it does? If the former, that seems stupid; if the latter, then what’s the use of statistics for them?

    Thank you Tamino — I would never get to explore this thought space if not for this excellent blog that attracts people from diverse fields. Hopefully somebody can answer the questions I’ve just posed. It’s wonderful to learn about how other disciplines operate. (Oops, a quick skimming of subsequent comments suggests that maybe other disciplines don’t operate as advertised. I’m not sure if I’m disappointed or relieved.)

  50. I am genuinely surprised people are actually debating this, much less getting offended by it. This is not intended to be a quip at climate science, or a statement of superiority (which is truly absurd) but rather clarifying what is frankly low level textbook material and/or common practise.

    This has nothing to do with particle physics (where the sheer number count of experiments implies lots of bumps). There, we would be wrong tens of thousands of times per month if we admitted 2 sigma events and the time wasted would be astronomical.

    [edit]

    [Response: This is absurd. It’s idiotic that you still believe physicists have a better understanding of statistics than statisticians.

    You complain that “the sheer number count of experiments implies lots of bumps). There, we would be wrong tens of thousands of times per month if we admitted 2 sigma events and the time wasted would be astronomical.”

    DUH! Apparently you and a lot of other physicists don’t know how to compensate for the “sheer number of experiments” — something I specifically discuss in this article. If you *really* understood statistics, you’d know that what you think is a 2-sigma event is just your naivete. Are you really so woefully ignorant of statistics that the only way you can get valid results is to set a ludicrously high significance level?

    The offensive part is the sheer arrogance, that rather than accept your own ignorance you convince yourself you know statistics better than the experts who invented the discipline.

    Quit implying that “all the guys in the physics department agree on this” (which isn’t even true). Go talk to the guys in the department of statistics — see how many of them you can convince that 5-sigma is mandatory.]

    • There, we would be wrong tens of thousands of times per month if we admitted 2 sigma events and the time wasted would be astronomical.

      In many fields it’s not possible to be wrong “tens of thousands of times per month” because individual experiments take months to complete.

      You really, really need to broaden your understanding of science beyond the very particular work that you yourself are familiar with. This kind of naive, parochial outlook is curable but only if you’re willing to learn from other people rather than assuming that everyone who does things differently from you is wrong.

  51. Guys, not to apologize for Lubos et al., but the pitfalls of a physics education are the same as any other discipline that demands focused specialization and promises strong predictive power.

    The reason for hammering on Lubos has to do as much with his inconsistency regarding his treasured 5-sigmas as his hubris on insisting on it (but only when it suits him!)

  52. Pete Dunkelberg

    Frank Wilczek got his Noble Prize in physics for calculating the mass of a proton closer than ten percent.

  53. Good points, Ray. And now I will try to be fair to physicists, by picking on my own field–experimental psychology (no, that is not the same as counseling; instead think “cognitive science”).

    Back when I was teaching classes on how to use computers to do data management and statistical tests, I was surprised to discover that many of the PhD students in the class lacked a firm grasp of why they were doing those particular tests for their dissertation experiments, and exactly how to interpret the details of the results.

    I was surprised, because they had been required to take a lot of prior, sophisticated, stat classes from professors in one of the best quantitative methods departments in the US. These students could do all sorts of stats by computing the matrix algebra by hand. They knew a multitude of esoteric stat approaches and details.

    Turns out that was the problem. The classes they took were designed by professors who were stat specialists, were excited about detailed and esoteric stats, and who designed all the stat classes as part of a sequence for students whose PhD would be in quantitative methods. Missing from those classes was sufficient emphasis on the big picture, for students who quit taking the sequence after the first couple years, because they were specializing in something else. There were no classes tuned to students like that.

    That seems to be a problem in a lot of fields, not just stat. The big picture generally is provided by overview classes intended to be the only class a student will ever take in that field. But many of the introductory classes that are part of a sequence intended for students who will major in that field, provide only narrow building blocks. Students are expected to get the big picture later, after they’ve got the vocabulary and basic concepts. But students who do not finish the sequence (e.g., do not major in that field, or do not go on to a graduate program) never get the big picture sufficiently.

    But those people often do come away with an incorrect impression of how much they know–an example of the Dunning-Kruger effect. An example is the misconception that all of science boils down to the p value of a statistical test.

    • Tom,
      In the dark and distant past, I almost switched my major from physics to experimental psych. Your story resonates with me–the only formal stats training I had was in statistical mechanics. I had to teach myself the basics and pretty much everything beyond that.

      Probability and statistics probably remains one of the least understood and most essential aspects of science education in just about any field. I think that you are right, though. A lot of people assume that because they know a bunch of advanced tests that they understand stats–when they really don’t understand the basics. VS comes to mind.

  54. Philippe Chantreau

    Hank, that Research News article was mind boggling. Awesome stuff.

  55. I am reminded a story of one of the old timers in the Department.

    He is told by the Physics department that he needs to come talk to them because “Math is wrong”.

    So he heads over to the Physics department and they show him this series of functions that they have.

    Apparently they want to integrate it over some interval. So first they integrate each function separately and then they add up the result.

    And then they do it in the other order, adding up the functions first and then integrating.

    The results are different, so MATH IS WRONG.

    They were expecting an mathematicians apology but instead they got an ass kicking.

    “Does the series converge uniformly?” he asks.

    “What does uniform convergence mean?” they reply. At which point he unloads on them.

    YOU ARE APPLYING A THEOREM, AND NOT ONLY DON’T YOU KNOW THE HYPOTHESIS OF THE THEOREM, YOU DON’T EVEN KNOW YOU ARE USING A THEOREM.

    Too bad old Harvey isn’t around to deal with the present group of moron physicist trolls.

    I will do my best to fill in for the fallen:

    YOU MORONIC TWITS.
    THE REASON YOU THINK YOU NEED 5-SIGMA IS BECAUSE YOU ARE CALCULATING THE CONFIDENCE INTERVALS WRONG.

    YOU ARE JUST HOPING THAT IF THE BULLSHIT CALCULATION GIVES YOU 5-SIGMA THEN THE REAL CALCULATION WILL GIVE YOU 2-SIGMA.

    IT DOESN’T OF COURSE, SO YOUR RESPONSE TO ONCE AGAIN STEPPING ON YOUR DICK ISN’T “HEH MAYBE I SHOULD TRY TO UNDERSTAND THIS STATISTICS STUFF” (BECAUSE OF COURSE YOU ARE A BRILLIANT PHYSICIST )

    NO, YOUR RESPONSE IS “5-SIGMA IS THE ABSOLUTE MINIMUM BECAUSE LOOK AT HOW MANY SPURIOUS RESULTS COME IN EVEN WITH 5-SIGMA”.

  56. “Missing from those classes was sufficient emphasis on the big picture”

    I’d like to emphasize that only through rigorous quantitative mathematics/statistics can you see the big picture. Although you can see the trees without seeing the forest, you cannot see the forest without seeing the trees as well.

  57. Lazar,

    Nice internet research! Thanks! You’ve pretty much demolished Lubos’s idiotic proposition.

  58. If you are in the business of searching for new particles etc. and you are doing a huge number of experiments with multiple possible hypotheses, false positives are a serious problem. Simply demanding a higher standard of evidence is a naïve approach that can cause you to miss significant effects, which is why false discovery rate (FDR) was invented. If physicists don’t know about this and are still in effect doing Bonferroni correction they’re a bit behind.

    On the other hand where you are testing ONE hypothesis a lower confidence level is certainly justifiable. As Tamino explained, a 95% confidence interval means you have a 1 in 20 chance that you have falsely declared a result significant, reasonable odds especially if multiple independent experiments testing the same hypothesis result in the same finding. Finally, unlike in particle physics, use of stats here aims to confirm a real-world effect based on known physics, not arrive at a new theory.

  59. [edit]

    Whether the 3-sigma level is enough for new sciences like climate, environmental, forestry and other sciences remains to be seen. Climate science is a new enough science that there is not enough time yet to decide this issue. Time will tell. I do not know enough about climate science to say one way or the other.

    [Response: It’s time to put a stop to this.

    If you’re a physicist and your experience with statistics leads you to believe that 5-sigma is mandatory, it’s because you’re doin’ it wrong.

    Many sciences encounter circumstances in which the naive application of statistical tests is wrong. If you keep getting false results, then instead of allowing yourself the arrogance of believing that your discipline is “special” (let alone that you know better than other scientists), you need to get over yourself and accept the fact that your statistics is naive. Consult with a statistician. This is what we do. The 5-sigma limit in physics is not a sign of some inherent difference in their science, but of a weakness of their understanding of statistics.

    The statement that “Whether the 3-sigma level is enough for new sciences like climate … remains to be seen” is just plain wrong. Worse, it plays into the hands of those who, like Lubos, will do anything no matter how idiotic to discredit climate science.

    What climate science, and physics, and all sciences need is for the statistics to be done right.]

    [edit]

    Mathematics (which includes statistics) and Physics are actually good friends. Please don’t make enemies out of them.

    [Response: Go tell Motl; he’s the one suggesting that physicists know statistics better than statisticians.]

    • Hellcat, you do realize that the prediction of anthropogenic climate change is about a decade older than the Special Theory of Relativity, right? Older than quantum mechanics, plate tectonics…

      Look, I’ve done particle physics. I know why you need 5 sigma. I had a buddy who decided he wanted to do the stats right–took him 13 years to graduate!

      Look, 5 sigma is six 9’s. There isn’t much knowledge in human experience that is that certain–nor does there need to be.

  60. Motl made the same asinine 5σ “demand” back in Jan 2009:

    http://motls.blogspot.com/2009/01/record-cold-temperatures-in-2009.html

    Apparently he uses 5σ as a hammer, and sees everything climate-related as a nail. Never mind his 5σ is wrong. He just wants to deny climate change. Yet another denialist paper tiger.

  61. “Similarly, engineering has a “5-sigma” standard, too. Do you want your buildings to withstand three sigma deviations from average? No, millions of people would die if that was the best that was done (c.f., earthquakes in countries without proper building codes). There are hundreds of millions of buildings in the countries, even to a one-in-a-million level, that would be hundreds of buildings collapsing a year, could you imagine that!”

    You are confusing the liberal use of safety factors in design codes with a 5 sigma standard in research. Two sigma is all over the place in engineering research.

  62. “It’s never ludicrous because the amount of data/events that you typically need for a 5-sigma discovery is just roughly 3 times bigger than the data for a 3-sigma discovery”

    Stupidest comment ever. I’ll tell my collegue that she’ll ‘just’ need 3 times as much patients. That’s three times as much testing personnel she’ll need to hire, 3 times as much money she needs to pay the patients, three times as much time she needs to reserve the MRI (not a trivial expense) and three times as much weekends she’ll need to give up to run her experiments in.

    But I’m sure Motl will easily cough up the tens of thousands of dollars extra that will be needed to run the experiments until they are up to his standard and sit in on her weekend duties as a test leader. “Just a bit of extra work.”

    • Yeah, Motl’s doing an excellent job of ruining what shreds of credibility he still had left.

      Repeat after me, Lubos: The number of events needed depends in detail on the statistics of the problem–not every problem follows Poisson statistics.
      To demonstrate that 90% of a lot of parts is good with 5 sigma confidence, I’d have to test 132 parts! Lubos has spent way too long with his head in warm, dark places.

  63. HellCat: Climate science is a new enough science that there is not enough time yet to decide this issue.

    BPL: Where do you set the beginning of climate science? Here are some possible choices:

    1. When Aristotle divided the world into torrid, temperature, and frigid zones, c. 300 BC.

    2. When Galileo invented the thermometer and Torricelli the barometer, c. 1600 AD.

    3. When Hadley worked out the first model for the general circulation of the atmosphere c. 1740 AD.

    4. When Fourier demonstrated the existence of the greenhouse effect in 1824.

    5. When Agassiz demonstrated the existence of Ice Ages in 1837.

    6. When Arrhenius proposed anthropogenic global warming theory in 1896.

    Which starting point would make climatology ” a new science?”

  64. Sorry, that should read “torrid, temperate, and frigid” above, not “temperature”

  65. And our friend Lubos already responded on his blog: http://motls.blogspot.com/2010/03/proliferation-of-wrong-papers-at-95.html

    :)

  66. May I propose that physicists and climate scientists switch jobs for a few select experiments? Live in New York for a while, leave before you become self-obsessed and hard. Live in North Carolina for a while, but leave before you get soft.. as it where.

    You guys have a lot to learn from eachother and could probably (with a 5-sigma CL) benefit society positively by exploring the varied world of science a bit.

    • Simon,
      I got my PhD in experimental particle physics. I’ve since worked in education, international development, science journalism and space physics.

      Motl’s whole premise is absurd. The 95% CL does not imply that 5% of results are wrong–it is simply a threshold, the level to which significance is demonstrated.

  67. I just posted this at Lubos’ blog:

    Having worked in industrial quality assurance for many years, I am surprised at the demand for 5-sigma assurance.

    Traditionally, QA uses “warning” at 2-sigma shifts of a mean value (e.g. mean diameter of 10 samples of a circular widget taken from a widget-producing process). 3-sigma is taken as an “action” limit – stop the process and seek a cause of the shift in the mean.

    These limits have been used since Walter Shewhart first introduced Statistical quality Control at Western Electric in the 1920s. See http://www.amazon.co.uk/Introduction-Statistical-Quality-Control-Montgomery/dp/0471656313/ref=sr_1_1?ie=UTF8&s=books&qid=1269460503&sr=8-1

    If 5-sigma limits had been used in industrial processes, then the level of quality and reliability we see in modern mechanical and electronic equipment would be considerably less.

    • John Whitman

      toby,

      Although there might be some 5 sigma industries. Or even 4?

      Industries at 6 sigma also occur, doing quite well.

      John

      • Not wishing to derail the conversation any … but ordinarily, in the long term, those 6-sigmas are probably only really operating at 4 to 5 sigma aren’t they because of movement of the process’s mean … he says from a dim and distant TQM/Taguchi exercise a company I once worked for decided to try to implement and I tried to get my head around.

      • In “6-sigma industries”, the process is designed such that the process parameters (e.g widget diameters) lie as close to the mean as possible so as to be well within limits and have few defects.
        To achieve 6-sigma status, a process may have to endure many years of 2-sigma and 3-sigma behaviour.

        3-sigma and 2-sigma limits on sample measurements would still be used to indicate unfavourable shifts in the process mean.

  68. Just to be sure, you are right that 5 sigma is physics exception to the 1.96 sigma rule.

    Click to access cdf8128_mclimit_csm_v2.pdf

  69. Motl, Calabi, and Haelfix.

    How many sigmas does one require in order to accept that an apple is not an orange?

  70. Bernard, if your method of determining the characteristic of the fruit being observed, is a temperature sensor, then I think you need a whole lot of sigmas.

    • Actually, I think that a years worth of temperature readings would suffice in many cases to preclude an orange being grown in a temperate region–sight unseen.

  71. a particle physicist

    It’s worth noting that the convention in particle physics is that 2-sigma confidence is enough to rule out a theory of new physics, but 5-sigma confidence is required to discover something new. Two or three sigma bumps (even after correcting for all the places a bump could have appeared in a given distribution) come and go all the time, just because there are thousands and thousands of different distributions being investigated.

    Climate science obviously differs in that we have only one Earth to work with (a “cosmic variance” problem that should be familiar to many physicists in other contexts). Not to mention that global warming from the greenhouse effect is a prediction of well-understood physics; the extraordinary claim that would require huge statistical significance to be convincing would be that it isn’t happening. Seeing at 90% confidence that the global warming that really should be happening, as a result of basic physics, is happening should be enough to convince anyone. (Whereas to be convinced of Lubos’s favored strong negative feedbacks, in the absence of any compelling argument that they exist, I would want both highly statistically significant data and a plausible mechanism.)

  72. @ a particle physicist: Beautifully said.

  73. Think of it this way: we’re trying to get enough data to get a convincing answer about what’s happening, while sacrificing the minimum number of research subjects needed.

    In climate studies, the research subjects are life on the planet, or humans present and future.

    In convincing people, just eyeballing the pretty picture is usually what actually works.

    Good blog post here on the problem of using eyeballing to convince people there’s an effect:

    http://scienceblogs.com/drugmonkey/2010/03/iacuc_101_satisfying_the_erron.php