Error Statistics Philosophy: Getting Credit (or blame) for Something You Don't Deserve (and first honorable mention)

Ruler at the Bottom of Ocean

It was three months ago that I began this blog with “overheard at the comedy hour at the Bayesian retreat” …and we’re near the end of the repertoire of jokes (unless I hear any new ones). This last, in effect, accuses the frequentist error-statistical account of licensing the following (make-believe) argument after the oil spill in the Gulf of Mexico in 2010:

Oil Exec: We had highly reliable evidence that H: the pressure was at normal levels on April 20, 2010!

Senator: But you conceded that whenever your measuring tool showed dangerous or ambiguous readings, you continually lowered the pressure, and that the stringent “cement bond log” test was entirely skipped.

Oil Exec: Granted, we omitted reliable checks on April 20, 2010, but usually we do a better job—I am reporting the average! You see, we use a randomizer that most of the time directs us to run the gold-standard check on pressure. But, but April 20 just happened to be one of those times we did the nonstringent test; but on average we do ok.

Senator: But you don’t know that your system would have passed the more stringent test you didn’t perform!

Oil Exec: That's the beauty of the the frequentist test!

Even if we grant (for the sake of the joke) that overall, this “test” rarely errs in the report it outputs (pass or fail), that is irrelevant to appraising the inference from the data on April 20, 2010 (which would have differed had the more stringent test been run).[i] That interpretation violates the severity criterion: the observed passing result was altogether common if generated from a source where the pressure level was unacceptably high, Therefore it misinterprets the actual data. The question is why anyone would saddle the frequentist with such shenanigans on averages? To be continued when I return from the Secret Comedy Club (this time in Covent Garden).

To continue:

Lest anyone think I am inventing a criticism, here is a familiar statistical instantiation, where the choice for each experiment is given to be .5 (Cox 1958).

Two Measuring Instruments with Different Precisions:

A single observation X is to be made on a normally distributed random variable with unknown mean m, but the measurement instrument is chosen by a coin flip: with heads we use instrument E' with a known small variance, say 10^-4, while with tails, we use E", with a known large variance, say 10⁴. The full data indicates whether E' or E" was performed, and the particular value observed, which we can write as x' and x", respectively.

In applying our test T+ (see November 11 blog post ) to a null hypothesis, say, µ = 0, the “same” value of X would correspond to a much smaller p-value were it to have come from E’ than if it had come from E”. Denote the two p-values as p’ and p”, respectively. However, or so the criticism proceeds, the error statistician would report the average p-value: .5(p’ + p”).

But this would give a misleading assessment of the precision and corresponding severity with either measurement! Instead you should report the p-value of the result in the experiment actually run (this is Cox’s Weak Conditionality Principle, WCP).

But what could lead the critic to suppose the error statistician must average over experiments not even performed? Rule #2 for legitimate criticism is to give the position being criticized the most generous construal once can think of. Perhaps the critic supposes what is actually a distortion of even the most radical behavioristic construal:

· If you consider outcomes that could have occurred in hypothetical repetitions of this experiment, you must also consider other experiments you did not run (but could have been run) in reasoning from the data observed (from the test you actually ran), and report some kind of frequentist average!

The severity requirement makes explicit that such a construal is to be rejected---I would have thought it obvious, and not in need of identifying a special principle. Since it wasn’t, I articulated this special notion for interpreting tests and the corresponding severity criterion.

Let me now give a special (the first!) honorary mention to Christian Robert on this point, as raised in Cox and Mayo (2010). He writes p. 9 http://arxiv.org/abs/1111.5827:

A compelling section is the one about the weak conditionality principle (pp.294- 298), as it objects to the usual statement that a frequency approach breaks this principle. In a mixture experiment about the same parameter θ, inferences made conditional on the experiment “are appropriately drawn in terms of the sampling behaviour in the experiment known to have been performed” (p. 296). This seems hardly objectionable, as stated. And I must confess the sin of stating the opposite as The Bayesian Choice has this remark (Robert (2007), Example 1.3.7, p.18) that the classical confidence interval averages over the experiments. The term experiment validates the above conditioning in that several experiments could be used to measure θ, each with a different p-value. I will not argue with this.

He would want me to mention that he does go on with some caveats:

I could, however, [argue] about ‘conditioning is warranted to achieve objective frequentist goals’ (p. 298) in that the choice of the conditioning, among other things, weakens the objectivity of the analysis. In a sense the above pirouette out of the conditioning principle paradox suffers from the same weakness, namely that when two distributions characterise the same data (the mixture and the conditional distributions), there is a choice to be made between “good” and “bad”.

But there is nothing arbitrary about regarding as “good” the only experiment actually run and from which the actual data arose. The severity criterion only makes explicit what is/should be already obvious. Objectivity, for us, is directed by the goal of making correct and warranted inferences, not freedom from thinking. After all, any time an experiment E is performed, the critic could insist that the decision to perform E is the result of some chance circumstances and with some probability we might have felt differently that day and have run some other test, perhaps a highly imprecise test or a much more precise test or anything in between, and demand that we report whatever average properties they come up with. The error statistician can only shake her head in wonder that this gambit is at the heart of criticisms of frequentist tests.

Still, we exiled ones can’t be too fussy, and Robert still gets the mention for conceding that we have a solid leg on which to pirouette.

[i] Of course, no one really argued this way. Moreover, deep-water drillers who had nothing to do with the spill (e.g., DO) were exiled from the GOM. Note DO is over 60 today.

Error Statistics Philosophy

Search This Blog

Friday, December 2, 2011

Getting Credit (or blame) for Something You Don't Deserve (and first honorable mention)

No comments:

Post a Comment