Search This Blog

Tuesday, November 29, 2011

If you try sometime, you find you get what you need!

picking up the pieces

Thanks to Nancy Cartwright, a little ad hoc discussion group has formed: “PhilErrorStat: LSE: Three weeks in (Nov-Dec) 2011.”  I’ll be posting related items on this blog, in the column to your left, over its short lifetime. We’re taking a look at some articles and issues leading up to a paper I’m putting together to give in Madrid next month on the Birnbaum-likelihood principle business (“Breaking Through the Breakthrough”) at a conference (“The Controversy about Hypothesis Testing,” Madrid, December 15-16, 2011).  I hope also to get this group’s feedback as I follow through on responses I’ve been promising to some of the comments and queries I’ve received these past few months.  

Our very first meeting already reminded me of an issue Christian Robert raised in his blog about Error and Inference: Is the frequentist (error-statistical) interest in probing discrepancies, and the ways in which statistical hypotheses and models can be false, akin to a Bayesian call for setting out rival hypotheses with prior probability assignments?

Sunday, November 27, 2011

The UN Charter: double-counting and data snooping

John Worrall, 26 Nov. 2011
Last night we went to a 65th birthday party for John Worrall, philosopher of science and guitarist in his band Critique of Pure Rhythm. For the past 20 or more of these years, Worrall and I have been periodically debating one of the most contested principles in philosophy of science: whether evidence in support of a hypothesis or theory should in some sense be “novel.”

A novel fact for a hypothesis H may be: (1) one not already known, (2) one not already predicted (or counter-predicted) by available hypotheses, or (3) one not already used in arriving at or constructing H. The first corresponds to temporal novelty (Popper), the second, to theoretical novelty (Popper, Lakatos), the third, to heuristic or use-novelty. It is the third, use-novelty (UN), best articulated by John Worrall, that seems to be the most promising at capturing a common intuition against the “double use” of evidence:

If data x have been used to construct a hypothesis H(x), then x should not be used again as evidence in support of H(x).

(Note: Writing H(x) in this way emphasizes that, one way or another, the inferred hypothesis selected or constructed to fit or agree with data x. The particular instantiation can be written as H(x0).)

The UN requirement, or, as Worrall playfully puts it, the “UN Charter,” is this:

Use-novelty requirement (UN Charter): for data x to support hypothesis H (or for x to be a good test of H), H should not only agree with or “fit” the evidence x, but x itself must not have been used in H's construction.

The problem has arisen as a general prohibition against data mining, hunting for significance, tuning on the signal, ad hoc hypotheses, and data peeking, and as a preference for predesignated hypotheses and novel predictions.

Wednesday, November 23, 2011

Elbar Grease: Return to the Comedy Hour at the Bayesian Retreat

I lost a bet last night with my criminologist colleague Katrin H. It turns out that you can order a drink called “Elbar Grease” in London, in a “secret” comedy club in a distant suburb (see Sept. 30 post).[i] The trouble is that it’s not nearly as sour as the authentic drink (not sour enough, in any case, for those of us who lack that aforementioned gene). But I did get to hear some great comedy, which hasn’t happened since early days of exile, and it reminded me of my promise to revisit the “comedy hour at the Bayesian retreat” (see Sept. 3 post). Few things have been the butt of more jokes than examples of so-called “trivial intervals”.

Sunday, November 20, 2011

RMM-5: Special Volume on Stat Scie Meets Phil Sci

The article "Low Assumptions, High Dimensions" by Larry Wasserman has now been published in our special volume of the on-line journal, Rationality, Markets, and Morals (Special Topic: Statistical Science and Philosophy of Science: Where Do/Should They Meet?")

These days, statisticians often deal with complex, high dimensional datasets. Researchers in statistics and machine learning have responded by creating many new methods for analyzing high dimensional data. However, many of these new methods depend on strong assumptions. The challenge of bringing low assumption inference to high dimensional settings requires new ways to think about the foundations of statistics. Traditional foundational concerns, such as the Bayesian versus frequentist debate, have become less important.

Friday, November 18, 2011

Neyman's Nursery (NN5): Final Post

I want to complete the Neyman’s Nursery (NN) meanderings while we have some numbers before us, and while there is a particular example, test T+, on the table.  Despite my warm and affectionate welcoming of the “power analytic” reasoning I unearthed in those “hidden Neyman” papers (see post from Oct. 22)-- admittedly, largely lost in the standard decision-behavior model of tests--, it still retains an unacceptable coarseness: power is always calculated relative to the cut-off point ca for rejecting H0.  But rather than throw out the baby with the bathwater, we should keep the logic and take account of the actual value of the statistically insignificant result.
(For those just tuning in, power analytic reasoning aims to avoid the age-old fallacy of taking a statistically insignificant result as evidence of 0 discrepancy from the null hypothesis, by identifying discrepancies that can and cannot be ruled out.  For our test T+, we reason from insignificant results to inferences of the form:  μ < μ0 + g.)

Tuesday, November 15, 2011

Logic Takes a Bit of a Hit!: (NN 4) Continuing: Shpower ("observed" power) vs Power:

Add caption
Logic takes a bit of a hit---student driver behind me.  Anyway, managed to get to JFK, and meant to explain a bit more clearly the first "shpower" post.
I'm not saying shpower is illegitimate in its own right, or that it could not have uses, only that finding that the logic for power analytic reasoning does not hold for shpower  is no skin off the nose of power analytic reasoning.  
Consider our one-sided test T+, with μ0= 0 and α=.025.  Suppose σ = 1, n = 25, so x̄ is statistically significant only if it exceeds .392. Suppose x̄ just misses significance, say
x̄ = .39.

Power-analytic reasoning says (in relation to our test T+):

If x̄ is statistically insignificant and the POW(T+, μ= μ1) is high, then x̄ indicates, or warrants inferring (or whatever phrase you like) that  μ < μ1.

How do they do it?

Outside My Window (NYC)
We all know there's scarcely any privacy these days...but how do they take a pic (I found it on Flickr, by chance) right outside the 8th floor? (zoom perhaps?)  It was in the past 4-5 weeks because that faux marble coral moulding wasn't even up before that, and it can be seen in the window.

Friday, November 11, 2011

Neyman's Nursery (NN) 3: SHPOWER vs POWER

EGEK weighs 1 pound

Before leaving base again, I have a rule to check on weight gain since the start of my last trip.  I put this off til the last minute, especially when, like this time, I know I've overeaten while traveling.  The most accurate of the 4 scales I generally use (one is at my doctor's) is actually in Neyman's Nursery upstairs.  To my surprise, none of these scales showed any discernible increase over when I left.  At least one of the 4 scales would surely have registered a weight gain of 1 pound or more, had I gained it, and yet none of them do; that is an indication I’ve not gained a pound or more.  I check that each scale reliably indicates 1 pound, because I know that is the weight of the book EGEK (you can even see this on the scale shown), and they each show exactly one pound when EGEK is weighed. Having evidence I've gained less than 1 pound, there is even less grounds for supposing I’ve gained as much as 5 pounds, right?

This kind of measure of the capability of a method to detect a change or discrepancy is very much of a Power-type notion (whether formal or informal).  Analogously, if an experimental test very probably would have rejected the null hypothesis, if the correct value of μ is as large as μ’---i.e., if the Power of the test against μ’ were very high--- , then a non-rejection is an indication that μ is not as large as μ’. This is a general type of Power Analytic reasoning.

Tuesday, November 8, 2011

Neyman's Nursery 2: Power and Severity [Continuation of Oct. 22 Post]:

Let me pick up where I left off in “Neyman’s Nursery,” [built to house Giere's statistical papers-in-exile]--please see  Oct. 22 post. The  main goal of the discussion is to get us to exercise correctly our "will to understand power", if only little by little.  One of the two surprising papers I came across the night our house was hit by lightening has the tantalizing title “The Problem of Inductive Inference” (Neyman 1955).  It reveals a use of statistical tests strikingly different from the long-run behavior construal most associated with Neyman.  Surprising too, Neyman is talking to none other than the logical positivist philosopher of confirmation, Rudof Carnap:

I am concerned with the term “degree of confirmation” introduced by Carnap.  …We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true].  The question is: does this result “confirm” the hypothesis that H0 is true of the particular data set? (Neyman, pp 40-41).
Neyman continues:
The answer … depends very much on the exact meaning given to the words “confirmation,” “confidence,” etc.  If one uses these words to describe one’s intuitive feeling of confidence in the hypothesis tested H0, then…. the attitude described is dangerous.… [T]he chance of detecting the presence [of discrepancy from the null], when only [n] observations are available, is extremely slim, even if [the discrepancy is present].  Therefore, the failure of the test to reject H0 cannot be reasonably considered as anything like a confirmation of H0.  The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95.
The general conclusion is that it is a little rash to base one’s intuitive confidence in a given hypothesis on the fact that a test failed to reject this hypothesis. A more cautious attitude would be to form one’s intuitive opinion only after studying the power function of the test applied.

Monday, November 7, 2011


In the Clutches of the TSA
I don’t know if this is true, but I was told yesterday by some TSA inspectors that there would no longer be an “opt-out” option from full-body scanners in Europe.  (Anyone know about this?)  About to pass through security at Heathrow (British Air) I began the usual strip, including knee brace, which invariably triggers bells.  I was told they didn’t want me removing the knee brace “in public”, so I went through the machine, it went off, and I was given a pat down and told I also had to go through the full-body scanner which I always opt out of (not that it has often arisen). They usually grab a bullhorn and yell out loudly “female opt out!” in order to signal the need for a non-male TSA rep to do the pat down.  This time, however, they told me there had just a few days ago been a change of rules in Europe, and there was no opting out (if selected).  After I argued for several minutes that neither the safety nor the effectiveness of the full-body scan had passed severe tests, I suddenly found myself surrounded by 4 male TS officials who said I either go through the full-body scanner or not fly.  I received a form in which to write my complaint to the authorities.  After I submitted to their invasion of privacy, they still demanded I take the brace off---I guess it was ok to perform in public now.  Any females with similar experiences?  

If the following is true, I hope she succeeds in suing: (updated Dec. 3, 2011)

Saturday, November 5, 2011

Skeleton Key and Skeletal Points for (Esteemed) Ghost Guest

Secret Key
Why attend presentations of interesting papers or go to smashing London sites when you can spend better than an hour racing from here to there because the skeleton key to your rented flat won’t turn the lock (after working fine for days)? [3other neighbors tried, by the way, it wasn't just me.] And what are the chances of two keys failing, including the porter’s key, and then a third key succeeding--a spare I’d never used but had placed in a hollowed-out volume of Error and Inference, and kept in an office at the London School of Economics?  (Yes, that is what the photo is!  A anonymous e-mailer guessed it right, so they must have spies!)  As I ran back and forth one step ahead of the locksmith, trying to ignore my still-bum knee (I left the knee brace in the flat) and trying not to get run over—not easy, in London, for me—I mulled over the perplexing query from one of my Ghost Guests (who asked for my positive account).

Thursday, November 3, 2011

Who is Really Doing the Work?*

Note Figure Lurking in Background

A common assertion (of which I was reminded in Leiden*) is that in scientific practice, by and large, the frequentist sampling theorist (error statistician) ends up in essentially the "same place" as Bayesians, as if to downplay the importance of disagreements within the Bayesian family, let alone between the Bayesian and frequentist.   Such an utterance, in my experience, is indicative of a frequentist in exile (as described on this blog). [1]  Perhaps the claim makes the frequentist feel less in exile; but it also renders any subsequent claims to prefer the frequentist philosophy as just that---a matter of preference, without a pressing foundational imperative. Yet, even if one were to grant an agreement in numbers, it is altogether crucial to ascertain who or what is really doing the work.  If we don’t understand what is really responsible for success stories in statistical inference, we cannot hope to improve those methods, adjudicate rival assessments when they do arise, or get ideas for extending and developing tools when entering brand new arenas.  Clearly, understanding the underlying foundations of one or another approach is crucial for a philosopher of statistics, but practitioners too should care, at least some of the time.

Tuesday, November 1, 2011

RMM-4: Special Volume on Stat Sci Meets Phil Sci

The article "Foundational Issues in Statistical Modeling: Statistical Model Specification and Validation*" by Aris Spanos has now been published in our special volume of the on-line journal, Rationality, Markets, and Morals (Special Topic: Statistical Science and Philosophy of Science: Where Do/Should They Meet?")

Statistical model specification and validation raise crucial foundational problems whose pertinent resolution holds the key to learning from data by securing the reliability of frequentist inference. The paper questions the judiciousness of several current practices, including the theory-driven approach, and the Akaike-type model selection procedures, arguing that they often lead to unreliable inferences. This is primarily due to the fact that goodness-of-fit/prediction measures and other substantive and pragmatic criteria are of questionable value when the estimated model is statistically misspecified. Foisting one’s favorite model on the data often yields estimated models which are both statistically and substantively misspecified, but one has no way to delineate between the two sources of error and apportion blame. The paper argues that the error statistical approach can address this Duhemian ambiguity by distinguishing between statistical and substantive premises and viewing empirical modeling in a piecemeal way with a view to delineate the various issues more effectively. It is also argued that Hendry’s general to specific procedures does a much better job in model selection than the theory-driven and the Akaike-type procedures primary because of its error statistical underpinnings.