Wednesday, December 2, 2009

Simpson's Paradox on WSJ

Interesting article. Copied Andrew's comments here:

Simpson's Paradox not always such a paradox

By Andrew Gelman on December 3, 2009 9:10 AM | 3 Comments

I'm on an email list of media experts for the American Statistical Association: from time to time a reporter contacts the ASA, and their questions are forwarded to us. Last week we got a question from Cari Tuna about the following pattern she had noticed:

Measured by unemployment, the answer appears to be no, or at least not yet. The jobless rate was 10.2% in October, compared with a peak of 10.8% in November and December of 1982.

But viewed another way, the current recession looks worse, not better. The unemployment rate among college graduates is higher than during the 1980s recession. Ditto for workers with some college, high-school graduates and high-school dropouts.

So how can the overall unemployment rate be lower today but higher among each group?

Several of us sent in answers. Call us media chasers or educators of the populace; whatever. Luckily I wasn't the only one to respond: I sent in a pretty lame example that I'd recalled from an old statistics textbook; whereas Xiao-Li Meng, Jeff Witmer, and others sent in more up-to-date items that Ms. Tuna had the good sense to use in her article.

There's something about this whole story that bothers me, though, and that is the implication that the within-group comparisons are real and the aggregate is misleading. As Tuna puts it:

The Simpson's Paradox in unemployment rates by education level is but the latest example. At a glance, the unemployment rate suggests that U.S. workers are faring better in this recession than during the recession of the early 1980s. But workers at each education level are worse off . . .

This discussion follows several examples where, as the experts put it, "The aggregate number really is meaningless. . . . You can't just look at the overall rate. . . ."

Here's the problem. Education categories now do not represent the same slices of the population that they did in 1976. A larger proportion of the population are college graduates (as is noted in the linked news article), and thus the comparison of college grads (or any other education category) from 1982 to the college grads today is not quite an apples-to-apples comparison. Being a college grad today is less exclusive than it was back then.

In this sense, the unemployment example is different in a key way from the other Simpson's paradox examples in the news article. In those other examples, the within-group comparison is clean, while the aggregate comparison is misleading. In the unemployment example, it's the aggregate that has a cleaner interpretation, while the within-group comparisons are a bit of a mess.

As a statistician and statistical educator, I think we have to be very careful about implying that the complicated analysis is always better. In this example, the complicated analysis can mislead! It's still good to know about Simpson's paradox, to understand how the within-group and aggregate comparisons can differ--but I think it's highly misleading in this case to imply that the aggregate comparison is wrong in some way. It's more of a problem of groups changing their meaning over time.

in reference to:

"When Combined Data Reveal the Flaw of Averages In a Statistical Anomaly Dubbed Simpson's Paradox, Aggregated Numbers Obscure Trends in Job Market, Medicine and Baseball"
- http://online.wsj.com/article/SB125970744553071829.html (view on Google Sidewiki)

No comments: