How To Lie With Statistics

I ripped off the title of this column from an amusing and timeless little 1954 book written by Darrell Huff, which I highly recommend to anyone who has ever heard an "expert" spout off some numerical "fact" that sounded a little fishy. I thought it would be a particularly relevant topic today of all days -- May 15. This is of course the magical cutoff date by which farmers must have their corn planted or else it will instantly die and wither in the fields. Or so it may have sounded during the past month of cold, wet, late planting across the U.S. Corn Belt, as I and every other new-crop corn market bull constantly mentioned late planting.
This chart shows the correlation between planting early and percent of trendline yield achieved. (Chart by Elaine Kub)
Well, the corn won't instantly wither, but the general idea (as highlighted by DTN Contributing Agronomist Dan Davidson in two recent DTN stories) is that the later a corn or soybean crop gets planted, the less ultimate yield potential it may have. As corn planting gets delayed into late May, depending on where it's being planted and what its original yield expectation was, corn can lose up to 2.5% of its yield potential per extra day of delay. Soybeans can experience 0.5% of yield loss per day of delay in late May. Dan quoted University of Illinois agronomist Emerson Nafziger: "The total losses if planting is delayed to May 31 would be about 25% for corn and 15% for soybeans."
Those kinds of losses, if they come to pass and are widespread in 2013, would have a profound influence on the new-crop markets. Yield losses would trim the overall size of expected production and potentially lead to tighter ending stocks. USDA even included this note in its most recent Supply and Demand report: "The 2013/14 corn yield is projected at 158.0 bushels per acre, 5.6 bushels below the weather-adjusted trend presented at USDA's Agricultural Outlook Forum in February. The slow start to this year's planting and the likelihood that progress by mid-May will remain well behind the 10-year average reduce prospects for yields."
But not everyone is convinced. As willing as I have been to say some additional risk premium is justified in new-crop prices, there have been other market commenters vehemently opposed to the idea that late planting will matter to yield. "Planting date just does not matter that much," they say. "It's summer weather that makes the crop." The market at large apparently agrees with them; December corn futures and November soybean futures have been trading in a downwards or sideways trend all through the month of May.
I want to show you that those bearish commenters have been using statistics to lie to themselves. For instance, it was pointed out to me that in 1984, Iowa and Illinois experienced a late April snowstorm that severely delayed planting (nationwide progress was 24 percentage points behind the five-year average in mid-May) and yet nationwide average yield ultimately came in at trendline. That's cherry-picking one anecdote and ignoring the vast bulk of the data. I could as easily choose to emphasize the year 1991, when planting was only 20 points behind the five-year average in mid-May, or 1983 when planting was only 13 points behind, and nationwide yield turned out 7% below trend or 26 points below trend, respectively.
OK, so to escape my allegations of cherry-picking, the bears will select a slightly larger sample size. They may look at the last 12 instances when less than 50% of the nation's corn was planted by May 15 and note that in 8 of those 12 years, nationwide yield came in at trendline or above. Now that sounds like a pretty convincing statistical story. But what they're saying is we can plant corn in late May and have a 67% chance of improving our yields. Whew, I bet Illinois farmers are glad they waited this long.
The statistical crime being committed with that "fact," as Darrell Huff would have pointed out, is poor sampling. If you flip a coin 12 times, knowing there is a 50/50 chance of heads or tails, you still probably won't get 6 instances of heads and 6 instances of tails. If you flip the same coin 1,000 times or any sufficiently large number of times (i.e. if you start observing a statistically significant sample size), you will very likely start to get a half-and-half distribution of results between heads and tails. It's important to use an unbiased sample that is large enough to permit a real conclusion.
And really, it's not that hard to get a large sample of data in this instance. The corn planting progress as of the 19th week of the year is freely available from the National Agricultural Statistics Service going all the way back to 1980, as is the nationwide average corn yield (going back even farther than that). So I did the quick-and-dirty correlation and there is definitely a noticeable, positive relationship between having planting progress ahead of the five-year average and the ultimate percentage of trendline yield that is achieved. See the chart accompanying this story on DTN online or at….
Of course there are outliers, both above and below what this one variable would suggest for ultimate yield because, of course, there are many other influences that come into play besides simply the planting date. Someone might also accuse me of statistically "lying" through spurious correlation. The positive relationship alone does not imply causation. As Huff points out, there can be a close relationship between the salaries of Presbyterian ministers in Massachusetts and the price of rum in Havana, although one does not cause the other, and they may both be simultaneously influenced by a third factor, like inflation.
But fortunately, you don't have to rely on my word to believe there really is causation between late planting and yield loss. In real-life, it took us 33 years to get this statistically-significant sample size and even at that, it's impossible to control for the influences of weather or any other factor from one year to the next. But in the scientific world, there is a way to control experiments and to determine just which variables are responsible for yield differences. Land grant universities all over this great nation of ours (and in other nations) have been doing it for decades. They can grow identical plants alongside each other at several sites and accumulate dozens of "site years" in just one growing year, and they can isolate the effects of yield loss due to late planting. Invariably, they conclude that grain yield is maximized by planting early (although the definitions of "early" will depend on which state or region is under consideration). For an example of a good, recent study that was well controlled and used a significant sample size over several years, visit:…
And that's why every agronomist you'll ever talk to will urge a farmer to plant early and why most farmers believe late planting will damage yields and, thus, why most farmers in 2013 are indeed quite anxious about summer weather and ultimate yield … and thus, why many farmers are frustrated by the lack of response noted in new-crop futures prices so far this spring.
To close, I will concede that we have no idea of knowing what July and August weather will be like and, yes, it is possible the summer of 2013 will be abnormal and the U.S. crop yields will respond abnormally to late planting. But at this point in time, I and the rest of the new-crop market bulls have some statistical reasons to believe some extra risk premium is justified in prices. Huff includes a series of quotes about statistics in the front of his book. I liked this one, from Artemus Ward: "It ain't so much the things we don't know that get us in trouble. It's the things we know that ain't so."
This link shows a relationship between planting dates and yield:…
Elaine Kub is the author of Mastering the Grain Markets
