The myth of data mining
Why men don’t buy beer and diapers at the same time, and what we can still learn from urban legends.
“It’s in there. The discovery, the fact, the one piece of the puzzle that will blow away the competition, propel your company to the top, and stick a ‘VP’ after your name. It’s right there, in your database.” (1) So there it is, the wonderful new world of data mining: lying in our databases is knowledge that will not only scare the competition stiff, but at the same time bring us glittering careers.
And in fact, you do read of such successes. For example, Wal-Mart, the world’s largest retailer, supposedly found out that there are certain times at which beer and diapers sell particularly well together – when on Friday evenings young men make a last dash to the supermarket to get beer and their wives call after them, “Pick up some diapers, too, honey!” (2)
“Some of the ways Wal-Mart managers found to exploit their findings are legendary. One such legend is the story, “diapers and beer”. Wal-Mart discovered through data mining that the sales of diapers and beer were correlated on Friday nights. It determined that the correlation was based on working men who had been asked to pick up diapers on their way home from work. On Fridays the men figured they deserved a six-pack of beer for their trouble; hence the connection between beer and diapers. By moving these two items closer together, Wal-Mart reportedly saw the sales of both items increase geometrically.” (3)
A version with a slightly different view of the roles involved suggests that the men are sent to the supermarket for the diapers and, because there’s no time left to go to a bar, take beer home with them.
In all versions of the story, Wal-Mart then puts the diapers closer to the beer and makes a fortune. (4)
It never happened like that, though, and the story should be filed under the category of Urban Legends. Nevertheless, the tale is a good one and we can learn something from it (”never let truth get in the way of a good story”). I myself have often been tempted to invent stories like this in order to express something in a way that everyone can understand. When we went hunting for treasure in the data at Gühring, Metabo or Sandoz using the data mining system that we built in our university days, we discovered all kinds of conspicuous features that we couldn’t understand because we didn’t have the background knowledge. We showed our results to the people at the companies named and they confirmed we had come up with valuable indicators. The business of making such results comprehensible to third-parties using concrete examples, however, always proved at least as complicated as the data treasure hunt itself.
What the diapers-and-beer example should tell us is this: There are algorithms which we can use for automated recognition of data associations. If we find insights that make the competition go pale with fear off the bat, on the other hand, is another question entirely.
(1) Reese Hedberg, S., The Data Gold Rush, Byte 20 (1995) 10, p. 83.
(2) Just how widespread this legend is, is documented, among others, by Fisk, D., Beer and Nappies – A Data Mining Urban Legend, accessed on January 25, 2006.
(3) Hospel, H., Down the Rabbit Hole, Executive Update Online No. 3/2001, accessed on January 25, 2006.
(4) A persuasive version of how the legend arose can be found in Fawcett, T., Origin of “diapers and beer”, accessed on January 25, 2006.