In Stunning Upset, Real World Defeats Data


Between 8pm and 12pm ET last night, The New York Times’ presidential prediction swung from 81% Clinton to 95% Trump. Slate’s VoteCastr forecast of voter turnout flailed on all seven of their key states. Mike Murphy, long-time political pundit said on MSNBC, “Tonight, data died.” And Sam Wang at the Princeton Election Consortium, who called the election for Clinton on October 18th, now has to eat a bug.

There are only three ways to criticize analysis you disagree with. One, go after the analyst. Two, the methodology. Three, and most common, both. All of these have already happened.

Last week, Ryan Grim at The Huffington Post accused Nate Silver of abandoning polling for punditry. This morning, HBR  bemoaned the sorry state of poll data in an age of cell phones and caller ID. Also today, The Chicago Tribune published a eulogy for the entire “profession of prognostication.”

But take a cleansing breath. This has happened before. On the way from natural philosophy to modern science, empiricism took a detour through alchemy.

In his excellent book Extraordinary Popular Delusions and the Madness of Crowds, historian Charles McKay recounts tales of collective convictions gone wrong. His chapter on alchemy includes scores of mystics, from Geber in the 700s to Count Cagliostro a thousand years later, who claimed to possess the legendary philosopher’s stone.

In all that time, no one succeeded in turning base metals into gold. Some went mad in their pursuits. Princes lost fortunes financing endless experiments. But the thing that put an end to alchemy wasn’t a ban. It was an evolution into something better. The answer to pseudoscience wasn’t less science, but more.

It seems that data science is in its awkward alchemy phase. And the answer now, as it was then, isn’t less science, but more.

If changes in the real world cause selection bias and sampling problems for traditional data gathering tactics, what new ways would work better? If data-driven model-tuning produces more useful results but obscures the model itself, what new auditing methods would fix this?

Among today’s pollsters and data-modelers there may be Nicolas Flamels, willfully leading the credulous astray with techno-wizardry. But there are modern-day Roger Bacons, too, who fix their errors, advocate for the scientific method and advance the practice of actual data science so we can all understand the real world better tomorrow than we do today.

How Much Data Did Uber Buy for $1.2 Billion?

moneyUber lost at least (at least!) $1.2 billion in the first six months of 2016, according to Bloomberg News. The majority of this loss was due to driver subsidies. When startups lose money on purpose it’s called an investment. But an investment in what, exactly?

This question came up in email correspondence between Justin Fox of Bloomberg and Melissa Schilling of NYU’s Stern School of Business. Prof Schilling asserted that “[t]here are two main reasons for tech companies to lose money early to make money later, and neither of them apply to Uber.”

The first is “[u]pfront investments in fixed costs that are going to pay off with scale.” But, Schilling says, Uber’s fixed costs are low. Most of its losses are due to subsidizing drivers, a variable cost that won’t go down as Uber gains more drivers.

The second is “[s]ubsidizing a large installed base to “win” the market.” But the switching costs for both riders and drivers to use another service are low. In fact, many Uber drivers in the US already also drive for direct competitor Lyft.

But there is a long-term asset Uber gets each time a driver ferries a rider from point A to point B — the data.

Uber is like a card-counter at a blackjack table. It gains information to improve its future bets from every hand it plays. But unlike blackjack where every player sees the other players’ cards, in the ride share game whoever gets the fare first shuts everyone else out of the hand. Only that firm gets the data from that ride, building a unique stock of data capital.

So, imagine for a moment that Uber is buying data to improve its future ability to compete. What does that data cost? And is it worth it?

Uber is a private company, so information about its financials and ridership is a little thin on the ground. But we can use rough numbers from Bloomberg’s reporting and a few heroic assumptions to make some educated guesses.

We know that Uber lost at least $1.2 billion in the first six months of 2016. We also know that it provided one billion rides in roughly the same time frame (It was actually between Dec 24, 2015, when it delivered its billionth ride, and June 18, 2016, when it delivered its 2 billionth). That’s $1.20 per ride record.

Considering that personal transportation is a 10 trillion dollar market worldwide, a billion dollars for a billion unique detailed records is probably a pretty good deal.



The Law As Data Capital

Screen Shot 2016-08-30 at 1.06.18 PMHarvard Law School and Ravel Law, a startup, are digitizing Harvard’s complete collection of US case law going back to 1647 . That’s 43,000 volumes, totaling about 40 million pages.

This new trove of data capital costs millions to create. It will be free to search online, but analysis of the data can only be had for a fee. There’s a big lesson in data capital here: different uses of the same data can command different prices.

Try a search for “privacy” (pictured above). You get all the rulings that mention privacy and a visual guide to their relationships over time and by citation. This use is free.

Now imagine that you’re a lawyer about to argue a case pertaining to privacy in front of a specific judge. You’d like to know which of these rulings the judge has cited, how they figure into the judge’s own rulings, and how he or she compares to other judges on this issue. That analysis is a different end-product from simple search results and you’ll have to pay for it.

Because every piece of data is non-rivalrous (it can be used in many searches and analyses simultaneously) it can be freely available in one way and available for a fee in another.

Ravel’s business plan reveals yet another aspect of data capital. After eight years, the entire database will be available to anyone for any analysis. How will the company stay in business?

By that point, Ravel should have been able to create new data by observing how the case law data was searched and analyzed. If this unique stock of data capital belongs to Ravel alone, it can create new digital services that attorneys can only get from them, maintaining a competitive advantage.


Data Is Capital, Not Money

Capital and money might seem like the same thing, but they’re not. A lot of executives I talk to about data capital confuse the two — even MBAs! So, let’s clarify the difference between capital and money, and why it matters when it comes to data.

Take capital first. Capital, along with labor and land, is an economic factor of production in a good or service. If you don’t have enough of these basic inputs, you can’t make the thing or deliver the service you have in mind.

Greg Mankiw, professor of economics at Harvard, uses an apple-producing firm to illustrate these factors in his Principle of Economics, the gold standard for Econ 101 textbooks.


Land is pretty easy to picture. It’s the apple orchard. The same for labor. It’s all the work that goes into tending the orchard, picking the apples, packaging them for sale, and so on. But capital is a bit harder to see. The capital of an apple farm includes ladders, tractors, and warehouses used in growing, harvesting, and packaging apples for sale.

In other words, capital is any produced good which is a necessary input for creating another good or service.

Financial capital is also a produced good. It’s not a natural resource. It has to be made somehow. Any you make it by selling your apples at a price above your costs. You can also increase your financial capital beyond what you can make yourself by borrowing it from someone who already has a whole lot of it, like a bank.

So, yes, a firm’s capital can include money. Money is a necessary input into most production processes. But money is different from all other kinds of capital, including capital equipment or data.

In order for something to be money, it must be both a store of value and a means of exchange. The Benjamin in your wallet (soon Tubman) is good at being money because 1) its value tends to stay pretty stable (a twenty will buy tomorrow pretty much what it buys today), and 2) you can exchange it for things you want more than twenty bucks in your pocket.

Anything with these characteristics can be money. There’s a story, made famous among economists by Milton Friedman, about the islands of Yap whose inhabitants used limestone discs for money. Some of the discs were huge, as big as 12 feet in diameter, and they were cut from the limestone on a nearby island. This is difficult to do, so the number of discs in circulation grew slowly which helped existing discs keep their value.


The discs were the recognized form of currency in the community, so you could buy things with them. But since they were so big, when you paid someone, the community simply recognized the change in ownership, and the disc stayed in your front yard or wherever you dropped it when you brought it home. The discs may not have been convenient, but they were money.

Data is different. To see how, consider a specific data set. Let’s say you have web browsing data on everyone in the richest zip code in the US (which is 10104, according to Experian) for the last year. What’s the value of this data? What’s it worth?

The fact that you immediately want to define its worth in terms of dollars, euros, or renminbi is the first tell. While the data may be valuable, it is not in itself a store of value. Its worth is what the market is willing to bear. It goes up or down depending on what potential buyers are willing to pay, like a house or a Van Gogh.

In addition to being a poor store of value, the data itself is not a unit of exchange. You can’t walk into a Starbucks and buy a latte with a megabyte of your one-percenter browsing data. You can’t pay for things with it.

One objection to this last point is that online we do, in fact, pay with data. We use Google and Facebook without making traditional payments. We get these services in exchange for our data. True, but that’s barter. Which is how you trade when you don’t have money.

The reason this distinction between data-as-capital and data-as-money matters is because of the harsh competitive reality of data.

Contrary to popular belief, data is not abundant. Data consists of countless scarce, even unique observations. If the competition digitizes and datafies interactions with your customers before you do, they get that data and you don’t. They can then create algorithms and analytical services you can’t. To fix this, you’d have to go back in time. And no amount of money can make that happen.

Not 30 Posts In 30 Days — Not Even Close

GoRuck TshirtHere’s to experiments — the winners (electricity), the losers (alchemy), and writing. Instead of the 30 posts I said I’d write in June, I wrote 10. These garnered 419 views from 255 visitors, or about 1.6 views per visit.

To all of you who came and read, thank you.

They say you should learn from your mistakes. But they neglect to mention that so many people have made so many mistakes so far that the likelihood you’ve learned something new is zero. So, here are the things that are already known which I needlessly demonstrated for myself:

  1. Under promise, over deliver. Not the other way around. See T-shirt above.
  2. Writing because you have something to say is fun. Blurting junk because you have to write is not.
  3. The more you say, the more likely you are to say something stupid.

The worst part is that these aren’t just things known by someone somewhere sometime before this experiment. I knew these things. The decision research on overconfidence is clear. Anything motivated by a quota — even sex — loses its enjoyment. Pick your favorite wisdom literature and it’ll tell you to keep your mouth shut.

But some things have to be learned personally, and often re-learned, for them to stick. It’s worth remembering this as we move into the age of big data.

1933 worlds fair poster

In the early twentieth century, the pace of technological advancement in science and industry was so great that the perfectability of humankind was at hand. The motto of the 1933 World’s Fair summed it up:

Science Finds — Industry Applies — Man Conforms

The overall theme of the fair was “A Century Of Progress”, mainly through technology. Taking place in the heart of the great depression and a mere fifteen years after The Great War, this was clearly the triumph of hope over experience. As we now know, progress suffered serious setbacks in the decades that followed. Believing technology can eliminate human weakness is a human weakness itself.

So, I’ll continue to write, but at a more measured pace. In the mean time, let’s all agree to avoid a 1933 World’s Fair for big data.

Data Can’t Prove Happiness

On a recent trip, I stand in line at an airport Starbucks to get a hit. In front of me is an older woman, fussily put together and a bit anxious. She turns around and asks, “Do you come to this airport often?”

This is either the worst pick-up line ever or a precursor to a question that will reveal I don’t come here often enough.

“Occasionally,” I say.

“Is there a Dunkin’ Donuts here?”

“This is Boston. There has to be.”

But, I tell her, I don’t know for sure. Sighing, she turns around and says it’s probably better to just stay here.

In this day and age there’s no reason not to have the overpriced coffee of your choice, so I get out my phone and look it up. There’s an app for that. Heck, there’s a hundred apps for that.

“Excuse me,” I say. “There is a Dunkin’ Donuts in this terminal, but it’s a bit of a hike from here.”

She looks at the phone, looks at me, and says, “Oh. You’re one of those people.” And turns back around.

She’s right. There’s a certain kind of thinking that comes along with being a data person. The data exists. If you don’t know where, there’s probably data about that. The amount of effort to get the data and use it is probably lower than the penalty you’ll pay of not doing so.

But there is a risk. Thinking that life is a long series of optimizations can turn you into a social idiot. Sometimes people don’t want to know their options. Sometimes they don’t want the best solution. They just want comfort that what they’re doing is ok.

The key is to know one case from the other, and optimize accordingly.