When it comes to people—and policy—numbers are both powerful and perilous.
Tony Blair was usually relaxed and charismatic in front of a crowd. But an encounter with a woman in the audience of a London television studio in April, 2005, left him visibly flustered. Blair, eight years into his tenure as Britain’s Prime Minister, had been on a mission to improve the National Health Service. The N.H.S. is a much loved, much mocked, and much neglected British institution, with all kinds of quirks and inefficiencies. At the time, it was notoriously difficult to get a doctor’s appointment within a reasonable period; ailing people were often told they’d have to wait weeks for the next available opening. Blair’s government, bustling with bright technocrats, decided to address this issue by setting a target: doctors would be given a financial incentive to see patients within forty-eight hours.
It seemed like a sensible plan. But audience members knew of a problem that Blair and his government did not. Live on national television, Diana Church calmly explained to the Prime Minister that her son’s doctor had asked to see him in a week’s time, and yet the clinic had refused to take any appointments more than forty-eight hours in advance. Otherwise, physicians would lose out on bonuses. If Church wanted her son to see the doctor in a week, she would have to wait until the day before, then call at 8 a.m. and stick it out on hold. Before the incentives had been established, doctors couldn’t give appointments soon enough; afterward, they wouldn’t give appointments late enough.
“Is this news to you?” the presenter asked.
“That is news to me,” Blair replied.
“Anybody else had this experience?” the presenter asked, turning to the audience.
Chaos descended. People started shouting, Blair started stammering, and a nation watched its leader come undone over a classic case of counting gone wrong.
Blair and his advisers are far from the first people to fall afoul of their own well-intentioned targets. Whenever you try to force the real world to do something that can be counted, unintended consequences abound. That’s the subject of two new books about data and statistics: “Counting: How We Use Numbers to Decide What Matters” (Liveright), by Deborah Stone, which warns of the risks of relying too heavily on numbers, and “The Data Detective” (Riverhead), by Tim Harford, which shows ways of avoiding the pitfalls of a world driven by data.
Both books come at a time when the phenomenal power of data has never been more evident. The covid-19 pandemic demonstrated just how vulnerable the world can be when you don’t have good statistics, and the Presidential election filled our newspapers with polls and projections, all meant to slake our thirst for insight. In a year of uncertainty, numbers have even come to serve as a source of comfort. Seduced by their seeming precision and objectivity, we can feel betrayed when the numbers fail to capture the unruliness of reality.
The particular mistake that Tony Blair and his policy mavens made is common enough to warrant its own adage: once a useful number becomes a measure of success, it ceases to be a useful number. This is known as Goodhart’s law, and it reminds us that the human world can move once you start to measure it. Deborah Stone writes about Soviet factories and farms that were given production quotas, on which jobs and livelihoods depended. Textile factories were required to produce quantities of fabric that were specified by length, and so looms were adjusted to make long, narrow strips. Uzbek cotton pickers, judged on the weight of their harvest, would soak their cotton in water to make it heavier. Similarly, when America’s first transcontinental railroad was built, in the eighteen-sixties, companies were paid per mile of track. So a section outside Omaha, Nebraska, was laid down in a wide arc, rather than a straight line, adding several unnecessary (yet profitable) miles to the rails. The trouble arises whenever we use numerical proxies for the thing we care about. Stone quotes the environmental economist James Gustave Speth: “We tend to get what we measure, so we should measure what we want.”
The problem isn’t easily resolved, though. The issues around Goodhart’s law have come to haunt artificial-intelligence design: just how do you communicate an objective to your algorithm when the only language you have in common is numbers? The computer scientist Robert Feldt once created an algorithm charged with landing a plane on an aircraft carrier. The objective was to bring a simulated plane to a gentle stop, thus registering as little force as possible on the body of the aircraft. Unfortunately, during the training, the algorithm spotted a loophole. If, instead of bringing the simulated plane down smoothly, it deliberately slammed the aircraft to a halt, the force would overwhelm the system and register as a perfect zero. Feldt realized that, in his virtual trial, the algorithm was repeatedly destroying plane after plane after plane, but earning top marks every time.
Numbers can be at their most dangerous when they are used to control things rather than to understand them. Yet Goodhart’s law is really just hinting at a much more basic limitation of a data-driven view of the world. As Tim Harford writes, data “may be a pretty decent proxy for something that really matters,” but there’s a critical gap between even the best proxies and the real thing—between what we’re able to measure and what we actually care about.
Harford quotes the great psychologist Daniel Kahneman, who, in his book “Thinking Fast and Slow,” explained that, when faced with a difficult question, we have a habit of swapping it for an easy one, often without noticing that we’ve done so. There are echoes of this in the questions that society aims to answer using data, with a well-known example concerning schools. We might be interested in whether our children are getting a good education, but it’s very hard to pin down exactly what we mean by “good.” Instead, we tend to ask a related and easier question: How well do students perform when examined on some corpus of fact? And so we get the much lamented “teach to the test” syndrome. For that matter, think about our use of G.D.P. to indicate a country’s economic well-being. By that metric, a schoolteacher could contribute more to a nation’s economic success by assaulting a student and being sent to a high-security prison than by educating the student, owing to all the labor that the teacher’s incarceration would create.
One of the most controversial uses of algorithms in recent years, as it happens, involves recommendations for the release of incarcerated people awaiting trial. In courts across America, when defendants stand accused of a crime, an algorithm crunches through their criminal history and spits out a risk score, meant to help judges decide whether or not they should be kept behind bars until they can be tried. Using data about previous defendants, the algorithm tries to calculate the probability that an individual will re-offend. But, once again, there’s an insidious Kahnemanian swap between what we care about and what we can count. The algorithm cannot predict who will re-offend. It can predict only who will be rearrested.
Arrest rates, of course, are not the same for everyone. For example, Black and white Americans use marijuana at around the same levels, but the former are almost four times as likely to be arrested for possession. When an algorithm is built out of bias-inflected data, it will perpetuate bias-inflected practices. (Brian Christian’s latest book, “The Alignment Problem,” offers a superb overview of such quandaries.) That doesn’t mean a human judge will do better, but the bias-in, bias-out problem can sharply limit the value of these gleaming, data-driven recommendations.
Shift a question on a survey, even subtly, and everything can change. Around twenty-five years ago in Uganda, the active labor force appeared to surge by more than ten per cent, from 6.5 million individuals to 7.2 million. The increase, as Harford explains, arose from the wording of the labor-force survey. In previous years, people had been asked to list their primary activity or job, but a new version of the survey asked individuals to include their secondary roles, too. Suddenly, hundreds of thousands of Ugandan women, who worked primarily as housewives but also worked long hours doing additional jobs, counted toward the total.
To simplify the world enough that it can be captured with numbers means throwing away a lot of detail. The inevitable omissions can bias the data against certain groups. Stone describes an attempt by the United Nations to develop guidelines for measuring levels of violence against women. Representatives from Europe, North America, Australia, and New Zealand put forward ideas about types of violence to be included, based on victim surveys in their own countries. These included hitting, kicking, biting, slapping, shoving, beating, and choking. Meanwhile, some Bangladeshi women proposed counting other forms of violence—acts that are not uncommon on the Indian subcontinent—such as burning women, throwing acid on them, dropping them from high places, and forcing them to sleep in animal pens. None of these acts were included in the final list. When surveys based on the U.N. guidelines are conducted, they’ll reveal little about the women who have experienced these forms of violence. As Stone observes, in order to count, one must first decide what should be counted.
Those who do the counting have power. Our perspectives are hard-coded into what we consider worth considering. As a result, omissions can arise in even the best-intentioned data-gathering exercises. And, alas, there are times when bias slips under the radar by deliberate design. In 2020, a paper appeared in the journal Psychological Science that examined how I.Q. was related to a range of socioeconomic measures for countries around the world. Unfortunately, the paper was based on a data set of national I.Q. estimates co-published by the English psychologist Richard Lynn, an outspoken white supremacist. Although we should be able to assess Lynn’s scientific contributions independently of his personal views, his data set of I.Q. estimates contains some suspiciously unrepresentative samples for non-European populations. For instance, the estimate for Somalia is based on one sample of child refugees stationed in a camp in Kenya. The estimate for Haiti is based on a sample of a hundred and thirty-three rural six-year-olds. And the estimate for Botswana is based on a sample of high-school students tested in South Africa in a language that was not their own. Indeed, the psychologist Jelte Wicherts demonstrated that the best predictor for whether an I.Q. sample for an African country would be included in Lynn’s data set was, in fact, whether that sample was below the global average. Psychological Science has since retracted the paper, but numerous other papers and books have used Lynn’s data set.
And, of course, I.Q. poses the familiar problems of the statistical proxy; it’s a number that hopelessly fails at offering anything like a definitive, absolute, immutable measure of “intelligence.” Such limitations don’t mean that it’s without value, though. It has enormous predictive power for many things: income, longevity, and professional success. Our proxies can still serve as a metric of something, even if we find it hard to define what that something is.
It’s impossible to count everything; we have to draw the line somewhere. But, when we’re dealing with fuzzier concepts than the timing of medical appointments and the length of railroad tracks, line-drawing itself can create trouble. Harford gives the example of two sheep in a field: “Except that one of the sheep isn’t a sheep, it’s a lamb. And the other sheep is heavily pregnant—in fact, she’s in labor, about to give birth at any moment. How many sheep again?” Questions like this aren’t just the stuff of thought experiments. A friend of mine, the author and psychologist Suzi Gage, married her husband during the covid-19 pandemic, when she was thirty-nine weeks pregnant. Owing to the restrictions in place at the time, the number of people who could attend her wedding was limited to ten. Newborn babies would count as people for such purposes. Had she gone into labor before the big day, she and the groom would have had to disinvite a member of their immediate families or leave the newborn at home.