Read Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy Online
Authors: Cathy O'Neil
Tags: #Business & Economics, #General, #Social Science, #Statistics, #Privacy & Surveillance, #Public Policy, #Political Science
But data studies that track employees’ behavior can also be used to cull a workforce. As the 2008 recession ripped through the economy, HR officials in the tech sector started to look at those Cataphora charts with a new purpose. They saw that some workers were represented as big dark circles, while others were smaller and dimmer. If they had to lay off workers, and most companies did, it made sense to start with the small and dim ones on the chart.
Were those workers really expendable? Again we come to digital phrenology. If a system designates a worker as a low idea generator or weak connector, that verdict becomes its own truth. That’s her score.
Perhaps someone can come in with countervailing evidence. The worker with the dim circle might generate fabulous ideas but not share them on the network. Or perhaps she proffers price
less advice over lunch or breaks up the tension in the office with a joke. Maybe everybody likes her. That has great value in the workplace. But computing systems have trouble finding digital proxies for these kinds of soft skills. The relevant data simply isn’t collected, and anyway it’s hard to put a value on them. They’re usually easier to leave out of a model.
So the system identifies apparent losers. And a good number of them lost their jobs during the recession. That alone is unjust. But what’s worse is that systems like Cataphora’s receive minimal feedback data. Someone identified as a loser, and subsequently fired, may have found another job and generated a fistful of patents. That data usually isn’t collected. The system has no inkling that it got one person, or even a thousand people, entirely wrong.
That’s a problem, because scientists need this error feedback—in this case the presence of false negatives—to delve into forensic analysis and figure out what went wrong, what was misread, what data was ignored. It’s how systems learn and get smarter. Yet as we’ve seen, loads of WMDs, from recidivism models to teacher scores, blithely generate their own reality. Managers assume that the scores are true enough to be useful, and the algorithm makes tough decisions easy. They can fire employees and cut costs and blame their decisions on an objective number, whether it’s accurate or not.
Cataphora remained small, and its worker evaluation model was a sideline—much more of its work was in identifying patterns of fraud or insider trading within companies. The company went out of business in 2012, and its software was sold to a start-up, Chenope. But systems like Cataphora’s have the potential to become true WMDs. They can misinterpret people, and punish them, without any proof that their scores correlate to the quality of their work.
This type of software signals the rise of WMDs in a new realm.
For a few decades, it may have seemed that industrial workers and service workers were the only ones who could be modeled and optimized, while those who trafficked in ideas, from lawyers to chemical engineers, could steer clear of WMDs, at least at work. Cataphora was an early warning that this will not be the case. Indeed, throughout the tech industry, many companies are busy trying to optimize their white-collar workers by looking at the patterns of their communications. The tech giants, including Google, Facebook, Amazon, IBM, and many others, are hot on this trail.
For now, at least, this diversity is welcome. It holds out the hope, at least, that workers rejected by one model might be appreciated by another. But eventually, an industry standard will emerge, and then we’ll all be in trouble.
In 1983, the Reagan administration issued a lurid alarm about the state of America’s schools. In a report called
A Nation at Risk
, a presidential panel warned that a “rising tide of mediocrity” in the schools threatened “our very future as a Nation and a people.” The report added that if “an unfriendly foreign power” had attempted to impose these bad schools on us, “we might well have viewed it as an act of war.”
The most noteworthy signal of failure was what appeared to be plummeting scores on the SATs. Between 1963 and 1980, verbal scores had fallen by 50 points, and math scores were down 40 points. Our ability to compete in a global economy hinged on our skills, and they seemed to be worsening.
Who was to blame for this sorry state of affairs? The report left no doubt about that. Teachers. The
Nation at Risk
report called for action, which meant testing the students—and using the results to zero in on the underperforming teachers. As we saw in
the Introduction, this practice can cost teachers their jobs. Sarah Wysocki, the teacher in Washington who was fired after her class posted surprisingly low scores, was the victim of such a test. My point in telling that story was to show a WMD in action, how it can be arbitrary, unfair, and deaf to appeals.
But along with being educators and caretakers of children, teachers are obviously workers, and here I want to delve a bit deeper into the models that score their performance, because they might spread to other parts of the workforce. Consider the
case of Tim Clifford. He’s a middle school English teacher in New York City, with twenty-six years of experience. A few years ago, Clifford learned that he had bombed on a teacher evaluation, a so-called value-added model, similar to the one that led to Sarah Wysocki’s firing. Clifford’s score was an abysmal 6 out of 100.
He was devastated. “I didn’t see how it was possible that I could have worked so hard and gotten such poor results,”
he later told me. “To be honest, when I first learned my low score, I felt ashamed and didn’t tell anyone for a day or so. However, I learned that there were actually two other teachers who scored below me in my school. That emboldened me to share my results, because I wanted those teachers to know it wasn’t only them.”
If Clifford hadn’t had tenure, he could have been dismissed that year, he said. “Even with tenure,” he said, “scoring low in consecutive years is bound to put a target on a teacher’s back to some degree.” What’s more, when tenured teachers register low scores, it emboldens school reformers, who make the case that job security protects incompetent educators. Clifford approached the following year with trepidation.
The value-added model had given him a failing grade but no advice on how to improve it. So Clifford went on teaching the way he always had and hoped for the best. The following year, his score was a 96.
“You’d think I’d have been elated, but I wasn’t,” he said. “I knew that my low score was bogus, so I could hardly rejoice at getting a high score using the same flawed formula. The 90 percent difference in scores only made me realize how ridiculous the entire value-added model is when it comes to education.”
Bogus is the word for it. In fact, misinterpreted statistics run through the history of teacher evaluation. The problem started with a momentous statistical boo-boo in the analysis of the original
Nation at Risk
report. It turned out that the very researchers who were decrying a national catastrophe were basing their judgment on a fundamental error, something an undergrad should have caught. In fact, if they wanted to serve up an example of America’s educational shortcomings, their own misreading of statistics could serve as exhibit A.
Seven years after
A Nation at Risk
was published with such fanfare,
researchers at Sandia National Laboratories took a second look at the data gathered for the report. These people were no amateurs when it came to statistics—they build and maintain nuclear weapons—and they quickly found the error. Yes, it was true that SAT scores had gone down on average. However, the number of students taking the test had ballooned over the course of those seventeen years. Universities were opening their doors to more poor students and minorities. Opportunities were expanding. This signaled social success. But naturally, this influx of newcomers dragged down the average scores. However, when statisticians broke down the population into income groups, scores for every single group were rising, from the poor to the rich.
In statistics, this phenomenon is known as
Simpson’s Paradox: when a whole body of data displays one trend, yet when broken into subgroups, the opposite trend comes into view for each of those subgroups. The damning conclusion in the
Nation at Risk
report, the one that spurred the entire teacher evaluation
movement, was drawn from a grievous misinterpretation of the data.
Tim Clifford’s diverging scores are the result of yet another case of botched statistics, this one all too common. The teacher scores derived from the tests measured
nothing
. This may sound like hyperbole. After all, kids took tests, and those scores contributed to Clifford’s. That much is true. But Clifford’s scores, both his humiliating 6 and his chest-thumping 96, were based almost entirely on approximations that were so weak they were essentially random.
The problem was that the administrators lost track of accuracy in their quest to be fair. They understood that it wasn’t right for teachers in rich schools to get too much credit when the sons and daughters of doctors and lawyers marched off toward elite universities. Nor should teachers in poor districts be held to the same standards of achievement. We cannot expect them to perform miracles.
So instead of measuring teachers on an absolute scale, they tried to adjust for social inequalities in the model. Instead of comparing Tim Clifford’s students to others in different neighborhoods, they would compare them with forecast models of
themselves
. The students each had a predicted score. If they surpassed this prediction, the teacher got the credit. If they came up short, the teacher got the blame. If that sounds primitive to you, believe me, it is.
Statistically speaking, in these attempts to free the tests from class and color, the administrators moved from a primary to a secondary model. Instead of basing scores on direct measurement of the students, they based them on the so-called error term—the gap between results and expectations. Mathematically, this is a much sketchier proposition. Since the expectations themselves are derived from statistics, these amount to guesses on top of guesses. The result is a model with loads of random results, what statisticians call “noise.”
Now, you might think that large numbers would bring the scores into focus. After all, New York City, with its 1.1 million public school students, should provide a big enough data set to create meaningful predictions. If eighty thousand eighth graders take the test, wouldn’t it be feasible to establish reliable averages for struggling, middling, and thriving schools?
Yes. And if Tim Clifford were teaching a large sampling of students, say ten thousand, then it might be reasonable to measure that cohort against the previous year’s average and draw some conclusions from it. Large numbers balance out the exceptions and outliers. Trends, theoretically, would come into focus. But it’s almost impossible for a class of twenty-five or thirty students to match up with the larger population. So if a class has certain types of students, they will tend to rise faster than the average. Others will rise more slowly. Clifford was given virtually no information about the opaque WMD that gave him such wildly divergent scores, but he assumed this variation in his classes had something to do with it. The year he scored poorly, Clifford said, “I taught many special education students as well as many top performers. And I think serving either the neediest or the top students—or both—creates problems. Needy students’ scores are hard to move because they have learning problems, and top students’ scores are hard to move because they have already scored high so there’s little room for improvement.”
The following year, he had a different mix of students, with more of them falling between the extremes. And the results made it look as though Clifford had progressed from being a failing teacher to being a spectacular one. Such results were all too common. An analysis by a blogger and
educator named Gary Rubinstein found that of teachers who taught the same subject in consecutive years, one in four registered a 40-point difference. That suggests that the evaluation data is practically random. It
wasn’t the teachers’ performance that was bouncing all over the place. It was the scoring generated by a bogus WMD.
While its scores are meaningless, the impact of value-added modeling is pervasive and nefarious. “I’ve seen some great teachers convince themselves that they were mediocre at best based on those scores,” Clifford said. “It moved them away from the great lessons they used to teach, toward increasing test prep. To a young teacher, a poor value-added score is punishing, and a good one may lead to a false sense of accomplishment that has not been earned.”
As in the case of so many WMDs, the existence of value-added modeling stems from good intentions. The Obama administration realized early on that school districts punished under the 2001 No Child Left Behind reforms, which mandated high-stakes standardized testing, tended to be poor and disadvantaged. So it offered waivers to districts that could demonstrate the effectiveness of their teachers, ensuring that these schools would not be punished even if their students were lagging.
*
The use of value-added models stems in large part from this regulatory change. But in late 2015 the teacher testing craze took what may be an even more dramatic turn. First,
Congress and the White House agreed to revoke No Child Left Behind and replace it with a law that gives states more latitude to develop their own approaches for turning around underperforming school districts. It also gives them a broader range of criteria to consider, including student and teacher engagement, access to advanced coursework, school climate, and safety. In other words, education officials can attempt to study what’s happening at each individual school—and
pay less attention to WMDs like value-added models. Or better yet, jettison them entirely.