Are aptitude tests an accurate measure of human potential?

For the American psychologist Lewis Terman, who devised the Stanford-Binet IQ test, the birth of a gifted child was like the birth of a star. In the name of illuminating what he saw as a void, Terman sought out the stars that burned brightest. ‘The school has no more important task than to foster the development of the mentally gifted,’ the researcher at Stanford University in California told attendees at a meeting in San Francisco. ‘Our chances of survival … may well depend upon the discovery and utilisation of highly superior abilities of every kind.’ To further this mission, Terman urged hundreds of children all over California to sit for his newly developed intelligence test in the 1920s. Intending to prove his theory that high IQs predicted life success, he sorted the highest scorers into a group of 1,521 children (called his ‘Termites’) that he would follow and document throughout their lives.

Among the cardinal stars in Terman’s firmament was a girl who, at age 10, posted a 192 on the Stanford-Binet, placing her among the top four scorers in his entire study. The journalist Joel Shurkin found her story in Terman’s Stanford archives. In his book Terman’s Kids (1992), Shurkin calls her Beatrice Carter (a pseudonym he chose to protect her confidentiality) and says she read nearly 1,500 books before finishing elementary school, including the likes of William Shakespeare and Robert Burns. She also had writing chops that rivalled those of the éminences grises she admired. ‘Several of Beatrice’s poems completely fooled an English class at Stanford,’ reported a San Francisco newspaper, ‘where they were presented anonymously with some of the little-known work of Tennyson, Longfellow, and other masters.’

Precocious talents such as Beatrice have lived among us throughout history. It was Terman, Beatrice’s beloved mentor, who essentially created the way we measure, rank and assess the broad spectrum of human abilities. In the century since Terman’s grand experiment began, his ideas have helped to spawn an aptitude-testing gauntlet that children enter as early as kindergarten, and that we continue to navigate as we enter high school, compete for college admission and apply for jobs.

Whatever their stated purpose, what these tests attempt to do is create a working index of who is worthy: for academic advancement, for career success, for opportunities of every kind. They are all about making the broad, ragged cut, bestowing opportunity on some while filtering out others – an enterprise that has historically teemed with racial and social discrimination. Those of us who grew up immersed in the aptitude-testing hierarchy can testify not just to the lopsided rewards that accrue to those who test well, but to the way that our test-centric culture shapes, and often constricts, our sense of what defines human value.

In recent years, the drive to rank the masses through aptitude tests has met with backlash at college admissions offices, in human resources departments and at public institutions. That backlash has educators and policymakers asking once-unthinkable questions: what if the foundation that Terman built was rotten? What if the way we evaluate potential actually has the power to squelch it?

Intelligence and aptitude tests – binary stars circling in the same orbit – are now so ubiquitous that it’s easy to forget that they barely existed as the 20th century began. Throughout the westward expansion of the United States, the Jeffersonian frontier ethos reigned in education just as it did elsewhere. Schools gave students exams to assess their progress, but no national board dictated what those exams should look like, and few people floated gauzy notions of measuring student potential. If such notions had arisen, they would have been regarded with disdain. Conventional wisdom of the era pegged brainy kids – the kinds of youngsters today called gifted – as ‘addlepates’ in the making, sickly and ill-adjusted to society. Sweat-of-the-brow exertion, not ethereal aptitude, reigned supreme in the public imagination.

That all began to change as the 20th century dawned, and Terman was a primary architect of this transformation. He was cresting a wave that had already begun to rise: by 1900, the country’s rising population had fuelled interest in how to decide who should receive limited educational opportunities. Into the gulf stepped Terman, who – along with the French psychologist Alfred Binet – perfected a test that claimed to measure a person’s inherent intellectual capacity in just an hour or so. They dubbed this capacity the ‘intelligence quotient’ or IQ, and measured it by asking subjects to repeat long sentences, complete verbal analogies, and pick out pattern inconsistencies, among other things.

It wasn’t long before large institutions started showing interest in tests such as Terman and Binet’s, hoping to identify top performers. In 1917, at the height of the First World War, the US government recruited Terman to help develop the Army Alpha, a glorified IQ test administered to nearly 2 million drafted men. The results of this test determined whether a recruit made the elite officer class or was shipped to the front lines to become cannon fodder. Later on, the Scholastic Aptitude Test (now simply called the SAT) gave colleges a seemingly foolproof way to choose applicants with the most promise. (Psychometricians widely consider the SAT another glorified IQ test – so much so that a high score on older versions of the test qualifies you for acceptance into Mensa, the high-IQ society.)

In Terman’s mind, low IQ scores were simply and unarguably the result of objective deficiency

Despite initial resistance, the public accepted the notion of a test-driven meritocracy because it twined together two established strands of thought: first, that the spoils should go to the declared winner, and second, that high-performers’ abilities should be harnessed for the good of the nation. ‘To each according to their ability’ became the tacit watchword, a neat variant of the Marxist injunction ‘to each according to their need’.

The first aptitude-testers promoted the idea that each person had an innate, more-or-less fixed intellectual capacity. In the context of the early 20th century’s growing eugenics movement, the tests were often deployed to justify widespread racial discrimination. Terman claimed that what he called borderline deficient scores on the Stanford-Binet were ‘very, very common among Spanish-Indian and Mexican families of the Southwest and also among Negroes’. ‘Children of this group should be segregated into separate classes,’ he wrote in 1916. ‘They cannot master abstractions but they can often be made into efficient workers … From a eugenic point of view they constitute a grave problem because of their unusually prolific breeding.’ In Terman’s mind, then, low IQ scores were simply and unarguably the result of objective deficiency.

We now understand just how wrong that notion was. Today, many psychologists understand IQ and aptitude tests to be ‘culture-bound’ to one degree or another – that is, they evaluate abilities prized in the dominant Western culture, such as sorting items into categories, and can privilege those raised in that milieu. Such inequities have persisted despite attempts to make the tests fairer to those from non-dominant cultures.

As the US marinated in social Darwinism after the First World War, the government began devising its own sinister solution to the ‘grave problem’ of which Terman had warned. The US Supreme Court case Buck v Bell in 1927 ruled for compulsory sterilisation of the ‘feeble-minded’ in the name of public welfare. For more than four decades thereafter, US states sterilised thousands of people with low IQ scores; a disproportionate number of victims were nonwhite. In later years, though aptitude tests’ eugenic roots would fade from view, the ranking of test-takers according to perceived social value would continue unabated.

Though I didn’t fully grasp it until well into adulthood, I was a direct beneficiary of the test-driven hierarchy that Terman worked so hard to establish. As smarts go, I was never in Beatrice Carter’s league, but my 1980s aptitude test scores were good enough to earn me the label ‘gifted’. With this distinction came the message, from a hundred different directions, that my essence was somehow superior. I was fawned over, head-patted, even loved, in a way that many of my peers were not. I was given certificates to commemorate scores I’d gotten on tests for which I’d never studied.

Starting in fifth grade, I reaped the most tangible benefit of my test-taking knack – I was invited to attend academic summer programmes run by the Johns Hopkins Center for Talented Youth in Baltimore. For the first time, I got to explore subjects such as logic and creative writing that were rarely taught in school, and I delighted in the company of talented classmates. But I wondered about the winnowing process that had gotten me there. Sure, I was good at taking tests, but my work ethic was almost nonexistent – my teachers often hounded me about missed assignments. Why hadn’t my equally curious but harder-working peers, the ants to my grasshopper, gotten this opportunity as well?

Affluent families send their kids to test-prep cram schools for years to boost their chances of admission

For many years, I kept my qualms to myself. But thanks to a proposed admissions overhaul at New York City’s special high schools, including Stuyvesant, Bronx High School of Science and Brooklyn Tech, questions about test-taking aptitude versus merit are now being hotly debated in the open. For decades, admission to these schools – among the city’s most venerated public institutions – has been based on a single factor: a candidate’s score on the Specialized High Schools Admission Test (SHSAT), a modified version of the SAT for younger students. To some, that admissions standard established the special high schools as a model of meritocracy. ‘You don’t get in because you told the saddest sob story. You don’t get in because the school feels it needs a certain number of people of your skin colour,’ wrote Kyle Smith in National Review in 2018. ‘You get in by acing the insanely challenging SHSAT. That’s it.’

The problem, as the New York City mayor Bill de Blasio realised, was that this strategy produced a student body that was racially and economically skewed. Affluent families, largely Asian and white, were sending their kids to test-prep cram schools – sometimes for years – to boost their chances of admission. In recent years, only about one in 10 admitted students has been black or Hispanic.

So this June, de Blasio attempted to intervene. Instead of making admission contingent on an SHSAT cut-off score, he argued, why not admit a group of the highest-ranking kids from each middle school in the city, who had also scored well on achievement tests? That would reward students who’d been working hard to ace their classes, and it would help to eliminate the bias toward candidates of financial means. New research seemed to support his approach. Jonathan Taylor, an analyst at Hunter College in New York, surveyed 28,000 students who took the SHSAT, and found that grades in middle school predicted admitted students’ grades far better than their SHSAT scores did. ‘The SHSAT represents a two-and-a-half hour sample of a limited range of skills and knowledge,’ Taylor writes. ‘In contrast, middle-school grades reflect a full year of student performance across the full range of academic subjects.’

Nevertheless, de Blasio’s proposal led to howls of protest from the kinds of families who’d been paying mightily for test prep (and thus arguably gaming the system). The tepid pushback here from state regulators means that an SHSAT-based admissions policy will likely stay in place for now. Pian Rockfeld, a teacher at the public High School of American Studies in the Bronx, considers that a missed opportunity. Having invigilated the SHSAT, she’s intimately familiar with the kinds of questions that the test includes, and she teaches kids who have been admitted to her school on its merits.

That firsthand knowledge has convinced her that New York’s specialised high schools are hewing to an outmoded admissions standard. ‘There is a lack of correlation between what kids learn in middle school, and what is being tested,’ she says. Instead, like Terman’s intelligence tests, the SHSAT attempts to gauge raw intellectual potential as the test-makers define it – a venture that Rockfeld sees as incomplete at the very least. Year in, year out, her most remarkable students aren’t necessarily the ones who score highest on the test. They are ‘the students who light up the room, eager to articulate their ideas,’ she says. ‘The intelligence we value in the real world is creativity, independent thinking, innovation. I just wish how we assess children would match that.’

Rockfeld’s perspective has been embraced by school reformers such as Alfie Kohn and Susan Ohanian. But to researchers such as Jonathan Wai, this is a line of critique that has been debunked. Wai, a psychologist at the University of Arkansas who used to advise Duke University’s Talent Identification Program, has heard all the arguments: aptitude tests are elitist, they’re reductive, they distil people down to numbers.

But the bottom line, to him, is that aptitude and intelligence tests meet a societal need no other available tools can. ‘Selection tools like the SAT, or tests to [enter] the specialised high schools – they’re all essentially the same kinds of tests,’ says Wai. ‘The reason they’re used is that they’re a way to capture many people who are talented. The test is really just a way to measure things that are going on in the mind.’

In a 2013 paper, Wai studied more than 2,000 members of the American elite, from federal judges to Fortune 500 CEOs, and concluded that almost all of them were good at taking aptitude tests. ‘High average test scores required for admission to these institutions indicated those who rise to or are selected for these positions are highly filtered for ability,’ he wrote. ‘America’s elite are largely drawn from the intellectually gifted, with many in the top one per cent of ability.’ Those kinds of results, in Wai’s view, validate our collective acceptance of aptitude testing as shorthand for overall potential.

Many employers depend on the test to quickly sort the intellectual wheat from the chaff

Aptitude testing’s eugenic roots aside, Wai points out that there are progressive arguments in its favour – chief among them that the tests identify talented people who might not be recognised any other way. In a recent policy paper, he advocated for the use of spatial-abilities testing in school admissions, since the results are relatively untethered to socioeconomic status. ‘If we were able to do that,’ he says, ‘we’d pick up a lot of students from disadvantaged and poor backgrounds.’

The argument that aptitude tests transcend human bias has also lent moral weight to those who use them to screen job candidates. The big kahuna is the Wonderlic test, a 50-question gauge of cognitive skill developed in the 1930s. Most famous as the test administered to all National Football League draftees (the quarterback Ryan Fitzpatrick aced it; Terry Bradshaw tanked), the Wonderlic is now a required part of the interview process at dozens of companies. While reports vary as to how much test results affect hiring decisions, it’s clear that many employers depend on the test to quickly sort the intellectual wheat from the chaff. ‘Almost immediately after we started using Wonderlic,’ reports Cindi Gilmore, a company president in Dallas, in her online testimonial, ‘we noticed the calibre of people increased.’

As Terman tracked his Termites’ journeys through college, graduate school and the job market, he viewed their achievements as proof that intellectual aptitude tests predict later success. By midcentury, seven in 10 of the Termites were college graduates, a figure 10 times higher than California’s overall graduation rate. In 1952, the journalists Milton and Margaret Silverman described Terman’s study as a triumph, one that validated his early decision to lean hard on testing. ‘The typical precocious child had not become stupid, committed suicide or, as some vulgarians observed, flipped his lid,’ the Silvermans wrote in The Saturday Evening Post. ‘The successful, healthy young Termite had grown up into a successful, healthy, well-adapted, versatile, happily married adult with a good job and a good income, many friends, a respected position in his community, and a respectable record of contributing to his nation’s welfare.’

To be fair, Terman’s broad-brush midcentury observation – that aptitude testing identifies a defined group of high performers – wasn’t so far off the mark, at least on the surface. Almost no one today, including detractors, questions the tests’ ability to detect a certain kind of smarts. In 2006, the psychologist Camilla Benbow at Vanderbilt University in Tennessee and colleagues reprised Terman’s study, tracking 5,000 gifted 12- and 13-year-olds, many of whom scored in the top one per cent of their age group on the mathematics SAT. These crack test-takers, like the Termites, went on to clear conventional benchmarks of success with room to spare. In adulthood, Benbow’s 5,000 study subjects racked up 681 patents, wrote 85 books, and earned 560 PhDs – an achievement pace surpassing that of their average-scoring peers.

The question, then, isn’t so much whether the tests measure anything significant; they do, at least to a point. The meatier questions are these: whom do the tests overlook? And how does our culture’s dogged focus on test-measured potential shape what those in the system become?

The top-line distillation of Terman’s research was that aptitude testing reliably foretold success. But a close look at Terman’s longer-range data reveals that his Termites’ successes tended to be of a certain predictable variety. They excelled, mostly, at following paths laid out by others who had come before, rather than charting their own. None of them became Nobel Prize winners, although one – Ancel Keys – is known for his now-contested research on the benefits of a low-fat diet. There were no once-in-a-generation creative luminaries in the cohort, no Sylvia Plaths or I M Peis or Toni Morrisons. ‘Rebels were scarce among the Termites, and Henry David Thoreau’s different drummer would have found few followers,’ wrote Shurkin in Terman’s Kids. ‘They did not change life; they accepted it as it came and conquered it.’

‘Whatever it was the IQ test was measuring, it was not creativity’

This is unsurprising given that the kinds of people who ace aptitude tests are, by definition, those specialising at jumping through the hoops that society has set up. If you believe that your entire purpose on Earth is to finish the course, chances are you’ll remain within its boundaries at all costs.

While the error baked into Terman’s study design – his choice to screen only for sheer mental firepower – might not have been apparent at first, it came into starker relief as time passed (and as one Terman study reject, Luis Alvarez, won the Nobel Prize for chemistry). ‘Whatever it was the IQ test was measuring, it was not creativity,’ Shurkin says. Neither did Terman’s test measure work ethic or grit, qualities that the psychologist Angela Duckworth has since proven correlate highly with success. Terman himself noted that the Termites who went to college averaged only Bs; only about one in 10 were straight-A students. He chalked this up to ‘idleness, unwillingness to do routine assigned tasks’.

Such flaws persist in aptitude tests today, says the psychologist Scott Barry Kaufman, the author of Ungifted: Intelligence Redefined (2013). High scorers on tests such as the Wonderlic, the Stanford-Binet and the SHSAT enjoy a near-instant degree of privilege, and the simplicity and ease of sorting is part of the appeal for administrators.

But Kaufman doesn’t feel the privilege granted to test-acers is justified given the constellation of factors that drive success and innovation, many of them difficult to quantify. ‘I don’t think we should get so hung up on the test being the arbiter of the ultimate truth. You have a lot of people whose actuality far surpasses what’s predicted,’ Kaufman says. ‘We all privilege potential over achievement. Our hierarchy of values is set up in a way that anyone that deviates from the kind of mind that does well on those metrics feels like a loser.’ Some psychologists have tried to upend this values hierarchy by creating new tests to assess ‘multiple intelligences’ such as emotional, musical and linguistic intelligence. But while the metrics might differ, the broad-based impulse to sort is the same.

The test-based drive to essentialise regularly frustrates Rockfeld, the New York City public school teacher and SHSAT proctor: ‘I’m not saying this test doesn’t pick out some very bright students. But I don’t think they’re the only ones in my city who are. There are other students who are equally likely to be successful.’ In an ideal world, she thinks, the SHSAT would be just one component of a more holistic admissions strategy. It’s not that the test ought to be scrapped altogether; it’s more that its relative importance should be dialled back in favour of more personalised assessments, such as essays, grades and teachers’ impressions over the course of a year.

Rockfeld stresses that real-world high-performers are slipping through the aptitude-testing net in droves, missing out on opportunities as a result. But what’s less obvious is that even the beneficiaries of the test-driven system suffer in insidious ways. When you get used to being rewarded for your test-measured potential, rather than for what you shape through force of will, a kind of existential atrophy sets in. Your notions of true mastery – of pride in self-directed accomplishment – fail to progress. I didn’t build this, you think to yourself. I don’t own this. Yet your quantified promise is the shell you display, and you are the crab trapped within its confines.

During my earliest years, I felt the squeeze only rarely. I saw it all so clearly – what the testing gods said I could be – as if it had already ruptured into aliveness, as if ‘you can’ and ‘you will’ were one and the same.

Still, something in me sensed what was ahead. At 11 or 12, I started watching a lot of women’s gymnastics. The sport obsessed me for largely unconscious reasons: here was a procession of exceptional children, all hand-selected for their potential. Back then, every routine was scored out of a perfect 10, an unquestioned, unchanging standard for which to strive.

One day, my father asked me something unexpected. ‘Do you feel like a scholastic gymnast?’ he said.

I can’t remember what I said in response. I only remember my urge to cry.

Years after that, I dreamed that I’d qualified for the Olympic team. But it wasn’t a wish-fulfilment dream; it was a horror show. I’d fooled the selection committee into choosing me, and I was scrambling for a way to hide the fact that I couldn’t do gymnastics at all.

Every aptitude test is unavoidably a shorthand, a way to render potential in thumbnail form

To this day, when I have to work hard to understand or finish something – when it doesn’t click for me, as the SAT did – it’s hard to shake the sense that something is wrong, that my struggle is a marker of stagnation rather than a necessary show of effort. And when others surpass me in various dimensions of life, I castigate myself: You were supposed to be there already. Didn’t the tests show you had a head start? When I do achieve something, it feels less like a triumph than like a box checked off with relief. Finally, I’m tracking the growth curve they pegged me to follow all along.

What’s perilous is when test-takers themselves, following their mentors’ and gatekeepers’ lead, begin to define themselves using the tests’ stilted lexicon. As brilliant as she was, Terman’s young supernova Beatrice Carter ultimately fell into this trap. Egged on by adult admirers, she grew so enamoured of her intellect that her high-school peers found her insufferable. ‘She is not socially mature enough to hold her own with students of college age,’ her headmistress remarked. And she was so used to seeing her dazzling potential as an end in itself that she had no idea how to cope after the publisher Dutton rejected her debut adult novel. Before long, her writing output dwindled to almost nothing. In middle age, she dabbled in a variety of disciplines, including sculpture, without committing herself to any. She died unknown of breast cancer in the mid-1980s. The profession listed on her death certificate, Shurkin notes, was ‘landlady’.

As Terman conceded, every aptitude test is unavoidably a shorthand, a way to render potential in thumbnail form. And there’s nothing wrong with that per se. Like every other form of assessment, such tests have a practical role to play – even a necessary one, if you accept Wai’s reasoning. Our collective error is one of emphasis, of overvaluing what is instant, quantifiable and culturally bound. It’s the seductive but crippling assumption, among both testers and testees, that the thumbnail can stand in for the human whole – and that it should dictate the contours of a life before the whole emerges in full.

Some experts think the test-mania that Terman inspired might be starting to recede. There’s the backlash against the SHSAT in New York City, and there’s also the fact that more than 200 colleges have dropped their SAT admissions requirement in recent years, including top institutions such as the University of Chicago. Testable aptitude, Kaufman says, is just one among many factors that today’s administrators consider in evaluating candidates.

This is likely true in some places. But in the California school district where I live, kids qualify for the gifted programme based on a single metric: a nonverbal intelligence test every student takes in the second grade. I can’t help thinking that this is just the way that Terman would have wanted it.

When it’s time for my oldest son to take this test, I plan to take a hands-off approach. I won’t drill him for weeks beforehand, nor will I lecture him about how important the test is to his future. Sticking to that commitment will mean rejecting decades-worth of relentless social programming: programming that says those who score high on these tests are more worthy than those who do not, programming that prizes sheer aptitude, whatever its origins, over the force of human will.

Every so often, I feel my resolve weakening in the face of such messages. When I’m supposed to be working, I’ll open a new browser window and load up an online IQ test. I know on some level that this is a primal attempt to shore up my own worth – to prove that the adult version of me is still on par with the whizzkid of my childhood. Partway through the test, out of disgust at my own insecurity, I usually close the browser window. In my mind, though, it might as well still be open.

Cognition and intelligence Education Work

3 December 2018

Post

SYNDICATE THIS ESSAY