Why big data is actually small, personal and very human

We live in what is sometimes called the ‘petabyte era’, and this pronouncement has provoked much discussion of the sheer size of data stores being created, as well as their rapid growth. Claims circulate along the lines of: ‘Every day, we create 2.5 quintillion bytes of data – so much that 90 per cent of the data in the world today has been created in the last two years alone.’ This particular statistic comes from IBM’s website under the topic: ‘What is Big Data?’ but similar ones appear regularly in the popular media. The idea has impact. Among other things, it is used to initiate a conversation in which an IBM representative, via a pop-up entreaty, offers big-data services. Merely defining big data, it seems, generates more opportunities for big data.

And the process continues. Ever more urgently in the press, in business and in scholarly journals the question arises of what is unique about big data. Often the definitions are strangely circular. In 2013, a writer for the Columbia Journalism Review described big data as ‘a catchall label that describes the new way of understanding the world through the analysis of vast amounts of data’ – a statement that amounts to: big data is big… and it’s made of data. Others talk about its transformational properties. In Wired magazine, the tech evangelist Chris Anderson claimed the ‘end of theory’ had been reached. So much data now exists that it is unnecessary to build a hypothesis to test scientifically. The data can, if properly handled and analysed, ‘speak for themselves’. Many resort to definitions that stress the ‘three Vs’: a data set is ‘big data’ if it qualifies as huge in volume, high in velocity, and diverse in variety. The three Vs occasionally pick up a fourth, veracity, which can be interpreted in a number of ways. At the least, it evokes the striving to capture entire populations, which opens up new frontiers of possibility.

What is often forgotten, or temporarily put aside, in such excited discussions is how much of this newly created stuff is made of and out of personal data, the almost literal mining of subjectivity. In fact, the now common ‘three Vs’ were coined in 2001 by the industry analyst Doug Laney to describe key problems in data management, but they’ve become reinterpreted as the very definition of big data’s nearly infinite sense of applicability and precision.

When introducing the topic of big data in a class I teach at Harvard, I often mention the Charlton Heston movie Soylent Green, set in a sci-fi dystopian future of 2022, in which pollution, overpopulation and assisted suicide are the norm. Rations take the form of the eponymous soylent-green tablets, purportedly made of high-energy plankton, spewed from an assembly line and destined to feed the have-nots. Heston’s investigation inevitably reveals the foodstuff’s true ingredients, and such is the ubiquity of the film’s famous tagline marking his discovery that I don’t think spoiler alert applies: Soylent green is people!

Likewise, I like to argue, if in a different register: ‘Big data is people.’

Most definitions of big data don’t take account of its inherent humanness, nor do they grapple meaningfully with its implications for the relationship between technology and changing ways of defining ourselves. What makes new collections of data different, and therefore significant, is their quality of being generated continuously from people’s mundane, scarcely thought-through, seemingly tiny actions such as Tweets, Facebook likes, Twitches, Google searches, online comments, one-click purchases, even viewing-but-skipping-over a photograph in your feed – along with the intimacy of these actions. They are ‘faint images of me’ (to borrow a phrase from William Gibson’s description of massed data traces), lending ghostly new life to the fruits of algorithmic processing.

Examples of the production sites of such data, as the geographer Rob Kitchin recently cataloged them, include the recording of retail purchases; digital devices that save and communicate the history of their own use (such as mobile phones); the logging of transactions and interactions across digital networks (eg email or online banking); clickstream data that record navigation through a website or app; measurements from sensors embedded into objects or environments; the scanning of machine-readable objects such as travel passes or barcodes; ‘automotive telematics’ produced by drivers; and social-media postings. These sources are producing massive, dynamic flows of diverse, fine-grained, relational data.

In 2012, Wal-Mart was generating 2.5 petabytes of data relating to more than 1 million customer transactions every hour. The same year, Facebook reported that it was processing 2.5 billion pieces of content (links, comments), 2.7 billion likes, and 300 million photo uploads per day. Meanwhile, opportunities for granular data-gathering keep evolving. This February, Facebook rolled out a diversified array of six emoji-like buttons to add range and affective specificity to the responsive clicks possible on the site. Another new feature adds more than 50 additional customised gender descriptors to choose from on Facebook, rather than the binary ‘male’ or ‘female’.

Continuously assembled trails of data derived from all those inputs are quickly being put to use. Data streams can feed maps that tell you not just where you are but also where you want to go; they can, as well, fuel preemptive police work – that is, programs that focus investigations based on patterns discerned in data before a subject has committed a crime. Big data is people, then, in two senses. It is made up of our clickstreams and navigational choices; and it in turn makes up many socially significant policies and even self-definitions, allegiances, relationships, choices, categories.

Some cultural critics call what is emerging a ‘new mind control’ capable of flipping major elections. Others describe a form of rapacious human engineering. Shoshana Zuboff of Harvard Business School argues that the harnessing of behavioural data is having massively disruptive results on freedom, privacy, moral reasoning and autonomy – results that will be playing out for decades to come. In her view, it is nothing less than a virulent new form of capitalism.

big data is too often regarded as a raw force of nature that simply must be harnessed

The momentum of big-data definitions tends to reinforce the impression that big data is devoid of subjectivity, or of any human point of view at all. A set of social-science scholars working in the field of technology studies recently urged researchers to turn from ‘data-centred’ to ‘people-centred’ methods, arguing that too much focus on a data-driven approach neglects the human being who is at the core of sociological studies. This reminder, however useful, neglects the central fact that data traces are made up of people.

Contrary to the novelty with which big data is frequently presented, important parts of this information-gathering process are not quite new – not at all new, in fact. Platforms such as social media are of recent design, but the goal of automated access, the concept of human-as-data, and the fantasy of total information long pre-exist the recent developments. This realisation punctures claims that we are grossly transformed as human beings by big data. The circulation of pervasive inaccuracies about big data is a problem because it has a quelling effect. Misconceptions about big data, tautological repetition, and confusion about its very meaning stifle needed conversations about data privacy and data use.

Even as we pay lip service to diminishing domains of privacy and increasing incursions into this beleaguered space – legal incursions, illegal ones, and the varieties in between – and even as we are reminded by whistleblowers that there is abundant cause for concern, we resist connecting the public-sphere discourse with our own circulating intimacies. Likewise, a feeling that big data is inhuman reinforces the sense that it cannot be modified or regulated; it is too often regarded as a raw force of nature that simply must be harnessed. These beliefs foster intrusions of government and private capital forces that people would probably resist much more strenuously if they clearly understood what is happening. The situation boils down, really, to this: to unwittingly accept big data’s hype is to be passive in the face of big data’s mantle of inevitability. Awareness is the only hope.

For all their futuristic trappings, big data and data-driven science resonate strongly with the history of social-scientific techniques, which during the course of the 20th century reached ever more exactingly into the realm of the subjective, the self, the intimate and the personal. As the social sciences differentiated themselves – sociology from anthropology from social psychology from economics, each in its own department, each with its own areas of interest and special tools – experts battened down authority and built firewalls against enthusiast amateurs, quasi-professionals, and interloping women. Mainstream, professionalising social science abounded in techniques for data-extraction, setting scenes in which subjects would be inclined and accustomed to share their memories, their lives, the seemingly banal details of their first steps or marital first nights.

In Muncie, Indiana, the vast ‘Middletown’ study conducted by the sociologists Robert and Helen Lynd between 1924 and 1926 employed a new grab-bag method (adapted in part from anthropology, in part from sociology) that combined information from interviews, participant-observation, newspaper research, questionnaires and other sources. As the historian Sarah E Igo wrote in The Averaged American (2007): ‘No fact or observation seemed too trivial to include in their purview, from the contents of seventh-grade school curricula to that of popular movies, from the number of hours spent on household washing to the size of Middletowners’ backyards.’

The mining of intimacy has a largely untold history. From early 20th-century observational networks to social surveys and polling efforts, to later-century focus groups, techniques evolved to become ever more targeted. The once out of bounds came in bounds in a seemingly relentless process. The ephemeral was materialised, the fleeting anchored. No subject, no state of subjectivity, was to be ignored. As the psychologist James Sully wrote in 1881: ‘The tiny occupant of the cradle has had to bear the piercing glance of the scientific eye.’ Likewise, by mid-century, everything from hallucinations to idle memories of the most pedestrian variety were targeted as data – with, in some cases, experimental data banks built to hold them.

In 1947, the psychologist Roger Barker created the ‘Midwest Psychological Field Station’, a social-science laboratory stationed in the small town of Oskaloosa, Kansas, in the process of which the town emerged as a kind of de facto laboratory. Revolutionising observation opportunities, Barker and his colleagues pioneered the regular capture of data concerning ‘everyday life’ – the unremarkable yet vexingly hard-to-capture details of boy scouts at play in sandlots, schoolyards and other spaces throughout the town. What appears as trivial detail – seven-year-old Raymond at 7:01 am on Tuesday, 26 April 1949, picks up a sock and begins pulling it on his left foot, slow to wake up and groggy, while his mother jokes: ‘Can’t you get your peepers open?’ – pooled with more such data, mounted and massed together, makes a unique resource for sociological study to access the ‘ordinary’ cadence of life during a now-bygone time in a much-changed place. The unremarkable, researchers sensed, would inexorably cease to be so.

Meanwhile, other techniques emerged in research environments designed to further intimate revelations. As the researchers Terry Bristol and Edward Fern have shown, focus-group participants – beginning in the late 1950s – entered a situation in which they experienced a unique mix of ‘anonymity and arousal’ that facilitated ‘expression of shared experiences’. These developments formed part of an American science for objectifying the realm of the subjective in the modern social sciences. A Midwestern flair ran through several of these projects, winding their way from polling in Indiana to child study in Kansas to the beauty parlours and kitchens of Middletown.

Another area of growing focus during the golden age of behavioural techniques was the use of anthropological subjects to pursue experiments in total access. Scientists looked at these relationships as opportunities to publish and penetrate new domains; research subjects from groups around the globe such as the Cree, the Navajo and Bikini Islanders pursued a range of goals including payment, self-knowledge, participation, feedback and the chance to make one’s voice heard in a not-yet-entirely-imagined scientific record.

By many calculations, a Hopi Indian man named Don Talayesva counts as the most-intensively documented such subject in history, in a life stretching from 1890 to 1976. Talayesva participated in 350 hours of formal interviews with the anthropologist Leo Simmons alone between 1938 and 1940, during which he used his life experiences as a Hopi to fill the taxonomic pigeonholes for ‘Hopi’ within an encyclopedic knowledge bank, the Human Relations Area File, hosted at Yale. There were also 8,000 diary pages Talayesva contributed to ethnographers; 341 dreams written down in wire-bound notebooks; a set of wide-ranging interviews; a full Rorschach protocol and other projective tests; and, as the result of all this, a thriving correspondence with the French surrealist André Breton.

anthopological research probed into the remote psyche, which was treated as a kind of territory to be mapped. But the mapping also helped change the territory

Talayesva’s usual rate of pay was seven cents per page of diary-writing and 35 cents per hour of interviewing, adding some expense for the Rorschach test, all of which made him a relatively wealthy man by Hopi standards. Whether or not he remains today the most-documented native person in history, he was the fount of an ‘enormous body of data’, wrote the author of a psychosexual re-study of the Talayesva corpus. Likewise, for another eminent anthropologist, he provided ‘a storehouse of substantive data’. The man himself became a kind of data pipeline.

The pioneering sociological studies targeted not only individuals but also large groups. The anthropologist Melford Spiro psychologically tested all inhabitants of an entire island in the Western Pacific (Ifaluk) during the same post-Second World War years as neighbouring atolls in the area (Bikini, among others) were sites of intensive nuclear tests. For his academic research, Spiro data-mined whole populations. For American Indians, this ongoing process constituted what the historian Thomas Biolsi calls ‘internal pacification’. In a study of Sioux history between the 1880s and 1940s, Biolsi shows how investigations of Sioux life delved increasingly into psychological domains. Such research probed further and further into the remote psyche, which was treated as a kind of territory to be mapped. But the mapping also helped change the territory. Not evenly or regularly, but painstakingly, a transformation of the Sioux ‘self’ was underway, as Biolsi describes it, and the process of being measured, counted, quantified and (eventually) tested served to aid and abet the subjective changes taking place. In effect, such research subjects were canaries in coalmines.

Experiments in the range of ways to get at what could be considered internal data – or what specialists called ‘subjective materials’ – extended from Indian reservations to occupied areas to reformatories, factories and armies. Large punch-card-driven statistical enquiries opened up new possibilities, as in the US Army’s landmark The American Soldier project. Starting on 8 December 1941, the day after Pearl Harbor, and continuing until the last days of the war, the Army’s Research Branch administered more than 200 questionnaires, each usually given to a sample of around 2,500 soldiers, some in overseas theatres of battle and remote outposts.

The result was ‘a mine of data, perhaps unparalleled in magnitude in the history of any single research enterprise in social psychology or sociology,’ in the words of the project’s director, Samuel Stouffer. The American Soldier project provided unique access to the inner states of soldiers – as a resulting publication put it, an unbiased look at ‘what the soldier thinks’.

Early audiences for the Lumière brothers’ films, especially crowds viewing footage of the oncoming train that seemed about to penetrate their cinema screen, ran out of the theatre in panic because they had not yet become trained in the illusionary calculus involved in making the experience of watching films enjoyable – at least according to the myths surrounding Arrival of a Train at La Ciotat Station. Made in 1895, with its first public showing in 1896, the 50-second film comprised one continuous shot of an everyday occurrence – a train steaming into station – yet the camera was positioned on the platform to produce the feeling of a locomotive bearing down on the seated viewer. These ‘naive’ audiences confused signals that indicated one scenario (danger) with another (watching a film about a dangerous situation).

One sees replications of this process of getting acclimated to new techniques in the arena of penetrating social-science instruments, and to its modern incarnation in big data. Early on, citizens seemed to have little resistance to being asked questions by phone pollsters, whereas today as few as 3 per cent will answer questions by phone – if they even have a landline. Technology and resistance arose hand in hand.

As the ‘man on the street’ interview debuted in the 1950s and ’60s, members of the public initially watched in bemusement or alarm as strangers posed random questions accompanied by recording devices. A wonderful depiction of this process appears in the classic 1961 cinema-vérité documentary Chronique d’un Été by the anthropologist Jean Rouch and the sociologist Edgar Morin, in which work-worn Parisians exiting the Métro encounter two snappily dressed young women pointing microphones and pressing into their personal space, asking: ‘Monsieur, are you happy?’ The query occasions a range of responses from blank to flirtatious to heart-rending. There is as yet, however, no sense of this as a normal activity, as one can see from the standardised ease with which college students or city commuters answer pointed questions today.

subjects turned researchers’ techniques to their own purposes, asking snarky questions, fomenting rebellion or teasing sociologists

By the second half of the 20th century, citizens (particularly urban dwellers) became increasingly accustomed to the possibility that intrusive questions might be asked at any time and answers expected; evasions also became normalised. The famous Kinsey Report research, built on thousands of interviews, stimulated a wave of prank social surveyors asking women intimate questions about their sex lives. Pretending to be working for the Kinsey report, these caddish pretenders often received fulsome answers until the surprisingly trusting public was warned of predatory practices. At other times, prospective participants queued in eagerness to take part in Kinsey interviews on sexual behaviour, many reporting exhilarating effects that came from feeling oneself to be ‘an infinitesimal cog in one of the greatest fact-finding projects ever undertaken… the first great mass of facts, and figures drawn from a cross-section of all social and educational groups, from which charts, curves and finally conclusions may be drawn,’ as one interviewee reported.

Also around the mid-20th century, a Harvard Business School team under the industrial psychologist Elton Mayo pioneered the gathering of intimate interviews with employees of the Hawthorne Works factory in Cicero, Illinois, carrying out some 20,000 interviews. They aimed to capture what another eminent social scientist famously called the ‘elusive phenomena’. Their answers remain on file at Harvard’s Baker Library, a curious archive of the mundane details of the lives of factory girls circa 1938 or 1941. Jennie, for example, provided the interviewer with details of her evolving hairstyle, hoped-for Christmas gifts, and proclivity to wear her stockings rolled when working on hot days. Assembly-line girls joked about going out drinking the night before and slowing production during the day. As with anthropologists’ American Indian subjects who often spoke back via purportedly neutral measuring instruments (in one case in 1885, Sioux respondents answering a census survey supplied names such as ‘Shit Head’ and other obscenities), the Hawthorne subjects turned researchers’ techniques to their own purposes, at times asking snarky questions, fomenting rebellion, or teasing visiting sociologists.

One day, perhaps not long from now, people will look back at our current decade amazed at the ease and naiveté with which we, enchanted users of new tech, failed to see the value of our own behavioural data resources, and therefore gave them away for little more than ease of use, entertainment value and dubious accretions of status. That is one possibility. On the other hand, the more we can see the process at work, the less the average user falls sway to the hype of ‘never before’. It becomes possible to disintegrate what is actually new about data-gathering capabilities – arguably: scale and granularity – from those tendencies that existed before, sometimes long in the past.

A recent White House report on ‘big data’ concluded: ‘The technological trajectory, however, is clear: more and more data will be generated about individuals and will persist under the control of others.’ When trying to understand the ramifications of this big-data trajectory, I argue, it is necessary again to bear in mind that the data is not only generated about individuals but also made out of individuals. It is human data.

In parallel with researchers’ increasingly aggressive collection of personal data, modern research subjects became trained how to participate, how to answer, how to obligingly offer up the precincts of the self to scrutiny – our own and others’. This training has prepared us for a new level of intrusiveness. We are all now primed to give ourselves up to big data.

To look at the history of the quest to scoop up the totality of human behaviour in a scientific net is to illuminate the present obsessions. In the end, we see that the attempt to capture all the parts of human experience – mostly boiling it down to its everyday-ness – reveals many elements that are familiar, but also some that are distinctly and wildly different. Big data is not a project suddenly spawned by our just-now-invented digital technologies, although it is transformed by them. Instead, we can see that it is a project at the driving core of all of modern life. In many ways, it crowns long-held ambitions to build a transparent machinic self, one capable of optimisation as well as automation.

We need to see the human in the data machine

The behavioural sciences in the 20th century, particularly as practiced by Americans spanning the globe, engaged in an ambitious push to capture ever-more-intimate parts of human experience and to turn them into materials amenable to manipulation by clever machines. This was a prelude to the Rubicon now known as big data. These historical projects, sometimes more and sometimes less closely aligned with government and military sources of support, ran on complex hodgepodge combinations of old and new technology, and paved the way for our own moment in which corporate-cum-research entities feed government data mills rather than the other way around.

This is why the erstwhile goal to gather large amounts of what specialists called ‘human materials’ resonates so strongly today. It speaks to the tension between humans and materials, and the desire to turn one into the other. What the Swiss biological historian Bruno Strasser calls the ‘supposedly unprecedented data-driven sciences’ are not so unprecedented. For that reason, it is necessary to understand what came before in order to grasp what is actually new.

Preceding examples of innovative data collection already targeted inner provinces, and already engaged in subjective data-mining. They were unable to do so on anything resembling the scale today possible by use of digitally derived data streams. Nonetheless, the old imperative to mine inner worlds finds a place at the heart of today’s practices. By being arrayed in new tech, and by being incorporated in new ways into our human experiences, it is transformed. As are we. But if we really want to understand that transformation and to speak up about it – if we want to see what is truly new rather than what is bumptiously paraded as new – we will need to be anchored in the historical particulars. We need to see the human in the data machine.

Information and communication Computing and artificial intelligence

16 June 2016

Post

SYNDICATE THIS ESSAY