About 25 minutes into the action film Iron Man 2 (2010), there is an explosive sequence in the middle of an auto race through the streets of Monaco. The scene is a technical tour de force, with explosions, cars flipping and fire everywhere, all in front of thousands of panicked race spectators. At a 2014 event at the Academy of Motion Picture Arts and Sciences, the film’s director Jon Favreau got to see the eye movements of audience members who watched the clip. He told us he was thrilled – and relieved – to see that everyone was watching the actors Robert Downey Jr and Mickey Rourke, particularly their faces and hands, and that nobody was looking at the crowd – because the crowd was all computer-generated, and if you look closely they don’t look all that real. As long as you don’t look closely, Favreau (who was also an executive producer) could go a little cheap on these effects and save the money for where it would really count.
This phenomenon – the audience’s eyes moving in unison – is characteristic of film viewing. It is not typical of real-world vision. Rather, filmmakers use editing, framing and other techniques to tightly control where we look. Over 125 years, the global filmmaking community has been engaged in an informal science of vision, conducting a large number of trial-and-error experiments on human perception. The results are not to be found in any neuroscience or psychology textbook, though you can find some in books on cinematography and film editing, and in academic papers analysing individual films. Other insights are there in the films themselves, waiting to be described. In recent years, professional scientists have started to mine this rich, informal database, and some of what we have learned is startling.
To understand how the eyes are affected by movies, you need to know a bit about how they work outside the theatre. When we are just living our lives, our eyes jump from one location to another two or three times per second, taking in some things and skipping over others. Those jumps are called saccades. (Our eyes also make smooth tracking movements, say when we are following a bird in the sky or a car on the road, but those are somewhat rare.) Why do we do this? Because our brains are trying to build a reasonably complete representation of what is happening using a camera – the eye – that has high resolution only in a narrow window. If any visual detail is important for our understanding of the scene, we need to point our eyes at it to encode it.
The way people use eye movements to explore a scene has a consistent rhythm that involves switching between a rapid exploratory mode and a slower information-extraction mode. Suppose you check into a resort, open a window, and look out on a gorgeous beach. First, your eyes will rapidly scan the scene, making large movements to fix on objects throughout the field of what you can see. Your brain is building up a representation of what is there in the scene – establishing the major axes of the environment, localising landmarks within that space, categorising the objects. Then, you will transition to a slower, more deliberate mode of seeing. In this mode, your eyes will linger on each object for longer, and your eye movements will be smaller and more deliberate. Now, your brain is filling in details about each object. Given enough time, this phase will peter out. At this point, you might turn to another window and start all over again, or engage in a completely different activity – writing a postcard or unpacking.
The fact that our eyes transition from an exploratory mode to an information-extraction mode, without anything new happening in the scene, tells us that visual behaviour is guided by powerful internal control mechanisms.
The first efforts to understand these mechanisms took place in the 1970s, when the psychologists Julian Hochberg and Virginia Brooks at Columbia University studied the transition from rapid exploration to slower information-extraction. To do that work, they created simple movies: slideshows made of still shots of simple abstract patterns, natural scenes and line drawings that told little stories. By getting rid of the motion within each shot, they could carefully quantify which features drove the eyes. Then, the experimenters asked people to watch the slide shows while their eyes were tracked with a special infrared camera setup.
In these initial studies, Hochberg and Brooks documented the switch from an exploratory phase lasting a few seconds at most to an information-extraction phase. The duration of each exploratory phase depended on the complexity of the slide being viewed. When presented with more complex pictures, people explored for longer before they settled down. And when they were given a choice between looking at more complex and less complex images, they spent more time looking at the more complex images. Later researchers investigated which visual features specifically draw the eyes, finding that viewers tend to look at parts of an image with edges, with a lot of contrast between light and dark, with a lot of texture, and with junctures such as corners.
A surfer might be drawn to a surfboard leaning against a tree, while a sailor’s eye is captured by a boat on the horizon
In natural scenes, there are usually several locations with features that draw the eyes. Your beach scene might have an umbrella or two with high contrast, a few palm trees with distinctive texture, and a few chairs with lots of edges and corners. During the exploratory phase, each person viewing the scene will tend to land on most of these focal points, but exactly when and for how long they hit each focal point will vary from person to person depending on differences in their internal control systems. For example, a surfer might be drawn to a surfboard leaning against a tree, while a sailor’s eye is captured by a boat on the horizon.
In my work with colleagues at the Dynamic Cognition Lab at the University of Washington in St Louis, our studies of viewers tracking unedited movies of everyday action show correspondence between patterns from one viewer to the next. But the effect is far more powerful in studies of commercial films such as Iron Man 2, where professionals manipulate visual cues such as corners, brightness contrast and others. Skilled filmmakers can push your eyes around the screen with impressive control.
One difference between real-world scenes and film is that movies move. How does this change what people look at? In a recent experiment, Parag Mital, Tim Smith, Robin Hill and John Henderson from the University of Edinburgh recorded eye movements from a few dozen people while they watched a grab-bag of videos, including ads, documentaries, trailers, news and music videos. A number of effects carried over from looking at still pictures. People still look at places with a lot of contrast, and at corners. However, with moving pictures, new effects dominate: viewers look at things that are moving, and at things that are going from light to dark or from dark to light. This makes good ecological sense: things that are changing are more likely relevant for guiding your actions than things that are just sitting there. In particular, the eyes follow new motion that could reveal something that you need to deal with in a hurry – an object falling or an animal on the move.
Motion onsets are known to powerfully capture attention, even more quickly than the eyes can move. For example, when we first see Edward Scissorhands in the 1990 Tim Burton film of the same name, he is attempting to hide in shadow in a complex scene. It is the involuntary movement of his scissors that gives him away, attracting the viewer’s eye at the same moment it attracts the eye of Peg the Avon Lady.
Does this mean that we’re slaves to motion? Whether in film or in real life, will we stop shifting between exploration and information-extraction, and become captives of the movement in the field? Michelle Eisenberg, a PhD student working with me at Washington University in Saint Louis, and I had a hunch that our visual systems would continue the shifts – but that in a sequence with constant motion, the shift from exploratory mode to information-extraction mode would have to be driven by changes in action instead of corners and lights.
Consider the opening shot of Orson Welles’s A Touch of Evil (1958). It follows two main characters through a complicated scene in a border town at night for three and a half minutes. The camera and the actors are in constant motion. How do we shift modes under those circumstances? We had a hypothesis, based on other work in the lab which showed that people spontaneously parse an ongoing activity into a series of discrete events. This parsing seems to be going on all the time when we are watching others perform actions and presumably when we are performing actions ourselves. For example, if I show you a movie of a man changing a tyre, you will spontaneously segment that activity into events such as taking out the jack and jacking up the car. At the boundaries between these events, large parts of your cortex show brief increases in their metabolic activity that we can see with functional MRI. Maybe each time one of these boundaries happens, our visual system treats it like it is seeing a new picture, with an initial exploration phase and a subsequent information-extraction phase?
In a skilfully made film, the director and editor use several tools to shape visual transitions, whether they are aware of doing so or not
To test this, we tracked viewers’ eyes using a modern version of the infrared camera system that Hochberg and Brooks used while the viewers watched movies of people doing everyday chores. The movies were a few minutes long, and showed things such as changing a tyre, planting a flower bed, or setting up for a party. We asked people to first just watch the movies, and then to watch them again while pressing a button for us to indicate the boundaries between events – whenever one event ended and another began. Then, we analysed what their eyes were doing around the time of the event boundaries. We found evidence for the hypothesised shift: toward the ends of events, eye movements tended to be small and the time between them tended to be long. Then, as a new event boundary approached, the time between eye movements became shorter and then the movements became larger. After the event boundaries, things settled back down. This sequence looks like the system shifting from information-extraction mode to exploration mode, just like was originally seen at the beginning of new pictures in slideshows.
These are not huge effects; they depend on the size of the event, and they aren’t always statistically reliable. In fact, other things drive the eyes as well. In edited movies, for instance, the beginning of a new shot appears crucial for initiating a shift to exploration mode.
In a skilfully made film, the director and editor use several tools to shape visual transitions, whether they are aware of doing so or not. First, the action has been staged. Second, it has been filmed by cameras placed deliberately at particular locations, possibly moving and aided by artificial lighting. Finally, and most importantly, the footage has been edited together, usually with the goal of telling a coherent narrative. These techniques manipulate both the sequence of events and the sequence of visual transitions to help our brains break down the continuous stream of action into a series of episodes, each of which we process like a picture: we first explore, working out the framework of what is happening, and then we turn to filling in the details of that framework. By fiddling with shot layout and editing, filmmakers are continually manipulating the visual process in ways often experimental and new.
For eye movements, probably the most important manipulation here is the last one – editing. Editing is in part the assembly of a collection of individual shots into a larger work. These days, most edits are the simplest possible juncture: a cut, in which one of the two segments are simply abutted with no transition. (More elaborate edits, such as fades, wipes and iris effects, used to be more popular but were always in the minority.) Editing has major effects on eye movements. They are frequent after a cut – particularly after the many cuts that happen within an ongoing scene. Often, these eye movements tend to bring the eyes to the centre of the screen. One possibility is that this is a default response to a visual disruption.
Another possibility, which I find very plausible, is that this tendency to look to the centre of the screen after a cut reflects a kind of implicit contract between filmmakers and audiences. In mainstream narrative cinema, camera operators usually try to keep the most important thing in the middle of the frame, especially at the beginning of a shot. As people grow up watching TV and movies, they come to expect this, and so after a cut they look to the middle of the screen because there is likely to be something important there. When a shot starts with the important stuff off to the side (a composition favoured by the Nouvelle Vague director Jean-Luc Godard), it is disorienting.
People also tend to blink at cuts. This might be a response to the sudden change in brightness that can occur at a cut. Or, it might reflect that our brains are taking the visual change as a sign to take a very quick break, and using the opportunity to wet our eyes. Last but not least, when we watch edited movies we make more eye movements. Whereas eye movements happen 2-3 times a second when we are looking at naturalistic stimuli, when watching unedited videos, eye movements happen at a rate of about four per second; when you add in editing, this rate increases further to about five per second.
Filmmakers have figured out things that I need to know as a scientist studying perception
In short, film editing alters pretty much everything about how we control our eyes: when they move, where they move, and when they blink. Thus, watching film is a dance between the filmmakers – especially the editor – and your visual system. If the filmmakers lead well, your eyes will follow effortlessly, smoothly coordinating the ideas presented by the filmmaker with the perceptual machinery you bring to bear. If not, the dance might be awkward, comic, or even anxiety-provoking. Filmmakers sometimes do this on purpose: if I am trying to make you laugh or scare you, I might try to lead your eyes in a way that makes it harder to figure out what’s going on; I might give you nothing to move to or too many things. Film has the option of playing against our perceptual systems rather than playing nice with them.
Controlling where and when viewers look is important from a practical point of view: filmmakers can save money on the parts of the frame we don’t look at, just like home builders can save money by not painting the attic. This information can also contribute to achieving artistic goals. In Woody Allen’s Zelig (1983), the title character repeatedly shows up in the frame in the middle of a scene. You realise that he has been there for some seconds, but we never seem to see when or where he enters the scene. As the director, Allen does this in part by drawing your eyes to something else and then easing his character into the frame.
When I see such sequences in the films I love, I am struck that accomplished filmmakers have figured out things that I need to know as a scientist studying perception. At the same time, I think that my field might actually have something useful to contribute to the practice of filmmaking. Artists and craftspeople conduct countless information experiments – you try out something, see how it looks to you, show it to a few friends, maybe even run a test screening and ask a few questions. Over time, you develop informal intuitions and rules of thumb that work. By measuring viewers’ behaviours and physiology in real time while they watch, perhaps my field can help accelerate this process and bring it more into the light.