| |
Basic facts
Object recognition in cortex is thought to be mediated by the ventral visual pathway (Ungerleider and Haxby, 1994) running from primary visual cortex, V1, over extrastriate visual areas V2 and V4, to inferotemporal cortex IT. Based on physiological experiments in monkeys, IT has been postulated to play a central role in object recognition. IT, in turn, is a major source of input to prefrontal cortex (PFC), “the center of cognitive control” (Miller, 2000) involved in linking perception to memory.
Over the past decade, several physiological studies in nonhuman primates have established a core of basic facts about cortical mechanisms of recognition that seem to be widely accepted and that confirm and refine older data from neuropsychology. A brief summary of this consensus knowledge begins with the groundbreaking work of Hubel and Wiesel first in the cat (1962, 1965) and then in the macaque (1968). Starting from simple cells in primary visual cortex, V1, with small receptive fields that respond preferably to oriented bars, neurons along the ventral stream (Logothetis and Sheinberg, 1996; Perrett and Oram, 1993; Tanaka, 1996) show an increase in receptive field size as well as in the complexity of their preferred stimuli (Kobatake and Tanaka, 1994). At the top of the ventral stream, in anterior inferotemporal cortex (AIT), cells are tuned to complex stimuli such as faces (Desimone, 1991; Desimone et al., 1984; Gross et al., 1972; Perrett et al., 1992). The tuning of the view-tuned and object-tuned cells in AIT depends on visual experience as shown by Logothetis et al. (1995) and supported by Booth and Rolls (1998), DiCarlo and Maunsell (2000), and Kobatake et al. (1998). A hallmark of these IT cells is the robustness of their firing to stimulus transformations such as scale and position changes (Logothetis et al., 1995; Logothetis and Sheinberg, 1996; Perrett and Oram, 1993; Tanaka, 1996). In addition, as other studies have shown (Booth and Rolls, 1998; Hietanen et al., 1992; Logothetis et al., 1995; Perrett and Oram, 1993), most neurons show specificity for a certain object view or lighting condition. In particular, Logothetis et al. (1995) trained monkeys to perform an object recognition task with isolated views of novel three-dimensional objects (“paper clips”; see the objects at the top of Fig. 111.1). When recording from the animals' IT, they found that the great majority of neurons selectively tuned to the training objects were view-tuned (with a half-width of about 20 degrees for rotation in depth) to one of the training objects [about one-tenth of the tuned neurons were view-invariant, in agreement with earlier predictions (Poggio and Edelman, 1990)], but an average translation invariance of 4 degrees (for typical stimulus sizes of 2 degrees) and an average scale invariance of 2 octaves (Riesenhuber and Poggio, 1999b). Thus, whereas view-invariant recognition requires visual experience of the specific novel object, position and scale invariance seems to be immediately present in the view-tuned neurons (Logothetis et al., 1995) without the need of visual experience for views of the specific object at different positions and scales. A very recent study (DiCarlo and Maunsell, 2003)—using different stimuli and training paradigm—reports translation invariance from one view of less than 3 degrees, pointing to a possible influence of training history and object shape on invariance ranges. Recent functional magnetic resonance imaging (fMRI) data have shown a similar pattern for the lateral occipital cortex (LOC), a brain region in human visual cortex central to object recognition and believed to be the homolog of monkey area IT (Grill-Spector et al., 2001; Malach et al., 1995; Tanaka, 1997). Optical recordings in monkeys confirmed the view dependency of several face-tuned neurons (Wang et al., 1996).
Figure 111.1..
The figure shows the first part of the Standard Model. It extends several recent models (see especially Fukushima, 1980; see also Perrett and Oram, 1993; Poggio and Edelman, 1990; Riesenhuber and Poggio, 1999b; Wallis and Rolls, 1997). The view-based module, HMAX, shown here is an hierarchical extension of the classical paradigm (Hubel and Wiesel, 1962) of building complex cells from simple cells. The circuitry consists of a hierarchy of layers leading to greater specificity and greater invariance by using two different types of pooling mechanisms. The first layer in V1 represents linear oriented filters followed by input normalization, similar to simple cells (Carandini et al., 1997); each unit in the next layer (C1) pools the outputs of simple cells with the same orientation but at slightly different positions (scales) by using a maximum operation (see text and Riesenhuber and Poggio, 1999b). Each of these units is still orientation selective but more invariant to position (scale), similar to some complex cells. In the next stage, signals from complex cells with different orientations but similar positions are combined (in a weighted sum) to create neurons tuned to a dictionary of more complex features. The next layer (C2) is similar to the C1 cells: by pooling together signals from S2 cells of the same type but at slightly different positions (and scales), the C2 units become more invariant to position (and scale) but preserve feature selectivity. They may correspond roughly to V4 cells. In the model, the C2 cells feed into view-tuned cells, with connection weights that are learned fom exposure to a view of an object. There may be more levels in the hierarchy than are shown in the figure after the C2 layer. The output of the view-based module is represented by view-tuned model units that exhibit tight tuning to rotation in depth (as well as illumination and other object-dependent transformations such as facial expression) but are tolerant to scaling and translation of their preferred object view. Notice that the cells labeled here as view-tuned units encompass, between PIT and AIT, a spectrum of tuning from full views to components or complex features: depending on the synaptic weights determined during learning, each view-tuned cell becomes effectively connected to all or only a few of the units activated by the object view (Riesenhuber and Poggio, 1999a).
A comment about the architecture is important: in its basic initial operation—akin to immediate recognition—the hierarchy is likely to be mainly feedforward (though local feedback loops almost certainly have key roles, e.g., possibly in performing a maximum-like pooling; see later) (Perrett and Oram, 1993). Event-related potential data (Thorpe et al., 1996) have shown that the process of object recognition appears to take remarkably little time, on the order of the latency of the ventral visual stream (Perrett et al., 1992), adding to earlier psychophysical studies using a rapid serial visual presentation (RSVP) paradigm (Intraub, 1981; Potter, 1975) that have found that subjects were still able to process images when they were presented as rapidly as eight per second.
In summary, the accumulated evidence points to six mostly accepted properties of the ventral stream architecture:
1. A hierarchical buildup of invariances first to position and scale and then to viewpoint and more complex transformations requiring the interpolation between several different object views
2. In parallel, an increasing size of the receptive fields
3. An increasing complexity of the optimal stimuli for the neurons
4. A basic feedforward processing of information (for “immediate” recognition tasks)
5. Plasticity and learning probably at all stages and certainly at the level of IT
6. Learning specific to an individual object is not required for scale and position invariance (over a restricted range)
| |