MIT CogNet, The Brain Sciences ConnectionFrom the MIT Press, Link to Online Catalog
SPARC Communities
Subscriber : Stanford University Libraries » LOG IN

space

Powered By Google 
Advanced Search

The CogNet Library : References Collection
mitecs_logo  The Visual Neurosciences : Table of Contents: Shape Dimensions and Object Primitives : Section 1
Next »»
 

Structural representation: theory

According to many theories (Milner, 1974; Selfridge, 1959; Sutherland, 1968), this transformation is based on an alphabet of simple shape elements or primitives that correspond to common real-world object components. Each neuron would represent one type of primitive, responding whenever that primitive was present within its receptive field. A given object would be represented by the combined activity of a number of such neurons, each signaling one of the primitives constituting the object. A complete representation would also require information about the relative position and connectivity between primitives.

This is a structural representation in the sense that neurons explicitly encode the geometric composition of the object. The idea is also referred to as representation by parts or representation by components (Biederman, 1987), since the object is described in terms of its parts or primitives. Parts-based representation satisfies the requirement for consistency, since the list of parts making up an object does not change when the retinal image changes. The particular parts that are visible may change when the object rotates (due to self-occlusion), but a familiar object is recognizable from a subset of its parts. Structural or parts-based representation also satisfies the requirement for similarity or second-order isomorphism: Explicitly encoding the geometrical structure of objects ensures that similar objects will have similar neural representations. Finally, parts-based coding has the efficiency and capacity required to represent the infinite space of object shape. A finite number of neurons encoding basic shape elements can represent any combination of those elements in the same way that letters of the alphabet can represent any word.

The discrete form of the theory described above is convenient for conveying the basic coding principle, and it analogizes to letters encoding words and DNA triplets encoding proteins. But the notion of stereotyped shape primitives signaled by all-or-nothing neural responses is a simplification. The shapes of real-world object components vary continuously. Correspondingly, visual neurons respond in a graded fashion across a range of shapes. Thus, the shape alphabet is really a set of shape dimensions suitable for describing object components—a multidimensional feature space (Edelman, 1999; Edelman and Intrator, 2000). [Some authors would reserve the label “structural” for the discrete form of the theory (Edelman and Intrator, 2000); I use it here to encompass all schemes in which part identity and position are explicitly represented.] Neurons with graded tuning in those dimensions would provide an analog signal (in spikes per second) related to how closely shape elements in the current image match their tuning peaks. Neural tuning peaks would be distributed across shape dimensions, so that any value could be represented. A given object would be represented by a constellation of population activity peaks corresponding to its constituent parts.

Example: Contour Fragments

What would a structural representation look like? Figure 71.1 illustrates a structural scheme for encoding two-dimensional (2-D) outline and silhouette-like shapes such as alphanumeric characters. The shape to be represented is a bold numeral 2 (Fig. 71.1A). There are a number of ways in which this shape could be decomposed into parts. The decomposition shown here is based on curved contour fragments. (The theoretical and empirical reasons for proposing contour fragments as parts are discussed below.) The lowercase letters label contour fragments with different curvature values. These fragments can be represented in four dimensions, two describing shape (Fig. 71.1B) and two describing relative position (Fig. 71.1C).

Figure 71.1..  

A structural (parts-based) shape-coding scheme based on contour fragments. A, The example shape, a bold numeral 2, can be decomposed into contour fragments (ag) with different curvatures, orientations, and positions. B, The curvature and orientation of each contour fragment is plotted on a 2-D domain. C, The positions of the contour fragments (relative to the object center) are plotted on a 2-D domain. Together, plots B and C represent a 4-D domain for describing contour fragments.


The two shape dimensions shown here are curvature and orientation. Curvature (radial axis in Fig. 71.1B) can be either positive (convex, projecting outward) or negative (concave, indented inward). Curvature is defined mathematically as the rate of change in tangent angle per unit contour length. For a circle, curvature is inversely related to radius. Thus, larger values signify tighter, sharper, more acute curvature. Extremely large values correspond to curvatures so tight that we perceive them as tangent discontinuities, that is, angles or corners. In Figure 71.1B, the curvature scale is squashed so that very sharp curves or angles have a value of 1.0 (convex) or −1.0 (concave). Thus, the sharp convex angle labeled b has a curvature value of 1.0, and the sharp concave angle g has a value of −1.0. The broader-convexity a has a value near 0.5, and the broader-concavity c has a value near −0.5. The straight contour segments (not labeled) would have curvature values of 0.

The other shape dimension is orientation (angular axis in Fig. 71.1B), which in this context means the direction in which the curved fragment “points.” More precisely, this is the direction of the surface normal—the vector pointing away from the object, perpendicular to the surface tangent—at the center of the curved contour fragment. Thus, the sharp-convexity b points toward the lower left (225 degrees), the broad-convexity a points toward the upper right (45 degrees), and so on. Note that this definition of orientation differs from the standard definition for straight lines or edges. The standard definition is orientation of the surface tangent rather than the surface normal. The orientation of the normal is more useful because it also indicates figure/ground direction. Under the convention used here, in which the surface normal points away from the figure interior, 45 degrees specifies a contour with the figure side on the lower left (e.g., a), while 225 degrees specifies a contour with the figure side on the upper right (e.g., b). The tangent in these two cases would be the same.

The relative position dimensions are shown in Figure 71.1C. The coordinate system used here is polar; the two dimensions are angular position and radial position with respect to the object center. (The radial position scale is relative to object height.) Polar coordinates are convenient for representing shape because changes in object size do not affect angular position and produce a uniform scaling of radial position. Contour section a is at the upper right with respect to the object center, so it is plotted near 45 degrees; b is at the upper left and is plotted at 135 degrees. There are many other ways in which the necessary position information could be parameterized.

These four dimensions capture much of the information needed to specify a simple shape like the bold numeral 2. A few important shape dimensions, like contour fragment length and connectivity, have been left out for simplicity. Also, the bold 2 exemplifies just one class of 2-D objects. A much higher dimensionality would be needed to represent three-dimensional (3-D) objects, objects with internal structure, objects of greater shape complexity, and objects defined by color and texture variations.

In a neural representation, the four dimensions in Figure 71.1 would constitute the tuning space for a large population of cells. Figures 71.1B and 71.1C can be thought of as 2-D projections of a single four-dimensional (4-D) domain. Each cell would have a tuning peak somewhere in the 4-D space, and tuning peaks would be distributed across the entire space. Each contour fragment would be represented by an activity peak in the population response. In other words, if all the neurons' responses were plotted, using a color scale, at their tuning peak locations in Figures 71.1B and 71.1C, there would be hot spots at the points corresponding to the object's contour fragments. Fragment a, for example, would be represented by strong activity in the tuning range labeled a in Figures 71.1B and 71.1C, that is, strong activity in neurons tuned for broad convexity oriented near 45 degrees and positioned near the upper right of the object. The bold 2 as a whole would be represented by the constellation of peaks indicated by all the lowercase letters.

The population response pattern would consist not only of punctate peaks. Regions of constant or gradually changing curvature would be represented by continuous ridges in curvature space. For example, the broad convex region labeled a in Figure 71.1A would be represented by an arc-shaped ridge running clockwise from 135 to 315 in Figure 71.1B, because it would stimulate cells sensitive to broad convex curvature at all those orientations. The sharp angle at b, on the other hand, would be represented by a punctate peak. The entire pattern of ridges and peaks would characterize the sequence of gradual and abrupt curvature changes in the shape. Neural representations are often thought of as single population activity peaks, but one study of motion coding has shown that the visual system can be sensitive to aspects of the population response pattern other than peak position (Treue et al., 2000).

Neural representation in terms of contour fragments would have some of the important characteristics required for object perception. It would be relatively robust to variations in an object's retinal image such as size and position changes. The population pattern would be stable in the orientation and angular position dimensions, and it would scale uniformly in the curvature and radial position dimensions. If curvature and radial position were represented relative to object size, the pattern would be stable in those dimensions as well.

Contour fragment coding would also meet the requirement of similar representations for similar objects (second-order isomorphism). The numeral 2 rendered in other fonts ( 2 , 2 , 2 , 2 , 2 , 2 ) would retain key features in the curvature representation, such as the broad convexity near the upper right and the sharp convexity near the lower left. In other words, all 2s would evoke a ridge somewhere near a and a peak somewhere near b in the 4-D population response space (Fig. 71.1). In fact, it is that kind of curvature pattern that defines the numeral 2 and allows us to generalize across the entire category of printed and handwritten 2s. Learning a shape category is a process of finding the characteristic features that define that category. It is critical that the neural representations of those features be consistent or at least grouped in neural tuning space.

Finally, because of its combinatorial, alphabet-like coding power, the scheme shown in Figure 71.1 would have the capacity and versatility to represent a virtual infinity of shapes composed of standard contour fragments. This could be accomplished by a reasonable number of neurons with tuning functions spanning the 4-D contour curvature space. As noted above, however, a higher-dimensional space would be required to represent more complex objects.

Shape Recognition Models

The coding scheme in Figure 71.1 is just one way to parameterize shape. There are a number of theoretical ideas about shape primitives or shape dimensions (both of which can be grouped under the general heading of shape descriptors). Most theories posit a hierarchical progression of parts complexity, with each stage in the processing pathway receiving input signals for simpler parts and synthesizing them into output signals for more complex parts (Barlow, 1972; Hubel and Wiesel, 1959, 1968). In almost all models, the first level of shape description is local linear orientation (i.e., orientation of straight edges and lines). This choice is dictated by the overwhelming evidence that linear orientation is accurately and explicitly represented by cells at early stages in the ventral pathway (V1 and V2) (Baizer et al., 1977; Burkhalter and Van Essen, 1986; Hubel and Livingstone, 1987; Hubel and Wiesel, 1959, 1965, 1968).

Theories diverge concerning higher-level shape descriptors. Marr distinguished two general possibilities: boundary or surface-based descriptors and axial or volumetric descriptors (Marr and Nishihara, 1978). The contour curvature scheme illustrated in Figure 71.1 is a surface-based description; it encodes the shape boundary, specifically the 2-D outline. This is a complete description of a flat silhouette shape like the numeral 2. It would also capture much of the important information about a 3-D shape and could even be used to infer 3-D surface shape (Koenderink, 1984). The potential importance of contour curvature was recognized by Attneave (1954), who pointed out that shape information is concentrated in contours at regions of high curvature, including angles. Angles may be particularly significant, because they are invariant to transformations in scale and can be easily derived by summing inputs from cells tuned for edge orientation (Milner, 1974). Contour curvature could serve as a final description level (Hoffman and Richards, 1984) or it could be used to infer the structure of more complex parts (Biederman, 1987; Dickinson et al., 1992; Hummel and Biederman, 1992).

Axial or volumetric descriptors constitute the ultimate level of representation in many theories. A volumetric primitive is a solid part, defined by the shape (straight or curved) and orientation (2-D or 3-D) of its medial axis. A complete object description in terms of medial axes is like a stick-figure drawing. That description can be refined with other parameters to represent how object mass is disposed about the axes—what the cross-sectional shape is and how width varies along the axis. A volumetric description of the bold numeral 2 would involve three medial axes, one curved and two straight, with corresponding width functions to specify the thick/thin structure of the font. (Cross-section would not be an issue for a flat 2-D shape.)

Volumetric primitives are also known as generalized cones (Marr and Nishihara, 1978; a generalized cone is constructed by sweeping a cross-section of constant shape but smoothly varying size along an axis) or geons (Biederman, 1987). Marr argued that a volumetric description would be more compact and stable than a surface-based description. For most alphanumeric symbols, the axis-defining dimensions would capture the stable, category-defining characteristics. More variable (e.g., font-specific) contour information would be segregated into the width and cross-sectional dimensions. However, medial axes must initially be inferred from surface or boundary information. This requires first segmenting the surface contour into parts (Marr and Nishihara, 1978), probably at regions of high concave curvature, because these represent joints between interpenetrating volumes (Hoffman and Richards, 1984). The medial axis for each part would then be derived from its contours. Some authors have proposed mechanisms for inferring 3-D volumetric structure from certain characteristic 2-D contour configurations (Biederman, 1987; Dickinson et al., 1992; Lowe, 1985).

As discussed above, a complete structural representation requires not just a list of primitives but also a description of their spatial arrangement (as in Fig. 71.1C). Spatial information is initially available to the visual system in retinotopic coordinates. Because the retinal image of an object is so variable, it would be useful to transform spatial information into an object-centered reference frame (but see Edelman and Intrator, 2000, who argue that coarse retinotopy would suffice). At the least, the object could define the center of the reference frame, so that changes in position on the retina would not alter the neural representation. In other words, neural shape responses would be position invariant at the final level of representation. The object might also define the scale of the reference frame, meaning that shape responses would be size invariant at the final level. This would make the neural representation stable across changes in viewing distance.

Some theories would limit the spatial transformation to these two changes—position and scale (e.g., Fig. 71.1C). The orientation of the reference frame would still be defined by the retina or by the head, body, or world (which are usually aligned with the retina). If so, the neural representation would change when the object rotated in either 2-D space (around an axis pointing toward the viewer, e.g., turning upside down) or 3-D space (around any other axis, e.g., rotating around a vertical axis). Dealing with viewpoint changes of this kind is one of the most difficult aspects of shape recognition. One idea is that neural shape representations are viewpoint-dependent—different views of the same object are represented differently—and the visual system learns to recognize an object by storing a limited set of canonical views (Poggio and Edelman, 1990; Tarr and Pinker, 1989; Vetter et al., 1995). Intermediate views would be handled by neural mechanisms for interpolating between the canonical views (Poggio and Edelman, 1990). A more absolute solution to the viewpoint problem is to transform the structural description into a 3-D reference frame defined completely (in position, scale, and orientation) with respect to the object (Biederman, 1987; Dickinson et al., 1992). This would yield a more stable neural representation, but it would require complex mechanisms for inferring and synthesizing 3-D structure.

Mel and Fiser (2000) have described an alternative to encoding each part position in a single spatial reference frame. Units sensitive to part (or feature) conjunctions can represent not only identity but also local connectivity between parts. In the example in Figure 71.2, some neurons might be tuned for the conjunction of fragments a and b, others for b-c, others for c-d, and so on. The response pattern across a population of such units would constitute a unique representation for the bold numeral 2. In effect, the local conjunctions would be concatenated to specify the entire sequence of contour fragments. As discussed below, recent neurophysiological results provide support for this idea (Pasupathy and Connor, 2001).

Figure 71.2..  

Tuning of a single macaque area V4 neuron in the curvature × orientation domain. The gray circular backgrounds indicate the neuron's average responses to each of the stimuli (white icons). The scale bar shows that light backgrounds correspond to response rates near zero, dark backgrounds to response rates near 30 spikes per second.


The general alternative to parts-based or structural representation is holistic representation—coding schemes in which each signal carries information about an entire object or an entire scene rather than just one part. Fourier-like decomposition, with basis functions extending across the entire domain, is an example. Edelman proposes that shapes are represented in a multidimensional space defined by pointwise information across whole objects. The high dimensionality of a point-based representation can be reduced by describing a novel shape in terms of its distances from a limited number of learned reference shapes (Edelman, 1999). A recent extension of this theory posits that the reference shapes could be object fragments and that the positions of those fragments could be represented in retinotopic or object-centered space (Edelman and Intrator, 2000). This would constitute a flexible, learning-based (and explicitly continuous) version of structural representation. Ullman (1996) proposes that an object is recognized by virtual (neural) alignment of its complete retinal image (through appropriate translation, scaling, and rotation) and pointwise matching with a shape template stored in memory. He notes that this approach could also be integrated with structural decomposition mechanisms.

 
Next »»


© 2010 The MIT Press
MIT Logo