MIT CogNet, The Brain Sciences ConnectionFrom the MIT Press, Link to Online Catalog
SPARC Communities
Subscriber : Stanford University Libraries » LOG IN

space

Powered By Google 
Advanced Search

The CogNet Library : References Collection
mitecs_logo  The Visual Neurosciences : Table of Contents: Visual Perception of Texture : Section 1
Next »»
 

Texture segregation

Texture Features

Much of the work on perception concerns the ability of observers to discriminate certain texture pairs effortlessly. For example, Figure 73.2 shows rectangular regions of Xs and Ts on a background of Ls. Observers can perceive effortlessly that there is a region of Xs different from the background, that this region has smooth, continuous borders, and that these borders form a rectangular shape. This is referred to as the segregation of figure from ground or segmentation of the image into multiple homogeneous regions. At the same time, none of these observations may be made about the region of Ts without the use of effortful scrutiny of the individual texture elements one by one.

Figure 73.2..  

Texture segregation. Note that the region of Xs on the left is easily segregated from the background of Ls. One immediately perceives the borders between the two regions and the shape of the region containing the Xs. By contrast, the border between the Ts and Ls is difficult to see, and the shape of the region of Ts can only be discerned slowly, effortfully, and with item-by-item scrutiny.


This sort of observation led a number of investigators to consider what aspects of image structure led to preattentive segregation of textures. Beck and Attneave and their colleagues (Beck, 1972, 1973; Olson and Attneave, 1970) hypothesized that textural segmentation is based on the distribution of simple properties of texture elements, where the simple properties are things like the brightness, color, size, the slopes of contours, and other elemental descriptors of a texture. Marr (1976) added contour terminations as an important feature.

Julesz's early efforts centered on image statistics. He first suggested (Julesz et al., 1973) that differences in dipole statistics were most important for texture pairs to segregate. (These are the joint image statistics of the gray levels found at the opposite ends of a line segment of a particular length and orientation, as it is placed at all possible image locations, gathered for all possible pairs of gray levels, dipole lengths, and orientations.) But counterexamples to this were found (e.g., Caelli and Julesz, 1978). It was then suggested that textures with identical third-order statistics would prove indiscriminable. (Analogous to dipole statistics, these are joint image statistics of the gray levels found at the three corners of a triangle with a particular size, shape, and orientation as it is placed at all possible image locations, gathered for all possible triplets of gray levels, triangle shapes, sizes, and orientations.) Again, counterexamples to this hypothesis were found (Julesz et al., 1978).

Julesz noted that the counterexamples were suggestive of an alternative explanation for texture segregation similar to those of Beck and Marr. Julesz found that texture pairs that segregated easily but had identical third-order statistics also differed in the amount of an easily discernible image feature (e.g., Caelli et al., 1978). The task then became one of identifying the list of image features, which Julesz (1981) dubbed textons, that were sufficient to explain segregation performance. The initial list of textons included such features as size, orientation, line terminations, and line crossings.

It has been noted that the third-order statistics used by Julesz were population statistics. That is, the counterexamples to Julesz's various conjectures never had identical second- or third-order statistics within the actual finite images observed. Rather, the identity was over all possible images that could have been generated by the process that generated the particular instantiation of texture currently in view. In fact, for continuous images, image pairs with identical third-order statistics must be identical images, rendering that version of the conjecture trivial (Yellott, 1993), and finite, discrete images are determined by their dipole statistics (Chubb and Yellott, 2000). On the other hand, Victor (1994) makes the case for the appropriateness of the use of population statistics for theorizing about texture segregation.

The feature-based theories were echoed in research in the visual search field (Treisman, 1985). A target pattern in a field of distracter patterns was easily found whenever the target and distracters differed in a feature (e.g., size, orientation) similar to the texton features that led to effortless texture segregation. For example, a target X was effortlessly and immediately located in a field of distracter Ls. However, when the target was a T, the task became effortful and required serial scrutiny of the texture elements, requiring more time with every additional distracter added to the stimulus (Bergen and Julesz, 1983). When the choice of target and distracters requires the observer to attend to a specific combination of two features, the search becomes difficult and observers often perceive illusory conjunctions between features of neighboring objects (Treisman and Schmidt, 1982). Somewhat analogous effects using texture elements having combinations of two features have been noted in texture segregation as well (Papathomas et al., 1999). However, Wolfe (1992) suggests that texture segregation and parallel visual search do not always follow the same rules.

A number of other observations have been made concerning when texture element stimuli do or do not segregate. Beck (1982) has pointed out that textures segregate based not only on the particular texture elements used but also on their arrangement, reminiscent of the Gestalt laws of figural goodness. As in the search literature (Treisman and Gormican, 1988), texture segregation may show asymmetries (Beck, 1973; Gurnsey and Browse, 1989). For example, a patch of incomplete circles will easily segregate from a background of circles, whereas the reverse pattern results in poor segregation. It has been suggested that this is due to a difference in the variability of responses of underlying visual mechanisms to the two possible texture elements (Rubenstein and Sagi, 1990).

Nothdurft (1985) suggested that finding an edge between two textures is analogous to finding a luminance-defined edge. To determine a luminance boundary involves locating large values of the derivative of luminance (the luminance gradient) across an image. Finding texture boundaries might involve the determination of other aspects of image structure (local scale, local orientation, etc.), and segregation would then result from large values of the structure gradient.

Finally, much of the literature assumes that effortless texture segregation and parallel visual search are truly effortless. That is, they require no selective attention to operate (demonstrated by, e.g., Braun and Sagi, 1990). However, Joseph et al. (1997) had observers perform an effortful secondary task and noted a large decrement in search performance in a search task that typically yields performance independent of the number of distracters. Thus, it is possible that even parallel search and, by extension, effortless texture segregation still require selective visual attention. Alternatively, texture segregation may not require focal visual attention, but attention may be used to alter the characteristics of visual mechanisms responsible for texture segregation (e.g., Yeshurun and Carrasco, 2000). Early literature also assumed that texture segregation was effortless in the sense of being immediate. However, at least some textures take substantial time to process (e.g., Sutter and Graham, 1995), thus undermining the notion that preattentive texture segregation is always immediate and effortless.

We have treated texture as if it is somehow an isolated cue that can signal the presence, location, and shape of an edge. However, texture can co-occur in a stimulus with other cues to edge presence such as luminance, color, depth, or motion. Rivest and Cavanagh (1996) showed that perceived edge location was a compromise between the position signaled by texture and by other cues (motion, luminance, color). In addition, localization accuracy was better for two-cue than for single-cue stimuli. Landy and Kojima (2001) found that different textural cues to edge location were combined using a weighted average, with greater weight given to the more reliable cues. This is analogous to the cue combination scheme that has been seen with multiple cues to depth (including depth from texture) by Landy et al. (1995), among others.

Current Models of Texture Segregation

How might one model the aspects of texture segregation performance we have just surveyed? If an edge is defined by a difference in luminance (a typical light/dark edge), then a bandpass linear spatial filter similar to a cortical simple cell can detect the edge by producing a peak response at the location of the edge. But, a typical texture-defined edge (e.g., Figs. 73.2 and 73.4A) has the same average luminance on either side of the edge and thus will not be detected by any purely linear mechanism.

Several early investigators (e.g., Beck, 1972; Julesz, 1981) suggested that observers calculate the local density of various image features, and that differences in these texton or feature statistics on either side of a texture-defined edge result in effortless texture segregation. However, it was never clearly described exactly what an image feature was and how it would be computed from the retinal image. The image features discussed (e.g., lines of different slopes, line terminations and crossings) were clearly tied to the kinds of stimuli employed in most texture studies of the period (basically, pen-and-ink drawings) and would not be applied easily to natural gray-scale images.

An alternative line of modeling suggests that we need look no further than the orientation- and spatial frequency–tuned channels already discovered in the spatial vision literature through summation, identification, adaptation, and masking experiments using sine wave grating stimuli (De Valois and De Valois, 1988; Graham, 1989, 1992). For example, Knutsson and Granlund (1983) suggested that the distribution of power in different spatial frequency bands might be used to segregate natural textures, and ran such a computational model on patchworks of textures drawn from the Brodatz (1966) collection (a standard collection of texture images often used in the computational literature).

Bergen and Adelson (1988) pointed out that even the example of Xs, Ls, and Ts (Fig. 73.2) could be accounted for by the distribution of power in isotropic channels similar in form to cells found in the lateral geniculate nucleus (LGN) and layer 4 of primary visual cortex. Further, they showed that if the size of the Xs was increased to effectively equate the dominant spatial frequency or scale of the different texture elements, the segregation of Xs from a background of Ls could be made difficult. This was strong evidence against the texton or feature theories.

A plethora of similar models based on filters selective for spatial frequency and orientation have been investigated (Bovik et al., 1990; Caelli, 1985; Fogel and Sagi, 1989; Graham, 1991; Landy and Bergen, 1991; Malik and Perona, 1990; Sutter et al., 1989; Turner, 1986; for an alternative view, see Victor, 1988). These models are so similar in basic design that Chubb and Landy (1991) referred to this class as the back pocket model of texture segregation, as texture perception researchers pull this model from their back pocket to explain new phenomena of texture segregation.

The basic back pocket model consists of three stages (Fig. 73.3). First, a set of linear spatial filters, akin to the simple cells of primary visual cortex, is applied to the retinal image. Second, the outputs of the first-stage linear filters are transformed in a nonlinear manner (by half- or full-wave rectification, squaring, and/or gain control). Finally, another stage of linear filtering is used to enhance texture-defined contours. If this third stage consisted only of spatial pooling, the resulting outputs would resemble those of cortical complex cells. But often this linear filter is modeled as bandpass and orientation-tuned, so that it enhances texture-defined edges much as an orientation-tuned linear spatial filter enhances luminance-defined edges.

Figure 73.3..  

The back pocket model of texture segregation. The retinal image is first processed by a bank of linear spatial filters. Then some form of nonlinearity is applied. Here, a pointwise full-wave rectification is indicated. Next, a second stage of linear spatial filtering is applied to enhance the texture-defined edge. Subsequent decision processes are dependent on the particular psychophysical task under study.


This process is illustrated in Figure 73.4. Figure 73.4A shows an orientation-defined texture border (Wolfson and Landy, 1995). In Figure 73.4B a vertically oriented spatial filter has been applied. The responses are larger to the vertically oriented portion of the image, but these responses are both strongly positive (when the filter is centered on a texture element) and negative (when the filter is positioned off to the side of a texture element). As a result, the average value of the output is identical on either side of the texture border, but on the left the response variability is greater. In Figure 73.4C the responses of Figure 73.4B have been rectified, resulting in larger responses in the area of vertically oriented texture. Finally, in Figure 73.4D, a second-order, larger-scale, vertically oriented spatial filter has been applied, resulting in a peak response at the location of the texture-defined edge. For a detection experiment (“Was there a texture-defined edge in this briefly-flashed stimulus?” or “Were there two different texture regions or only one?”), a model would try to predict human performance by the strength of the peak response in Figure 73.4D as compared to peaks in responses to background noise in stimuli not containing texture-defined edges. For further examples, see Bergen (1991) and Bergen and Landy (1991).

Figure 73.4..  

Back pocket model. A, An orientation-defined edge. B, The result of the application of a linear, vertically oriented spatial filter. C, The result of a pointwise nonlinearity (squaring). D, A second, large-scale, vertically oriented spatial filter yields a peak response at the location of the texture-defined border in A.


A wide variety of terminology has been used to describe the basic model outlined in Figure 73.3, making the literature difficult for the neophyte. The basic sequence of a spatial filter, a nonlinearity, and a second spatial filter has been called the back pocket model (Chubb and Landy, 1991), an LNL (linear, nonlinear, linear) model, an FRF (filter, rectify, filter) model (e.g., Dakin et al., 1999), second-order processing (e.g., Chubb et al., 2001), or a simple or linear channel (the first L in LNL) followed by a comparison-and-decision stage (e.g., Graham et al., 1992).

About the term “second-order”

The term second-order can be particularly troublesome. In some hands, and as we will use it here, it merely refers to the second stage of linear filtering following the nonlinearity in a model like that of Figure 73.3. As such, it has been applied to models in a wide variety of visual tasks (Chubb et al., 2001). But second-order has another technical definition that has also been used in similar contexts. If the nonlinearity in Figure 73.3 is a squaring operation, then the pixels in the output image (after the second stage of linear filtering) are all computed as second-order (i.e., quadratic) polynomials of the pixels in the model input.

In this chapter, we will refer to the model of Figure 73.3 as a second-order model, meaning that it contains a second-order linear spatial filter. Of necessity, this second-order linear filter must follow an intervening nonlinearity. Otherwise, there would simply be two sequential linear filters, which are indistinguishable from a single, lumped linear spatial filter. We will use this term regardless of the polynomial order of the intervening nonlinearity.

There is also a more general use of second-order. In this usage, a second-order entity (e.g., a neuron) pools, after some intervening nonlinearity, the responses from a number of other entities (called first-order) but, in this more general usage, the first-order entities do not form a linear filter characterized by a single spatial weighting function, as they do in Figure 73.3. Rather, the first-order entities can be an assortment of neurons sensitive to various things (e.g., different orientations or different spatial frequencies). See the introduction to Graham and Sutter (1998) for a brief review of such general suggestions.

Third-order models

Second-order models are not the end of the story. For example, Graham et al. (1993) used an element-arrangement texture stimulus consisting of two types of elements, arranged in stripes in one region and in a checkerboard in another region. Consider the case where each texture element is a high-frequency Gabor pattern (a windowed sine wave grating) and the two types of elements differ only in spatial frequency. Consider a second-order model like that just described, with the first linear filter tuned to one of the two types of Gabor patches and the second linear filter tuned to the width and orientation of stripes of elements. This second-order model would yield a response to these element-arrangement textures that is of the same average level, although of high contrast in the striped region and low contrast in the checked region. To reveal the texture-defined edge between the checkerboard and striped regions, therefore, requires another stage of processing, which could be a pointwise nonlinearity followed by an even larger-scale linear spatial filter (another NL), thus producing a sequence LNLNL. For an illustration of such a model's responses, see Graham et al. (1993), Figure 4.

Here we will call this LNLNL sequence a third-order model. But, to avoid confusion, let us note that Graham and her colleagues refer to the first LNL as a complex channel or second-order channel and the final NL is an instance of what they call the comparison-and-decision stage.

About the terms “Fourier” and “non-Fourier”

There is also possible confusion about the terms Fourier and non-Fourier. A stimulus like that in Figure 73.4A, in which the edge can be found by the model in Figure 73.3, has been referred to as non-Fourier (first applied to motion stimuli by Chubb and Sperling, 1988). The term was used because the Fourier spectrum of this stimulus does not contain components that correspond directly to the texture-defined edge. But some others (e.g., Graham and Sutter, 2000) have used the term Fourier channels for the first linear filters (the simple channels) in Figure 73.3 and reserved the term non-Fourier for the complex channels (the initial LNL) in what we called third-order models above (LNLNL).

This confusing terminology is the result of a difference in emphasis. In this chapter, we concentrate on models that localize (i.e., produce a peak response at) edges between two abutting textures. But, others (e.g., Graham and Sutter, 2000; Lin and Wilson, 1996) have emphasized response measures that can be used to discriminate between pairs of textures (whether simultaneously present and abutting or not) by any later, nonlinear decision process. Thus, finding the edge in an orientation-defined texture like that of Figure 73.3 is, in Graham and Sutter's terms, Fourier-based, as the power spectra of the two constituent textures differ, whereas finding the edge in a Gabor-patch element-arrangement texture like that of Graham et al. (1993) is non-Fourier-based, as the power spectra of the two constituent textures do not differ.

Model Specification

The models of texture segregation just described are complicated, with many details that require elucidation. Are the initial linear filters of a second-order pathway the same spatial filters as the spatial frequency channels that have been described using grating experiments? What is the nature of the following nonlinearity? Are there fixed, second-order linear filters, and what is their form? This is an area of current active research, and most of these issues have not been convincingly decided.

Graham et al. (1993) and Dakin and Mareschal (2000) provide evidence that the initial spatial filters in a second-order pathway used to detect contrast modulations of texture are themselves tuned for spatial frequency and orientation. In the same article, Graham and colleagues also demonstrated that the initial spatial filters in a third-order pathway (their complex channels) were orientation- and spatial-frequency-tuned as well.

The back pocket model includes a nonlinearity between the two stages of linear spatial filtering that is required to demodulate the input stimuli. For small first-order spatial filters, Chubb et al. (1994) provided a technique called histogram contrast analysis that allowed them to measure aspects of the static nonlinearity, showing that it included components of higher order than merely squaring the input luminances. Graham and Sutter (1998) found that this nonlinearity must be expansive. They also (Graham and Sutter, 2000) suggested that a gain control mechanism acts as an inhibitory influence among multiple pathways of the types called second-order and third-order here.

First-order spatial frequency channels were first measured using sine wave grating stimuli and various experimental paradigms including adaptation, masking, and summation experiments (reviewed in Graham, 1989). Recently, researchers used analogous experiments to examine the second-order linear filters. To do so, researchers hope to deliver to the second-order filter something like the sine wave grating stimuli of classical spatial frequency channel studies. The usual ploy is to use a stimulus that has a sine wave (or Gabor) pattern to modulate some aspect of textural content across the stimulus. The assumed first-order filter and the subsequent nonlinearity demodulate this stimulus, providing as input to the second-order linear filter a noisy version of the intended grating or Gabor pattern.

Studies of texture modulation detection have revealed a very broadband second-order texture contrast sensitivity function (CSF) using a variety of texture modulations including contrast (Schofield and Georgeson, 1999, 2000; Sutter et al., 1995), local orientation content (Kingdom et al., 1995), and modulation between vertically and horizontally oriented, filtered noise (Landy and Oruç, 2002). This function is far more broadband than the corresponding luminance CSF. A demonstration of this effect is shown in Figure 73.5A. A modulator pattern is used to combine additively a vertical and a horizontal noise texture. The modulator increases in spatial frequency from left to right and in contrast from bottom to top. As you can see, the texture modulation becomes impossible to discern at approximately the same level for all spatial frequencies. The sample data in Figure 73.5B confirm this observation.

Figure 73.5..  

The second-order contrast sensitivity function. A, This figure is constructed using a modulator image to additively combine vertical and horizontal noise images (Landy and Oruç, 2002). The modulator, shown as a function above the texture, has a spatial frequency that increases from left to right, and its contrast increases from bottom to top. Large modulator values result in a local texture dominated by vertically oriented noise and small values by horizontally oriented noise. Note that threshold modulation contrast is nearly independent of spatial frequency. B, Example data from a forced-choice modulation contrast detection experiment using sine wave modulators of noise patterns.


Evidence for multiple second-order filters underlying this broad second-order CSF has been equivocal, with evidence both pro (Arsenault et al., 1999; Landy and Oruç, 2002; Schofield and Georgeson, 1999) and con (Kingdom and Keeble, 1996). Many studies have found texture discrimination to be scale-invariant, suggesting the existence of a link between the scale of the corresponding first- and second-order spatial filters (Kingdom and Keeble, 1999; Landy and Bergen, 1991; Sutter et al., 1995). It has also been suggested that the orientation preferences of the first- and second-order filters tend to be aligned (Dakin and Mareschal, 2000; Wolfson and Landy, 1995). This alignment of first- and second-order filters has also been supported for element-arrangement stimuli that require a third-order model to detect the texture-defined edges (Graham and Wolfson, 2001).

If there is an obligatory link between the scales of the first- and second-order filters, this suggests that the preferred second-order scale should depend on eccentricity. This was first demonstrated by Kehrer (1989), who noted that performance on an orientation-defined texture-segregation task at first improves as the target texture moves into the periphery and then worsens as the eccentricity increases further. The poor foveal performance was dubbed the central performance drop (CPD). This argument that the CPD is due to the relation between the scale of the second-order pattern and the local scale of the second-order filter was made by Yeshurun and Carrasco (2000), who, in addition, suggested that the second-order spatial filters are narrowed as a consequence of the allocation of selective attention.

The temporal properties of the first- and second-order filters are not well understood, although some information is available (Lin and Wilson, 1996; Motoyoshi and Nishida, 2001; Schofield and Georgeson, 2000; Sutter and Graham, 1995, Sutter and Hwang, 1999).

The possibility that the wiring between first- and second-order filters is more complicated than that shown in Figure 73.3 remains open as well (see, e.g., the appendix in Graham and Sutter, 1998; Mussap, 2001), with particular interest in possible lateral excitatory and inhibitory interactions among different positions within the same filter (Motoyoshi, 1999; Wolfson and Landy, 1999).

Early filters are not the only visual processes that play an important role in determining the conscious perception of textured stimuli. Consider He and Nakayama (1994), who constructed a series of binocular demonstration stimuli involving both texture and disparity. The foreground surface consisted of a set of textured squares. The background stimuli consisted of a region of I shapes surrounded by L shapes that, monocularly, segregated quite easily. However, when seen in depth with the squares (that abutted the Ls and Is) in front, both the Ls and Is were perceived as occluded by the squares. They underwent surface completion; that is, they were both perceived as larger rectangles occluded by the squares, and texture segregation became effortful. This suggests that higher-level, surface-based representations are involved in judgments about the objects perceived on the basis of textured regions in the stimulus.

 
Next »»


© 2010 The MIT Press
MIT Logo