Modeling feature-based attention as an active top-down inference process
Introduction
Object recognition, generally implemented in a hierarchical bottom-up process Fukushima, 1980, Perrett and Oram, 1993, Wallis and Rolls, 1997, Riesenhuber and Poggio, 1999 in which the complexity of representation along with the receptive field size increases, leads to a strong overlapping of populations encoding features belonging to different objects. These ambiguities in cell populations encoding features within the same receptive field limit the use of these approaches for non-segmented scenes like natural images.
The closely linked paradigms of active vision, purposive vision and animate vision Aloimonos, 1993, Ballard, 1991 have proposed that bottom-up directed vision is an ill-posed-problem and suggested each task requires its own specific algorithm. In this regard, an universal, general vision is not possible. According to these paradigms, the fundamental problem of vision is the selection of the relevant information within the scene and the computation of an appropriate representation. An “active” vision system – in the sense of a visually selective device – is able to acquire the necessary information on demand by focusing on the relevant areas within the visual scene and taking different views from the same object.
The approach of “Deictic Codes for the Embodiment of Cognition” aims to provide a framework for describing the phenomena that appear at about one-third of a second in the perception–action process (Ballard et al., 1997). Deictic primitives dynamically refer to points in the world with respect to their crucial describing features (e.g., color or shape). The outcome of the processing after one-third second, which is the natural sequentiality of body movements can be matched to the natural computational economies of sequential decision systems through a system of implicit reference (called deictic) in which pointing movements are used to bind objects in the world to cognitive programs. Ballard et al. (1997) suggested visual routines Kosslyn, 1994, Ullman, 1984, Just and Carpenter, 1976 to divide one complex task into subtasks, such as selection and identification.
Selective perception has been addressed in attention related experimental frameworks such as visual search. The basic idea is that once an object is selected by a focus of attention it can be connected to an internal pointer and being processed in high-level areas. This view has its origin in the classical approach of perception that separates between a pre-attentive and attentive stage (Treisman and Gelade, 1980). Computer implementations of these types of models use a saliency map to indicate a location of interest Koch and Ullman, 1985, Wolfe, 1994, Itti and Koch, 2000 and compute a focus of attention that selects an object (Olshausen et al., 1993). This focus could be guided by some rough knowledge about an object, such as its color. Feature-based attention is left to only guide the selection process by weighting the input into the saliency map Wolfe, 1994, Milanese et al., 1995, Navalpakkam and Itti, 2005.
We have developed an alternative approach in which feature-based attention acts on the object representations itself. Spatially selective attentive binding, however, occurs through reentrant oculomotor loops. The search for an object or just parts of it produces top-down expectations, which meet the bottom-up processed stimulus features in the ventral pathway. This initiates a dynamic and distributed recognition process at different levels of the hierarchy by enhancing the features of interest. At higher areas these are typically complex patterns. At lower levels these complex patterns have to be decomposed into more simple patterns. Thus, top-down inference has to rely on reverse weights to decompose a pattern into its parts. By competitive interactions such a mechanism would allow to flexibly filter out the information which is inconsistent with the high-level goal description. However, the sensory evidence of the encoded items does not always allow to rule out all objects but one. This top-down inference only strengthens the expected features, which are not necessarily the to be reported ones, and guides goal-directed behavior. Thus, in parallel, areas responsible for oculomotor selection start to plan appropriate responses. Specifically, the target location of the planed eye movement is used for a location specific inference operation which in turn filters out objects at irrelevant locations. This spatial attention effect could be interpreted as a shortcut of the actual planned eye movement. It facilitates planning processes to evaluate the consequences of the planned action. As a result of both inference operations, the high-level goal description is bound to an object in the visual world.
In this approach vision is an active, dynamic and constructive process. It allows a more close look onto the processes of binding objects in the world to cognitive programs that act within one-third of a second. Our proposed concept relies on top-down connections in vision, which have been discussed and its usefulness has been demonstrated for several times Grossberg, 1980, Mumford, 1992, Ullman, 1995, Tononi et al., 1992, Tsotsos et al., 1995, Rao and Ballard, 1999, Rao, 1999, Hamker, 1999, Engel et al., 2001, Hamker and Worcester, 2002, Corchs and Deco, 2002, Hochstein and Ahissar, 2002, Rao, 2004, Hamker, 2004b. However, top-down connections have not been used in an unequivocal fashion. The generative approach Mumford, 1992, Olshausen and Field, 1997, Rao, 1999 predicts that the top-down signal is subtracted from the bottom-up signal. Such models predict a reduction of activity when the predicted input matches with the actual input. Our model predicts an enhancement, as previously suggested by ART (Grossberg, 1980). We have shown that this is consistent with cell recordings in IT, V4 and FEF in visual search (Hamker, 2005a) and in other attentional experiments Hamker, 2004a, Hamker, 2004b. Since these simulations have been done with artificial inputs, we have recently scaled up this model to simulate object detection Hamker, 2005c, Hamker, 2005d and change detection (Hamker, 2005b) tasks in natural scenes. Here, we will focus on feature-based attention in area V4/TEO with respect to the search task.
Section snippets
Anatomical, pysiological and behavioral evidence
The brain has developed specific functional areas in the visual cortex, which can be divided into two major streams. Form and color travel from V1 to V2, V4 of the occipital lobe into TEO and TE of the inferior temporal lobe Zeki, 1978, Livingstone and Hubel, 1988. This ventral pathway is known to encode object identity. It is generally accepted that the complexity of encoded features increases along the ventral pathway. V1 neurons can be driven by simple properties of a stimulus, such as the
Results
I now demonstrate the predictions of the model on the early response of cells in extrastriate areas (specifically V4) in a visual search task using natural scenes.
An object is presented to the model for 100 ms and the model memorizes some of its features as a target template. We do not give the model any hints which feature to memorize. The model’s task is to make an eye movement towards the target (Fig. 2(A and B)). When presenting the search scene, TE cells that match the target template
Discussion
We predict that goal directed, feature-based search first selectively modulates feature-sensitive cells prior to any spatial selection.
This prediction is consistent with cell recordings in visual search Bichot et al., 2005, Ogawa and Komatsu, 2004 and recent findings in which the learning of degraded natural scenes resulted in a selective enhancement of V4 cells (Rainer et al., 2004). According to this study, V4 plays a crucial role in resolving an indeterminate level of visual processing by a
References (65)
- et al.
Adaptive coincidence detection and dynamic gain control in visual cortical neurons in vivo
Neuron
(2003) Animate vision
Artif. Intell.
(1991)Predictions of a model of spatial attention using sum- and max-pooling functions
Neurocomputing
(2004)A dynamic model of how feature cues guide spatial attention
Vis. Res.
(2004)The emergence of attention by population-based inference and its role in distributed processing and cognitive control of vision
J. Comput. Vis. Image Understand.
(2005)- et al.
View from the top: hierarchies and reverse hierarchies in the visual system
Neuron
(2002) - et al.
A saliency-based search mechanism for overt and covert shifts of visual attention
Vis. Res.
(2000) - et al.
Eye fixations and cognitive processes
Cogn. Psychol.
(1976) - et al.
Goal-related activity in V4 during free viewing visual search. Evidence for a ventral stream visual salience map
Neuron
(2003) - et al.
Modeling the influence of task on attention
Vis. Res.
(2005)
Sparse coding with an overcomplete basis set: a strategy employed by V1?
Vis. Res.
The neurophysiology of shape processing
Image Vis. Comp.
An optimal estimation approach to visual perception and learning
Vis. Res.
How parallel is visual processing in the ventral pathway?
Trends Cogn. Sci.
A feature integration theory of attention
Cogn. Psychol.
Modeling visual attention via selective tuning
Artif. Intell.
Invariant face and object recognition in the visual system
Prog. Neurobiol.
Introduction: active vision revisited.
Deictic codes for the embodiment of cognition
Behav. Brain Sci.
Parallel and serial neural mechanisms for visual search in macaque area V4
Science
Primate frontal eye fields. I. Single neurons discharging before saccades
J. Neurophysiol.
Responses of neurons in inferior temporal cortex during memory-guided visual search
J. Neurophysiol.
Large-scale neural model for visual attention: integration of experimental single-cell and fMRI data
Cereb. Cortex
Neural mechanisms of selective attention
Annu. Rev. Neurosci.
An adaptive coding model of neural function in prefrontal cortex
Nat. Rev. Neurosci.
Dynamic predictions: oscillations and synchrony in top-down processing
Nat. Rev. Neurosci.
Modulation of oscillatory neuronal synchronization by selective visual attention
Science
Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position
Biol. Cybern.
How does the brain build a cognitive code?
Psychol. Rev.
The role of feedback connections in task-driven visual search
The reentry hypothesis: linking eye movements to visual perception
J. Vis.
Cited by (48)
Changing perspectives on goal-directed attention control: The past, present, and future of modeling fixations during visual search
2020, Psychology of Learning and Motivation - Advances in Research and TheoryCitation Excerpt :Second, these proto-objects are then routed to the object classification process that will be used to control behavior, which can be either the manual button press terminating a search or the eye movement to the next most probable location of the target object. Grounding fixation-prediction models in biased-competition theory is useful, not only because it is a widely accepted framework that can serve as a consensus theoretical platform for modelers, but also because this theory has emerged as modern doctrine among neurocomputational models of attention (Hamker, 2006; Tsotsos, 2011), increasing the likelihood that model predictions will actually become hypotheses that are tested using neuroscience techniques. Will such a unified model be able to predict grouping, fixation, and recognition behavior?
Attention to distinguishing features in object recognition: An interactive-iterative framework
2018, CognitionCitation Excerpt :Second, although we agree that top-down biasing of the visual analysis may play a significant role in object recognition, a different, relatively neglected aspect of interactive processing was emphasized in the present work, namely, the iterative allocation of visual attention to distinguishing object features. Apart from this work, and the two exceptions mentioned earlier (Blair et al., 2009; Ganis & Kosslyn, 2007), other proposals that have incorporated interactive-iterative allocation of attention have used it as a means of selecting a specific object in a cluttered scene, and not as part of the recognition process per se (e.g., Deco & Zihl, 2001; Hamker, 2006; Rybak, Gusakova, Golovan, Podladchikova, & Shevtsova, 1998; Schill, Umkehrer, Beinlich, Krieger, & Zetzsche, 2001). This study revived classic ideas (e.g., Gregory, 1966; von Helmholtz, 1867) regarding the interaction of top-down and bottom-up processes in perception, incorporating them into a framework that explicates the interactive-iterative nature of the process of object recognition, and the role of attention in that process.
Dynamic attention priors: A new and efficient concept for improving object detection
2016, NeurocomputingCitation Excerpt :Apart from their conceptual novelty in real-world object detection, we additionally propose an extremely efficient calculation scheme for DAPs which makes their application appealing especially in resource-constrained real-world systems. There is extensive literature about the computational modeling of the various aspects of visual attention [22,31,13,42,30,6]. In many contributions, scene context is used to derive static spatial attention [36,34,14,28,20,7,13,25].
A new field in monkey's frontal cortex: Premotor ear-eye field (PEEF)
2013, Neuroscience and Biobehavioral ReviewsCitation Excerpt :The neural activity in FEF, as in area 8B, is related to auditory stimuli recognition (Azuma and Suzuki, 1984) but FEF is prevalently related to eye movements as well as neck movements, visual stimuli and attentional processes (Schall, 1997; Tehovnik et al., 2000). Many recent studies show that the functions of FEF are related to selective attention and conditional operation and some studies suggest that in human, the FEF provides a possible source for top-down modulation of target selection (Buschman and Miller, 2007; Hamker, 2006). Another example of higher function of FEF could be the inhibition of return (IOR) which is a bias against reorienting attention to a previoulsly cued location, and the human FEF plays a crucial role in the generation of IOR (Lepsien and Pollmann, 2002; Ro et al., 2003).
Attention, please! A survey of neural attention models in deep learning
2022, Artificial Intelligence ReviewVisuomotor Integration
2022, Neuroscience in the 21st Century: From Basic to Clinical: Third Edition