Modeling feature-based attention as an active top-down inference process

doi:10.1016/j.biosystems.2006.03.010

Biosystems

Volume 86, Issues 1–3, October–December 2006, Pages 91-99

https://doi.org/10.1016/j.biosystems.2006.03.010 Get rights and content

Abstract

Vision is a crucial sensor. It provides a very rich collection of information about our environment. The difficulty in vision arises, since this information is not obvious in the image, it has to be constructed. Wheres earlier approaches have favored a bottom-up approach, which maps the image onto an internal representation of the world, more recent approaches search for alternatives and develop frameworks which make use of top-down connections. In these approaches vision is inherently a constructive process which makes use of a priory information. Following this line of research, a model of primate object perception is presented and used to simulate an object detection task in natural scenes. The model predicts that early responses in extrastriate visual areas are modulated by the visual goal.

Introduction

Object recognition, generally implemented in a hierarchical bottom-up process Fukushima, 1980, Perrett and Oram, 1993, Wallis and Rolls, 1997, Riesenhuber and Poggio, 1999 in which the complexity of representation along with the receptive field size increases, leads to a strong overlapping of populations encoding features belonging to different objects. These ambiguities in cell populations encoding features within the same receptive field limit the use of these approaches for non-segmented scenes like natural images.

The closely linked paradigms of active vision, purposive vision and animate vision Aloimonos, 1993, Ballard, 1991 have proposed that bottom-up directed vision is an ill-posed-problem and suggested each task requires its own specific algorithm. In this regard, an universal, general vision is not possible. According to these paradigms, the fundamental problem of vision is the selection of the relevant information within the scene and the computation of an appropriate representation. An “active” vision system – in the sense of a visually selective device – is able to acquire the necessary information on demand by focusing on the relevant areas within the visual scene and taking different views from the same object.

The approach of “Deictic Codes for the Embodiment of Cognition” aims to provide a framework for describing the phenomena that appear at about one-third of a second in the perception–action process (Ballard et al., 1997). Deictic primitives dynamically refer to points in the world with respect to their crucial describing features (e.g., color or shape). The outcome of the processing after one-third second, which is the natural sequentiality of body movements can be matched to the natural computational economies of sequential decision systems through a system of implicit reference (called deictic) in which pointing movements are used to bind objects in the world to cognitive programs. Ballard et al. (1997) suggested visual routines Kosslyn, 1994, Ullman, 1984, Just and Carpenter, 1976 to divide one complex task into subtasks, such as selection and identification.

Selective perception has been addressed in attention related experimental frameworks such as visual search. The basic idea is that once an object is selected by a focus of attention it can be connected to an internal pointer and being processed in high-level areas. This view has its origin in the classical approach of perception that separates between a pre-attentive and attentive stage (Treisman and Gelade, 1980). Computer implementations of these types of models use a saliency map to indicate a location of interest Koch and Ullman, 1985, Wolfe, 1994, Itti and Koch, 2000 and compute a focus of attention that selects an object (Olshausen et al., 1993). This focus could be guided by some rough knowledge about an object, such as its color. Feature-based attention is left to only guide the selection process by weighting the input into the saliency map Wolfe, 1994, Milanese et al., 1995, Navalpakkam and Itti, 2005.

We have developed an alternative approach in which feature-based attention acts on the object representations itself. Spatially selective attentive binding, however, occurs through reentrant oculomotor loops. The search for an object or just parts of it produces top-down expectations, which meet the bottom-up processed stimulus features in the ventral pathway. This initiates a dynamic and distributed recognition process at different levels of the hierarchy by enhancing the features of interest. At higher areas these are typically complex patterns. At lower levels these complex patterns have to be decomposed into more simple patterns. Thus, top-down inference has to rely on reverse weights to decompose a pattern into its parts. By competitive interactions such a mechanism would allow to flexibly filter out the information which is inconsistent with the high-level goal description. However, the sensory evidence of the encoded items does not always allow to rule out all objects but one. This top-down inference only strengthens the expected features, which are not necessarily the to be reported ones, and guides goal-directed behavior. Thus, in parallel, areas responsible for oculomotor selection start to plan appropriate responses. Specifically, the target location of the planed eye movement is used for a location specific inference operation which in turn filters out objects at irrelevant locations. This spatial attention effect could be interpreted as a shortcut of the actual planned eye movement. It facilitates planning processes to evaluate the consequences of the planned action. As a result of both inference operations, the high-level goal description is bound to an object in the visual world.

In this approach vision is an active, dynamic and constructive process. It allows a more close look onto the processes of binding objects in the world to cognitive programs that act within one-third of a second. Our proposed concept relies on top-down connections in vision, which have been discussed and its usefulness has been demonstrated for several times Grossberg, 1980, Mumford, 1992, Ullman, 1995, Tononi et al., 1992, Tsotsos et al., 1995, Rao and Ballard, 1999, Rao, 1999, Hamker, 1999, Engel et al., 2001, Hamker and Worcester, 2002, Corchs and Deco, 2002, Hochstein and Ahissar, 2002, Rao, 2004, Hamker, 2004b. However, top-down connections have not been used in an unequivocal fashion. The generative approach Mumford, 1992, Olshausen and Field, 1997, Rao, 1999 predicts that the top-down signal is subtracted from the bottom-up signal. Such models predict a reduction of activity when the predicted input matches with the actual input. Our model predicts an enhancement, as previously suggested by ART (Grossberg, 1980). We have shown that this is consistent with cell recordings in IT, V4 and FEF in visual search (Hamker, 2005a) and in other attentional experiments Hamker, 2004a, Hamker, 2004b. Since these simulations have been done with artificial inputs, we have recently scaled up this model to simulate object detection Hamker, 2005c, Hamker, 2005d and change detection (Hamker, 2005b) tasks in natural scenes. Here, we will focus on feature-based attention in area V4/TEO with respect to the search task.

Section snippets

Anatomical, pysiological and behavioral evidence

The brain has developed specific functional areas in the visual cortex, which can be divided into two major streams. Form and color travel from V1 to V2, V4 of the occipital lobe into TEO and TE of the inferior temporal lobe Zeki, 1978, Livingstone and Hubel, 1988. This ventral pathway is known to encode object identity. It is generally accepted that the complexity of encoded features increases along the ventral pathway. V1 neurons can be driven by simple properties of a stimulus, such as the

Results

I now demonstrate the predictions of the model on the early response of cells in extrastriate areas (specifically V4) in a visual search task using natural scenes.

An object is presented to the model for 100 ms and the model memorizes some of its features as a target template. We do not give the model any hints which feature to memorize. The model’s task is to make an eye movement towards the target (Fig. 2(A and B)). When presenting the search scene, TE cells that match the target template

Discussion

We predict that goal directed, feature-based search first selectively modulates feature-sensitive cells prior to any spatial selection.

This prediction is consistent with cell recordings in visual search Bichot et al., 2005, Ogawa and Komatsu, 2004 and recent findings in which the learning of degraded natural scenes resulted in a selective enhancement of V4 cells (Rainer et al., 2004). According to this study, V4 plays a crucial role in resolving an indeterminate level of visual processing by a

References (65)

R. Azouz et al.
Adaptive coincidence detection and dynamic gain control in visual cortical neurons in vivo
Neuron
(2003)
D. Ballard
Animate vision
Artif. Intell.
(1991)
F.H. Hamker
Predictions of a model of spatial attention using sum- and max-pooling functions
Neurocomputing
(2004)
F.H. Hamker
A dynamic model of how feature cues guide spatial attention
Vis. Res.
(2004)
F.H. Hamker
The emergence of attention by population-based inference and its role in distributed processing and cognitive control of vision
J. Comput. Vis. Image Understand.
(2005)
S. Hochstein et al.
View from the top: hierarchies and reverse hierarchies in the visual system
Neuron
(2002)
L. Itti et al.
A saliency-based search mechanism for overt and covert shifts of visual attention
Vis. Res.
(2000)
M.A. Just et al.
Eye fixations and cognitive processes
Cogn. Psychol.
(1976)
J.A. Mazer et al.
Goal-related activity in V4 during free viewing visual search. Evidence for a ventral stream visual salience map
Neuron
(2003)
V. Navalpakkam et al.
Modeling the influence of task on attention
Vis. Res.
(2005)

B.A. Olshausen et al.

Sparse coding with an overcomplete basis set: a strategy employed by V1?

Vis. Res.

(1997)

D.I. Perrett et al.

The neurophysiology of shape processing

Image Vis. Comp.

(1993)

R.P. Rao

An optimal estimation approach to visual perception and learning

Vis. Res.

(1999)

G.A. Rousselet et al.

How parallel is visual processing in the ventral pathway?

Trends Cogn. Sci.

(2004)

A. Treisman et al.

A feature integration theory of attention

Cogn. Psychol.

(1980)

J.K. Tsotsos et al.

Modeling visual attention via selective tuning

Artif. Intell.

(1995)

G. Wallis et al.

Invariant face and object recognition in the visual system

Prog. Neurobiol.

(1997)

Y. Aloimonos

Introduction: active vision revisited.

D. Ballard et al.

Deictic codes for the embodiment of cognition

Behav. Brain Sci.

(1997)

N.P. Bichot et al.

Parallel and serial neural mechanisms for visual search in macaque area V4

Science

(2005)

C.J. Bruce et al.

Primate frontal eye fields. I. Single neurons discharging before saccades

J. Neurophysiol.

(1985)

L. Chelazzi et al.

Responses of neurons in inferior temporal cortex during memory-guided visual search

J. Neurophysiol.

(1998)

S. Corchs et al.

Large-scale neural model for visual attention: integration of experimental single-cell and fMRI data

Cereb. Cortex

(2002)

R. Desimone et al.

Neural mechanisms of selective attention

Annu. Rev. Neurosci.

(1995)

J. Duncan

An adaptive coding model of neural function in prefrontal cortex

Nat. Rev. Neurosci.

(2001)

A.K. Engel et al.

Dynamic predictions: oscillations and synchrony in top-down processing

Nat. Rev. Neurosci.

(2001)

P. Fries et al.

Modulation of oscillatory neuronal synchronization by selective visual attention

Science

(2001)

K. Fukushima

Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

Biol. Cybern.

(1980)

S. Grossberg

How does the brain build a cognitive code?

Psychol. Rev.

(1980)

F.H. Hamker

The role of feedback connections in task-driven visual search

Hamker, F. H., Worcester, J., 2002. Object detection in natural scenes by feedback. In: Bülthoff, H.H., et al., (Eds.),...

F.H. Hamker

The reentry hypothesis: linking eye movements to visual perception

J. Vis.

(2003)

Cited by (48)

Changing perspectives on goal-directed attention control: The past, present, and future of modeling fixations during visual search
2020, Psychology of Learning and Motivation - Advances in Research and Theory
Citation Excerpt :
Second, these proto-objects are then routed to the object classification process that will be used to control behavior, which can be either the manual button press terminating a search or the eye movement to the next most probable location of the target object. Grounding fixation-prediction models in biased-competition theory is useful, not only because it is a widely accepted framework that can serve as a consensus theoretical platform for modelers, but also because this theory has emerged as modern doctrine among neurocomputational models of attention (Hamker, 2006; Tsotsos, 2011), increasing the likelihood that model predictions will actually become hypotheses that are tested using neuroscience techniques. Will such a unified model be able to predict grouping, fixation, and recognition behavior?
People make eye movements while interacting with objects, and these behaviors are rich with information about how visual goals are represented in the brain and used to prioritize sequential motor behavior. Here we adopt a real-world perspective and define goal-directed attention control as the guidance (or biasing) of gaze to target-object goals that have uncertain visual appearance. Specifically, we review models of goal-directed attention control that have attempted to predict the behavioral fixations made in the search for target-category goals in images. We will show how modeling perspectives on this question changed over the decades. Using the year 2020 as a reference, we will critically review the recent past of the categorical search modeling literature (~ 2000–2010), the literature defining our present (~ 2010–2020), and speculate about the future of search models and the directions that the literature may turn in the next decade (~ 2020–2030).
Attention to distinguishing features in object recognition: An interactive-iterative framework
2018, Cognition
Citation Excerpt :
Second, although we agree that top-down biasing of the visual analysis may play a significant role in object recognition, a different, relatively neglected aspect of interactive processing was emphasized in the present work, namely, the iterative allocation of visual attention to distinguishing object features. Apart from this work, and the two exceptions mentioned earlier (Blair et al., 2009; Ganis & Kosslyn, 2007), other proposals that have incorporated interactive-iterative allocation of attention have used it as a means of selecting a specific object in a cluttered scene, and not as part of the recognition process per se (e.g., Deco & Zihl, 2001; Hamker, 2006; Rybak, Gusakova, Golovan, Podladchikova, & Shevtsova, 1998; Schill, Umkehrer, Beinlich, Krieger, & Zetzsche, 2001). This study revived classic ideas (e.g., Gregory, 1966; von Helmholtz, 1867) regarding the interaction of top-down and bottom-up processes in perception, incorporating them into a framework that explicates the interactive-iterative nature of the process of object recognition, and the role of attention in that process.
This article advances a framework that casts object recognition as a process of discrimination between alternative object identities, in which top-down and bottom-up processes interact—iteratively when necessary—with attention to distinguishing features playing a critical role. In two experiments, observers discriminated between different types of artificial fish. In parallel, a secondary, variable-SOA visual-probe detection task was used to examine the dynamics of visual attention. In Experiment 1, the fish varied in three distinguishing features: one indicating the general category (saltwater, freshwater), and one of the two other features indicating the specific type of fish within each category. As predicted, in the course of recognizing each fish, attention was allocated iteratively to the distinguishing features in an optimal manner: first to the general category feature, and then, based on its value, to the second feature that identified the specific fish. In Experiment 2, two types of fish could be discriminated on the basis of either of two distinguishing features, one more visually discriminable than the other. On some of the trials, one of the two alternative distinguishing features was occluded. As predicted, in the course of recognizing each fish, attention was directed initially to the more discriminable distinguishing feature, but when this feature was occluded, it was then redirected to the less discriminable feature. The implications of these findings, and the interactive-iterative framework they support, are discussed with regard to several fundamental issues having a long history in the literatures on object recognition, object categorization, and visual perception in general.
Dynamic attention priors: A new and efficient concept for improving object detection
2016, Neurocomputing
Citation Excerpt :
Apart from their conceptual novelty in real-world object detection, we additionally propose an extremely efficient calculation scheme for DAPs which makes their application appealing especially in resource-constrained real-world systems. There is extensive literature about the computational modeling of the various aspects of visual attention [22,31,13,42,30,6]. In many contributions, scene context is used to derive static spatial attention [36,34,14,28,20,7,13,25].
Recent psychophysical evidence in humans suggests that visual attention is a highly dynamic and predictive process involving precise models of object trajectories. We present a proof-of-concept that such predictive spatial attention can benefit a technical system solving a challenging visual object detection task. To this end, we introduce a Bayes-like integration of the so-called dynamic attention priors (DAPs) and dense detection likelihoods, which get enhanced at predicted object positions obtained by the extrapolation of trajectories.
Using annotated video sequences of pedestrians in a parking lot setting, we quantitatively show that DAPs can improve detection performance significantly as compared to a baseline condition relying purely on pattern analysis.
A new field in monkey's frontal cortex: Premotor ear-eye field (PEEF)
2013, Neuroscience and Biobehavioral Reviews
Citation Excerpt :
The neural activity in FEF, as in area 8B, is related to auditory stimuli recognition (Azuma and Suzuki, 1984) but FEF is prevalently related to eye movements as well as neck movements, visual stimuli and attentional processes (Schall, 1997; Tehovnik et al., 2000). Many recent studies show that the functions of FEF are related to selective attention and conditional operation and some studies suggest that in human, the FEF provides a possible source for top-down modulation of target selection (Buschman and Miller, 2007; Hamker, 2006). Another example of higher function of FEF could be the inhibition of return (IOR) which is a bias against reorienting attention to a previoulsly cued location, and the human FEF plays a crucial role in the generation of IOR (Lepsien and Pollmann, 2002; Ro et al., 2003).
In macaque monkey, area 8B is cytoarchitectonically considered a transitional area between the granular Brodmann area 9, rostrally, and the rostral part of the dorsal agranular Brodmann area 6, caudally. As for electrophysiological data, microstimulation of area 8B evokes ear and/or eye movements; unit activity recording shows neurons encoding different auditory environmental stimuli and ear and/or eye movements. Moreover, visual attentive fixation modulates the discharge of auditory environmental neurons and auditory-motor neurons. As for anatomical data, area 8B is connected with auditory cortical areas, superior colliculus and cerebellum. Current functional and anatomical evidences support that area 8B is a specific Premotor Ear-Eye Field (PEEF) involved in auditory stimuli recognition and in orienting processes. In conclusion, we suggest that PEEF could play an important role in engaging the auditory spatial attention for the purpose of orienting eye and ear towards the sound source.
Attention, please! A survey of neural attention models in deep learning
2022, Artificial Intelligence Review
Visuomotor Integration
2022, Neuroscience in the 21st Century: From Basic to Clinical: Third Edition

View all citing articles on Scopus

View full text

Modeling feature-based attention as an active top-down inference process

Abstract

Introduction

Section snippets

Anatomical, pysiological and behavioral evidence

Results

Discussion

Neuron

Artif. Intell.

Neurocomputing

Vis. Res.

J. Comput. Vis. Image Understand.

Neuron

Vis. Res.

Cogn. Psychol.

Neuron

Vis. Res.

Vis. Res.

Image Vis. Comp.

Vis. Res.

Trends Cogn. Sci.

Cogn. Psychol.

Artif. Intell.

Prog. Neurobiol.

Introduction: active vision revisited.

Deictic codes for the embodiment of cognition

Behav. Brain Sci.

Parallel and serial neural mechanisms for visual search in macaque area V4

Science

Primate frontal eye fields. I. Single neurons discharging before saccades

J. Neurophysiol.

Responses of neurons in inferior temporal cortex during memory-guided visual search

J. Neurophysiol.

Large-scale neural model for visual attention: integration of experimental single-cell and fMRI data

Cereb. Cortex

Neural mechanisms of selective attention

Annu. Rev. Neurosci.

An adaptive coding model of neural function in prefrontal cortex

Nat. Rev. Neurosci.

Dynamic predictions: oscillations and synchrony in top-down processing

Nat. Rev. Neurosci.

Modulation of oscillatory neuronal synchronization by selective visual attention

Science

Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position

Biol. Cybern.

How does the brain build a cognitive code?

Psychol. Rev.

The role of feedback connections in task-driven visual search

The reentry hypothesis: linking eye movements to visual perception

J. Vis.