Top-down sharpening of hierarchical visual feature representations

The robustness of the visual system lies in its ability to perceive degraded images. This is achieved through interacting bottom-up and top-down pathways that process the visual input in concordance with stored prior information. The interaction mechanism by which they integrate visual input and prior information is still enigmatic. We present a new approach using deep neural network (DNN) representation to reveal the effects of such integration on degraded visual inputs. We transformed measured human brain activity resulting from viewing blurred images to the hierarchical representation space derived from a feedforward DNN. Transformed representations were found to veer towards the original non-blurred image and away from the blurred stimulus image. This indicated deblurring or sharpening in the neural representation, and possibly in our perception, from top-down modulation. We anticipate these results will help unravel the interplay mechanism between bottom-up and top-down pathways, leading to more comprehensive models of vision.


Introduction
1 Perception is the process by which humans and other animals make sense of the 2 environment around them. It involves integrating different sensory cues with prior 3 knowledge to arrive at a meaningful interpretation of the surroundings. This integration 4 is achieved by means of two neuronal pathways: a bottom-up pathway, which 5 processes sensory information hierarchically, and a top-down pathway, which projects 6 prior information down the hierarchy (Arnal & Giraud, 2012;Clark, 2013;Summerfield & 7 de Lange, 2014). In the visual cortex, top-down signals are believed to propagate from 8 higher visual and cognitive areas to the lower visual areas (Friston, 2005;Heeger, 9 2017). They cause modulation of the neural response of the sensory signal, which in 10 turn is processed up the hierarchy again, to the higher visual areas; the process is 11 repeated until convergence on a perceptual result. The interplay mechanism between 12 these two pathways is still an open question. It is believed that the imbalance between 13 the bottom-up and top-down pathways results in the hallucinatory symptoms associated 14 with psychological disorders such as schizophrenia (for review, see Friston,Brown,15 Siemerkus, & Stephan, 2016; Jardri et al., 2016). 16 4 Mumford, 2003). Conversely, the prediction error hypothesis (which originates from 25 computer science ideas; Shi, Sun, & Huifang (2008) states that top-down signals 26 provide expected signal information that is uninteresting, and therefore gets subtracted 27 (Mumford, 1992;Rao & Ballard, 1999). This results in an error signal that is further 28 processed to update the prediction signal. This process repeats until the error signal 29 reaches zero, which corresponds with achieving a perceptual result. This idea was 30 triggered by the finding of an associated reduction in brain activity in the early visual Bayesian-like perception model (Friston, 2005;Heeger, 2017). To transform brain data into the DNN feature space, we trained multivoxel decoders to 139 predict DNN features from brain activity data. A separate training stimulus dataset 140 consisting of 1000 natural non-blurred images was used to train those decoders, with 141 one image for each classification category of the DNN. We selected 1000 features from 142 each layer of the DNN and constructed a decoder for each feature, totaling 8000 143 decoders. Using a sparse linear regression algorithm (SLR; Bishop 2006), each 144 decoder was trained to use the fMRI responses to these images to predict the feature 145 values (normalized to a mean value of zero in each feature). Note that the decoders 146 were trained only on independent non-blurred images. 147

148
Using the trained decoders, the brain activity pattern induced by each stimulus in the 149 blurred-to-original sequences was decoded (transformed) into the DNN feature space. 150 For each stimulus image, the Pearson correlation coefficient between its decoded 151 feature vector and the true features of the same stimulus image (r s ) at each layer was 152 computed. In addition, the correlation between the decoded feature vector and the true 153 features of the corresponding non-blurred original image (r o ) was computed ( Figure 1B). 154 For non-blurred stimuli, r s and r o are identical. The correlation with stimulus image features (r s ) reflects the degree to which image 158 features resulting from feedforward processing are faithfully decoded from brain activity, 159 while the correlation with original images (r o ) reflects the degree to which the decoded 160 features are "sharpened" by top-down processing, to be similar to those of the non-161 blurred images. Figure 2A  no-prior conditions, and disregarding behavioral data. Figure 2B shows the mean of the 167 results in Figure 2A   blur levels (17 out of 24 DNN layer/blur level combinations; t-test across subjects, p < 202 0.05). This suggests that top-down processing modulates neural representations to bias 203 them towards the original images. We also noticed that the fully-connected layers 204 DNN6-8 had more pronounced positive feature gains than the convolutional layers. 205 Another notable issue is that the 12% blur level shows better feature gain relative to 206 both 6% and 25% blur levels in higher visual areas. One possible explanation is that 207 12% blur level is situated in a middle ground between the other levels where 6% blur 208 level is quite similar to the original image while 25% blur level is too degraded to 209 provide any useful information.

245
In the previous analyses, the data from different experimental conditions were pooled 246 together. We then further investigated the difference between the category-prior and no-247 prior conditions. We compared the feature gain means grouped according to the 248 experimental condition (category-prior vs. no-prior) while pooling all the behavioral 249 responses ( Figure 5). We performed two-way ANOVA on the feature gain data using the 250 ROI and the experimental conditions as the independent variables. The addition of a 251 14 prior caused significant enhancement to the feature gain in layers DNN4-8 (p < 0.05). 252 The difference was most pronounced in DNN8 (p = 0.000014). This result indicates that 253 addition of prior information enhances top-down modulation, thereby causing an 254 increase in feature gain. This implies augmented sharpening of neural representations. 255 256 This result, however, pooled both correctly and incorrectly reported results. When 257 considering behavioral data, there are considerable differences between category-prior 258 and no-prior conditions. The category-prior condition was characterized by a higher 259 number of correct responses (235 out of 300 total instances for five subjects) compared 260 with the no-prior condition (92 out of 300 total instances for five subjects). However, in 261 the category prior condition, the task was to choose one of five categories. This could 262 lead to false positives, as if a subject responded in a random manner, 20% of the 263 responses would be likely to be correct. In some cases when the stimulus was highly 264 degraded, the best guess response by the subjects could be random. To attempt to 265 curb this problem, we could use the certainty level as an indicator of correctness, 266 especially for the category prior. We found from the behavioral results that nearly all the 267 trials labelled as certain were also correctly recognized (category prior: 138 out of 139 268 certain trials were correct; no-prior: 57 out of 70 certain trials were correct). This further 269 supports the observation that adding priors aids recognition. 270

271
We further analyzed our data by grouping it according to both experimental condition 272 (category-prior and no-prior) and recognition performance (correct and incorrect). We 273 show the results of the mean feature gain over subject means for each DNN layer in 274 We also analyzed our data by grouping it according to certainty level (certain and 283 uncertain). We show the results of mean feature gain over subject means for each DNN 284 layer of this analysis in Figure 7. For the category-prior condition, when an image was 285 recognized with certainty we found significant enhancement in feature gain in DNN2, 4, 286 5, and 8, while for the no-prior condition significant enhancement was found in DNN1. 287

288
From the results of Figures 6 and 7, we can observe that in some layers and conditions, 289 recognition has a significant boosting effect on feature gain. However, we also found a 290 considerable feature gain even without recognition, which indicates a sharpening effect 291 that is not guided by subjective recognition. This could be caused by a lower-level 292 sharpening associated with local similarity or object component sharpening, which 293 could be common across different objects (like body parts in animals). to that in the visual cortex. We found that this sharpening was content-specific, and not 303 just due to a natural image bias. It was also shown to be boosted by giving category 304 information to the subject prior to stimulus viewing. This indicated that adding a more 305 specific prior does not lead to the expected signal being explained away as a prediction 306 error hypothesis. However, we did not find that recognition had a strong role in boosting 307 the enhancement process. 308 309 In our experimental protocol, the subjects viewed blurred stimuli in randomly organized 310 sequences. In each sequence, different levels of blur of the same image were shown, 311 ordered from the most blurred to the non-blurred stimulus ( Figure 1A). This ensured 312 that subjects did not have pixel level information. Nonetheless, the results show a 313 tendency for the blurred images' neural representations to correlate with the original 314 images (Figure 2A and B). Conversely, the feedforward behavior demonstrated by the 315 noisy DNN output showed an opposite tendency ( Figure 2C). We computed the feature 316 gain to investigate how the predicted DNN features deviated from pure feedforward 317 behavior. Feature gain analysis showed that the predicted features are rather correlated 318 with the original image features ( Figure 2D). This indicates that a sharpening effect 319 happens across the visual cortex, leading to a more natural-image-like neural 320 representation. This representation was also confirmed as not being due to a natural-  When we added a category-prior to the task, the number of competing categories for 355 recognition decreased, thus the subjects tended to have a more directed top-down 356 effect, due to the fewer number of competing stimuli (Bar & Aminoff, 2003). This led to a 357 higher feature gain, which was especially noted in the higher layers ( Figure 5). This 358 further supports the idea of neural representation sharpening when given a prior 359 describing the stimulus content, as the top-down signal would be more correlated with 360 the correct recognition results, thus leading to a stronger feature gain. 361

362
We also found that when subjects successfully recognized the image content, the 363 feature gain in some layers predicted from lower visual areas was significantly 364 improved. However, this was not salient as a general trend (Figures 6, 7). From these results, we can deduce that top-down modulation is in operation when 392 visual input is degraded, even in the absence of a memory or expectation prior. 393 Previous studies have proposed that the brain makes an initial processing step using 394 low spatial frequency information, which generates predictions of the content of the The test image runs consisted of two conditions. In the first condition, the subjects did 451 not have any prior information about the stimuli presented (no-prior condition). In the 452 second condition, the subjects were provided a semantic prior in the form of category 453 choices (category-prior condition). The stimuli in the category-prior condition consisted 454 of images from one of five object categories (airplane, bird, car, cat, or dog). The 455 subject was informed of these categories prior to the experiment, but not the order in 456 which they were to be presented.

MRI data preprocessing 506
After rejection of the first 8 seconds of each acquisition to avoid scanner instability 507 effects, the fMRI scans were preprocessed using SPM8 508 (http://www.fil.ion.ucl.ac.uk/spm), including 3D motion correction, slice-timing correction, 509 and co-registration to the appropriate high resolution anatomical images. Both scans 510 were then also co-registered to the T1 anatomical image. The EPI data were then 511 interpolated to 2 × 2 × 2 mm voxels and further processed using Brain Decoder Toolbox 512 2 (https://github.com/KamitaniLab/BrainDecoderToolbox2). Volumes were shifted by 2 s 513 (1 volume) to compensate for hemodynamic delays, then the linear trend was removed 514 from each run and the data were normalized. As each image was presented for 8 s, it 515 was represented by four fMRI volumes. These four volumes were then averaged to 516 provide a single image with increased signal to noise ratio for each stimulus image. The 517 averaged voxel values for each stimulus block were used as an input feature vector for 518 the decoding analysis. 519 520 The subject acquisition response was recorded manually from the voice recordings. 638

Region of interest construction
The written record was then revised with each subject to ensure accuracy. The record 639 was written as incorrect in the cases where the subject missed giving a voice response. 640 In the cases when the subject missed giving a button response, the previous button 641 response from the same sequence was used, except when the stimulus was the last 642   Error bars indicate 95% CI across five subjects.