Watching Movies Unfold, a Frame-by-Frame Analysis of the Associated Neural Dynamics

Abstract Our lives unfold as sequences of events. We experience these events as seamless, although they are composed of individual images captured in between the interruptions imposed by eye blinks and saccades. Events typically involve visual imagery from the real world (scenes), and the hippocampus is frequently engaged in this context. It is unclear, however, whether the hippocampus would be similarly responsive to unfolding events that involve abstract imagery. Addressing this issue could provide insights into the nature of its contribution to event processing, with relevance for theories of hippocampal function. Consequently, during magnetoencephalography (MEG), we had female and male humans watch highly matched unfolding movie events composed of either scene image frames that reflected the real world, or frames depicting abstract patterns. We examined the evoked neuronal responses to each image frame along the time course of the movie events. Only one difference between the two conditions was evident, and that was during the viewing of the first image frame of events, detectable across frontotemporal sensors. Further probing of this difference using source reconstruction revealed greater engagement of a set of brain regions across parietal, frontal, premotor, and cerebellar cortices, with the largest change in broadband (1–30 Hz) power in the hippocampus during scene-based movie events. Hippocampal engagement during the first image frame of scene-based events could reflect its role in registering a recognizable context perhaps based on templates or schemas. The hippocampus, therefore, may help to set the scene for events very early on.

Introduction 71 We generally perceive the world as a series of visual snapshots punctuated by eye blinks and saccades. 72 With parallels in terms of how the individual frames of a movie appear to be continuous (Tan, 2018), 73 somehow these separate images become linked together, such that we have a sense of the seamless 74 unfolding of life and events (Cutting, 2005;Magliano and Zacks, 2011). These dynamic events are central 75 to our lived experience, be that during 'online' perception, or when we recall the past or imagine the 76 future. 77 Here we defined an event as a dynamic, unfolding sequence of actions that could be described 78 in a story-like narrative. Functional MRI (fMRI) has helped delineate the brain areas involved in 2020), engage several of the same brain regions, including the hippocampus. Aligning with these fMRI 87 findings, impairments in recalling past events (Scoville and Milner, 1957;Rosenbaum et al., 2005;88 Kurczek et al., 2015), imagining single scene images (Hassabis et al., 2007b), and processing sequences 89 8 In each condition every image frame was presented for 700 ms -a duration identified by 184 piloting as being long enough to comprehend the scene or pattern being viewed, and brief enough to 185 minimize saccades and limit fixations to the center of frames. Between each image, 'gap' frames of the 186 same duration were inserted, where no image was displayed and which consisted of only a pixelated 187 gray background (see Figure 1). The pixelation served to mask the visual persistence of the preceding 188 image. Since images were presented in a sequence, the primary function of gaps was to act as a 189 temporal separator so that individual images could be subjected to analysis independently. Gaps also 190 ensured images in Unlinked movies were clearly perceived as independent, and the inclusion of gaps in 191 the Linked movies ensured close matching. The 16 gaps matched the number of images in each movie 192 clip, and each movie ended with a gap. 193 Pilot testing of a larger number of stimuli ensured that we only included in the main experiment 194 those Patterns movies that were not interpreted as depicting real objects, scenes, or social events. We 195 also confirmed that the gaps between images did not interrupt the naturalistic comprehension of Linked 196 movies or their sense of unfolding. During piloting, each individual movie was also rated on: perceived 197 linking -how linked (or disconnected) images appeared to be; and (2)  linked together so that the clip tells a story.' For Patterns-Linked movies it was explained: '…for this type 219 of clip, the patterns are all linked together, so one pattern leads to the next one in the clip. In this 220 example clip the pattern moved outwards at first, then the crosses became larger, and then the circles 221 increased in size, then the pattern changed again. The shape changed a bit step-by-step so that the clip 222 portrays an evolving pattern.' Participants were instructed not to link the images in Unlinked movies 223 and to treat each image frame separately when viewing them. 224 Movies were preceded by one of four visual cues -Pictures-Linked, Patterns-Linked or, for the 225 control conditions, Pictures-Unlinked and Patterns-Unlinked ( Figure 1A) -in order to advise a participant 226 of the upcoming condition. Cues were provided in advance of each movie so that participants would not 227 be surprised to discover the nature of the clip. Without a cue, the experiment would be poorly 228 controlled since there would most likely be differences across participants in terms of when they 229 registered the clip type during its viewing. This would make it impossible to time-lock processing of the 230 clip to neural activity in a consistent manner across participants. Instead, by using an informative cue, 231 we could be sure that from the very first image frame a participant understood whether the movie was 232 to be composed of linked images or not, and whether these images would depict pictures or patterns. 233 234 Task and procedure 235 Scripts run in Matlab R2018a were used to present stimuli and record responses in the MEG scanner. 236 Each trial was preceded by a cue advising of the upcoming condition (e.g. Pictures-Linked) which was 237 shown for 3000 ms. Each movie was 22400 ms in duration from the appearance of the first image frame 238 to the end of the final gap frame ( Figure 1A). Individual image and gap frames were each 700 ms in 239 duration. Participants then saw a fixation cross for 3000 ms before the next cue. To ensure participants 240 attended to the movies throughout the scanning session, an occasional probe question was included 241 (two trials per condition, Figure 1B). Following the final gap frame of a movie, a novel image was 242 presented (either a Picture or a Pattern) and participants were asked whether this image fitted well with 243 the movie clip they just saw. Of the two probe trials per condition, one was a 'yes' trial (the image was 244 congruent with the movie), and one was a 'no' trial (the image was incongruent with the movie). 245 Given the rate at which frames were presented, we sought to minimize a systematic relationship 246 between spontaneous blinking and stimulus onset. Furthermore, fatigue is known to increase blink 247 duration, which could result in participants missing individual frames, and increase the risk of significant 248 head movement. Consequently, to ensure participants remained alert, the scanning session was split 249 into five blocks each lasting approximately 6 minutes. During breaks between recordings participants 250 were instructed to blink and rest. Each recording block contained eight movie trials where conditions 251 were presented in a randomized order for each participant. Participants were instructed to maintain 252 fixation in the center of frames during the entire trial and to restrict eye movements to between-trial 253 periods. 254 In summary, movies were visually similar, with one-to-one matching between the two Linked 255 and also the two Unlinked conditions. Common to all movies was the use of a central item per image, 256 the inclusion of interleaved gap frames, use of simple line illustrations of pictures s or patterns in 257 grayscale, all of which were presented at the same frame rate of 1.43 frames per second. 258 259

In-scanner eye tracking and analysis 260
An Eyelink 1000 Plus (SR Research) eye tracking system with a sampling rate of 2000 Hz was used during 261 MEG scanning to monitor task compliance and record data (x and y coordinates of all fixations) across 262 the full screen. The right eye was used for a 9-point grid calibration, recording and analyses. For some 263 participants the calibration was insufficiently accurate, leaving 16 data sets for eye tracking analyses. 264 The Eyelink Data Viewer (SR Research) was used to examine fixation locations and durations. We used 265 the built-in online data parser of the Eyelink software whereby fixation duration was parsed 266 automatically with fixations exceeding 100ms. Eye tracking comparisons involving all four conditions 267 were performed to examine where (using group eye fixation heat maps) and for how long (using a two-268 way repeated measures ANOVA) participants fixated during a 700 ms time window. Our primary focus 269 was on comparing the neural activity evoked during the Pictures-Linked and Patterns-Linked conditions. 270 Consequently, the outcome of this comparison directed our subsequent examination of the eye tracking 271 data, meaning that we focused the eye tracking analysis on the specific time windows where differences 272 in the neural data were identified. This allowed us to ascertain whether the neural differences between 273 conditions could have been influenced by oculomotor disparities. 274 275

Post-scan surprise memory test 276
Following the experiment, participants completed a surprise free recall test for the event movies, since 277 the principal aim was to examine the neural differences between Pictures-Linked and Patterns-Linked 12 movies. Participants were asked to recall everything they could about what happened in each of these 279 clips, unprompted. If they correctly recalled the simple story, they scored '1' for that clip, otherwise they 280 scored '0'. Specifically, a score of 1 was awarded if all of the following information was provided: a 281 description of the main figure (be it a stick-figure or abstract pattern) and context, and a narrative 282 containing all of the sub-events that unfolded. The maximum score per participant and event condition 283 was therefore 10 (as there were 10 movies per condition). Performance for Pictures-Linked and 284 Patterns-Linked were compared using a paired-samples t-test with a statistical threshold of p < 0.05. As noted above, we were particularly interested in hippocampal neural activity. The ability of 295 MEG to detect deep sources, including the hippocampus, has been previously debated (Mikuni et  corresponding to the each movie cue were defined as -100 to 1000 ms relative to cue onset. Image 312 frames were defined as -100 to 700 ms relative to image onset. Gap periods were defined as -100 to 700 313 ms relative to gap onset. Epochs were concatenated across trials for each condition, and across scanning 314 sessions. Prior to the calculation of event-related fields (ERFs), data were first low-pass filtered using a 315 two-pass 6 th order Butterworth filter, with a frequency cut-off of 30 Hz. We implemented a broadband 316 approach (1 to 30 Hz), since the focus of this experiment was evoked activity. Although activity within 317 the theta band (4-8Hz) is often associated with the hippocampus (Colgin 2013(Colgin , 2016, there is also 318 evidence for the role of alpha (9-12Hz) and beta (13-30Hz) power in event processing (

MEG data analyses 327
Our principal aim was to assess differences between the Pictures-Linked and Patterns-Linked conditions 328 since our main interest was in comparing the processing of events built from scenes with those built 329 from non-scenes. In order to make this key comparison, our focus was on the individual image frames 330 that composed the movies. As previously mentioned, gaps were included in the design to provide 331 temporal separation between images, so that brain activity associated with each movie image could be 332 separately examined without interference or leakage from the previous image. Consequently, we 333 explored both the evoked responses to particular images along the time course of the movies and then, 334 as a second step, the likely sources of these responses. These steps are described below. We used a non-parametric cluster-based permutation approach for our ERF analyses, a 348 commonly adopted approach that deals with the multidimensional nature of MEG (and EEG) data (see single-shell head model (Nolte, 2003). This resulted in one weight-normalized image per participant 389 within the interval of interest for each condition, that were then smoothed using a 12 mm Gaussian 390 kernel, and a t-contrast was performed at the second level.  diverged (see Figure 3A). 432

434
Source reconstruction 435 We subsequently performed a beamformer analysis on the image 1 Pictures-Linked versus Patterns-436 Linked contrast, restricted to the same time window (178-447 ms) and frequency band (1-30 Hz) within 437 which the significant difference in evoked responses was found. This analysis served to give a better 438 indication of where in the brain this difference originated. The primary peak difference was found in the 439 right hippocampus for Pictures-Linked relative to Patterns-Linked (peak x, y, z = 32, -20, -16; Figure 3B Unfolding events are central to how we experience the world. In this study we had participants watch 445 dynamic, movie-like events, and compared those built from successively linked scenes (Pictures-Linked) 446 to those composed of successively linked non-scene patterns (Patterns-Linked). By using an ERF sliding 447 21 two conditions at the point of the first image. This suggests that it may be more than the registration of 495 the real-world context that is the influential factor, as contexts were present in both conditions. In the 496 Pictures-Linked condition, a participant knew that the context depicted in the first image was going to 497 endure for the entire clip, because the cue preceding each clip advised of the upcoming condition. 498 Similarly, they knew that each image in the Pictures-Unlinked condition related to that image alone, and 499 would not endure across the clip. Consequently, it may be that for the first image in a Pictures-Linked 500 movie, the context is registered, perhaps a relevant scene template or schema is activated fully (Gilboa 501 and Marlatte, 2017), and then used to help link each image across the sequence. In contrast, the first 502 image in a Pictures-Unlinked clip may be limited to just registering the context. 503 Our finding of a very early hippocampal response to unfolding scene-based events differs from 504 fMRI studies of movie viewing that found the hippocampus responded later, towards the offset of 505 events, with the speculation that this reflected event replay, aiding memory consolidation (Ben-Yakov 506 and Dudai, 2011; Baldassano et al., 2017; see Griffiths and Fuentemilla, 2020 for a review). There are 507 several differences between our study and this previous work. For instance, the latter typically involved 508 explicit memory encoding -participants knew they would be tested afterwards, and this may have 509 influenced hippocampal engagement towards the end of events if memory rehearsal occurred. By 510 contrast, our task had no memory demands, even though excellent incidental encoding took place. In 511 addition, our study was not designed to assess event boundaries; indeed, our two conditions were very 512 highly matched in terms of event structure, which may have precluded boundary-related findings. Prior 513 studies also used fMRI, which is blind to rapid, phasic neuronal activity, given the slow nature of the 514 hemodynamic response. The few EEG studies that have examined memory encoding using movies were 515 conducted at the sensor level (e.g. Silva et al., 2019), and did not source localize responses to specific 516 brain structures. Further MEG studies in the future would be particularly helpful in extending event, and 517 event boundary, research to characterize more precisely the temporal dynamics of hippocampal activity. 518

22
Beyond the hippocampus, our results also revealed the involvement of a broader set of brain 519 regions associated with Pictures-Linked more so than Patterns-Linked movies, namely, the posterior 520 parietal, inferior frontal, premotor, and cerebellar cortices. Consideration of these areas may may shed 521 further light on differences between the two conditions. These brain areas have been identified in 522 numerous studies as part of a network that processes biological motion and the anticipation of incoming 523 intentional movement (Battelli et al., 2003;Rizzolatti and Craighero, 2004;Saygin et al., 2004;Fraiman 524 et al., 2014). In particular, this has been observed in the context of point-light displays, in which a small 525 number of moving lights (e.g. at the joints of a moving person) are sufficient to interpret this as behavior 526 (e.g. dancing). The Pictures-Linked events were highly simplified portrayals of activities depicted by 527 stick-figures, lines and circles to create simple scenes. Although 2D drawings, they evoked 3D unfolding 528 events of real-world activities that were easily grasped by participants. Scene-and pattern-based 529 evolving stimuli may have been processed differently because abstract patterns were not perceived as 530 intentional, biological stimuli, while participants could automatically infer the actions performed in 531 scene-based events, even as early as the first image frame. Indeed, through piloting we sought to 532 exclude patterns that consistently evoked the sense of biological motion. The success of our efforts was 533 reflected in the descriptions provided by participants in the post-scan memory test. For example, 534 elements of a Patterns-Linked movie showing three overlapping diamond shapes was described as 535 'diamond shapes gradually expanded outwards, then rotated clockwise', while Pictures-Linked movies 536 were typically described in terms of the intentionality of the stick-figure. 537 Biological motion is often related to theory of mind. Could theory of mind explain the ERF 538 differences between the Pictures-Linked and Patterns-Linked conditions? We feel this is unlikely given 539 that brain areas typically engaged by theory of mind did not emerge in the analyses. Moreover, while 540 biological motion perception appears to relate to some aspects of theory of mind, they are not 541 equivalent constructs (e.g. Rice et al., 2016;Meinhardt-Injac et al., 2018). For example, people with 542 23 theory of mind deficits (e.g. in the context of autism) may demonstrate deficits in the perception of 543 biological motion relative to controls but this may depend on whether emotional state information is 544 required (Todorova et al., 2019). Whether there is a common neural circuitry underlying biological 545 motion and theory of mind remains unclear. It is likely that the ability to perceive biological motion is 546 required in order to make social judgements, but it is not the sole component of theory of mind 547 processing (Fitzpatrick et al., 2018). We suggest that our simple, emotionally neutral event movies did 548 not necessarily induce theory of mind processes and, consequently, engagement of brain areas 549 associated with theory of mind was not increased for Pictures-Linked stimuli. 550 What other alternative explanations might there be for the hippocampal difference between 551 Pictures-Linked and Patterns-Linked movies? It could be argued that the effect of Pictures-Linked was 552 simply the result of scene processing per se. If this was the case, then a difference ought to have been 553 observed between the Pictures-Unlinked and Patterns-Unlinked conditions, since the hippocampus is 554 known to respond strongly to scenes relative to non-scene stimuli ( Another possibility is that linking or sequencing accounts for the finding. However, linking and unfolding 558 sequences were features of both Pictures-Linked and Patterns-Linked, and so this factor cannot easily 559 explain the change in hippocampal power. In addition, no significant differences between any other 560 pairs of conditions, including between Patterns-Linked and Patterns-Unlinked, and Patterns-Linked and 561 Pictures-Unlinked were evident, suggesting the effect was not solely explained by the linking of images. 562 It seems that the hippocampus responded to the first scene image only when the expectation was that 563 this picture was the start of a linked, unfolding event, as reflected in the image type by linking 564 interaction that we observed. 565 Despite the measures taken to closely match event stimuli in terms of their sense of unfolding, 566 scenes could simply be more engaging or predictable than pattern-based events. If so, then one might 567 have expected event memory to differ in the surprise post-scan test, but it did not, and both types of 568 movie clips were easily recollected as clear narratives. We might also have expected to observe 569 differences in oculomotor behavior, but none were evident, also an indication of similar attentional 570 processes for the two conditions. Consequently, we can conclude that the neural difference identified 571 between the two conditions was not due to a large divergence in encoding success. However, we 572 acknowledge that memory differences might emerge with more complex stimuli. Furthermore, events 573 were very well matched in terms of the number of sub-events, and their evolving nature as reflected in 574 the highly similar ratings for 'linking' and 'thinking ahead' measures during piloting. It also seems 575 unlikely that the difference between the two event types can be explained by working memory load. If 576 Pictures-Linked movies were easier to hold in mind, while Patterns-Linked were more effortful to 577 process, we would have expected this to be reflected at later points in the movie clips, as memory load 578 increased, but no such effect was apparent. 579 In summary, this MEG study revealed very early hippocampal engagement associated with the 580 viewing of events built from scenes, over and above highly matched evolving sequences built from non-581 scene imagery. Together with the hippocampus, the involvement of other brain regions, including 582 posterior parietal, inferior frontal, premotor, and cerebellar cortices, may reflect the processing of 583 biologically-relevant information, which typifies the scene-rich episodes we encounter in our daily lives.