Abstract
Research on social perception in monkeys may benefit from standardized, controllable, and ethologically valid renditions of conspecifics offered by monkey avatars. However, previous work has cautioned that monkeys, like humans, show an adverse reaction toward realistic synthetic stimuli, known as the “uncanny valley” effect. We developed an improved naturalistic rhesus monkey face avatar capable of producing facial expressions (fear grin, lip smack and threat), animated by motion capture data of real monkeys. For validation, we additionally created decreasingly naturalistic avatar variants. Eight rhesus macaques were tested on the various videos and avoided looking at less naturalistic avatar variants, but not at the most naturalistic or the most unnaturalistic avatar, indicating an uncanny valley effect for the less naturalistic avatar versions. The avoidance was deepened by motion and accompanied by physiological arousal. Only the most naturalistic avatar evoked facial expressions comparable to those toward the real monkey videos. Hence, our findings demonstrate that the uncanny valley reaction in monkeys can be overcome by a highly naturalistic avatar.
Significance Statement
We introduce a new, naturalistic monkey avatar and validate it as an appropriate stimulus for studying primate social cognition by demonstrating that it elicits natural looking patterns and facial reactions in macaque monkeys rather than evoking an “uncanny” avoidance reaction. The fact that a degraded version of the avatar is able to evoke an uncanniness reaction confirms its existence in monkeys, supporting an evolutionary old behavioral commonality shared by monkeys and man. However, as this reaction can be overcome by a very naturalistic avatar, the uncanny valley is clearly not an inevitable consequence of high degrees of realism.
Introduction
Faces and facial expressions provide crucial social information for humans and for monkeys. Experimental work investigating the neuronal underpinnings of social cognition and facial processing so far has been hampered by several challenges. First, the experimental subjects for invasive studies are usually monkeys, particularly rhesus macaques, while typical visual stimuli are images of humans (Leopold et al., 2006; Freiwald and Tsao, 2010; Chang and Tsao, 2017), disregarding considerable species differences. Second, the dynamic component of faces is often neglected, as many studies in monkeys have deployed static stimuli (Gothard et al., 2007; Hadj-Bouziane et al., 2008; Ramezanpour and Thier, 2020). Third, the stimuli largely lack standardization, compromising the collection of reliable data. Especially standardized videos of monkeys, e.g., producing certain facial expressions with specific gaze directions and all other variables constant are basically impossible to capture (Furl et al., 2012; Mosher et al., 2014; Shepherd and Freiwald, 2018). Fortunately, modern computer animation technology offers a solution: virtual, animated monkeys, i.e., monkey head avatars, providing full control over facial expression, eye and head movements. In our attempt to create a highly naturalistic monkey avatar, we built on a computer graphics (CG) model of a monkey head based on MRI scans, furnished with a physically naturalistic model of skin and fur, and controlled by ribbon-like muscle structures linked to motion capture-driven control points.
The usage of such stimuli, however naturalistic they may appear to us humans, requires stimulus validation: we cannot simply extrapolate from our human perception to the perception of an animal, whose differing anatomy, physiology, and cognitive capacities might create a different percept (Chouinard-Thuly et al., 2017). Catarrhine and human vision share many low-level characteristics (Weinstein and Grether, 1940; Shumake et al., 1968; De Valois et al., 1974), but this does not guarantee similar cognitive apprehension of the stimuli. How is the avatar experienced by monkeys? Do they find it strange, maybe even frightening? This is especially relevant in the light of previous work showing that macaque monkeys are susceptible to the “uncanny valley” phenomenon (Steckenfinger and Ghazanfar, 2009). This hypothesis by roboticist Masahiro Mori states that human affinity for robots directly increases with the degree of human-likeness, however, only up to a certain level. Beyond it, i.e., for very lifelike synthetic agents, the likeability drops suddenly into a deep uncanny valley, before rising again for real humans (Mori, 1970/2012). Anecdotal support for this hypothesis from computer games and animated movies (Butler and Joschko, 2009; Kaba, 2013; Kätsyri et al., 2017) has led many roboticists and graphic designers to deliberately aim for a “safe” non-human appearance to ensure avoiding the uncanny valley (Fong et al., 2003; Fabri et al., 2004).
Careful evaluation of our monkey avatar is especially important as it is animated (Chouinard-Thuly et al., 2017). Mori postulated that movement would deepen the uncanny valley and that unnaturalistic movement would even cause it (Mori, 1970/2012). Experimental studies show that the acceptability of a CG character (Piwek et al., 2014) and the recognition of a facial emotion (Tinwell et al., 2011) depend on the animation quality. These findings emphasize the importance of providing accurate, naturalistic facial animations for CG avatars, which is why we sought to avoid this pitfall by resorting to natural motion-capture driven facial animation.
The aim of this study was to test whether the uncanny valley reaction of monkeys can be overcome by an avatar with highly naturalistic motion and appearance and if the avatar’s facial expressions elicit natural reactions. To this end, we generated monkey face videos of incrementally naturalistic render types: unnaturalistic wireframe avatar, grayscale avatar, furless avatar, naturalistic avatar, and real monkey face. The faces displayed different expressions: neutral, fear grin, lip smack, threat and an artificial “blowing” expression to control whether the monkeys’ reactions are influenced by facial motion per se or by the emotional meaning of the facial expression. Two different video types (dynamic and static) of each render type/expression combination were produced. We showed all videos to eight rhesus macaques using the time spent looking at a stimulus as a measure of preference, widely practiced for nonverbal subjects such as monkeys (Gothard et al., 2004; Steckenfinger and Ghazanfar, 2009) or infants (Lewkowicz and Ghazanfar, 2012; Tafreshi et al., 2014). As the uncanny valley in humans is characterized by negative emotional valence (Mori, 1970/2012; Wang et al., 2015), we analyzed the monkeys’ physiological reactions (heart rate and pupil response) and their reactive facial expressions to elucidate whether they too might experience aversion.
Materials and Methods
Subjects
Data were collected from eight male rhesus macaques (Macaca mulatta; ages 7–16 years), born in captivity and pair-housed. All monkeys had previously been implanted with individually adapted titanium head posts to allow head immobilization in unrelated neurophysiological experiments and they had been trained to climb into a primate chair and to accept head fixation. All surgical procedures were conducted under aseptic conditions and full anesthesia (introduced with ketamine and maintained by inhalation of isoflurane and nitrous oxide, supplemented by intravenous remifentanil) with control of vital parameters (body temperature, CO2, O2, blood pressure, electrocardiogram). After surgery, monkeys were supplied with analgesics until full recovery. All animal procedures were approved by the local animal care committee (Regierungspräsidium Tübingen, Abteilung Tierschutz) and fully complied with German law and the National Institutes of Health’s Guide for the Care and Use of Laboratory Animals.
Monkey head avatar
The basis of the avatar was a CG model of a monkey head based on MRI scans (3T head scanner, Siemens Magnetom Prisma). The surface mesh model derived from the scan was regularized, resulting in a mesh with 1.834 million polygons, which then was linked to a set of embedded ribbons that were modeled after the muscle anatomy of the macaque face (Parr et al., 2010). These elastic ribbons were linked to 43 control points, which correspond to the motion-captured markers, and which control the deformation of the mesh. The textures and skin material for the model were first painted by hand based on photograph reference and additional texture maps were derived to mimic all relevant layers of the skin using Adobe Photoshop. The monkey’s fur was created using Autodesk Maya’s XGen Interactive Grooming feature (https://knowledge.autodesk.com/support/maya/downloads/caas/CloudHelp/cloudhelp/2018/ENU/Maya-CharEffEnvBuild/files/GUID-496603B0-F929-45CD-B607-1CFCD3283DBE-htm.html), controlling the appearance and behavior of the simulated hair in terms of density, length, and direction maps. In order to generate the less naturalistic avatars, the highly naturalistic model was simplified in the following ways: (1) instead of modeling the fur structure in detail the face was modeled by a smooth surface with the same average color (furless); (2) the color information was discarded, modeling the face by a gray shaded smooth surface (grayscale); and (3) also the details of the surface structure were discarded, by subsampling the mesh with 30,940 polygons and connecting their points by smooth curves that follow the surface of the face, resulting in a wireframe picture with gray lines on a white background (wireframe). All movie frames were generated from the monkey head model using the Autodesk Arnold Renderer software.
Dynamic expression modeling
The facial movement of the avatar was based on motion capture data of real monkeys producing facial expressions. The monkeys were sitting in a primate chair with their head restrained. In order to attach the infrared reflecting tracer points to the skin, the monkeys’ face had to be shaved. The movement was recorded with a Vicon 1.8 Motion Capture System. In order to evoke facial expressions, interactions were initiated with the motion-captured monkey: (1) presenting a mirror for “lip smacking”; (2) showing a tool to elicit “fear grin”; and (3) staring at the monkey in a prolonged manner to elicit “threat.” The motion capture data were first preprocessed using Vicon NEXUS software to fill in missing marker trajectories and then segmented, selecting subsequences containing clear facial expressions (fear grin, lip smacking, open mouth threat, and neutral expressions). Motion capture data were recorded from two monkeys. The facial expressions used in this experiment were each based on one distinct expression by only one monkey. The resulting set of facial movements was time-normalized and further smoothed by approximating it using a Bayesian nonlinear dimension reduction method that combines Gaussian process latent variable models and Gaussian process dynamical models (Taubert et al., 2013). This algorithm also is suitable for online morphing between the dynamic expressions, a feature that was not used for the experiments in this paper, but for ongoing electrophysiological studies. Control experiments in humans verify that the algorithm outputs highly naturalistic facial motion (N. Taubert, M. Stettler, R. Siebert, S. Spadacenta, L. Sting, P.W. Dicke, P. Thier, M.A. Giese, unpublished observations). Figure 1A provides a schematic overview of the main steps of the avatar generation process.
Visual stimuli
The visual stimuli consisted of 2-s-long video clips featuring the face of a monkey displaying either a dynamic or a static facial expression, the latter corresponding to the extreme frame of the dynamic videos. The portrayed monkey was either a computer-generated avatar or a filmed real monkey (monkey Ja). The real monkey video was recorded with a Canon Legria HF S30 Camcorder (8.6 megapixels, 25 frames/s) while the monkey was seated in a primate chair with his head immobilized in front of a uniform green background. The background was later removed using Adobe After Effects (https://www.adobe.com/products/aftereffects.html) showing only the head on a gray background to match avatar videos. Four render types of avatars with varying degree of realism were used (for details of generation, see above, Monkey head avatar): a very unnaturally appearing wireframe face; a textured, but grayscale, still quite unnaturalistic avatar; a slightly more naturalistic, colored, but still furless avatar; the highly naturalistic monkey head avatar including fur and facial details such as wrinkles. All these monkey faces of different render type displayed one of four distinct species-specific facial expressions, or (in case of the avatars) an additional artificial expression, whereby the depicted monkey was blowing up its cheeks, a behavior never shown by real rhesus monkeys. The species-specific expressions consisted of fear-grin (a fearful or submissive reaction), lip smacking (an affiliative, peaceful gesture), and open mouth threat (an intimidating, aggressive display). This yielded 48 different videos. See Figure 1B for an overview of the stimulus matrix of the 24 static views.
Setup and paradigm
The monkey subjects sat in a primate chair with their head restrained inside a booth at a distance of 60 cm in front of a 24-inch monitor (1920 × 1080 screen resolution, 144-Hz refresh rate). Each trial started with the presentation of a central fixation dot (2° diameter) for 1 s to draw the monkeys’ attention to the screen, followed by 2-s videos, with a 1-s intertrial interval. The stimuli were presented via the NREC open source control system (https://nrec.neurologie.uni-tuebingen.de). The monkeys’ eye position and pupil size (area) were monitored with an eye tracker (EyeLink 1000, sampling rate 1000 Hz). They could freely move their eyes and were rewarded with a drop of water after each trial, as long as they kept their gaze direction within the boundaries of the monitor (fixation window 47° by 28° visual angle). The videos spanned 22° horizontally and 15° vertically. We conducted two different experiments. In experiment 1, a single video was presented centrally in the middle of the monitor (= 48 different trials), and in experiment 2, two videos of the same facial expression and motion type (dynamic or static), but of different render type were presented side by side, centered at −11° and 11° horizontally from the middle of the screen, respectively (= 184 different trials). Videos were played at 60 frames/s on a gray background, and trials were presented in pseudo-random order. The electrocardiogram was measured as the electrical potential difference between an electrode attached to the monkeys’ head post and a second electrode attached to the metal grid on the bottom of the primate chair on which the monkeys were sitting and was recorded using the Open Ephys recording system (http://www.open-ephys.org/, sampling rate 5000 Hz). Finally, the monkeys’ own reactive facial expressions to the videos were filmed with a Canon Legria HF S30 Camcorder (8.6 megapixels, 25 frames/s). Each monkey completed between six and 11 sessions (n = 62 sessions total).
Data analysis
All data were analyzed using MATLAB R2018a (MathWorks).
Looking behavior
Eye movement data were smoothed and eye velocity was determined by calculating the first derivative of the eye movement signal using a second order Savitzky–Golay filter of window size 10. Fixations were defined as time periods of at least 100 ms duration during which eye velocity did not exceed 20°/s. The mean coordinates during a particular fixation period served as the eye position during this fixation and the duration of each fixation was calculated. Regions of interest (ROIs) constituting the face (as opposed to the rest of the screen), the eyes, mouth, and nose were determined in all stimulus videos by manually outlining those areas closely along their borders on the frame with the maximum expression, using the same ROI for the static and dynamic video (Fig. 1C). For each trial, all eye fixations that fell within the respective ROI coordinates were identified. The fixations on each ROI were tallied up, yielding the number of fixations on the entire face and on the individual face parts. Fixations of the nose were only included in the feature index (see below) but not further analyzed individually due to their limited information content on emotional states. The first fixation was discarded when calculating the number of fixations on face parts as it usually fell close to the position of the fixation dot initiating the trial, in the vicinity of the nose. Additionally, the cumulative and mean fixation durations on the face and on face parts per trial were calculated, as well as the total looking time, which represents the total amount of time the eye position stayed on the face (or eyes/mouth) in one trial, including fixations as well as saccades. Two measures of exploration were computed: (1) the exploration distance within the face, which is the sum of the distances of all face fixation points from the geometrical fixation center of gravity, providing information about whether the monkeys scrutinize the entire face thoroughly or rather make several fixations on one or few (relevant or irrelevant) facial features; (2) a feature index, which is the difference between the number of fixations on relevant face parts (eyes, mouth, and nose) and the number of fixations on irrelevant face parts (remaining facial regions), divided by the total fixation number:
Pupil size
Pupil size is linked to arousal and has been shown to respond to the social relevance of stimuli in rhesus macaques (Ebitz et al., 2014). In order to gauge pupil size, first, noise was removed from the raw pupil size signal by deploying a second order Savitzky–Golay filter (window size 20 samples). Eye-blink artifacts were eliminated by detecting values smaller than −4 (arbitrary eye tracker area units), then discarding all samples up to 100 ms before and 100 ms after the signal dropped below the threshold of −4, and finally linearly interpolating the signal across the resulting gap. The pupil size signal was normalized for a given session by translating it into z scores. Subsequently, the signal was divided into bins of 250 ms and averaged per bin. The resulting averages were the basis of comparison between dynamic and static videos and between expressions. Comparison between render types was not possible because of luminance differences between the avatar versions.
Heart rate
Heart rate variability (HRV) is controlled by the sympathetic and parasympathetic nervous system, whereby high HRV at rest is an indicator of good health and increased HRV is associated with decreased arousal and, conversely, lowered HRV is associated with heightened arousal. Studies in humans have reported that it is possible to induce measurable changes in HRV by visual emotional stimulation (Choi et al., 2017) and also monkey cardiac physiology has been shown to be responsive to affective video content (Bliss-Moreau et al., 2013). Hence, the electrocardiogram was recorded continuously throughout each session. Heart data from monkeys C, K, and L could not be recorded due to technical problems. Afterwards, the signal was first bandpass filtered between 1 and 40 Hz using a rectangular window, then down-sampled to 1000 Hz and subsequently smoothed using a second order Savitzky–Golay filter (window size 20 samples). Artifacts, e.g., due to movements of the monkey, were cut out manually from the signal and trials containing artifacts were subsequently discarded. QRS-complexes were identified using the MATLAB findpeaks function and verified by visual inspection. Then, R-R intervals were calculated, and the heart rate in beats per minute (bpm), as well as the root mean square of successive differences (RMSSD) as a measure of HRV (Shaffer and Ginsberg, 2017) were computed for each trial:
Reactive facial expressions
Video recordings of the monkeys’ reactions were inspected visually. Only monkeys C, E, and P showed clear facial reactions toward the videos and were included in the analysis. Monkeys C and E reacted in the first block only, whereas monkey P reacted throughout the first four sessions (n = 6). Video recordings of the monkeys’ reactive facial expressions were scored manually, blind to the experimental condition, by judging whether the monkey lip smacked, fear grinned, behaved agitatedly (tension yawns, teeth grinding) or showed no reaction during the timeframe of each trial (other expressions, like open mouth threat, were never shown). Afterwards, the probability for each of the three monkeys to show each of these four types of reactions was calculated for all 48 different stimuli.
Statistical analysis
First, the within-subject mean was calculated for each condition and each variable. All subsequent comparisons were based on these means. As all dependent variables were not normally distributed as shown by Kolmogorov–Smirnoff tests for each variable (p < 0.05), we could not use a parametric multi-factorial repeated-measures ANOVA. Instead, Friedman’s nonparametric ANOVAs for related samples were deployed to test for the effects of render type, expression and video type (dynamic vs static) on all parameters individually. Blowing avatars had to be excluded from the statistical analysis of render type effects because of the absence of a natural blowing expression in videos of real monkeys. Likewise, to assure an equal number of render types per expression category for statistical testing, blowing expressions were omitted from the expression effects analysis. In order to determine pairwise differences with the blowing expression, additional Friedman’s ANOVAs for expression effects were conducted omitting the real videos. First, the effects of render type, expression, and video type were examined within the entire dataset. Then, the data were divided into dynamic and static conditions and analyzed for the effects or render type and expression separately. Finally, the data were also split by expression to test for the render type effect within each expression individually, and split by render type to test for the expression effect within each render type individually. Post hoc pairwise multiple comparisons were performed using Dunn and Sidák’s approach. We chose a significance level of p < 0.05 for all comparisons.
Results
Three different outcomes were conceivable regarding the render type. (1) If the naturalistic avatar is perceived as a real monkey and the uncanny valley does not exist, all reactions toward the naturalistic avatar and the real monkey should be the same, and the reactions toward the other avatars should differ. (2) If the uncanny valley exists in the form predicted by Mori, monkeys should avoid looking at the synthetic face with highest realism, our naturalistic avatar. (3) If an uncanny valley exists, but high realism is not the critical factor eliciting it, instead eerie stimulus features in general, monkeys should avoid the strange-looking less naturalistic avatars.
Assuming that the uncanny valley in humans and monkeys are equivalent phenomena, the least preferred stimuli should elicit feelings of aversion, which in turn should induce physiological arousal and reactive facial expressions of fear or agitation, whereas we expected higher affinity for a character to evoke affiliative facial reactions, i.e., lip smacking.
With respect to the video type (dynamic vs static), the uncanny valley hypothesis makes two predictions: movement would deepen the aversion toward an uncanny stimulus, and unnaturalistic movement would cause an otherwise acceptable artificial character to descend into the uncanny valley. If this pertains, the static avatar at the bottom of the valley should be even more avoided when moving. If no valley is apparent in the static condition, but one emerges in the dynamic condition, this would mean that the animation of the avatar is flawed.
Regarding the facial expression category of the stimulus, monkeys should prefer the expressions with the highest ethological importance, i.e., threat (negative) and lip smacking (positive), especially in the dynamic condition, which should also induce physiological arousal. We expected the monkeys’ facial reaction to be predominantly governed by the viewed render type and less contingent on the viewed facial expression.
Looking behavior (experiment 1)
The monkeys showed a general interest in looking at the faces of the centrally presented stimulus videos and largely ignored the surrounding background, as evident from Figure 1D. The primary target of fixation were the eyes with 18.85% of all fixations on the face, the mouth received 14.60%.
Influence of render type
The render type had a significant effect on all parameters used to characterize the looking behavior, except for the feature index (measure for focus on relevant vs irrelevant face parts, see Materials and Methods), in the entire dataset from eight monkeys comprising dynamic and static as well as all expression conditions. The monkeys looked most at the videos showing the real monkey and the unnaturalistic wireframe head, followed by the naturalistic avatar, whereas they avoided looking at the grayscale and furless avatars. The naturalistic avatar, wireframe avatar and real monkey were significantly preferred over the grayscale and furless avatars. This pattern was seen in the number of fixations (Fig. 2A) and the total looking time (although pairwise differences were not always significant for looking time), cumulative fixation duration results were less clear. Interestingly, the lower number and total duration of looks at the gray and furless faces were accompanied by a longer mean fixation duration and a greater focus on the eyes at the expense of the mouth and by a reduced exploration distance. Real and naturalistic avatar faces were explored the most. A differently weighted focus on relevant facial features compared with irrelevant areas as represented in the feature index was not observed. Compare Table 1 for an overview of all investigated parameters.
Analyzing reactions to dynamic and static stimuli separately yielded the same pattern of effects of render type as the analysis of the pooled data (fixation number dynamic: χ2(4) = 51.11, p < 0.001; static: χ2(4) = 24.23, p < 0.001). As evident in Figure 2B, there were interindividual differences in the monkeys’ performance and preferences, but a clear avoidance of the gray and furless avatars was shown by six out of eight monkeys. The render type effect on fixation number was consistent within all expressions individually; however, within the artificial blowing expression, it was only marginally significant (Fig. 2C).
We would argue that fixation number, the parameter that exhibited the most robust effect of render type, is indeed the most appropriate measure of preference as it is not influenced by the limitations that compromise the informative value of the others: cumulative fixation duration can only change markedly in monkeys that generally look less at the face and more on the surrounding screen. For monkeys who have a high baseline for looking at the face, a further increase in fixation number must necessarily lead to a decrease in total fixation time as the saccades between subsequent fixations also require time and the duration of the trial is fixed, which is what we observed in our data. Total looking time, comprising fixations and saccades on the face, avoids this problem but introduces another, because it includes times during which no processing of the visual input takes place, i.e., during re-fixation saccades. Like cumulative fixation duration, mean fixation duration suffers from the fact that every video has the same duration, which monkeys most likely realize very quickly. Hence, a monkey wishing to scrutinize a given video more intensively must necessarily decrease the duration of an individual fixation. Although mean fixation duration might not carry much information about preference, it can still reflect the saliency or importance of the respective fixation target (e.g., eyes or mouth).
Influence of expression
The type of facial expression shown in the videos influenced the looking behavior differently depending on whether the video was static or dynamic as documented by Figure 3A. Table 2 summarizes the results of all parameters within dynamic, static and pooled data. For static faces, comprising all render types, the monkeys looked most at the threatening expression, followed by fear grin, and significantly less at the lip smacking and neutral faces. When the video content was dynamic, the most fixations were counted on threatening displays, this time followed by lip-smacking faces, whereby fear grinning and neutral faces were looked at least. The artificial expression with the blown-up cheeks received an intermediate amount of looks both when static or moving. This was measured in fixation number as well as total looking time, less clearly in cumulative fixation duration. Mean fixation duration was not distinctly modulated. The patterns were more variable between monkeys than the render type effect (Fig. 3B), but appeared rather consistent over render types (Fig. 3C). However, the expression effect did not reach significance for individual categories other than furless static faces (grayscale static marginally significant), most probably due to lack of statistical power resulting from the small sample size when the data were split up both by render type and video type.
Attention to the eyes did not differ between expressions, but the mouth was looked at most for fear grins, followed by open mouth threats, then lip smacks and least for neutral and blowing expressions, which was observed both for both dynamic and static expressions, probably a reflection of feature saliency. The amount of exploration was significantly increased for threatening displays. The feature index reflected the aforementioned effects of looks on the mouth, in that it was greatest for fear and threat, then lip smacking, and smallest for neutral and blowing faces (Table 2).
Influence of video type
When examining the effect of movement within a face throughout the entire dataset comprising all render types and expressions, presenting dynamic faces rather than static ones did not lead to overall more, but to on average longer fixations. Movement caused a slightly stronger focus on the mouth, the part of the faces exhibiting the most pronounced movements, whereas static faces drew more attention to the eyes. This is shown by an increased mean fixation duration and marginally increased the cumulative fixation duration on dynamic faces compared with static ones. The eyes of static faces received a higher fixation number, cumulative fixation duration and total looking time, whereas the mean fixation duration on the mouth was longer for dynamic videos (for an overview of all parameters, see Table 3). When looking at the video type effect within each expression separately, fixation number did not differ significantly, but it was revealed that movement particularly increased the mean fixation duration on fear grinning (χ2(1) = 4.90, p = 0.027), threatening (χ2(1) = 19.60, p < 0.001), and blowing (χ2(1) = 4.50, p = 0.034) faces, whereas it increased the total looking time at lip smacking displays (χ2(1) = 8.10, p = 0.0044). Dynamic videos especially drew away attention from the eyes of lip-smacking faces, leading to shorter fixations (mean fixation duration: χ2(1) = 5.16, p = 0.023), whereby the gaze was instead fixated longer on the mouth (mean fixation duration: χ2(1) = 10.53, p = 0.0012). Also, fixations were deflected faster from the eyes of blowing faces (mean fixation duration: χ2(1) = 7.26, p = 0.0071), possibly in favor of the salient movement at the cheeks. Separate analysis within each render type showed that dynamic content prolonged the mean fixation duration especially on the naturalistic avatar (χ2(1) = 12.10, p < 0.001) and on the grayscale avatar (χ2(1) = 10.00, p = 0.0016). Notably, the number of fixations on moving faces was less for grayscale avatars only (χ2(1) = 5.77, p = 0.016; Fig. 2D), with a decreased looking time (χ2(1) = 6.40, p = 0.011) and decreased mean fixation duration (χ2(1) = 5.16, p = 0.023) on the eyes.
Preferential looking (experiment 2)
In experiment 2, instead of one central stimulus, two video clips were presented side by side. The two videos had the same expression category and video type (dynamic or static), but differed in render type. Analysis of this experiment quickly revealed that the monkeys’ looking behavior, confronted with a choice, was less driven by the stimuli, but rather by strong side biases: Friedman’s ANOVAs for the effect “side,” comparing how much the monkeys looked at the face on the left side versus the face on the right side, showed strong side biases in fixation number (χ2(1) = 37.70, p < 0.001), cumulative fixation duration (χ2(1) = 36.78, p < 0.001), total looking time (χ2(1) = 38.78, p < 0.001), and mean fixation duration (χ2(1) = 27.38, p < 0.001). Figure 4 shows that every monkey except one (monkey P) exhibited a clear bias toward one side of the screen. This very likely reflects prior overtraining on other tasks, as supported anecdotally after inquiring the monkeys’ training history, and/or idiosyncratic biases. Hence any further analysis of experiment 2 would not have been meaningful.
Physiological measures
We recorded the monkeys’ electrocardiogram throughout experiment 1. When the viewed expression was threatening, it tended to have a suppressive effect on the HRV, measured as the RMSSD (see Materials and Methods), for dynamic expressions (χ2(3) = 6.94, p = 0.074; illustrated in Fig. 5A) and for static expressions (χ2(3) = 6.98, p = 0.072). When specifically looking at the video type effect within each expression group, the only significant effect was observed for the threatening expression, with a decreased RMSSD in the dynamic condition (χ2(1) = 4.84, p = 0.028; Fig. 5B). This indicates elevated arousal when viewing a moving threatening face. The effect of dynamic expressions was also investigated in each render type group separately, and it was revealed that the effect of the dynamic threatening expression was most strongly driven by the threatening grayscale avatar, which was the only render type group where the RMSSD in the threat condition was decreased significantly (χ2(4) = 14.72, p = 0.0053), depicted in Figure 5C.
Pupil size analysis of our data did not yield any meaningful results, most probably because the signal was too corrupted due to tracking angle changes introduced by the exploratory gaze shifts. Undisturbed pupil size tracking requires steady fixation and carefully controlled luminance conditions, requirements that were precluded by our interest in unrestrained looking behavior.
Reactive facial expressions
Monkeys showed differential facial reactions on their initial encounter with the avatars. The render type significantly influenced the probability for lip smacking (χ2(4) = 23.44, p < 0.001), fear grinning (χ2(4) = 18.87, p < 0.001), and showing no reaction (χ2(4) = 41.91, p < 0.001). Signs of agitation were only shown by monkey E and thus were not significantly different between the conditions. Monkeys lip smacked most toward the real video and the naturalistic avatar, whereas the furless avatar was the render type toward which they fear grinned the most, while seeing the wireframe avatar most frequently elicited no reaction (Fig. 6; Movie 1). The same effect was present when looking at dynamic and static videos separately. Neither the video type nor the expression alone significantly changed the probability for a certain facial reaction.
Discussion
The results show that the uncanny valley effect in monkeys can be overcome by using sufficiently naturalistic avatar stimuli. Consistently over all facial expressions, the monkeys avoided looking at the strange furless and gray avatar heads. The naturalistic avatar, the most unnaturalistic avatar, and the real monkey were looked at significantly more, whereby the difference between the real monkey and the naturalistic avatar is possibly due to familiarity of the observer monkeys with the depicted real monkey. This indicates an uncanny avoidance reaction for the less naturalistic but not the most naturalistic synthetic face, placing the naturalistic avatar on the other side of the uncanny valley. The monkeys’ facial expression reactions reflect this, as they tended to lip smack toward the real monkey and the naturalistic avatar. The furless avatar, on the other hand, was the most likely to elicit fear grinning and the very unnaturalistic wire head almost never gave rise to any kind of reaction. This supports the assumption that while the very unnatural wireframe avatar was not perceived as a monkey at all, both the real video and the naturalistic avatar were regarded as a conspecific warranting a positive approaching behavior, whereas the uncanny furless avatar elicited fear due to its eerie appearance. We illuminated that the avoidance of uncanny faces is associated with physiological arousal, as the clearest increase in arousal in terms of HRV was measured for moving grayscale threatening faces. This likely reflects negative emotions such as fear, similar to the uncanny aversion elicited in humans. Movement selectively amplified the avoidance of uncanny grayscale faces only, which had been predicted (Mori, 1970/2012), but so far not confirmed (Piwek et al., 2014). Indications for improper animation eliciting an aversion (Mori, 1970/2012; Tinwell et al., 2011; Piwek et al., 2014) were not obtained, as the uncanny valley emerged both for static and for dynamic faces.
Moreover, the visual exploration patterns of avatar faces were in accordance with reports in the literature on how monkeys look at photographs of conspecifics (Keating and Keating, 1982; Nahm et al., 1997; Guo et al., 2003; Gothard et al., 2004; Ghazanfar et al., 2006), with the avatars’ facial expressions modulating looking patterns in a way that the face parts characterizing the expression were mainly looked at, as it was the case for monkey face pictures (Gothard et al., 2004). Static expressions with the most salient features, i.e., the open mouth of the threat and the bared teeth of the fear displays, were looked at most. When the expression was dynamic, threatening faces caused a significant increase in arousal and were also scrutinized most but followed by socially highly relevant affiliative lip smacking and then submissive fear grin, whereas uninformative neutral and unnatural blowing faces received the least attention. The increased preference for dynamic lip smacking and the physiological response to dynamic threat, along with the small interest in blowing faces, indicates that the looking behavior toward moving faces was possibly more driven by the social meaning of the expressions, instead of salient features or movement alone. This conclusion is also supported by the fact that fixations on dynamic faces were longer, particularly on the areas exhibiting the strongest movements. The differential exploration of dynamic and static expressions underlines the importance of naturalistically animated avatar stimuli, which we implemented by resorting to motion capture technology.
However, the physiological results obtained from the heart rate data should be regarded cautiously, as the design of our study was suboptimal for the detection of significant heart rate reactions. The duration of one trial was only 3 s, making it difficult to detect changes in the oscillatory HR, which resides in a frequency range of 2–3 Hz at rest for rhesus monkeys. Because of the sluggishness of HR reactions, attempts to identify changes are usually based on recordings of 5 min and rarely down to 10 s in experimental settings, but not less (Shaffer and Ginsberg, 2017). A block design repeating the same expression would be conceivable, but this would entail the downside of habituation.
The reactive facial expressions of our monkeys represent an interesting proof of principle. It has been demonstrated before that macaque monkeys lip smack toward videos of conspecifics under experimental conditions (Mosher et al., 2011; Shepherd and Freiwald, 2018), and show contagious yawning (Paukner and Anderson, 2006). Social reactions toward computer animations have been recorded for chimpanzees (Campbell et al., 2009). However, to the best of our knowledge, our study is the first demonstration of non-ape primates reacting toward a virtual avatar. We can speculate why only three monkeys (C, E, and P) showed any reactions toward the videos. All these three were dominant monkeys and had never been exposed to face stimuli under experimental conditions before. Monkeys Ja, Jo, K, and L did not show any behavioral reaction toward any of the videos. Possible explanations for this could be the low rank in the dominance hierarchy in the case of monkey K, as dominant monkeys are more likely to lip smack video monkeys (Mosher et al., 2011) and initiate contact. Monkeys Ja, Jo, and L had a long history of participation in experiments involving images and videos of conspecifics, which is why their lack of reaction could reflect habituation. Monkey F, a rather young and submissive monkey who had not been involved in any experiment before, exhibited general signs of agitation like fidgeting in his chair and fear grinning toward all videos indiscriminately.
Implications for the use of avatars in social cognition research
The absence of an uncanny valley effect for our naturalistic avatar, the affiliative behavioral responses elicited by this avatar, and the differential reactions toward the various dynamic facial expressions validate our avatar as a suitable stimulus for additional experiments involving social cognition in monkeys, providing us with a powerful tool to study social perception and social interactions in a standardized, dynamic, fully controllable setting.
The basic possibility of computer avatars to elicit an uncanny avoidance reaction in monkeys, shown by us and by Steckenfinger and Ghazanfar (2009), shows that it is crucial to validate an artificial social stimulus before use. Virtual avatars have been employed as stimuli in behavioral experiments with monkeys before (Paukner et al., 2014, 2018), and lately, neurophysiological investigations showed that face-selective neurons respond to monkey avatar faces and are modulated by changing the gaze direction or facial expression (Murphy and Leopold, 2019). However, the face avatar stimuli used were not tested for an uncanny valley response, at least not to our knowledge.
Wilson et al. (2019) recently developed a head avatar of a long-tailed macaque (Macaca fascicularis) and investigated the looking behavior of rhesus and long-tailed macaques toward it. The study failed to reveal any difference in viewing times between static images of real faces, naturalistic and unnaturalistic avatars. That was interpreted as support for the use of the avatar and against the presence of an uncanny aversion in macaque monkeys. This stands in contrast to Steckenfinger and Ghazanfar (2009) and to our study, which clearly show that an uncanny valley exists in monkeys. The lack of agreement could arise from the species incongruence of avatar and observers, the small number of subjects, the deviating experimental design and the confinement of the analysis to one dependent variable in the study of Wilson et al. (2019). In this experiment, only three rhesus monkeys were tested who were repeatedly presented with the same stimuli during a single session and only total (cumulative) fixation time on the faces was measured, which proved to be the least informative parameter in our study. Moreover, as the observer monkeys were rhesus macaques while the avatar stimulus was modelled after long-tailed macaques, the ethological validity of the synthetic stimulus is decreased, possibly introducing unpredictable unfamiliarity and irritation effects. The experiments on the 10 long-tailed macaques cannot be compared as the animals were free-ranging, could view each image for up to 60 s or change the image earlier by touching a target. As only static avatars were tested, the important role of facial movements (Mori, 1970/2012; Tinwell et al., 2011; Piwek et al., 2014; Chouinard-Thuly et al., 2017) was not addressed.
Implications for the uncanny valley hypothesis
Our results corroborate the existence of an uncanny avoidance reaction in macaque monkeys, first shown by Steckenfinger and Ghazanfar (2009). Moreover, we confirm the second prediction of the original uncanny valley hypothesis that movement would deepen the uncanny avoidance. The emergence of the uncanniness reaction in our non-human primate relatives has ramifications for possible explanations of the phenomenon. Currently, several lines of explanation for the uncanny valley effect exist: one hypothesis assumes pathogen avoidance as the critical mechanism, proposing that facial aberrations are a sign of disease, triggering disgust as an evolved mechanism for avoiding a contagious disease. The more human-like and thus genetically related a character seems, the more sensitive we may be to such facial defects, as the perceived chance of contracting the disease in question increases (MacDorman and Ishiguro, 2006; MacDorman et al., 2009). This is only partly supported by Green et al. (2008), who observed higher interrater agreement on what ideal facial proportions are for more human-like faces. However, the tolerance regarding the acceptable range of facial features was not affected by human likeness. Other perceptual hypotheses suggest that uncanny faces remind humans of their own mortality, or that they fail to meet evolved aesthetic standards, shaped by the specialized face processing system. Yet other lines of explanation engage more cognitive mechanisms, including the violation of expectations by eliciting expectations for a human being, but failing to fulfill them, or category uncertainty about whether or not a given entity is human/real or not (for review, see Wang et al., 2015).
Our findings provide support for an evolutionary origin of the phenomenon, like threat avoidance driven by disgust or fear, or evolved aesthetic standards arising from the highly specialized face processing system. Brink et al. (2019) assumed that an evolutionary origin of the uncanny valley would require the phenomenon to be present already in very young children. As they failed to observe an uncanny valley reaction in children younger than nine years, exposed to human-like and machine-like robots, they discarded a phylogenetic basis. However, this argument is flawed. Although a functional trait appearing early during development is most probably innate, the reverse does not hold true. The ability to walk is evolutionary in origin and nonetheless not present in infants, and so are various cognitive capabilities. Although a preference for looking at faces (Johnson et al., 1991) and a proto-organization of face perception is present from birth (Deen et al., 2017; Livingstone et al., 2017), the face processing system undergoes perceptual learning (Nelson, 2001) and refinement of selectivity throughout childhood of humans (Behrmann et al., 2016) and monkeys (Livingstone et al., 2017). It is likely that these changes are also associated with a refinement of sensitivity for facial deviations through experience as indicated by Lewkowicz and Ghazanfar (2012), who found avoidance of an uncanny avatar by 12-month-old children, but not by six-month-olds. The findings of Brink et al. (2019), however, seem to be less related to refinement of the face processing system than to learning of cognitive associations regarding robots.
Explanations centering around category uncertainty whether or not the uncanny character is a real human/conspecific monkey, or around the violation of expectations on how a presumed fellow human or monkey is supposed to behave, seem unlikely according to our data, as the “valley” we found in our study was not located at the place on the realism-axis predicted by Mori. Not the most realistic artificial stimuli were subject to avoidance but those of intermediate realism. We show that monkeys evade uncanny faces, but reveal that high realism is not the factor evoking the uncanny quality in a synthetic face. Instead, abnormal features in the stimulus (in our case lack of fur with abnormal smoothness of skin and lack of natural coloring) seem to elicit the uncanny avoidance, whereas sufficiently naturalistic stimuli without eerie features are able to eliminate the avoidance. Several studies reporting an uncanny response of humans used stimuli with abnormal features, such as unnaturally large eyes, mismatched degree of realism of different face parts (Seyama and Nagayama, 2007; MacDorman et al., 2009; Lischetzke et al., 2017), or alteration by plastic surgery (Rosenthal-von der Pütten et al., 2019). Among the studies that failed to detect an uncanny valley were notably those deploying controlled, morphed stimulus sets varying only realism (MacDorman and Chattopadhyay, 2017; Kätsyri et al., 2019). As Seyama and Nagayama (2007) pointed out, some robots, dolls, or computer animations seem very pleasant although they are unrealistic, and conversely, humans differ in perceived levels of pleasantness although they are all real. Following Ockham’s razor, it is more parsimonious to assume that abnormal visual features in the stimulus, experienced as off-putting, elicit the avoidance reaction instead of invoking the elusive concept of realism. In the same vein, one might argue that also the uncanny response found in monkeys by Steckenfinger and Ghazanfar (2009) may have been a consequence of the lack of lifelike proportions, skin texture, and fur in the avatar termed realistic, whereas the unrealistic avatar was not perceived as a monkey at all, like the wireframe head in our study.
The uncanny valley literature is full of methodological shortcomings and conceptual fallacies (for review, see Wang et al. 2015), in part because the hypothesis was initially ill defined, with the hypothetical curve lacking a mathematical formulation as well as clearly defined dependent and independent variables, leaving researchers too many degrees of freedom. One could argue that what is currently investigated under the term uncanny valley is actually a collection of sometimes interacting psychological phenomena ranging from a simple fear of the unknown (Jentsch, 1997), in particular being startled by something believed to be animate actually being inanimate or vice versa, over evolutionarily developed threat avoidance (MacDorman et al., 2009) with an aversion toward facial aberrations (Seyama and Nagayama, 2007), to a fear of technology takeover. The latter is possibly inspired by science fiction media, as we tend to ascribe higher abilities to more naturalistic robots (Walters et al., 2008) and spontaneously apply social expectations to computers (Nass and Moon, 2000), but we cannot anticipate what robots are capable of, as they might not have human morals and at the same time superhuman capabilities. Even if robots looked exactly like humans, as long as we still know or, more accurately, believe that they are robots, we would probably experience an uncanny feeling, as happening in HBO’s Westworld [see also Giger et al. (2019) for an overview of the benefits and drawbacks of humanizing robots].
Acknowledgments
Acknowledgements: We thank Friedemann Bunjes for valuable technical assistance and Michael Stettler for his contribution to ongoing refinement of the avatar.
Footnotes
The authors declare no competing financial interests.
This work was supported by the Deutsche Forschungsgesellschaft Grant TH 425/12-2, the Human Frontier Science Program Grant RGP0036/2016, the Bundesministerium für Bildung und Forschung Grant FKZ 01GQ1704, and Baden-Württemberg Stiftung Grant NEU007/1 KONSENS-NHE.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license, which permits unrestricted use, distribution and reproduction in any medium provided that the original work is properly attributed.