| VoiceID Conference 2022

Risultati immagini per homa asadi isfahan Homa Asadi

Department of Linguistics, University of Isfahan, Iran.

https://collegium.ethz.ch/en/fellows/ph-d-homa-asadi-university-of-isfahan/

Acoustic variation within and between bilingual speakers

An important part of human social interaction is the ability to hear and identify voices on a daily basis. Our voice not only conveys information about the message being spoken but also provides clues about the identity and emotional attributes of an individual. Nevertheless, voices are often more variable within the same speaker rather than between different speakers. One of the sources of within-individual vocal variability occurs when speakers communicate in different languages and switch from one language to another. This adds an intriguing dimension of variability to the speech, both in perception and production. But do bilinguals change their voice while switching from one language to another? From the speech production perspective, it is suggested that while some aspects of speech signal vary due to linguistic reasons, some indexical features remain intact across different languages (Johnson et al., 2020). Nevertheless, little is known about the influence of language on within- and between-speaker vocal variability. In our talk, we will discuss how the acoustic parameters of voice quality vary or remain stable between different speakers' languages. We assume that phonological differences and different sound patterns underlying Persian and English are likely to influence the acoustic parameters of voice quality, and thus it is plausible that acoustic voice structure varies accordingly between the languages of Persian-English bilinguals. Following a psychoacoustic model proposed by Kreiman (2014) and using a series of principal component analyses, we will discuss how acoustic voice quality spaces are structured across the languages of Persian-English bilingual speakers.

A person smiling for the camera

Description automatically generated with medium confidence Pascal Belin

Institut de Neurosciences de la Timone, UMR7289, Centre National de la Recherche Scientifique &Aix Marseille Université, France

Département de Psychologie, Université de Montréal, Canada

https://neuralbasesofcommunication.eu/

How do you say “Hello”? Acoustic-based modulation of voice personality impressions

Research in face perception shows that robust personality impressions—stable in time and consistent across observers (although not necessarily accurate)—emerge within less than a second of exposure to novel faces, and that these impressions are well summarized by a 2-D Trustworthiness-Dominance ‘Social Face Space’. Here I present studies showing that a similar phenomenon exists in the voice domain. Our results indicate that exposure to a single ‘Hello’ is sufficient to elicit robust personality impressions in listeners, and that these impressions are accurately summarized, for both male and female voices, by the same Trustworthiness-Dominance ‘Social Voice Space’ as for faces. Acoustical manipulations based on voice morphing effectively modulate these impressions, while reverse-correlation techniques successfully predict the optimal acoustical pattern for each impression, opening the door to a principled-based ‘vocal make-up’ –modulations of perceived voice personality in real or synthetic voices.

Judith M. Burkart

Department of Anthropology, University of Zurich

https://www.aim.uzh.ch/de/members/professors/judithburkart.html

VoiceID in marmoset monkeys: Flexibility and trade-offs in vocal accommodation

Marmosets are highly voluble monkeys, renowned for the vocal flexibility. Even though their vocal repertoires are fixed, they engage in some vocal learning in the form of vocal accommodation, i.e. changes in the acoustic structure of their calls. We find that different captive colonies of marmosets have different “dialects”, and translocation experiments where animals are moved from one colony to the other show that these dialects are socially learned. Intriguingly, not all call types accommodate in the same way: long distance contact calls, for which signaling identity is crucial because the animals can not see each other, tend to accommodate less and without compromising individual recognizability of the calls. In contrast, for short distance contact calls, signaling identity is less important because individual recognition is warranted by visual and olfactory cues. These short distance contact calls accommodate more, which leads to a decrease in individual recognizability. These results suggest a trade-off in vocal accommodation, between the need to signal social closeness by becoming more similar to each other and the need to maintain individual recognizability. To further scrutinize these trade-offs, we have developed on the one hand more sensitive ML based approaches to analyze and classify marmoset vocalizations and currently test their generalizability to other primates and mammals. On the other hand, we follow vocal changes in wild marmosets in Brazil during migrations between groups, which allows us to better estimate the ultimate function of vocal accommodation and signaling individuality under natural conditions.

Volker Dellwo

Department of Computational Linguistics, University of Zurich, Switzerland

https://www.cl.uzh.ch/de/people/team/phonetics/vdellw.html

Vocal identity dynamics: Can speakers control their vocal recognizability?

Identity recognition through voice has so far typically been studied in terms of the performance of a human listener or a machine in identifying a person by their voice. In our line of research, we investigate (a) variable characteristics of human voices and their impact on how well voices can be recognised and (b) whether humans control their vocal identity features to be more or less well recognisable in communication situations. We start from the assumption of a mental acoustic voice space in which voices vary around an average voice (norm-based coding) and hypothesise that speakers can adjust their voices to be closer to the average - and thus less distinguishable from others - or further away from the average to find a more unique place that makes them more distinguishable. We show evidence for such adjustments from within-speaker variability of speaking style in which some speaking styles that require strong social bonding lead to better recognition results (e.g. infant-directed speech) compared to styles in which the speaker typically has no interest in being identified and thus forms more average voices (e.g. deception). We found that styles that are targeted at intelligibility (clear speech) were found to be less speaker-specific and resulted in lower voice recognition performance. We conclude that speakers have control over their recognisability by applying different speaking styles and we will show how more refined control of identity markers may play a role in dialogue processing.

Peter French

University of York, UK

https://www.york.ac.uk/language/people/academic-research/peter-french/

Voicetype, Phenotype, Genotype

In the criminal justice system, various categories of individual identification evidence are frequently treated as being independent of one another. For example, if evidence against a defendant includes both facial identification and voice identification, the assumption is that the evidence overall is particularly strong as two independent modalities point in same direction. This assumption of independence is called into question by links between voices and cranial/facial anatomy emerging from research studies, which are reviewed in the presentation. Similarly, DNA identification evidence is often presented as an independent biometric. Emerging knowledge of the links between DNA and face/head features is already sufficient to question whether facial identification evidence can legitimately be regarded as independent of DNA evidence. If the three modalities genuinely were to be independent of one another, the procedure for estimating their combined strength would be one of simple multiplication: Voice X Face X DNA. However, while we now know that this procedure would result in an overestimation, we have no basis for ‘factoring in’ interdependencies. In order to establish such a basis, and thereby an improvement to the quality of justice, further, collaborative research is needed by those working across the modalities. In respect of criminal investigations, the emerging patterns of correlation hold great potential, in that they could allow one to make informed predictions about, say, an offender’s facial appearance from a recording of his voice, or vice versa; or – at some future point – to make inferences about both from a DNA sample. The advantages of this are both obvious and inestimable. Those working in forensic speech science have amassed a body of research findings – again reviewed in the presentation - concerning with aspects of voice are most likely to be biologically determined rather than the products of linguistic socialisation. These provide a starting point for further work. We now call for input from other disciplines to take a ‘big vision’ research initiative forward

Sascha Frühholz

Department of Psychology, University of Oslo, Norway

Department of Psychology, University of Zurich, Switzerland

https://www.sv.uio.no/psi/english/people/aca/saschaf/

“May I activate your amygdala please” – Realtime modulation of the limbic brain system by live affective speech

Affect signaling in communication involves cortico-limbic brain systems for affect information decoding, such as expressed in a speaker’s vocal tone. To more realistically address the socio-dyadic and neural context of affective communication, we used a novel real-time neuroimaging setup that adaptively linked live speakers’ affective voice production with limbic brain signals in listeners as a proxy for affect recognition. We show that affective communication is acoustically more distinctive, adaptive, and individualized in dyadic than in non-dyadic settings and more efficiently capitalized on neural affect decoding mechanisms in limbic and associated networks. Only vocal affect produced in adaption to listeners'’ limbic signals was linked to emotion recognition in listeners. While live vocal aggression directly modulated limbic activity in listeners, live vocal joy modulated limbic activity in connection with neural pleasure nodes in the ventral striatum. This suggests that evolved neural systems for affect recognition are largely optimized for dyadic communicative contexts.

Vincent Hughes

Department of Language and Linguistic Science, University of York, UK

https://www.york.ac.uk/language/people/academic-research/vincent-hughes/

Forensic voice comparison at the intersection of linguistics and automatic speaker recognition

Within the field of forensic speech science, there has been growing interest in integrating traditional linguistic methods with automatic speaker recognition (ASR) systems. This work has two aims. The first is to better understand what linguistic information is captured by increasingly ‘black boxy’ ASR systems. The second is to empirically combine the results of linguistic analysis with ASR output, to reduce overall error rates. Some studies have shown promising results. For example, Gonzalez-Rodriguez et al. (2014) and Hughes et al. (2017) found that ASR misclassifications can be resolved by trained phoneticians primarily using laryngeal voice quality analysis. However, key questions remain: As systems continue to produce marked improvements in overall performance with each new paradigm (usually every 3-5 years), will we reach a stage where forensic voice comparison is conducted entirely using ASR? If so, what role will linguistic methods have in forensic casework in the future?I will argue that to answer these questions we must recognise that forensics is a unique application of ASR. As such, we have different concerns and priorities from developers of ASR systems for other commercial applications. Specifically, this means that: (i) Features for analysis should be determined on a case-by-case basis; (ii) The context in which forensic recordings are made is unique, making replication difficult; (iii) Our focus should be on reducing uncertainty rather than maximising potential discriminability. This involves identifying, reporting, and attempting to mitigate for sources of variability in system performance (e.g. sample size); (iv) The state-of-the-art ASR system isn’t necessarily the best choice for every forensic case. In this talk, I will review the current state of knowledge at the intersection of linguistics and ASR, and make proposals for ways forward in the quest to find the best ways of analysing voices in the specific context of forensic comparison.

A person smiling for the camera

Description automatically generated with low confidence Nadine Lavan

School of Biological and Behavioural Sciences, Queen Mary University of London, UK

https://www.qmul.ac.uk/sbbs/staff/nadine-lavan.html

The time course of person perception from voices

Listeners readily form impressions about a person based on the voice: Is the person old or young? Trustworthy or not? While some studies suggest that these impressions can be formed rapidly (e.g., by 400ms of exposure for traits), it is unclear just how quickly these impressions are formed across a number of person-related impressions. In a gating experiment, we collected ratings of 3 physical characteristics (age, sex, health), 3 trait characteristics (trustworthiness, dominance, attractiveness) and 3 social characteristics (level of education, poshness, professionalism) from recordings of the sustained vowel /a/ of 100 unfamiliar voices. Voice recordings were presented to listeners for 25ms, 50ms, 100ms, 200ms, 400ms, and 800ms. We observe that even from 25ms of exposure, interrater agreement for impressions of physical characteristics is high, with these impressions already being similar to the impressions formed after 800ms. This suggests that impressions of physical characteristics can be established rapidly. In contrast, agreement for trait and social characteristics is low to moderate for short exposure durations and gradually increases over time. These findings thus show that different person-related impressions arise at different points in time, suggesting that, in principle, person impressions that arise earlier may influence impressions that arise later on in time.

Marta Manser

Department of Evolutionary Biology and Environmental Studies, University of Zurich, Switzerland

https://www.ieu.uzh.ch/en/staff/member/manser_marta.html

Selection levels on vocal individuality: strategic use or byproduct

In animals, large variation for vocal individuality between and within call types exist, yet we know little on what level selection is taking place. Identifying the selection pressures causing this variation in individuality will provide insight into the evolutionary relationships between cognitive and behavioral processes and communication systems, particularly in group-living species where repeated interactions are common. Analyzing a species’ full, large vocal repertoire on individual signatures, its biological function, and the respective selection pressures is challenging. Here, we emphasize that comparing the acoustic individual distinctiveness between life-history stages and different subjects within a call type will allow the identification of selection pressures and enhance the understanding of variation in individuality and its potential strategic use by senders.

A close-up of a person smiling

Description automatically generated Carolyn Mcgettigan

UCL Speech Hearing and Phonetic Sciences, UK
www.carolynmcgettigan.com

https://www.ucl.ac.uk/pals/people/carolyn-mcgettigan

Perceiving familiar voice identities

Identity perception from voices can be error-prone, and is generally thought to be inferior to face identity processing. While performance can improve with greater exposure to vocal identities, the continued popularity of "mystery voice" contests in radio broadcasts keenly demonstrates the fragility of our mental representations of even very well-known voices. In this talk, I will present some of our behavioural work investigating how different types of familiarity affect the accuracy of voice identity perception, particularly in the presence of perceptual challenges such as natural within-person variation, reduced verbal cues, and artificial modulations of vocal acoustics. Our findings indicate that familiarity is not a binary state but more likely reflects a continual process of developing a perceptual representation via greater experience with a voice. At its best - for example when hearing the voice of a close relative or partner - the identification of a person from the voice can in fact be highly accurate and robust. However, familiarity benefits do not generalise to all tasks – in a speech-in-noise recognition task, we found equivalent performance for personally familiar and unfamiliar targets.

Elisa Pellegrino

Department of Computational Linguistics, University of Zurich, Switzerland

https://www.cl.uzh.ch/de/people/team/phonetics/epelle.html

Individualization versus cooperation: The effect of group size on voice individuality

Compared to other species, humans have an unparalleled ability to cooperate with unrelated individuals. Cooperation, acoustically signalled by convergent accommodation, is facilitated when group members are more similar. Nevertheless, convergence constraints may arise when interlocutors need to mark their vocal individuality. Inspired by findings in animal communication showing higher vocal individuality in larger groups, the focus of this presentation will on the effect of group size on vocal individualization in human interactions. We will illustrate the novel data collection method designed to investigate the trade-off between acoustic convergence and voice individualization in cooperative situations wherein voice recognition is at stake. We will describe the computational approaches used to quantify between- and within-speaker acoustic similarity across group sizes (i-Vector PLDA; Principal Component Analysis) with the results based on automatic features (MFCC) and more traditional acoustic features relevant for identity processing (e.g., F0, harmonicity, formant dispersion, duration). We will also show the effect of individualization vs. cooperative accommodation on automatic voice discriminability in terms of Equal Error Rate. Results pointing to vocal convergence in larger groups will be discussed compared to the opposite trend observed in animal species. Alternative interpretations will be offered that are based on the role of feedback, the effect of first exposure, and familiarization between speakers’ voices.

Tyler Perrachione

Department of Speech, Language, and Hearing Sciences, Boston University

https://www.bu.edu/sargent/profile/tyler-k-perrachione-ph-d/

The source and the signal: An integrated framework for talker identification and speech processing

There are bidirectional dependencies between talker identification (knowing who is speaking) and speech processing (recognizing what is being said). While classically studied separately, decades of research in psycholinguistics and cognitive psychology now convincingly show that human listeners process these two types of information simultaneously and integrally. Such integral processing of voice and speech is often mutually advantageous to both understanding what was said and recognizing who said it. However, accommodating the cognitive demands of a system that evolved to decode these two signals simultaneously is also sometimes detrimental to fast and accurate talker identification or speech perception. Investigating speech processing and talker identification through an integrated framework provides more parsimonious answers to key questions in both of these areas: Why is listening to several talkers, even one at a time, more effortful than listening to a single talker? How do listeners learn to identify voices when most vocal interactions prioritize speech comprehension? And why are listeners so much better at identifying talkers in their native language than in an unfamiliar foreign language? Ultimately, theoretical advances in both speech perception and talker identification should inform one another: Speech must be recognized in the context of phonetic variability across talkers. And talker identification is enhanced when listeners’ linguistic knowledge lets them include talkers’ phonetic idiosyncrasies in their representation of talkers’ identities.

A picture containing person, indoor, posing

Description automatically generated Katarzyna Pisanski

CNRS, French National Centre for Scientific Research

Dynamics of Language Lab (DDL), Lyon, France

Sensory Neuroethology Lab (ENES), Saint-Etienne, France

http://www.ddl.cnrs.fr/Annuaires/Index.asp?Langue=FR&Page=Katarzyna%20PISANSKI

www.ENESlab.com

Individual differences in human voice pitch are highly stable

Voice pitch is arguably the most intensively studied and salient nonverbal parameter of the human voice. As the perceptual correlate of fundamental frequency (fo), determined by vocal fold size and tension, voice pitch is lower in adults than in children and in men than in women. However, fo also varies considerably within these age-sex classes. Hundreds of studies have linked these individual differences in fo to biologically and socially relevant speaker characteristics, from hormone levels and reproductive fitness to perceived dominance and trustworthiness. Given the dynamic nature of fo, both as people age and as they speak, how stable are between-individual differences in this critical vocal parameter? In a series of within-subject and longitudinal studies, we show that individual differences in human fo remain remarkably conserved across the lifespan and across utterances. The pitch of babies’ cries predicts their voice pitch as children, and the pitch of pre-pubertal children’s voices predicts their voice pitch throughout adulthood. Individual differences in voice pitch also covary among neutral speech, emotional speech, and nonverbal vocalisations such as cries and screams. Taken together, these results suggest that voice pitch, known to play an important role in social and mating success, is largely determined in early human ontogeny and has predictive power as a robust individual and biosocial marker across disparate communication contexts, with relevance to both human listeners and voice recognition technologies.

Claudia Roswandowitz

Department of Psychology & Department of Computational Linguistics, University of Zurich, Zurich, Switzerland

https://www.suz.uzh.ch/cl/de/people/team/phonetics/roswandowitz.html

Do humans distinguish deepfake from real vocal identity? Insights from the perceptual and neurocognitive system

Deepfakes artificially re-create and manipulate original human data, with the main purpose of spreading social and political misinformation. High-quality deepfakes are viral ingredients of digital environments, and they can trick human cognition into misperceiving the fake as real. However, experimental research on how the human neurocognitive system processes deepfake information has been greatly neglected so far. In this talk, I will present perceptual and neuroimaging data on the sensitivity of the human brain to detect or be deceived by instances of deepfake voice identities. By using advanced deepfake technologies, we created voice identity clones that are acoustically like the natural human voices. During an identity recognition task, humans were mainly deceived by deepfake voice identities, but showed some remaining resources for deepfake detection. On the brain level, we identified a potential „deepfake sensor” including the subcortical ventral striatum, which assigns social reward to natural but not to deepfake identities, and sensory auditory cortex evaluating the acoustic degree of artificiality in human utterances. With our study, we present neurocognitive findings on the potential but also limitations of emerging deepfakes as artificial social signals for humans. Our findings highlight the relevance of the reward level of social cues for successful and effective human-computer interactions.

A person wearing glasses

Description automatically generated with medium confidence Stefan R. Schweinberger

Department for General Psychology and Cognitive Neuroscience, Friedrich Schiller University Jena, Germany
Voice Research Unit, Friedrich Schiller University Jena, Germany
http://www.allgpsy.uni-jena.de/stefan-r-schweinberger/

New Tools for Assessing Individual Differences in Voice Perception

Auditory morphing can be used to control sensory information in voices (e.g., by interpolating between an average and a specific identity or expression, or by caricaturing). I first introduce current concepts and research using parameter-specific morphing (PSM) technology, by which we can selectively manipulate acoustic parameters (e.g., fundamental frequency (F0), or timbre), thus permitting more objective assessments of their relative roles for perceiving specific signals. I then present selected examples for how PSM can be used to assess voice perception with cochlear implants (CIs), which tend to be optimized for speech perception, with less attention to socio-emotional signals. Although CI users’ voice gender perception seems exclusively based on F0, they make more efficient use of timbre in the context of age or emotion perception. Importantly, subjective quality of life with a CI is related to nonverbal voice perception skills. Overall, PSM is a promising new approach to objectively assess profiles of abilities to perceive socio-emotional vocal signals, and can inform perceptual training interventions which generated promising initial results. Finally, I briefly discuss the Jena Voice Learning and Memory Test (JVLMT) as a new and freely available standardized tool for assessing voice learning and recognition skills with pseudospeech utterances with speech-like phonetic variability.

Sarah V. Stevenage Professor Sarah Stevenage's photo

University of Southampton, UK

https://www.southampton.ac.uk/psychology/about/staff/svs1.page

Voice identity processing under challenging conditions: responding to singers and impersonators.

Effective vocal identity processing requires that we can tell together different instances from a single speaker, and can tell apart similar instances from two different speakers. Here, two experiments tested the limits of these capabilities by introducing extreme natural challenges. Experiment 1 challenged listeners by maximising variability within a target voice - listeners were asked to match speaking with singing clips. Performance significantly declined in this challenging condition, relative to the baseline condition when matching two speaking clips. Moreover, a lack of target familiarity magnified the impact of this challenge. However, performance remained above chance even in the hardest experimental condition. Taking a different approach, Experiment 2 challenged listeners by minimising the variability across different target voices - listeners were asked to distinguish celebrity targets from impersonators. Across three tasks, performance declined when telling apart a target from an impersonator, relative to the baseline condition when telling apart two quite different speakers. Again, however, performance remained above chance even in the hardest conditions. Taken together, these results indicated the resilience of vocal identity processing even under challenging natural listening conditions, and suggested a level of sensitivity to vocal cues that had not previously been demonstrated.

イメージ Junichi Yamagashi

National Institute of Informatics, Japan

https://doi.org/10.1016/j.csl.2020.101114

Differences between human- and machine-based audio deepfake detection – analysis on the ASVspoof 2019 database

To automatically detect audio deepfake and prevent spoofing attacks, we have built a large corpus, ASVSpoof2019, which pairs natural human speech with speech waveforms generated by several types of synthesis algorithms. The speech synthesis methods are diverse and include text-to-speech synthesis and voice conversion. In this talk, we will first present the results of large-scale listening tests conducted on this database to discriminate between natural and synthetic human speech. In the test, the subjects were asked to conduct two role-playing tasks. In one task, they were asked to judge whether the utterance was produced by a human or machine, given an imagined scenario where they must detect abnormal telephone calls in the customer service center of a commercial bank. In the other task, the subjects listened to two utterances and were asked to judge whether they sounded like the same person's voice. Next, the results of several automatic detection algorithms for similar tasks on the same database are presented. Finally, the differences between human- and machine-based audio deepfake detection are discussed.

Quicklinks

Main navigation