Ehret, Jonathan, Boensch, Andrea, Aspoeck, Lukas, Roehr, Christine T., Baumann, Stefan ORCID: 0000-0001-5963-6079, Grice, Martine ORCID: 0000-0003-4973-4059, Fels, Janina and Kuhlen, Torsten W. (2021). Do Prosody and Embodiment Influence the Perceived Naturalness of Conversational Agents' Speech? ACM Trans. Appl. Percept., 18 (4). NEW YORK: ASSOC COMPUTING MACHINERY. ISSN 1544-3965

Full text not available from this repository.

Abstract

For conversational agents' speech, either all possible sentences have to be prerecorded by voice actors or the required utterances can be synthesized. While synthesizing speech is more flexible and economic in production, it also potentially reduces the perceived naturalness of the agents among others due to mistakes at various linguistic levels. In our article, we are interested in the impact of adequate and inadequate prosody, here particularly in terms of accent placement, on the perceived naturalness and aliveness of the agents. We compare (1) inadequate prosody, as generated by off-the-shelf text-to-speech (TTS) engines with synthetic output; (2) the same inadequate prosody imitated by trained human speakers; and (3) adequate prosody produced by those speakers. The speech was presented either as audio-only or by embodied, anthropomorphic agents, to investigate the potential masking effect by a simultaneous visual representation of those virtual agents. To this end, we conducted an online study with 40 participants listening to four different dialogues each presented in the three Speech levels and the two Embodiment levels. Results confirmed that adequate prosody in human speech is perceived as more natural (and the agents are perceived as more alive) than inadequate prosody in both human (2) and synthetic speech (1). Thus, it is not sufficient to just use a human voice for an agents' speech to be perceived as natural-it is decisive whether the prosodic realisation is adequate or not. Furthermore, and surprisingly, we found no masking effect by speaker embodiment, since neither a human voice with inadequate prosody nor a synthetic voice was judged as more natural, when a virtual agent was visible compared to the audio-only condition. On the contrary, the human voice was even judged as less alive when accompanied by a virtual agent. In sum, our results emphasize, on the one hand, the importance of adequate prosody for perceived naturalness, especially in terms of accents being placed on important words in the phrase, while showing, on the other hand, that the embodiment of virtual agents plays a minor role in the naturalness ratings of voices.

Item Type: Journal Article
Creators:
CreatorsEmailORCIDORCID Put Code
Ehret, JonathanUNSPECIFIEDUNSPECIFIEDUNSPECIFIED
Boensch, AndreaUNSPECIFIEDUNSPECIFIEDUNSPECIFIED
Aspoeck, LukasUNSPECIFIEDUNSPECIFIEDUNSPECIFIED
Roehr, Christine T.UNSPECIFIEDUNSPECIFIEDUNSPECIFIED
Baumann, Stefanstefan.baumann@uni-koeln.deorcid.org/0000-0001-5963-6079UNSPECIFIED
Grice, Martinemartine.grice@uni-koeln.deorcid.org/0000-0003-4973-4059UNSPECIFIED
Fels, JaninaUNSPECIFIEDUNSPECIFIEDUNSPECIFIED
Kuhlen, Torsten W.UNSPECIFIEDUNSPECIFIEDUNSPECIFIED
URN: urn:nbn:de:hbz:38-574070
DOI: 10.1145/3486580
Journal or Publication Title: ACM Trans. Appl. Percept.
Volume: 18
Number: 4
Date: 2021
Publisher: ASSOC COMPUTING MACHINERY
Place of Publication: NEW YORK
ISSN: 1544-3965
Language: English
Faculty: Faculty of Arts and Humanities
Divisions: Faculty of Arts and Humanities > Fächergruppe 1: Kunstgeschichte, Musikwissenschaft, Medienkultur und Theater, Linguistik, IDH > Institut für Linguistik > Phonetik
Subjects: Language, Linguistics
Uncontrolled Keywords:
KeywordsLanguage
Computer Science, Software EngineeringMultiple languages
Refereed: Yes
URI: http://kups.ub.uni-koeln.de/id/eprint/57407

Downloads

Downloads per month over past year

Altmetric

Export

Actions (login required)

View Item View Item