Towards context-aware voice interaction via acoustic sensing

Yang, Qiang

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/98168

Title:	Towards context-aware voice interaction via acoustic sensing
Authors:	Yang, Qiang
Degree:	Ph.D.
Issue Date:	2023
Abstract:	Voice interaction has become the fundamental approach to connecting humans and smart devices. Such an interface enables users to easily complete daily tasks by voice commands, which not only contain the explicit user's semantic meaning but also imply the user's physical context information such as location and speaking direction. Although current speech recognition technology allows devices to accurately understand voice content and take smart actions, these contextual clues can further help smart devices make more intelligent responses. For example, knowing a user's location helps narrow down the possible set of voice commands and provides customized services to users in a kitchen. Acoustic sensing has been studied for a long time. However, unlike actively transmitting handcrafted sensing signals, we can only obtain the voice on the receiver side, making sensing voice contexts challenging. In this thesis, we use voice signals as a sensing modality and propose new acoustic sensing techniques in a passive way to extract the physical context of the voice/user: location, speaking direction, and liveness. Specifically, (1) inspired by the human auditory system, we investigate the effects of human ears on binaural sound localization and design a bionic machine hearing framework to locate multiple sounds with binaural microphones. (2) We exploit the voice energy and frequency radiation patterns to estimate the user's head orientation. By modeling the anisotropic property of voice propagation, we can measure the user's speaking direction, serving as a valuable context for smart voice assistants. (3) Attackers may use a loudspeaker to play pre-recorded voice commands to deceive voice assistants. We check the sound generation difference between humans and loudspeakers and find that the human's rapid-changing mouth leads to a more dynamic sound field. Thus, we can detect voice liveness and defend against such replay attacks by examining sound field dynamics. To achieve such context-aware voice interactions, we look into the physical properties of voice, work with hardware and software, and introduce new algorithms by drawing from principles in acoustic sensing, signal processing, and machine learning. We implement these systems and evaluate them with various experiments, demonstrating that they can facilitate many new real-world applications, including multiple sound localization, speaking direction estimation, and replay attack defense.
Subjects:	Speech processing systems Hong Kong Polytechnic University -- Dissertations
Pages:	xv, 139 pages : color illustrations
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/12290

Show full item record

Page views

144

Citations as of May 11, 2025

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM