Photo-realistic talking face generation under latent space manipulation

Salahudeen, R; Siu, WC; Chan, HA

doi:10.1109/TCE.2024.3516387

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/112799

DC Field	Value	Language
dc.contributor	Department of Electrical and Electronic Engineering	en_US
dc.creator	Salahudeen, R	en_US
dc.creator	Siu, WC	en_US
dc.creator	Chan, HA	en_US
dc.date.accessioned	2025-05-09T00:55:02Z	-
dc.date.available	2025-05-09T00:55:02Z	-
dc.identifier.issn	0098-3063	en_US
dc.identifier.uri	http://hdl.handle.net/10397/112799	-
dc.language.iso	en	en_US
dc.publisher	Institute of Electrical and Electronics Engineers	en_US
dc.rights	© 2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/	en_US
dc.rights	The following publication R. Salahudeen, W. -C. Siu and H. Anthony Chan, "Photo-Realistic Talking Face Generation Under Latent Space Manipulation," in IEEE Transactions on Consumer Electronics, vol. 71, no. 1, pp. 379-387, Feb. 2025 is available at https://doi.org/10.1109/TCE.2024.3516387.	en_US
dc.subject	Deep Learning	en_US
dc.subject	Latent Space	en_US
dc.subject	Multimedia Applications	en_US
dc.subject	Talking Face Generation	en_US
dc.title	Photo-realistic talking face generation under latent space manipulation	en_US
dc.type	Journal/Magazine Article	en_US
dc.identifier.spage	379	en_US
dc.identifier.epage	387	en_US
dc.identifier.volume	71	en_US
dc.identifier.issue	1	en_US
dc.identifier.doi	10.1109/TCE.2024.3516387	en_US
dcterms.abstract	This paper focuses on generating photo-realistic talking face videos by leveraging on semantic facial attributes in a latent space and capturing the talking style from an old video of a speaker. We formulate a process to manipulate facial attributes in the latent space by identifying semantic facial directions. We develop a deep learning pipeline to learn the correlation between the audio and the corresponding video frames from a reference video of a speaker in an aligned latent space. This correlation is used to navigate a static face image into frames of a talking face video, which is moderated by three carefully constructed loss functions, for accurate lip synchronization and photo-realistic video reconstruction. By combining these techniques, we aim to generate high-quality talking face videos that are visually realistic and synchronized with the provided audio input. Our results were evaluated against some state-of-the-art techniques on talking face generation, and we have recorded significant improvements in the image quality of the generated talking face video.	en_US
dcterms.accessRights	open access	en_US
dcterms.bibliographicCitation	IEEE transactions on consumer electronics, Feb. 2025, v. 71, no. 1, p. 379-387	en_US
dcterms.isPartOf	IEEE transactions on consumer electronics	en_US
dcterms.issued	2025-02	-
dc.identifier.scopus	2-s2.0-85212111923	-
dc.identifier.eissn	1558-4127	en_US
dc.description.validate	202505 bcch	en_US
dc.description.oa	Version of Record	en_US
dc.identifier.FolderNumber	OA_Scopus/WOS	-
dc.description.fundingSource	RGC	en_US
dc.description.fundingSource	Others	en_US
dc.description.fundingText	Saint Francis University, Hong Kong (Grant Number: ISG200206)	en_US
dc.description.pubStatus	Published	en_US
dc.description.oaCategory	CC	en_US
Appears in Collections:	Journal/Magazine Article

Files in This Item:

File	Description	Size	Format
Salahudeen_Photo_Realistic_Talking.pdf		2.93 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Access

View full-text via PolyU eLinks

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Access

Google ScholarTM

Altmetric

Google Scholar^TM