Continuous autoregressive modeling with stochastic monotonic alignment for speech synthesis

Lin, W; He, C

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/119655

DC Field	Value	Language
dc.contributor	Department of Electrical and Electronic Engineering	-
dc.contributor	Department of Computing	-
dc.creator	Lin, W	-
dc.creator	He, C	-
dc.date.accessioned	2026-07-03T07:14:01Z	-
dc.date.available	2026-07-03T07:14:01Z	-
dc.identifier.uri	http://hdl.handle.net/10397/119655	-
dc.language.iso	en	en_US
dc.publisher	OpenReview.net	en_US
dc.rights	CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)	en_US
dc.rights	The following publication Lin, W., & He, C. (2025). Continuous autoregressive modeling with stochastic monotonic alignment for speech synthesis. In The Thirteenth International Conference on Learning Representations(ICLR) is available at https://openreview.net/forum?id=cuFzE8Jlvb.	en_US
dc.title	Continuous autoregressive modeling with stochastic monotonic alignment for speech synthesis	en_US
dc.type	Conference Paper	en_US
dcterms.abstract	We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Unlike previous methods that rely on residual vector quantization, our model leverages continuous speech representations from the VAE's latent space, greatly simplifying the training and inference pipelines. We also introduce a stochastic monotonic alignment mechanism to enforce strict monotonic alignments. Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3% of VALL-E's parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at \url{https://tinyurl.com/gmm-lm-tts}.	-
dcterms.accessRights	open access	en_US
dcterms.bibliographicCitation	The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, Apr 24 2025	-
dcterms.issued	2025	-
dc.relation.conference	International Conference on Learning Representations [ICLR]	-
dc.description.validate	202606 bcjz	-
dc.description.oa	Version of Record	en_US
dc.identifier.FolderNumber	OA_Others	en_US
dc.description.fundingSource	RGC	en_US
dc.description.fundingText	This work was supported by the RGC of Hong Kong SAR, Grant No. PolyU 15228223.	en_US
dc.description.pubStatus	Published	en_US
dc.description.oaCategory	CC	en_US
Appears in Collections:	Conference Paper

Open Access Information

Status	open access
File Version	Version of Record

Show simple item record

Google Scholar^TM

Check

Open Access Information

Google ScholarTM

Google Scholar^TM