DirMoE : Dirichlet-Routed Mixture of Experts

Vahidi, A; Moullet, M; Asadollahzadeh, H; Ly, K; Yang, X; Attar, NA; Lotfollahi, M

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/119293

DC Field	Value	Language
dc.contributor	Department of Data Science and Artificial Intelligence	en_US
dc.creator	Vahidi, A	en_US
dc.creator	Moullet, M	en_US
dc.creator	Asadollahzadeh, H	en_US
dc.creator	Ly, K	en_US
dc.creator	Yang, X	en_US
dc.creator	Attar, NA	en_US
dc.creator	Lotfollahi, M	en_US
dc.date.accessioned	2026-06-12T07:15:25Z	-
dc.date.available	2026-06-12T07:15:25Z	-
dc.identifier.uri	http://hdl.handle.net/10397/119293	-
dc.description	The Fourteenth International Conference on Learning Representations, ICLR 2026, Rio de Janeiro, Brazil, Apr 23-27 2026	en_US
dc.language.iso	en	en_US
dc.publisher	OpenReview.net	en_US
dc.rights	CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)	en_US
dc.rights	The following publication Vahidi, A., Asadollahzadeh, H., Attar, N. A., Moullet, M., Ly, K., Yang, X., & Lotfollahi, M. (2026). DirMoE: Dirichlet-routed Mixture of Experts. In The Fourteenth International Conference on Learning Representations is available at https://openreview.net/forum?id=a15cDnzr6r.	en_US
dc.title	DirMoE : Dirichlet-Routed Mixture of Experts	en_US
dc.type	Conference Paper	en_US
dcterms.abstract	Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-k+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-k+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.	en_US
dcterms.accessRights	open access	en_US
dcterms.bibliographicCitation	The Fourteenth International Conference on Learning Representations, ICLR 2026, Rio de Janeiro, Brazil, Apr 23-27 2026, https://openreview.net/forum?id=a15cDnzr6r	en_US
dcterms.issued	2026	-
dc.relation.conference	International Conference on Learning Representations [ICLR]	en_US
dc.description.validate	202606 bcch	en_US
dc.description.oa	Version of Record	en_US
dc.identifier.FolderNumber	a4508	-
dc.identifier.SubFormID	52995	-
dc.description.fundingSource	Self-funded	en_US
dc.description.pubStatus	Published	en_US
dc.description.oaCategory	CC	en_US
Appears in Collections:	Conference Paper

Files in This Item:

File	Description	Size	Format
Vahidi_DirMoE_Dirichlet_Routed.pdf		2.21 MB	Adobe PDF	View/Open

Open Access Information

Status	open access
File Version	Version of Record

Show simple item record

Google Scholar^TM

Check

Files in This Item:

Open Access Information

Google ScholarTM

Google Scholar^TM