Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/119293
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Data Science and Artificial Intelligenceen_US
dc.creatorVahidi, Aen_US
dc.creatorMoullet, Men_US
dc.creatorAsadollahzadeh, Hen_US
dc.creatorLy, Ken_US
dc.creatorYang, Xen_US
dc.creatorAttar, NAen_US
dc.creatorLotfollahi, Men_US
dc.date.accessioned2026-06-12T07:15:25Z-
dc.date.available2026-06-12T07:15:25Z-
dc.identifier.urihttp://hdl.handle.net/10397/119293-
dc.descriptionThe Fourteenth International Conference on Learning Representations, ICLR 2026, Rio de Janeiro, Brazil, Apr 23-27 2026en_US
dc.language.isoenen_US
dc.publisherOpenReview.neten_US
dc.rightsCC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)en_US
dc.rightsThe following publication Vahidi, A., Asadollahzadeh, H., Attar, N. A., Moullet, M., Ly, K., Yang, X., & Lotfollahi, M. (2026). DirMoE: Dirichlet-routed Mixture of Experts. In The Fourteenth International Conference on Learning Representations is available at https://openreview.net/forum?id=a15cDnzr6r.en_US
dc.titleDirMoE : Dirichlet-Routed Mixture of Expertsen_US
dc.typeConference Paperen_US
dcterms.abstractMixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-k+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-k+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization.en_US
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationThe Fourteenth International Conference on Learning Representations, ICLR 2026, Rio de Janeiro, Brazil, Apr 23-27 2026, https://openreview.net/forum?id=a15cDnzr6ren_US
dcterms.issued2026-
dc.relation.conferenceInternational Conference on Learning Representations [ICLR]en_US
dc.description.validate202606 bcchen_US
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumbera4508-
dc.identifier.SubFormID52995-
dc.description.fundingSourceSelf-fundeden_US
dc.description.pubStatusPublisheden_US
dc.description.oaCategoryCCen_US
Appears in Collections:Conference Paper
Files in This Item:
File Description SizeFormat 
Vahidi_DirMoE_Dirichlet_Routed.pdf2.21 MBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Show simple item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.