Please use this identifier to cite or link to this item:
http://hdl.handle.net/10397/119293
| DC Field | Value | Language |
|---|---|---|
| dc.contributor | Department of Data Science and Artificial Intelligence | en_US |
| dc.creator | Vahidi, A | en_US |
| dc.creator | Moullet, M | en_US |
| dc.creator | Asadollahzadeh, H | en_US |
| dc.creator | Ly, K | en_US |
| dc.creator | Yang, X | en_US |
| dc.creator | Attar, NA | en_US |
| dc.creator | Lotfollahi, M | en_US |
| dc.date.accessioned | 2026-06-12T07:15:25Z | - |
| dc.date.available | 2026-06-12T07:15:25Z | - |
| dc.identifier.uri | http://hdl.handle.net/10397/119293 | - |
| dc.description | The Fourteenth International Conference on Learning Representations, ICLR 2026, Rio de Janeiro, Brazil, Apr 23-27 2026 | en_US |
| dc.language.iso | en | en_US |
| dc.publisher | OpenReview.net | en_US |
| dc.rights | CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/) | en_US |
| dc.rights | The following publication Vahidi, A., Asadollahzadeh, H., Attar, N. A., Moullet, M., Ly, K., Yang, X., & Lotfollahi, M. (2026). DirMoE: Dirichlet-routed Mixture of Experts. In The Fourteenth International Conference on Learning Representations is available at https://openreview.net/forum?id=a15cDnzr6r. | en_US |
| dc.title | DirMoE : Dirichlet-Routed Mixture of Experts | en_US |
| dc.type | Conference Paper | en_US |
| dcterms.abstract | Mixture-of-Experts (MoE) models have demonstrated exceptional performance in large-scale language models. Existing routers typically rely on non-differentiable Top-k+Softmax, limiting their performance and scalability. We argue that two distinct decisions, which experts to activate and how to distribute expert contributions among them, are conflated in standard Top-k+Softmax. We introduce Dirichlet-Routed MoE (DirMoE), a novel end-to-end differentiable routing mechanism built on a Dirichlet variational autoencoder framework. This design fundamentally disentangles the core routing problems: expert selection, modeled by a Bernoulli component, and expert contribution among chosen experts, handled by a Dirichlet component. The entire forward pass remains fully differentiable through the use of Gumbel-Sigmoid relaxation for the expert selection and implicit reparameterization for the Dirichlet distribution. Our training objective, a variational ELBO, includes a direct sparsity penalty that precisely controls the number of active experts in expectation, alongside a schedule for key hyperparameters that guides the model from an exploratory to a definitive routing state. Moreover, our DirMoE router matches or exceeds other methods while improving expert specialization. | en_US |
| dcterms.accessRights | open access | en_US |
| dcterms.bibliographicCitation | The Fourteenth International Conference on Learning Representations, ICLR 2026, Rio de Janeiro, Brazil, Apr 23-27 2026, https://openreview.net/forum?id=a15cDnzr6r | en_US |
| dcterms.issued | 2026 | - |
| dc.relation.conference | International Conference on Learning Representations [ICLR] | en_US |
| dc.description.validate | 202606 bcch | en_US |
| dc.description.oa | Version of Record | en_US |
| dc.identifier.FolderNumber | a4508 | - |
| dc.identifier.SubFormID | 52995 | - |
| dc.description.fundingSource | Self-funded | en_US |
| dc.description.pubStatus | Published | en_US |
| dc.description.oaCategory | CC | en_US |
| Appears in Collections: | Conference Paper | |
Files in This Item:
| File | Description | Size | Format | |
|---|---|---|---|---|
| Vahidi_DirMoE_Dirichlet_Routed.pdf | 2.21 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.


