Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/99802
PIRA download icon_1.1View/Download Full Text
DC FieldValueLanguage
dc.contributorDepartment of Computingen_US
dc.creatorYan , Hen_US
dc.creatorGui, Len_US
dc.creatorLi, Wen_US
dc.creatorHe, Yen_US
dc.date.accessioned2023-07-21T01:07:31Z-
dc.date.available2023-07-21T01:07:31Z-
dc.identifier.issn2640-3498en_US
dc.identifier.urihttp://hdl.handle.net/10397/99802-
dc.descriptionThe 38tj Conference on Uncertainty in Artificial Intelligence (UAI 2022), 1-5 August 2022, Eindhoven, The Netherlandsen_US
dc.language.isoenen_US
dc.publisherPMLR web siteen_US
dc.rightsPosted with permission of the author.en_US
dc.rightsThe following publication Hanqi Yan, Lin Gui, Wenjie Li, Yulan He Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, PMLR 180:2181-2191, 2022 is available at http://proceedings.mlr.press/v180/yan22b.html.en_US
dc.titleAddressing token uniformity in transformers via singular value transformationen_US
dc.typeConference Paperen_US
dc.identifier.spage2181en_US
dc.identifier.epage2191en_US
dc.identifier.volume180en_US
dcterms.abstractToken uniformity is commonly observed in transformer-based models, in which different tokens share a large proportion of similar information after going through stacked multiple self-attention layers in a transformer. In this paper, we propose to use the distribution of singular values of outputs of each transformer layer to characterise the phenomenon of token uniformity and empirically illustrate that a less skewed singular value distribution can alleviate the token uniformity problem. Base on our observations, we define several desirable properties of singular value distributions and propose a novel transformation function for updating the singular values. We show that apart from alleviating token uniformity, the transformation function should preserve the local neighbourhood structure in the original embedding space. Our proposed singular value transformation function is applied to a range of transformer-based language models such as BERT, ALBERT, RoBERTa and DistilBERT, and improved performance is observed in semantic textual similarity evaluation and a range of GLUE tasks.en_US
dcterms.accessRightsopen accessen_US
dcterms.bibliographicCitationProceedings of Machine Learning Research, 2022, v. 180, p. 2181-2191en_US
dcterms.isPartOfProceedings of Machine Learning Researchen_US
dcterms.issued2022-
dc.relation.conferenceConference on Uncertainty in Artificial Intelligence [UAI]en_US
dc.description.validate202307 bcwwen_US
dc.description.oaVersion of Recorden_US
dc.identifier.FolderNumbera2311-
dc.identifier.SubFormID47465-
dc.description.fundingSourceSelf-fundeden_US
dc.description.pubStatusPublisheden_US
dc.description.oaCategoryCopyright retained by authoren_US
Appears in Collections:Conference Paper
Files in This Item:
File Description SizeFormat 
yan22b.pdf1.06 MBAdobe PDFView/Open
Open Access Information
Status open access
File Version Version of Record
Access
View full-text via PolyU eLinks SFX Query
Show simple item record

Page views

186
Last Week
14
Last month
Citations as of Nov 10, 2025

Downloads

55
Citations as of Nov 10, 2025

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.