Compliance vulnerabilities and test-time governance in transformers

Dong, Peiran

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/115224

Title:	Compliance vulnerabilities and test-time governance in transformers
Authors:	Dong, Peiran
Degree:	Ph.D.
Issue Date:	2025
Abstract:	Transformer models have greatly advanced AI applications in areas such as natural language processing and image generation by utilizing their sophisticated architectures for both discriminative and generative tasks. For example, Transformer models trained on large text corpora excel in tasks like semantic analysis and language translation. When integrated into visual models, they also enable text-conditioned image generation. However, the increasing deployment of these models has introduced new security risks, particularly concerning compliance vulnerabilities. These vulnerabilities involve ensuring that model outputs meet ethical and regulatory standards, even when faced with malicious attacks. To prevent an AI race that compromises safety and ethical values, it is essential to balance the risks and benefits of deploying AI models. This thesis addresses these concerns by focusing on the compliance vulnerabilities of Transformer architectures, particularly backdoor attacks and unsafe content generation. First, we investigate the security risks of backdoor attacks in discriminative models. We introduce a novel backdoor attack method that uses encoding-specific perturbations to trigger malicious behaviors in pre-trained language models. Our research shows that Transformer-based language models can be manipulated to pass off harmful text as benign, allowing it to spread on public platforms undetected. Traditional defenses against backdoor attacks, such as data preprocessing or model fine-tuning, are often expensive. To overcome this, we propose a test-time defense for Vision Transformers (ViTs). By examining output distributions across different ViT blocks, we develop a Directed Term Frequency-Inverse Document Frequency (TF-IDF) based method to detect and classify poisoned inputs effectively. Our approach significantly improves the security and reliability of ViTs against backdoor attacks. Generative models with Transformer architectures also face severe compliance risks. Users can generate harmful content, such as violent, infringing, or pornographic material, through text prompts, leading to negative social impacts. To address this, we introduce the PROTORE framework, which ensures safe content generation at test time. This framework employs a "Prototype, Retrieve, and Refine" pipeline to enhance the identification and mitigation of unsafe concepts in generative models. Comprehensive evaluations on various benchmarks demonstrate the effectiveness and scalability of the PROTORE approach in refining generated content. In summary, this thesis provides a thorough examination of compliance vulnerabilities in Transformer-based models. Our proposed methodologies and frameworks tackle critical issues in model compliance, laying the groundwork for future research in secure and responsible AI deployment.
Subjects:	Artificial intelligence Computer security Cyber intelligence (Computer security) Hong Kong Polytechnic University -- Dissertations
Pages:	xiv, 131 pages : color illustrations
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/13826

Show full item record

Google Scholar^TM

Check

Access

Google ScholarTM

Google Scholar^TM