Towards elastic, robust and privacy-preserving AI model serving

Chen, Jinyu

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/117282

DC Field	Value	Language
dc.contributor	Department of Computing	-
dc.creator	Chen, Jinyu	-
dc.identifier.uri	https://theses.lib.polyu.edu.hk/handle/200/14150	-
dc.language.iso	English	-
dc.title	Towards elastic, robust and privacy-preserving AI model serving	-
dc.type	Thesis	-
dcterms.abstract	AI model serving has become a cornerstone of intelligent applications, transforming industries and enhancing daily life through AI-driven services. The emergence of foundation models, such as GPT and Vision Transformers, has revolutionized AI services across diverse domains. These models, with billions of parameters, exhibit remarkable generalization capabilities but introduce substantial computational and deployment challenges, underscoring the need for efficient serving strategies to enable real-world adoption. However, modern AI model serving systems face several critical challenges. First, the rapid expansion of model size and complexity results in significant inference overhead, necessitating extensive computational resources and memory bandwidth. Second, the dynamic and unpredictable nature of query loads in AI services leads to severe latency fluctuations and resource contention. Third, user requirements vary significantly in terms of accuracy and response time, demanding flexible serving solutions capable of adaptively balancing efficiency and quality. Additionally, privacy concerns arise when deploying AI models in edge environments, where user data cannot be directly transmitted to centralized servers. To address these challenges, this thesis investigates techniques to enhance elasticity, robustness, and privacy preservation in AI model serving.	-
dcterms.abstract	First, we develop the first elastic serving system specifically designed for Transformer models. Unlike conventional approaches that pre-train multiple model variants of different sizes to accommodate diverse service requirements, which result in prohibitive I/O delays and excessive training costs, we propose a lightweight token adaptation mechanism for elastic Transformer serving. This mechanism dynamically adds prompting tokens to enhance accuracy and reduces redundant tokens to accelerate inference, thereby enhancing system elasticity. To further improve serving throughput, our framework integrates an application-aware selective batching strategy and an online token adaptation algorithm, which dynamically adjusts the token allocation scheme in real time. Experimental results demonstrate that our method significantly enhances serving throughput while maintaining high accuracy.	-
dcterms.abstract	Second, while token reduction techniques effectively accelerate inference by dynamically removing redundant tokens, they often introduce unpredictable accuracy degradation under varying reduction ratios, compromising service robustness. To address this challenge, we introduce Prodigy, an elastic and robust Transformer serving system based on token-reduction warm-up. The core idea is to pre-train multiple warmed-up models at different token reduction levels, leveraging the insight that fine-tuning with token reduction significantly enhances inference accuracy. Instead of fine-tuning models for every possible reduction setting, we develop a strategic fine-tuning planner and a model ensemble method that enable robust inference across a wide range of reduction ratios with high efficiency. These approaches substantially improve service quality while reducing the computational and storage costs for fine-tuning.	-
dcterms.abstract	Third, to enable privacy-preserving optimization, we propose a fast multimodal edge inference framework with a selective feature distillation method. Our method selectively distills knowledge from a pre-trained model in the cloud by uploading only feature representations for public data selection, effectively preventing user data leakage. Additionally, we introduce a privacy-preserving feature clustering mechanism that transmits only prototype-based representations of local features, further enhancing security. To accommodate varying communication bandwidths, we design an adaptive feature compression module that efficiently reduces transmission costs. Experimental results demonstrate that the proposed framework ensures strong privacy protection, optimizes resource utilization, and maintains high inference accuracy.	-
dcterms.abstract	In summary, this thesis presents a set of innovative techniques to improve the elasticity, robustness, and privacy-preserving of AI model serving. Through extensive experiments and evaluations, we demonstrate that our proposed methods significantly enhance serving system performance across diverse real-world scenarios. These contributions pave the way for future advancements in scalable and advanced AI model deployment, ultimately fostering more intelligent, efficient, and trustworthy AI services for society.	-
dcterms.accessRights	open access	-
dcterms.educationLevel	Ph.D.	-
dcterms.extent	xvi, 130 pages : color illustrations	-
dcterms.issued	2025	-
dcterms.LCSH	Artificial intelligence -- Data processing	-
dcterms.LCSH	Machine learning	-
dcterms.LCSH	Computer security	-
dcterms.LCSH	Hong Kong Polytechnic University -- Dissertations	-
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/14150

Show simple item record

Google Scholar^TM

Check

Access

Google ScholarTM

Google Scholar^TM