Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/114359
DC FieldValueLanguage
dc.contributorDepartment of Electrical and Electronic Engineering-
dc.creatorDuan, Jiawei-
dc.identifier.urihttps://theses.lib.polyu.edu.hk/handle/200/13699-
dc.language.isoEnglish-
dc.titleBenchmarking and enhancing the utility of differential privacy for data mining applications-
dc.typeThesis-
dcterms.abstractDifferential Privacy (DP) offers robust guarantees for protecting individual data against malicious attacks in both industrial sectors (e.g., Apple and Google) and administrative sectors (e.g., the U.S. Census Bureau). In general, DP allows for efficient statistical analysis while safeguarding privacy, making it widely adopted in various data mining tasks such as frequency/mean estimation, private data publication, and private learning. However, there exists a trade-off between utility and privacy: enhancing one typically compromises the other. Despite significant efforts to mitigate this trade-off, critical limitations persist. Some solutions achieve high utility but are tailored to specific DP mechanisms or data mining tasks, thus lacking generality. Conversely, more general solutions often fail to deliver superior utility. This creates a dilemma where achieving both generality and effectiveness simultaneously remains challenging.-
dcterms.abstractThe works in this thesis together compose a platform that theoretically benchmarks and generally enhances the utilities of various DP mechanisms in two prevalent data mining scenarios: statistical analysis and model training. The main contributions of this thesis are divided into three chapters, organized in a top-down order: high-dimensional statistics estimation [7, 51, 84, 88, 104], centralized learning [6, 82, 85, 160], and federated learning [150].-
dcterms.abstractIn the first chapter, we present LDPTube, an analytical toolbox that generalizes and enhances DP mechanisms for high-dimensional mean estimation. Specifically, we leverage the Central Limit Theorem (CLT) [43, 115], one of the most recognized theorems in statistics, to describe the mean square errors (MSEs) of various DP mechanisms. To optimize their MSEs, HDR4ME* uses regularizations to eliminate excessively noisy data, thereby achieving better utilities in high-dimensional mean estimation. The second chapter focuses on the utilities of private centralized learning. Here, we introduce GeoDP, a framework that first theoretically derives the impact of DP noise on model efficiency. Our analysis reveals that the existing perturbation methods introduce biased noise to the gradient direction, resulting in a sub-optimal training process. GeoDP addresses this issue by adding unbiased noise to the gradient direction, thereby improving model utilities. In the final chapter, we propose LDPVec, which theoretically analyzes and enhances model utility in federated learning under various DP mechanisms. Similar to mean estimation, the global aggregation step in federated learning averages noisy gradients from each local party, allowing the CLT to effectively describe model utilities. We observe that preserving the gradient direction is crucial, while the perturbed gradient magnitude can be adjusted through fine-tuning the learning rate or clipping. Consequently, LDPVec optimizes model efficiency by allocating (d-1)/d of the privacy budget to the gradient direction and 1/d to the gradient magnitude.-
dcterms.accessRightsopen access-
dcterms.educationLevelPh.D.-
dcterms.extentxiv, 157 pages : color illustrations-
dcterms.issued2025-
dcterms.LCSHData protection-
dcterms.LCSHPrivacy -- Mathematical models-
dcterms.LCSHData mining -- Security measures-
dcterms.LCSHMachine learning-
dcterms.LCSHHong Kong Polytechnic University -- Dissertations-
Appears in Collections:Thesis
Show simple item record

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.