Generalizing oversampling methods from euclidean space data to graph structured data for the class imbalance problem

Liu, Yongxu

Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/97857

Title:	Generalizing oversampling methods from euclidean space data to graph structured data for the class imbalance problem
Authors:	Liu, Yongxu
Degree:	Ph.D.
Issue Date:	2022
Abstract:	Over the last decade, classification, one of the most important pattern recognition tasks, has promoted a wide range of real-world applications. Although many classic classification algorithms have been proposed, some of them may not perform well when data follows the unequal distribution, where some minority classes have much fewer samples than the majority classes. By generating new samples for the minority class, oversampling methods propose a promising solution to alleviate the class imbalance problem. There are still two concerns: the noise generated by oversampling methods and the narrow scope that oversampling methods focus on. First, the oversampling algorithms utilize the interpolation method to generate samples between the chosen source samples. If a minority sample is far from its minority neighbors, the interpolated samples would easily reside in the majority region and cause additional classification difficulties. Second, most of the existing oversampling methods focus on the numerical variables and binary classification. Nevertheless, categorical variables and multi-class classification are ubiquitous in the real world. Extending existing methods to the complex scenario would meet several non-trivial challenges. Furthermore, we delve deeper into oversampling algorithm designs and observe that most of them assume the data falls into the Euclidean space. It severely limits the scope of existing oversampling methods. Non-Euclidean space data, such as graph-structured data, arise in numerous applications. Thus, this thesis progressively proposes three steps to alleviate the above issues, named Generalizing Oversampling Methods from Euclidean Space Data to Graph Structured Data for the Class Imbalance Problem. First, to reduce the noises of generated samples, we propose PABIO, an efficient position-aware safe boundary interpolation oversampling algorithm on Euclidean space data. We utilize a combined clustering algorithm, which would not cluster two dense clusters into one. Then we can safely generate new samples within the discovered clusters. Moreover, we leverage the majority class information to learn a safe boundary for generating samples. We force the synthetic samples closer to their minority neighbors than to any other majority instances in the embedding space. It avoids the generation of noisy samples, especially for the minorities far away from their minority neighbors. Second, we propose NROMM, a noise-robust oversampling algorithm, for the categorical variables and multi-class data with an underlying Euclidean space. NROMM uses a heterogeneous distance metric to calculate the difference between samples with numerical and categorical variables. One-versus-ensemble decomposition method is proposed to binarize the multi-class problem. Except for the class with the maximum number of samples, we oversample one class at a time. We ensemble the original and synthetic samples from larger classes than the class being oversampling as the majority class. We propose a cleaning strategy to alleviate the aggregated class overlapping problem in the multi-class scenario. After all the classes have been oversampled, we discover the overlapping sets and then remove the inland samples in these sets to enrich the class boundaries. Third, we propose GATSMOTE to enable the oversampling algorithm to work on one of the non-Euclidean spaces, graph-structured data. We focus on adding new edges between synthetic samples and the original graph, while preserving the locality graph structure and shortening the path length. We utilize the attention mechanism to calculate the weighted edge connections adaptively. The attention coefficients are only calculated within a given neighborhood, and thus more local graph structures are preserved. Moreover, we follow the homophily in graph theory to add edges between similar nodes. We propose to add edges between nodes sharing similar feature vectors. In addition, we propose to add edges between nodes within each class. Adding edges between nodes with identical labels shortens the path length to facilitate message passing via edges.
Subjects:	Computer algorithms Pattern recognition systems Hong Kong Polytechnic University -- Dissertations
Pages:	xv, 110 pages : color illustrations
Appears in Collections:	Thesis

Access

View full-text via https://theses.lib.polyu.edu.hk/handle/200/12239

Show full item record

Page views

306

Last Week
0

Last month

Citations as of Apr 12, 2026

Google Scholar^TM

Check

Access

Page views

Google ScholarTM

Google Scholar^TM