Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/91715
Title: Mining multi-label data : feature engineering and label learning approaches
Authors: Guo, Yumeng
Degree: Ph.D.
Issue Date: 2021
Abstract: In recent years, modeling and prediction techniques for multi-label data is a new research field in machine learning. Numerous studies have found that, multi-label type of data, where each instance is associated with multiple class labels simultaneously, widely exists in various real-world applications. It appears in areas, such as text categorization where each document may belong to several topics, gene function determination where each gene may be associated with a set of functional classes, scene recognition where each image may demonstrate several semantic classes and so on. As an important machine learning task, multi-label learning deals with this type of data for modeling and prediction. Due to the redundant and irrelevant information existed in the feature space of multi-label data, the performance of multi-label learning has been greatly affected. This research focuses on how to effectively remove the interference of such information. With a focus on multi-label engineering, this dissertation aims at exploring how to improve the performance of multi-label learning by manipulating the feature space of multi-label data from various aspects, including the research of multi-label feature selection and multi-label feature reconstruction. Furthermore, we explore the application of deep learning techniques to multi-label learning. The corresponding research works can be presented as follows. (1) Due to the characteristics of filter-based multi-label feature section method, which is independent from classifier and has the advantages of simple design and high efficiency, a quick and basic understanding of the importance of features in multi-label dataset is needed before designing more complex multi-label feature selection methods. In this dissertation, we propose a filter framework for multi-label feature selection which employs Chi-Square test as evaluation measure in experimental parts. Based on three methods which are max, avg and min, we explore the characteristics of feature importance ranking in various domains of multi-label datasets. Different from the previous researchers focusing on max and avg, we emphasize the exploration of min. By analyzing the experimental results, we find that min can effectively generate feature importance ranking for multi-label datasets from some domains, which also reflects the important significance of our propose framework. (2) In some multi-label data analysis of bioinformatics, reliable feature or feature subsets which have interpretation of physical significance are needed. In this case, multi-label feature selection is preferred. In this dissertation, a novel algorithm named EEFS, i.e. multi-label Ensemble Embedded Feature Selection, is proposed to improve the performance of multi-label classifier on multi-label bioinformatics data. EEFS randomly selects part of examples from training set to train models by exploiting ensemble method and multi-label learners. And then, it employs evaluation measures and averaged training data corresponding to each feature to test these trained models iteratively. At last, it employs prediction risk and forward search strategy to get the final feature importance ranking. Experimental results show that our algorithm achieves significant superiority over other algorithms of generating feature importance ranking.
(3) In multi-label learning, one common strategy adopted by existing approaches is manipulating label space, which exploits correlations between labels and utilizes the instance with identical feature representation, to finish the learning task. Due to the fact that algorithms designed for this strategy cannot utilize the feature space to distinguish the specificity of each label, we proposes a novel algorithm named LSDM, i.e. multi-label learning with Label-Specific Discriminant Mapping features, to improve the performance of multi-label learning by utilizing label specificity to reconstruct feature space. It sets several different values of ratio parameter to control the number of cluster centers for each label. It also reconstructs several feature spaces specific to each label by conducting cluster analysis on positive and negative instance sets of this class label. The label specific features of LSDM include distance mapping features and linear embedding features. The distance mapping features mainly contain far and near information between instances and the cluster centers, while linear representation features can describe the spatial topological information between them. Through empirical test, we find that the combination of the two types of features can better exploit the clustering results. For the problem of diverse combinations for identical label, we also employ simplified linear discriminant analysis (sLDA) to efficiently excavate optimal one for each label. Comparison with other algorithms clearly manifests the competitiveness of LSDM. (4) In multi-label learning, some approaches learn for each label individually by utilizing label specificity to reconstruct feature space, which ignore the effect of correlations within labels with respect to feature space. Hence, we propose a novel algorithm named ATOM, i.e. multi-label learning with globAl densiTy fusiOn Mapping features, to reconstruct the feature space based on global information of all the class labels (label set). In ATOM, cluster analysis is performed on training instances with each class label and then all the cluster centers are combined as a union. Aimed at effciently excavating cluster centers to reduce irrelevant or redundant information, density fusion technique is employed to update the cluster center union. Then, reconstructed feature space based on distance mapping and linear embedding is constructed by querying the final cluster center union. At last, classifier is induced from the reconstructed feature space instead of the original one and multi-label learner. Comparison with other algorithms clearly manifests the reconstructed feature space based on global information can improve the performance of multi-label learning. (5) As most deep learning models focus on single-label or multi-class learning problems, we propose a calibrated deep neural network (CDML) for multi-label learning. We employ the Softmax regression, which is a generalization of logistic regression to handle the multiple class label classification problem, in CDML. In order to solve the problem of multi-label classification, firstly, the objective function's cross entropy loss is modified for multi-label learning, and then we add a calibrated label to the datasets to distinguish which labels are relevant or irrelevant to each instance. Finally, the Softmax regression is applied to predict the posterior probability of multiple labels simultaneously, and determine the final classification results. Compared with other algorithms, the experimental results clearly manifests the competitiveness of CDML.
Subjects: Machine learning
Data mining
Hong Kong Polytechnic University -- Dissertations
Pages: xviii, 142 pages : color illustrations
Appears in Collections:Thesis

Show full item record

Page views

10
Citations as of May 15, 2022

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.