Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/91725
Title: Feature representation for large-scale data set
Authors: Hu, Yanxing
Degree: Ph.D.
Issue Date: 2021
Abstract: Feature representation is one of the most important research topics in Machine Learning (ML) area. In machine learning, representation of features means mapping the raw data into a new feature space that can be effectively exploited in machine learning tasks. Many supervised and unsupervised approaches, including supervised dictionary learning, Fuzzy and rough logics, Principal Component Analysis (PCA), local linear embedding, have been employed for feature representation of different types of data sets. The coming of the big data era brings both opportunities and challenges to the studies on feature representation. In real applications, the scale and the complexity of employed data far exceed the previous scenarios. On the one hand, the large volume of data set enables more complicate models be employed for feature representation, on the other hand, the multi-data source, complicate data structure and high computational requirement bring the new difficulties to the feature representation for huge data sets. In this study, concentrating on the feature representation problem for large-scale data set and related applications, new algorithms were proposed so that the obtained feature mapping enables better results for machine learning tasks. Our study starts with the feature representation for data set with discrete values. For data sets with discrete values, the features often contain some categorical information about the data points. This study solves the feature representation of this kind of data by providing a novel rough set-based feature reduction approach, to efficiently and reliably extract the necessary information in the features while removing the redundant information of the data set.
Our second work is to provide a matrix decomposition based unsupervised pre-training approach for the feature representation. One of the important unsupervised feature representations approach is based on clustering models. However, clustering approaches are time-consuming, especially for large-scale data sets. An eigenvector based unsupervised pre-training approach is therefore proposed for feature representation, and combined as the first layer of the Radial Basis Function Neural Network(RBFNN). Our third work concentrates on the feature representation for the data from multiple sources/views. A canonical correlation based-Auto encoder model is proposed for the feature fusion representation issue of the multi-domain data sets. The proposed model is consequently applied to the wind speed forecasting scenario to improve the wind speed forecasting accuracy. Finally, we proposed a localize generalization error based data reduction approach, this approach can reliably reduce the training set for some large-scale data set, which provide a thought for the large-scale learning takes. This approach is highly related to the distribution of the values for each feature, it can be seen from this work that the representation of the features can affect the necessary number of training samples. In summary, we make the following contributions: (i) algorithms and applications for feature representation on different types of large scale data sets; (ii) multi-domain feature fusion approach and applications; (iii) algorithms for computing the safe regions for the sum-optimal point notification problem.
Subjects: Machine learning
Algorithms -- Data processing
Data mining
Hong Kong Polytechnic University -- Dissertations
Pages: xviii, 175 pages : color illustrations
Appears in Collections:Thesis

Show full item record

Page views

5
Citations as of May 15, 2022

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.