Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/88391
Title: Statistical learning with empirical features and data of different types
Authors: Qin, Huihui
Degree: Ph.D.
Issue Date: 2020
Abstract: The thesis consists of three parts that cover different aspects of statistical learning for data mining. In the first part, we propose a new algorithm, LESS (Learning with Empirical feature-based Summary statistics from Semi-supervised data), which uses only summary statistics instead of raw data for regression learning. Nowadays the extensive collection and analyzing of data is stimulating widespread privacy concerns, and therefore is increasing tensions between the potential sources of data and researchers. A privacy-friendly learning framework can help to ease the tensions, and to free up more data for research. In LESS, The selection of empirical features serves as a trade-off. between prediction precision and the protection of privacy. We show that LESS achieves the minimax optimal rate of convergence, in terms of the size of the labeled sample. LESS extends naturally to the applications where data are separately held by different sources. Compared with existing literature on distributed learning, LESS removes the restriction of minimum sample size on single data sources. In the second part of the thesis, we study different approaches for analyzing topics in text data. Topic modeling has been an important field in natural language processing (NLP) and recently witnessed great methodological advances. Yet, the development of topic modeling is still, if not increasingly, challenged by two critical issues. First, despite intense efforts toward nonparametric/post-training methods, the search for the optimal number of topics K remains a fundamental question in topic modeling and warrants input from domain experts. Second, with the development of more sophisticated models, topic modeling is now ironically been treated as a black box and it becomes increasingly difficult to tell how research findings are informed by data, model specifications, or inference algorithms. Based on about 120,000 newspaper articles retrieved from three major Canadian newspapers (Globe and Mail, Toronto Star, and National Post) since 1977, we employ five methods with different model specifications and inference algorithms (Latent Semantic Analysis, Latent Dirichlet Allocation, Principal Component Analysis, Factor Analysis, Non-negative Matrix Factorization) to identify discussion topics. The optimal topics are then assessed using three measures: coherence statistics, held-out likelihood (loss), and graph-based dimensionality selection. Mixed findings from this research complement advances in topic modeling and provide insights into the choice of optimal topics in social science research. In the third part, we consider the generalized linear hurdle model with grouped and right-censored count data. This data type is widely applied in demography, epidemiology, sociology, criminology, psychology, and many other branches of social sciences. The corresponding generalized linear model and the zero-inflated model recently draw much attention. In this part, we study the hurdle model which covers not only zero inflation but also zero deflation. We provide sufficient conditions for the asymptotic consistency and asymptotic normality of maximum likelihood estimator. We represent the Fisher information matrix of the hurdle model in terms of the vanilla grouped and right-censored model. We provide an elegant sufficient and necessary condition for the Fisher information matrix of the hurdle model to be strictly positive definite. The research complements the recent development of the statistical inference with grouped and right-censored count data.
Subjects: Mathematical statistics
Machine learning
Data mining
Hong Kong Polytechnic University -- Dissertations
Pages: x, 67 pages : color illustrations
Appears in Collections:Thesis

Show full item record

Page views

47
Last Week
0
Last month
Citations as of May 5, 2024

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.