Please use this identifier to cite or link to this item: http://hdl.handle.net/10397/88046
Title: Supervised statistical inference for data of versatile dimensionality with application to GWAS studies
Authors: Xu, Sheng
Degree: Ph.D.
Issue Date: 2020
Abstract: Genome-Wide Association Studies (GWAS) have been successful strategies of applying biological insights into diseases in epigenetics and epigenomics in the past two decades, by linking diseases or their traits with genomic variants, environmental confounders, and clinically relevant information. The companion data used to be of versatile dimensionality and of complex data structure, posing exciting challenges and opportunities for new statistical methodology and inference, coupled with new modeling and effective computing implementation. The thesis composes of three parts and aims to address several important regression problems of estimation, hypothesis testing, and classification arising from the prevailing GWAS data pool, to meet the increasing need of statistical analytic toolsets. Part I focuses on regression with censored survival outcomes and is motivated by data of diffuse large B-cell lymphoma (DLBCL), which integrated a large number of gene expression variants and censored survival time of patients with low sample size. This calls for efficient algorithms for feature screening and delicate statistical inference for the selected subset of influenced variables after dimensionality reduction. In Chapter 2, we present the non-monotone proximal gradient (NPG) algorithm to speed up sure joint screening for ultrahigh-dimensional Cox proportional hazard model and prove its convergence with LASSO initiator. The accompanied R-package named coxnpgsjs is fast and efficient to select a designated number of influenced gene variants from the DLBCL data. In Chapter 3, we investigate the impact of such a subset of genetic factors on the survival time through the single-index hazard (SIH) semiparametric regression model. The SIH model is robust but challenging in efficient statistical inference owing to the nested single index structure. We propose a censored version of multiple local linear regression to attain uniformly consistent estimator of the nonparametric component and the semiparametric efficient bound for the profile likelihood estimator of the parametric component. Two classes of estimations equations are derived as the practical alternative of the score equation from the perspective of double robustness. The proposed methods and results are applied to estimate the gene effects and to detect its significance on the aforementioned lymphoma. Part II focuses on regression with sparse longitudinal responses and is motivated by large-scale longitudinal GWAS for Alzheimer's Disease in detecting Single Nucleotide Polymorphisms (SNPs) level genotype effects on the phenotype response. It is in urgent need of powerful test procedures to detect the significance at the GWAS P-value significant threshold to the wide community of associated researchers. To compare multiple treatments, Chapters 4 and 5 present practical strategies on bootstrap procedures and apply successfully on models with Gaussian and non-Gaussian phenotype response and gigantic SNP level genotypes. This unveils some interesting association discoveries of generic effect on the disease at the GWAS significance level for the well-known Alzheimer's Disease Neuroimaging Initiative (ADNI) cohort. Part III focuses on regression with binary outcomes and is motivated by labeling the Multiple Sclerosis disease precisely among a population where the projection scores are skewed. In Chapter 6, we define a general distance to incorporate existing optimal functional classifiers and interpret reasonably why our proposed quantile classifier is robust. The optimal property of near perfect is derived. The accompanied classification procedure is fast and accurate. A Shiny app is built for the convenience of clinical practitioners.
Subjects: Genomics -- Statistical methods
Genomics -- Data processing
Hong Kong Polytechnic University -- Dissertations
Pages: xiii, 204 pages : color illustrations
Appears in Collections:Thesis

Show full item record

Page views

19
Citations as of May 22, 2022

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.