报告人:李宇家(美国匹兹堡大学)
时 间:4月19日上午09:00
地 点:腾讯会议ID:5799224321(无密码)
内容摘要:
With the rapid advancement of high-throughput technologies, a large amount of high-dimensional omics data has been generated in the public domain, which gives rise to various statistical and computational challenges in the cluster and association analysis of omics data. This talk focuses on estimation of tuning parameters in cluster analysis (Part I) and disease subtyping issues (Part II) in high-dimensional omics studies.
In Part I, we proposed a resampling framework called S4 for selecting parameters in cluster analysis. Estimating the number of clusters (K) is a critical and often difficult task in cluster analysis. Many methods have been proposed to estimate K including S4 as the best performer. Our proposed S4 method measures the similarity (i.e., stability) between the clustering result of the whole and subsampled data and determines the optimal K with the highest stability score, based on the belief that the underlying true K can have stable clustering result when the data structure is perturbed (subsampling). In clustering high-dimensional omics data, many irrelevant features exist and may interfere with detection of true cluster structure. Therefore, feature selection is often needed for improved performance and interpretation. Witten and Tibshirani (2010) proposed a sparse K-means approach with lasso regularization on feature-specific weights to tackle this problem, where number of clusters K and sparsity parameter lambda must be both pre-estimated. To the best of our knowledge, little has been studied for simultaneous estimation of these two parameters. We extend our S4 to bridge the gap and it shows superior performance based on extensive simulations and nine real applications.
In Part II, we proposed a novel outcome-guided disease subtyping framework with weighted joint likelihood approach (named ogClust_WJL). Traditionally people utilize conventional cluster analysis (e.g., K-means) to identify subgroups of patients with similar expression pattern, without consideration of outcome information. Therefore, the subgroups identified can be irrelevant to clinical outcome of interest. Liu et al. (2020) proposed to incorporate outcome information into cluster analysis through a unified generative model (named ogClust_GM). However, ogClust_GM lacks the flexibility to tune the relative contribution of outcome association and gene clustering separation. In practice, the identified clusters are often dominated by outcome association and the disease subtyping model of omics data cannot work well in independent validation data, which causes overfitting. Our proposed ogClust_WJL can take user-defined weight as input to control the contribution of outcome association and gene clustering separation and by finely tuning the weight, potential overfitting can be avoided.
个人简介:
李宇家,生物统计专业博士。2013-2017年在伟德国际1946源自英国就读并获得学士学位;2017-2022年在美国匹兹堡大学攻读博士学位,研究方向为生物统计学。博士期间在Biometrics,Biostatistics等期刊发表论文多篇;获得2019 ENAR distinguished student paper award, 2021 ASA Pittsburgh chapter student of the year (honorable mentioned) 和 Delta Omega dissertation award等奖项。
联系人:周达