Advances in similarity-based prediction modeling

Kim, Minzee

Advances in similarity-based prediction modeling

dc.contributor.author	Kim, Minzee
dc.date.accessioned	2026-04-28T13:03:01Z
dc.date.available	2026-04-28T13:03:01Z
dc.date.issued	2026-04-28
dc.date.submitted	2026-04-21
dc.description.abstract	Personalized predictive modeling has been growing rapidly in recent years, especially with the availability of Electronic Health Records (EHRs). This approach aims to improve a model's predictive performance by fitting a unique model to each individual. We train the model on a subset of the training data consisting of individuals that are similar to the individual we are predicting for, identified through some patient similarity metric. Studies have shown that using a personalized model trained on a customized subset of the data leads to better prediction than using a global model trained on all the available data in the training data. In this thesis, we discuss advancements in similarity-based prediction modeling through extensive simulation studies and data analyses. Longitudinal and time-to-event data are often analyzed in biomarker research to study the association between the longitudinal biomarker measurements and the event-time outcome, in which the longitudinal information contributes to the probability of the outcome of interest. An attractive feature of fitting a joint model on this type of data is that we can dynamically predict the survival probability as additional longitudinal information becomes available. In Chapter 2, we propose a new similarity-based method for the dynamic prediction of joint models where we consider training the model on only a targeted subset of the data to obtain an improved outcome prediction. Through a comprehensive simulation study and an application to intensive care unit data on patients diagnosed with sepsis, we demonstrate that the predictive performance of the dynamic prediction of joint models can be improved with our proposed similarity-based approach. Next, we develop a new patient similarity metric designed to improve the predictive performance of a personalized model for binary response data. Specifically, we introduce a weighted cosine similarity metric in Chapter 3 that extends the standard cosine similarity metric by assigning predictor-specific weights when computing similarity between participants. These weights are estimated using the relaxed adaptive group lasso. Results from our simulation study and an analysis of intensive care unit data involving patients with circulatory system disease show that although the proposed similarity metric leads to a slight deterioration in calibration, it produces substantial gains in discrimination. Overall predictive performance measured by the Brier Score improves because the increase in discrimination outweighs the loss in calibration; therefore, our proposed similarity metric more effectively identifies clinically similar patients, resulting in improved predictive accuracy. Finally, in Chapter 4, we conduct a comprehensive comparison of several similarity metrics to investigate how the choice of similarity metric influences predictive performance in personalized modeling, again in the context of binary response data. By fitting models using only a subset of training participants who are most similar to the individual of interest, prediction accuracy for that individual can be improved. Consequently, selecting an appropriate similarity metric that identifies the most relevant subset of data is critical. We compare a range of distance-based and cosine similarity measures alongside clustering-based approaches, an area that is not well explored in the existing literature. In addition, we perform an extensive simulation study to examine how different data-generating mechanisms and underlying dataset characteristics affect the relative effectiveness of each similarity metric. Finally, we end with a discussion chapter that summarizes the key contributions of the thesis along with highlighting some key areas of future work.
dc.identifier.uri	https://hdl.handle.net/10012/23072
dc.language.iso	en
dc.pending	false
dc.publisher	University of Waterloo	en
dc.subject	personalized prediction
dc.subject	precision medicine
dc.subject	joint modeling
dc.subject	similarity metric
dc.subject	cosine similarity
dc.subject	clustering
dc.subject	similarity-based modeling
dc.title	Advances in similarity-based prediction modeling
dc.type	Doctoral Thesis
uws-etd.degree	Doctor of Philosophy
uws-etd.degree.department	Statistics and Actuarial Science
uws-etd.degree.discipline	Statistics
uws-etd.degree.grantor	University of Waterloo	en
uws-etd.embargo.terms	4 months
uws.contributor.advisor	Dubin, Joel
uws.contributor.affiliation1	Faculty of Mathematics
uws.peerReviewStatus	Unreviewed	en
uws.published.city	Waterloo	en
uws.published.country	Canada	en
uws.published.province	Ontario	en
uws.scholarLevel	Graduate	en
uws.typeOfResource	Text	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Kim_Minzee.pdf
Size:: 8.79 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 6.4 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses
Statistics and Actuarial Science