Statistics and Actuarial Science
Permanent URI for this collectionhttps://uwspace.uwaterloo.ca/handle/10012/9934
This is the collection for the University of Waterloo's Department of Statistics and Actuarial Science.
Research outputs are organized by type (eg. Master Thesis, Article, Conference Paper).
Waterloo faculty, students, and staff can contact us or visit the UWSpace guide to learn more about depositing their research.
Browse
Browsing Statistics and Actuarial Science by Author "Diao, Liqun"
Now showing 1 - 4 of 4
- Results Per Page
- Sort Options
Item Correlated Data Analysis with Copula Models or Bayesian Nonparametric Methods(University of Waterloo, 2021-01-20) Zhuang, Haoxin; Diao, Liqun; Yi, Grace Y.Different types of correlated data arise commonly in many studies and present considerable challenges in modeling and characterizing complex dependence structures. This thesis considers statistical issues in analyzing such kinds of data. Chapters 2-4 of the thesis aim to develop models to account for complex dependence structures and propose new statistical inference methods. In particular, our attention focuses on using copula models and their variants to delineate association structures for dependent data. As ``big data" has increasingly versatile applications in many fields, more and more data with irregular distributions emerge, which calls for more flexible and robust nonparametric statistical methods. Chapters 5 and 6 of the thesis develop novel Bayesian nonparametric methods on sampling algorithms and regression models. More specifically, in Chapter 2, we consider longitudinal data with a time-span, of which common examples include temperature and precipitation data. We utilize a vine copula model to account for the dependence among longitudinal responses; the joint distribution of responses is factorized as a product of marginal distributions and bivariate conditional copulas. To release the computational burden and concentrate on the structure of interest, we propose composite likelihood methods which divide the responses into time blocks and leave the connecting structure between time blocks unspecified. We explore the efficiency, robustness, model selection and prediction of our proposed methods by simulation studies. The proposed model is applied to analyze an Ontario temperature dataset. In Chapter 3, we consider dependent data with a hierarchical structure. Analysis of such data is often challenging due to the complexity in modeling different dependence structures as well as the demand of intensive computation sources. To alleviate these issues, we propose a Bayesian hierarchical copula model (BHCM) to accommodate the hierarchical structures of the dependent data, where the subject-level dependence is facilitated by the copula-based model and the hierarchical structure is described using random dependence parameters. We introduce a layer-by-layer sampling scheme for conducting inferences. Our proposed BHCM enjoys the flexibility of modeling various complex association structures, while retaining manageable computation. Extensive simulation studies show that our proposed estimators outperform conventional likelihood-based estimators in finite sample settings. We apply the BHCM to analyze the Vertebral Column dataset from the UCI Machine Learning Repository. In Chapter 4, we consider dependent data coming from multiple sources where we aim to group similar dependence structures together and then conduct model selection and parameter estimation based on copula models. We propose a mixture of Dirichlet process mixture copula model (M-DPM-CM) to identify similar dependence structures and select copula models, in which the model selection parameters and copula parameters are assigned a Dirichlet process prior. Simulation studies and data analysis are conducted to compare the M-DPM-CM to the conventional copula selection method using the AIC criterion. The results show that the M-DPM-CM can accurately recover the true grouping structure with a moderate sample size, and achieve a more accurate model selection results than the conventional AIC method. The M-DPM-CM is also applied to analyze the Vertebral Column dataset used in Chapter 3 to obtain more insights into the dependence structures. In Chapter 5, we focus on developing sampling algorithms from a complex distribution. To remedy the limitations of Markov Chain Monte Carlo (MCMC) algorithms, we propose a novel sampling method, called Polya tree Monte Carlo (PTMC). Our proposed PTMC method can feasibly approximate the posterior Polya tree by the Monte Carlo method, which is justified theoretically that the approximated Polya tree posterior converges to the target distribution under regularity conditions. We further propose a series of simple and efficient sampling algorithms which are useful for different scenarios. Extensive numerical studies are conducted to demonstrate the appealing performance of the proposed method, including its superiority to the usual MCMC algorithms, under various settings. The evaluation and comparison are carried out in terms of sampling efficiency, computational speed and the capacity of identifying distribution modes. In Chapter 6, we consider the topic of nonparametric regression models. The Polya tree (PT) based nearest neighbor regression model is introduced as a fully nonparametric regression method. To approximate the true conditional probability measure of the response given the covariate value, we construct a PT-distributed probability measure of the response in the nearest neighborhood of the covariate value of interest. Our proposed method gives consistent and robust estimators, and has a faster convergence rate than the kernel density estimation. We conduct extensive simulation studies and analyze the Combined Cycle Power Plant dataset to compare the performance of our method to other nonparametric or semi-parametric methods. %The studies suggest that the proposed method exhibits the superiority to the kernel and PT density estimation methods in terms of the estimation accuracy and convergence rate and to LDTFP in terms of robustness. Summary remarks and discussion of future research topics are presented in Chapter 7.Item Marginal Causal Sub-Group Analysis with Incomplete Covariate Data(University of Waterloo, 2019-01-11) Cuerden, Meaghan; Cook, Richard; Cotton, Cecilia; Diao, LiqunIncomplete data arises frequently in health research studies designed to investigate the causal relationship between a treatment or exposure, and a response of interest. Statistical methods for conditional causal effect parameters in the setting of incomplete data have been developed, and we expand upon these methods for estimating marginal causal effect parameters. This thesis focuses on the estimation of marginal causal odds ratios, which are distinct from conditional causal odds ratios in logistic regression models; marginal causal odds ratios are frequently of interest in population studies. We introduce three methods for estimating the marginal causal odds ratio of a binary response for different levels of a subgroup variable, where the subgroup variable is incomplete. In each chapter, the subgroup variable, exposure variable and the response variable are binary and the subgroup variable is missing at random. In Chapter 2, we begin with an overview of inverse probability weighted methods for confounding in an observational setting where data are complete. We also briefly review methods to deal with incomplete data in a randomized setting. We then introduce a doubly inverse probability weighted estimating equation approach to estimate marginal causal odds ratios in an observational setting, where an important subgroup variable is incomplete. One inverse probability weight accounts for the incomplete data, and the other weight accounts for treatment selection. Only complete cases are included in the response model. Consistency results are derived, and a method to obtain estimates of the asymptotic standard error is introduced; the extra variability introduced by estimating two weights is incorporated in the estimation of the asymptotic standard error. We give a method for hypothesis testing and calculation of confidence intervals. Simulation studies show that the doubly weighted estimating equation approach is effective in a non-ignorable missingness setting with confounding, and it is straightforward to implement. It also performs well when the missing data process is ignorable, and/or when confounding is not present. In Chapter 3, we begin with an overview of an EM algorithm approach for estimating conditional causal effect parameters in the setting of incomplete covariate data, in both randomized and observational settings. We then propose the use of a doubly weighted EM-type algorithm approach to estimate the marginal causal odds ratio in the setting of missing subgroup data. In this method, instead of using complete case analysis in the response model, all available data is used and the incomplete subgroup variable is “filled in” using a maximum likelihood approach. Two inverse probability weights are used here as well, to account for confounding and incomplete data. The weight which accounts for the incomplete data is needed, even though an EM approach is being used, because the marginal causal odds ratio is of interest. A method to obtain asymptotic standard error estimates is given where the extra variability introduced by estimating the two inverse probability weights, as well as the variability introduced by estimating the conditional expectation of the incomplete subgroup variable, is incorporated. Simulation studies show that this method is effective in terms of obtaining consistent estimates of the parameters of interest; however it is difficult to implement, and in certain settings there is a loss of efficiency in comparison to the methods introduced in Chapter 2. In Chapter 4, we begin by reviewing multiple imputation methods in randomized and observational settings, where estimation of the conditional causal odds ratio is of interest. We then propose the use of multiple imputation with one inverse probability weight to account for confounding in an observational setting where the subgroup variable is incomplete. We discuss methods to correctly specify the imputation model in the setting where the conditional causal odds ratio is of interest, as well as in the setting where the marginal causal odds ratio is of interest. We use standard methods for combining the estimates of the marginal log odds ratios from each imputed dataset. We propose a method for estimating the asymptotic standard error of the estimates, which incorporates both the estimation of the parameters in the weight for confounding, and the multiply imputed datasets. We give a method for hypothesis testing and calculation of confidence intervals. Simulation studies show that this method is efficient and straightforward to implement, but correct specification of the imputation model is necessary. In Chapter 5, the three methods that have been introduced are used in an application to an observational cohort study of 418 colorectal cancer patients. We compare patients who received an experimental chemotherapy with patients who received standard chemotherapy; of interest is estimation of the marginal causal odds ratio of a thrombotic event during the course of treatment or 30 days after treatment is discontinued. The important subgroups are (i) patients receiving first line of treatment, and (ii) patients receiving second line of treatment. In Chapter 6, we compare and contrast the three methods proposed. We also discuss extensions to different response models, models for missing response data, and weighted models in the longitudinal data setting.Item Mortality Prediction using Statistical Learning Approaches(University of Waterloo, 2022-11-21) Meng, Yechao; Weng, Chengguo; Diao, LiqunLongevity risk, as one of the major risks faced by insurers, has triggered a heated stream of research in mortality modeling among actuaries for effective design/pricing/risk management of insurance products. The idea of borrowing a ``proper'' amount of information from populations with similar structures, widely acknowledged as a conducive strategy to enhance the accuracy of the mortality prediction for a target population, has been explored and utilized by the actuarial community. However, the problem of determining a ``proper'' amount of information amounts to a trade-off that one needs to strive well between gains from including relevant signals and adverse impacts from bringing in irrelevant noise. Conventional solutions to determine a ``proper'' amount of information resort to multiple sources of exogenous data and involve substantial manual work of ``feature engineering'' without guaranteeing an improvement in prediction accuracy. Therefore, in this thesis, we set sail from the exploration to design fully data-driven frameworks to screen out useful hidden information from different aspects effectively to enhance the predicting accuracy of mortality rates with the assistance of various statistical learning approaches. First and foremost, Chapter 2 aims to throw light on how to select a ``proper'' group of populations among a given pool to ensure the success of a multi-population mortality model conducive to improved mortality predicting accuracy. We design a fully data-driven framework, based on a Deletion-Substitution-Addition algorithm, to automatically recommend a group selection for joint modeling through a multi-population model in order to obtain enhanced predicting accuracy. The procedure avoids the excessive involvement of subjective decisions in the group selection. The superiority of the proposed framework in mortality predicting performance is evident by extensive numerical studies when compared with several conventional strategies for population selection problems. Chapter 3 also focuses on how to effectively borrow information from a given pool of populations to enhance the mortality predicting accuracy in a computationally efficient manner. In this chapter, we propose a bivariate model based ensemble framework to aggregate predictions that use the joint information from each pair of populations in the given pool. In addition, we also introduce a time-shift parameter to the base learner mortality model for extra flexibility. This additional parameter characterizes the time by which one population is ahead of or behind the other in their mortality development stages and allows for borrowing information from populations at disparate mortality development stages. The results of the empirical studies confirm the effectiveness of the proposed framework. In Chapter 4, we extend the idea of borrowing information by changing the scope of consideration from populations to ages. We provide insights on detecting similarities and borrowing information that is hidden under the similarities of age-specific mortality patterns among ages. We propose a novel predicting framework where the overall predicting goal is decomposed into multiple individual tasks that search for age-specific age bands to ensure the mortality prediction of each target age can receive the benefit of borrowing information across ages to the largest extent. Extensive empirical studies with the Human Mortality Database confirm noticeable differences for different target ages in their ways of borrowing information from other ages. Those empirical studies also confirm an overall improvement in predicting accuracy of the proposed framework for most ages, especially for adults and retiree groups. In Chapter 5, information across different ages and different populations is considered simultaneously. We extend the idea of borrowing information among ages to multi-population cases and proposed three different approaches: a distance-based approach, an ensemble-based approach, and an ACF model-based approach. Empirical studies with real mortality data are conducted to compare their predicting performance and significance in improving predicting accuracy compared with some benchmark models. Additionally, several general stylized facts of how ages from multiple populations are borrowed by the distance-based method are provided. Finally, Chapter 6 briefly outlines some directions worth further exploration for research by the momentum from each chapter and some research ideas that are less relevant to the previous chapters.Item Topics in Study Design and Analysis Involving Incomplete Data(University of Waterloo, 2021-07-27) Yang, Ce; Cook, Richard; Diao, LiqunIncomplete data is a common occurrence in statistics with various types and mechanisms such that each can have a significant effect on statistical analysis and inference. This thesis tackles several statistical issues in study design and analysis involving incomplete data. The first half of the thesis deals with the case of incomplete observations of the responses. In medical studies, events of interest are most likely to be under intermittent observation schemes, for example, detected via periodic clinical examinations. As a result, the event of interest is only known to happen within an interval, and the resulting interval-censored data hinders the application of numerous analysis tools. Although it is possible to presume the event time to happen at the endpoint or the midpoint of the interval, such ad hoc imputations are known to lead to invalid inferences. In Chapter 2, we propose appropriate imputations via censoring unbiased transformations and pseudo-observations of such incomplete responses to facilitate a straightforward use of prevalent machine learning algorithms. The former technique helps preserve the conditional mean structure with the presence of censoring, and the latter originates from the biased-corrected jackknife estimates. For a continuous response, both proposed imputations lead to regression trees models with the same expected L2 loss as those fitted from complete observations. Therefore, prediction and variable selection naturally follow. Unlike most survival trees in literature, our proposed models do not rely on the widely made proportional hazard assumption. Furthermore, such models reduce to ordinary regression trees without the presence of censoring. Survivor function estimates of interval-censored data are required to employ the imputations; various semiparametric and nonparametric approaches are considered and compared. In particular, we scrutinize the case of current status data in a separate section. The second half of the thesis addresses incomplete covariate data missing by design. Controlled by the investigators, the missingness is attributed to the budgetary constraints when measuring an ``expensive exposure variable" in real-life scenarios. We focus on the well-known two-phase studies which exploit the response and inexpensive auxiliary information of the population to select a phase II sub-sample for the collection of the expensive covariate. In Chapter 3, we look into an adaptive two-phase design that avoids the need for external pilot data. Dividing the phase II sub-sampling into multiple interim stages, we employ conventional sampling to select a fraction of the individuals of the phase II sub-sample to provide the information required for constructing an optimal sub-sample from those remaining to achieve maximum statistical efficiency subject to sampling constraints. Such adaptive two-phase designs naturally extend to multiple stages in phase II and are applicable when a surrogate of the exposure variable is available. Efficiency and robustness issues are investigated under various frameworks of analysis. As expected, the maximum likelihood approach that models the nuisance distribution tends to be more efficient, whereas inverse probability weighted estimating equations that avoid this tend to be more robust to the misspecification of the nuisance covariates models. The conditional maximum likelihood approach, to our delight, is well-balanced between the two. Moreover, the eagerness to gain efficiency while maintaining a certain level of robustness further drives us to explore semiparametric methods in all the analyses and designs. Chapter 4 onward pays attention to more complicated settings in which covariates are missing in a sequence of two-phase studies with multiple responses and sampling constraints conducted on a common platform. For a given two-phase study, we expect to exploit not only information of the responses and auxiliary covariates at hand but also those passed on from earlier studies. We consider joint response models and perform secondary analyses of a new response using previously studied exposure variables. Moreover, the exposure variables acquired from earlier studies serve as pilot data to help construct an optimal selection model in an upcoming two-phase study. As we assess the balance between efficiency and robustness of the analysis methods, the potential misspecification of the joint response model warrants our attention. Finally, we note that the work can be extended to deal with two-phase response-dependent sampling with longitudinal data in Chapter 5.