Statistics and Actuarial Science
Permanent URI for this collectionhttps://uwspace.uwaterloo.ca/handle/10012/9934
This is the collection for the University of Waterloo's Department of Statistics and Actuarial Science.
Research outputs are organized by type (eg. Master Thesis, Article, Conference Paper).
Waterloo faculty, students, and staff can contact us or visit the UWSpace guide to learn more about depositing their research.
Browse
Browsing Statistics and Actuarial Science by Author "Cook, Richard"
Now showing 1 - 10 of 10
- Results Per Page
- Sort Options
Item Design and Analysis of Life History Studies Involving Incomplete Data(University of Waterloo, 2022-04-26) Mao, Fangya; Cook, RichardIncomplete life history data can arise in study designs, coarsened observations, missing covariates, and unobserved latent processes. This thesis consists of three different projects developing statistical models and methods to address problems involving such features. Statistical models which facilitate the exploration of spatial dependence can advance scientific understanding of chronic diseases processes affecting several organ systems or body sites. Motivated by the need to investigate the spatial nature of joint damage in patients with psoriatic arthritis, we develop a multivariate mixture model to characterize latent susceptibility and the progression of joint damage in different locations in Chapter 2. In addition to a large number of joints under consideration and the heterogeneity in risk, the times to joint damage are subject to interval censoring as damage status is only observed at intermittent radiological examination times. We address computational and inferential challenge through use of composite likelihood and two-stage estimation procedures. The key contribution of this chapter is the development of a convenient and general framework for regression modeling to study risk factors for susceptibility to joint damage and the time to damage, as well as spatial dependence of these features. The design and analysis of two-phase studies have been investigated for biomarker studies involving lifetime data. Two-phase designs aim to guide the efficient selection of a sub-sample of individuals from a phase I cohort to measure some "expensive" markers under budgetary constraints. In a phase I sample information on the response and inexpensive covariates is available for a large cohort, and in phase II, a subsample is selected in which to assay the marker of interest through examination of a biospecimen. The design efficiency is measured in terms of the precision in estimating the effect of the biomarker on some event process (e.g. disease progression) of interest. Chapter 3 considers two-phase designs involving current status observation of the failure process; here individuals are monitored at a single assessment time to determine whether or not they have experienced a failure event of interest. This kind of observation scheme is sometimes desirable in practice as it is more efficient and cost-effective then carrying out multiple assessments. We examine efficient two-phase designs under two analysis methods, namely maximum likelihood and inverse probability weighting. The former tends to be more efficient but requires additional model assumptions involving the nuisance covariate model, while the latter is more robust but yields less efficient estimators since it only analyses data from the phase II subsample. The optimal designs are derived by minimizing the asymptotic variance of the coefficient estimators for the expensive marker. To circumvent the computational challenge in evaluating asymptotic variances at the design stage, we consider designs involving sub-sampling based on extreme score statistics, extreme observations, or via stratified sub-sampling schemes. The role of the assessment time is highlighted. Research involving progressive chronic disease processes can be conducted by synthesizing data from different disease registries using different enrolment conditions. In inception cohorts, for example, individuals may be required to not have entered an advanced stage of the disease, while disease registries may focus on individuals who have progressed to a more advanced stage. The former yields left-truncated progression times while the latter yields right-truncated progression times. Chapter 4 considers the development of two-phase designs when the phase I sample contains data pooled from different registries launched to recruit individuals from a common population with different disease-dependent selection criteria. We frame the complex data structure by multistate models and construct partial likelihoods restricted to parameters of interest using intensity-based models under some model assumptions. Both recruitment (phase I) and sub-selection (phase II) biases are accounted for to ensure valid inference. An inverse probability weighting method is also developed to relax or weaken assumptions needed for the likelihood approach. We investigate and compare the performance of various two-phase sampling schemes under each analysis method and provide practical guidance for phase II selection given budgetary constraints. The contributions of this thesis are reviewed in Chapter 5 where we also mention topics of future research.Item Failure Time Analysis with Discrete Marker Processes under Intermittent Observation(University of Waterloo, 2021-07-28) Xie, Bing Feng; Cook, RichardRegression analysis for failure time data is often directed at studying the relationship between a time-dependent biomarker and failure. The Cox regression model and the associated partial likelihood on which inference is based is well-suited for this kind of investigation since the values of time-dependent biomarkers are only required at the observed failure times in the sample. It is common, however, for markers values to be obtained only at periodic clinic visits when biospecimens are acquired for testing. The convention is then to take these values as the working value of the biomarker until the next visit, failure, or censoring. In such settings the assumed biomarker value is typically out-of-date and therefore misrepresents the true value. Joint modeling can be shown to address this misspecification, where the marker process can mitigate the bias from a naive analysis using the last observation carried forward approach. In Chapter 2 of this thesis an expectation-maximization algorithm is developed for fitting a joint (i.e. multistate) model for an intermittently-observed binary time-dependent biomarker and failure time. This is implemented and assessed empirically through simulation studies and applied to a dataset from a cancer clinical trial studying the relation between a biomarker and the occurrence of a composite endpoint defined as the time of a skeletal complication or death. Chapter 3 involves a careful study of the asymptotic bias of regression coefficients from a Cox regression model using the conventional approach of carrying biomarker values forward in time from the time of clinic visits until the next measurement occasion, failure or censoring. Using counting process notation and large sample theory related to misspecified models we gain insights into the determinants of the limiting bias. We consider a true underlying Cox model in which the current marker value and a baseline covariate act multiplicatively on a baseline hazard so the bias in the effect of the biomarker and the baseline covariate can be examined. The determinants of the limiting bias include the proportion of time spent in the two marker states, the relation between the baseline covariates and the intensities governing transitions between the marker states, and the frequency of the measurements. We also define a marker-dependent visit process as one in which the visit intensity depends on the latent marker value. The strength of this association is found to affect the magnitude of the asymptotic bias as well. An expanded joint model is described in Chapter 4 which incorporates the marker process, failure process, visit process and right-censoring process. This general framework accommodates marker-dependent censoring and marker-dependent visit intensities and so is quite general. It offers a basis for joint modeling of all four processes in order to mitigate the biases from either the conventional last observation carried forward approach, or the simpler joint model of Chapter 2. Note that visit and failure times are observed exactly but are subject to right censoring, so the baseline intensities of these events can be well-estimated. The transitions between marker states are unobserved however so these intensities must be modelled parsimoniously. The focus of the investigation is primarily to study the ability to obtain good estimation of the failure process intensity under a marker-dependent visit process and so this is the setting of the simulation studies. We fit the model to data from a study of the relation between an inflammatory blood marker, the erythrocyte sedimation rate, a baseline genetic marker and the time to joint damage involving patients from the University of Toronto Psoriatic Arthritis Clinic. A summary is given in Chapter 5 along with some discussion of topics for future research.Item Joint modeling, variable selection and multiply robust estimation in mediation analysis with multiple mediators(University of Waterloo, 2024-01-10) Wang, Lijia; Zhu, Yeying; Cook, RichardThis thesis explores topics in causal mediation analysis with multiple possibly related mediators. The goal of this thesis is to propose innovative methodologies for joint modeling of multiple uncausally related mediators, selecting mediators from high-dimensional candidates while simplifying their dependency structures and performing multiply robust estimations to uncover causal effects of interest. Causal mediation analysis aims to enhance understanding of the effects of an exposure on an outcome by examining direct and indirect effects. In settings where multiple mediators are involved, the relations among these mediators play an important role. Traditional studies focus on the scenario that the multiple mediators are either related under specified causal structures or independent given baseline covariates. Our studies focus on multiple uncausally related mediators, where the mediators are associated with each other conditioning on pre-treatment covariates and treatment but there is no causal ordering among them. In Chapter 2, we begin by reviewing and expanding upon the concept of mediators that are uncausally related, followed by the introduction of causal effects defined under such settings and the associated identification assumptions. We propose to jointly model the uncausally related mediators using copula functions. An important advantage of employing copula functions in joint modeling is the significant flexibility it offers, as this method allows for multiple mediators to have different distributions and be correlated in various ways. Subsequently, we propose methods estimating causal effects within this framework. In Chapter 3, we center our attention on the sparse mediation phenomenon, where only a handful of true mediators, from a pool of possibly high-dimensional candidates, exhibit nonzero indirect effects. We propose a LASSO-based penalization technique that selects the true mediators by considering their indirect effects. Acknowledging that the selected mediators often still exhibit complex dependency structures even after selection, our method also simplifies these structures by selecting non-zero correlation entries within the correlation matrix using a similar penalized estimation technique. To facilitate the correlation structure selection, we transform the correlation matrix selection problem into a standard variable selection problem within the framework of a linear model. Moreover, our proposed method allows the mediator selection and the dependency structure selection processes, to be conducted either via either a parallel or a sequential approach. The grouped and individual causal effects are defined under such settings with estimation approaches discussed. In Chapter 4, we discuss the issue of model misspecification within the context of causal mediation analysis. Following the discussion, we propose two ways of constructing multiply robust estimators. In causal mediation analysis, typically three working models must be specified: the treatment model, the mediator model, and the response model. Both of our multiply robust estimation methods yield consistent estimation of the causal quantities of interest, provided that any two out of the three models are correctly specified. For each proposed method introduced in Chapters 2, 3 and 4, we provide theoretical results with proofs of the consistency and other properties. We also derive large sample properties and investigate finite sample properties via simulations. Each chapter includes an application of the proposed method to a genetic study in psychiatry to investigate DNA methylation loci as mediators on the causal path between childhood trauma and stress reactivity. In Chapter 2, the proposed method estimates the mediation effects of three DNA loci on the Kit ligand gene. Chapter 3 extends this analysis and applies the proposed mediator selection method to the entire DNA methylation dataset, revealing 12 mediating loci, with 10 showing a strong association. We estimate the grouped indirect effect from them and the individual effects of the remaining two loci. In Chapter 4, we employ our multiply robust estimation methods to re-evaluate the mediation effects of these 12 loci, demonstrating enhanced robustness to previous findings.Item Marginal Causal Sub-Group Analysis with Incomplete Covariate Data(University of Waterloo, 2019-01-11) Cuerden, Meaghan; Cook, Richard; Cotton, Cecilia; Diao, LiqunIncomplete data arises frequently in health research studies designed to investigate the causal relationship between a treatment or exposure, and a response of interest. Statistical methods for conditional causal effect parameters in the setting of incomplete data have been developed, and we expand upon these methods for estimating marginal causal effect parameters. This thesis focuses on the estimation of marginal causal odds ratios, which are distinct from conditional causal odds ratios in logistic regression models; marginal causal odds ratios are frequently of interest in population studies. We introduce three methods for estimating the marginal causal odds ratio of a binary response for different levels of a subgroup variable, where the subgroup variable is incomplete. In each chapter, the subgroup variable, exposure variable and the response variable are binary and the subgroup variable is missing at random. In Chapter 2, we begin with an overview of inverse probability weighted methods for confounding in an observational setting where data are complete. We also briefly review methods to deal with incomplete data in a randomized setting. We then introduce a doubly inverse probability weighted estimating equation approach to estimate marginal causal odds ratios in an observational setting, where an important subgroup variable is incomplete. One inverse probability weight accounts for the incomplete data, and the other weight accounts for treatment selection. Only complete cases are included in the response model. Consistency results are derived, and a method to obtain estimates of the asymptotic standard error is introduced; the extra variability introduced by estimating two weights is incorporated in the estimation of the asymptotic standard error. We give a method for hypothesis testing and calculation of confidence intervals. Simulation studies show that the doubly weighted estimating equation approach is effective in a non-ignorable missingness setting with confounding, and it is straightforward to implement. It also performs well when the missing data process is ignorable, and/or when confounding is not present. In Chapter 3, we begin with an overview of an EM algorithm approach for estimating conditional causal effect parameters in the setting of incomplete covariate data, in both randomized and observational settings. We then propose the use of a doubly weighted EM-type algorithm approach to estimate the marginal causal odds ratio in the setting of missing subgroup data. In this method, instead of using complete case analysis in the response model, all available data is used and the incomplete subgroup variable is “filled in” using a maximum likelihood approach. Two inverse probability weights are used here as well, to account for confounding and incomplete data. The weight which accounts for the incomplete data is needed, even though an EM approach is being used, because the marginal causal odds ratio is of interest. A method to obtain asymptotic standard error estimates is given where the extra variability introduced by estimating the two inverse probability weights, as well as the variability introduced by estimating the conditional expectation of the incomplete subgroup variable, is incorporated. Simulation studies show that this method is effective in terms of obtaining consistent estimates of the parameters of interest; however it is difficult to implement, and in certain settings there is a loss of efficiency in comparison to the methods introduced in Chapter 2. In Chapter 4, we begin by reviewing multiple imputation methods in randomized and observational settings, where estimation of the conditional causal odds ratio is of interest. We then propose the use of multiple imputation with one inverse probability weight to account for confounding in an observational setting where the subgroup variable is incomplete. We discuss methods to correctly specify the imputation model in the setting where the conditional causal odds ratio is of interest, as well as in the setting where the marginal causal odds ratio is of interest. We use standard methods for combining the estimates of the marginal log odds ratios from each imputed dataset. We propose a method for estimating the asymptotic standard error of the estimates, which incorporates both the estimation of the parameters in the weight for confounding, and the multiply imputed datasets. We give a method for hypothesis testing and calculation of confidence intervals. Simulation studies show that this method is efficient and straightforward to implement, but correct specification of the imputation model is necessary. In Chapter 5, the three methods that have been introduced are used in an application to an observational cohort study of 418 colorectal cancer patients. We compare patients who received an experimental chemotherapy with patients who received standard chemotherapy; of interest is estimation of the marginal causal odds ratio of a thrombotic event during the course of treatment or 30 days after treatment is discontinued. The important subgroups are (i) patients receiving first line of treatment, and (ii) patients receiving second line of treatment. In Chapter 6, we compare and contrast the three methods proposed. We also discuss extensions to different response models, models for missing response data, and weighted models in the longitudinal data setting.Item Mixture Models for Coarsened Multivariate Failure Time Data(University of Waterloo, 2018-08-13) Jiang, Shu; Cook, RichardThe aim of this thesis is to develop statistical methodology for the analysis of life history data under incomplete observation schemes and with latent features which must be accom- modated to ensure models provide a reasonable representation of the processes of interest and advance scientific understanding. Life history data frequently arise in health studies of disease processes in which indi- viduals pass through a series of stages of disease. Multistate models offer an appealing approach to modelling processes in settings where the stages can be meaningfully char- acterized into a finite number of disjoint stages and we adopt such models for much of the research in this thesis. In many instances, because processes are only observed in- termittently, the precise number, types and times of transitions between assessments are not available. For failure time processes at most a single transition can occur between assessments and the resulting data are called interval-censored failure time data. For more general multistate processes it is more generally called a panel data observation scheme. We investigate problems related to interval-censored data throughout this thesis, and con- sider a more extreme form of incomplete data due to aggregation. The term coarsened data is used to unify these settings. Despite careful attempts to collect and exploit available information to characterize the dynamic features of life history processes, substantial unexplained variability often exists between individuals or groups of individuals. Heterogeneity can be accommodated in various ways. Finite mixture models can be specified to accommodates distinct classes, or sub-populations, in which different disease processes govern progression in the different classes; latent class models are often used when class membership is fixed. When there are two classes and no disease progression occurs in one class, so-called cure rate models are often used. Classical mixture models with continuous random effect models are also often used to account for heterogeneity which can be characterized by a more finely distinguished nature of unexplained variation. This approach is often used in frailty models for survival data or more generally accommodating between cluster variation in clustered data. In this thesis, the focus is on methods for statistical modeling and inference for mul- tivariate failure time and multistate processes subject to intermittent observation; the resulting data are interval-censored multivariate failure time data and panel data respec- tively. Finite mixture models offer a powerful approach for accommodating heterogeneity when there are distinct types of processes present in a population with latent sub-populations following one of such processes. Methods for fitting finite mixture models and conduct- ing score tests for genetic markers are developed in Chapter 2 for a problem involving heterogeneous multistate processes under intermittent observation. When there are multiple marginal processes of interest, the correlation between such processes must be taken into account. In Chapter 3 we develop multivariate models for the joint analysis of marginal processes. Copula models are popular for modeling the correlation between marginal failure time processes, while odds ratios are commonly used to capture the association between binary variables. Through the use of multivariate mixture models the dependence structure can be decomposed into one for susceptibility and one for the failure times given joint susceptibility. Mixed multistate processes involving aggregate data are developed in Chapter 4 and 5. The computational challenges are addressed through the use of composite likelihood. We deal with between-cluster variation/within-cluster correlation in both chapters and propose two approaches to deal with such data. Specifically, we propose a marginal approach where we introduce dependence modeling via copulas, propose a composite likelihood and derive procedure for inference. A random effect model is also formulated in which a cluster-level latent variable accommodates heterogeneity between clusters. An optimal cost-effective design is also proposed which gives insights regarding the efficiency of studies involving aggregation and tracking. In Chapter 5, sample size criteria are developed to meet design objectives and cost-effective optimal allocations of clusters to the tracking and aggregate observation schemes are developed.Item Modeling and Prediction of Disease Processes Subject to Intermittent Observation(University of Waterloo, 2016-07-21) Wu, Ying; Cook, RichardThis thesis is concerned with statistical modeling and prediction of disease processes subject to intermittent observation. Times of disease progression are interval-censored when progression status is only known at a series of assessment times. This situation arises routinely in clinical trials and cohort studies when events of interest are only detectable upon imaging, based on blood tests, or upon careful clinical examination. The work that follows is motivated by the study of demographic, genetic and clinical data available from the University of Toronto Psoriasis Registry and the University of Toronto Psoriatic Arthritis Registry, each involving cohorts of several hundred patients with the respective diseases. Chapter 2 deals with the problem of selecting important prognostic biomarkers from a large set of candidates biomarkers when the status with respect to an event of interest (e.g. disease progression) is only known at irregularly spaced and individual-specific assessment times. Penalized regression techniques (e.g. LASSO, adaptive LASSO and SCAD) are adapted to deal with the interval-censored event times arising from this observation scheme. An expectation-maximization algorithm is developed which is demonstrated to perform well in extensive simulation studies involving independent and correlated continuous and binary covariates. Application to the motivating study of the development of arthritis mutilans in patients with psoriatic arthritis is given and several important human leukocyte antigen (HLA) variables are identified for further investigation. Extensions of this algorithm are developed for settings in which data from different sources with distinct disease-related entry conditions are to be synthesized. The extended Turnbull-type expectation-maximization algorithm is based on a complete data likelihood which incorporates missing information from individuals not meeting the entry criteria of the respective registries. Simulation studies demonstrate good empirical performance and an application to the motivating study identifies HLA markers associated with the onset of psoriatic arthritis among individuals with psoriasis. This analysis is carried out using data from a psoriasis registry in which the times to psoriatic arthritis are left-truncated, and psoriatic arthritis registry in which the onset times are right-truncated. Chapter 3 deals with the challenge of assessing the accuracy of a predictive model when response times are interval-censored. Inverse probability weighted (IPW) and augmented inverse probability weighted (AIPW) estimators of predictive accuracy are developed and evaluated based on the mean prediction error and the area under the receiver operating characteristic curve. The weights are estimated from a multistate model which jointly considers the event process, the inspection process, and the right-censoring processes. We investigate the performance of the proposed methods by simulation and illustrate their application in the context of a motivating rheumatology study in which HLA markers are used for predicting disease progression in psoriatic arthritis. A two-phase model is developed in Chapter 4 for chronic diseases which feature an indolent phase followed by a phase with more active disease resulting in progression and damage. The time-scales for the intensity functions for the active phase are more naturally based on the time since the start of the active phase, corresponding to a semi-Markov formulation. In cohort studies for which the disease status is only known at a series of clinical assessment times, transition times are interval-censored which means the time origin for phase II is interval-censored. Weakly parametric models with piecewise constant baseline hazard and rate functions are specified and an expectation-maximization algorithm is described for model fitting. A computationally faster two-stage estimation procedure is also developed and the asymptotic variances of the resulting estimators are derived. Simulation studies examining the performance of the proposed model show good performance under both maximum likelihood and two-stage estimation. An application to data from the motivating study of disease progression in psoriatic arthritis illustrates the procedure, and identifies new human leukocyte antigens associated with the duration of the indolent phase, and others associated with disease progression in the active phase. Open problems and topics for ongoing and future research are discussed in Chapter 5.Item Statistical Methods for Joint Modeling of Disease Processes under Intermittent Observation(University of Waterloo, 2024-09-20) Chen, Jianchu; Cook, RichardIn studies of life history data, individuals often experience multiple events of interest that may be associated with one another. In such settings, joint models of event processes are essential for valid inferences. Data used for statistical inference are typically obtained through various sources, including observational data from registries or clinics and administrative records. These observation processes frequently result in incomplete histories of the event processes of interest. In settings where interest lies in the development of conditions or complications that are not self-evident, data become available only at periodic clinic visits. This thesis focuses on developing statistical methods for the joint analysis of disease processes involving incomplete data due to intermittent observation. Many disease processes involve recurrent adverse events and an event which terminates the process. Death, for example, terminates the event process of interest and precludes the occurrence of further events. In Chapter 2, we present a joint model for such processes which has appealing properties due to its construction using copula functions. Covariates have a multiplicative effect on the recurrent event intensity function given a random effect, which is in turn associated with the failure time through a copula function. This permits dependence modeling while retaining a marginal Cox model for the terminal event process. When these processes are subject to right-censoring, simultaneous and two-stage estimation strategies are developed based on the observed data likelihood, which can be implemented by direct maximization or via an expectation-maximization algorithm - the latter facilitates semi-parametric modeling for the terminal event process. Variance estimates are derived based on the missing information principle. Simulation studies demonstrate good finite sample performance of proposed methods and high efficiency of the two-stage procedure. An application to a study of effect of pamidronate on reducing skeletal complications in patient with skeletal metastases illustrates the use of this model. Interval-censored recurrent event data can occur when the events of interest are only evident through intermittent clinical examination. Chapter 3 addresses such scenarios and extends the copula-based joint model for recurrent and terminal events proposed in Chapter 2 to accommodate interval-censored recurrent event data resulting from intermittent observation. Conditional on a random effect, the intensity for the recurrent event process has a multiplicative form with a weak parametric piecewise constant baseline rate, and a Cox model is formulated for the terminal event process. The two processes are then linked via a copula function, which defines a joint model for the random effect and the terminal event. The observed data likelihood can be maximized directly or via an EM algorithm; the latter facilitates a semi-parametric terminal event process. A computationally convenient two-stage estimation procedure is also investigated. Variance estimates are derived and validated by simulation studies. We apply this method to investigate the association between a biomarker (HLA-B27) and joint damage in patients with psoriatic arthritis. Databases of electronic medical records offer an unprecedented opportunity to study chronic disease processes. In survival analysis, interest may lie in studying the effects of time-dependent biomarkers on a failure time through Cox regression models. Often however, it is too labour intensive to collect and clean data on all covariates at all times, and in such settings it is common to select a single clinic visit at which variables are measured. In Chapter 4, we consider several cost-effective ad hoc strategies for inference, consisting of: 1) selecting either the last or the first visit for a measurement of the marker value, and 2) using the measured value with or without left-truncation. The asymptotic bias of estimators based on these strategies arising from misspecified Cox models is investigated via a multistate model constructed for the joint modeling of the marker and failure processes. An alternative selection method for efficient selection of individuals is discussed under budgetary constraint, and the corresponding observed data likelihood is derived. The asymptotic relative efficiency of regression coefficients obtained from Fisher information is explored and an optimal design is provided under this selection scheme.Item Statistical Models and Methods for Dependent Life History Processes(University of Waterloo, 2018-11-28) Lee, Jooyoung; Cook, RichardThis thesis deals with statistical issues in the analysis of complex life history processes which have characteristics of heterogeneity and dependence. We are motivated, in this thesis, by three specific types of processes; i) processes featuring recurrent episodic conditions ii) multi-type recurrent events, and iii) clustered multi state processes as arise in family studies. In chronic diseases featuring recurrent episodic conditions, symptom onset is followed by a period during which symptoms are present until recovery. In the analysis of data from such processes, analysis is often based only on the recurrent onset of disease, ignoring the duration of symptoms. This loss of information may lead to incorrect conclusions in the analysis of this data. In Chapter 2, we propose a novel model for an alternating two-state process including symptom-free state and symptomatic state to recognize the duration of symptoms. This approach reflects the dynamics of individual's disease process and helps to understand a course of disease. Intensity-based models with multiplicative random effects are considered where the disease onset time is governed by a conditionally Markov intensity and the time of recovery is governed by a conditionally semi-Markov intensity. A bivariate random effect with one multiplicative component for each intensity is introduced to accommodate between-individual heterogeneity and a dependence between bivariate random effect variables offers a natural and more general framework for modeling the two state process. A copula function is used for the joint distribution of random effects which retains the marginal features and gives flexible choices of dependence structure. The proposed model is a semiparametric model for which estimation is carried out using an expectation-maximization algorithm. The aforementioned problem leads us to investigate the impact of ignoring symptom duration in a randomized trial setting. In Chapter 3, we define two risk sets for recurrent event analyses: one involves including individuals during their symptomatic period, and the other excluding individuals from the risk set during symptomatic periods. In a clinical trial, the balance between treatment groups in unmeasured confounders present at the time of randomization can be lost following randomization as the risk set changes, thus, retaining individuals in the risk set is a common approach. Here we examine asymptotic and empirical biases of estimators from the rate-based models when two different risk sets are applied. We assume that the true underlying process is an alternating two-state process where the true risk set is the one that excludes individuals when they are experiencing an exacerbation. We consider two scenarios of the true model. First, there is no between-variation for each process and no dependence between two processes. The second scenario is to use the proposed dependent alternating two-states model in Chapter 2. Issues of model misspecification and causal inference are considered. When focus is on clinical trials, power implications of risk set misspecification is of interest. In Chapter 4, attention is directed at multiple recurrent events where each endpoint is of interest. The use of composite endpoint which is the time point of the first event of any type is a simple way to analyse such data. However, when multiple events are of comparable importance, use of a composite endpoint analysis may not be suitable. We propose a copula-based model for multi-type recurrent events where each type of recurrent event process arises from a mixed-Poisson model and random effects linking the events through a copula function. When more than two types of events are considered, composite likelihood is adopted to ease the computational burden, and simultaneous and two-stage estimation are explored. An aim of family studies is typically to gain knowledge about factors governing the inheritance of diseases. One may be interested in examining a dependence of disease onset between family members, and in identifying genetic markers associated with heritable disease. A common procedure is to collect families is through probands in which such affected individuals are selected from a disease registry and their family members (non-probands) are, then, recruited for examination. This approach to sampling families motivates us to consider the disease onset process along with survival since the proband must be diseased and alive to be recruited, and family members may need to be alive. In Chapter 5, we propose a model for a clustered illness-death process for family studies which accounts for the semi-competing risks problem for disease onset as well as biased sampling. We model within-family association in the age of disease onset via a copula function and applied to the possibly latent disease onset time and incorporate survival through a marginal illness-death model. The ascertainment condition is reflected in the likelihood or composite likelihood construction. Two study designs regarding the recruitment of family members are considered. One involves the collection of disease history from family members via the proband or medical records. The other requires family members to undergo a medical examination in which case they must be alive at the time of the family study. Family data alone are insufficient to estimate all of the parameters of the illness-death processes. We therefore make use of auxiliary data including the population mortality data and additional registry data to address the estimatability issue. Another source of auxiliary data is current status survey. The issue of missing genetic markers is also addressed in each study design.Item Topics in Study Design and Analysis Involving Incomplete Data(University of Waterloo, 2021-07-27) Yang, Ce; Cook, Richard; Diao, LiqunIncomplete data is a common occurrence in statistics with various types and mechanisms such that each can have a significant effect on statistical analysis and inference. This thesis tackles several statistical issues in study design and analysis involving incomplete data. The first half of the thesis deals with the case of incomplete observations of the responses. In medical studies, events of interest are most likely to be under intermittent observation schemes, for example, detected via periodic clinical examinations. As a result, the event of interest is only known to happen within an interval, and the resulting interval-censored data hinders the application of numerous analysis tools. Although it is possible to presume the event time to happen at the endpoint or the midpoint of the interval, such ad hoc imputations are known to lead to invalid inferences. In Chapter 2, we propose appropriate imputations via censoring unbiased transformations and pseudo-observations of such incomplete responses to facilitate a straightforward use of prevalent machine learning algorithms. The former technique helps preserve the conditional mean structure with the presence of censoring, and the latter originates from the biased-corrected jackknife estimates. For a continuous response, both proposed imputations lead to regression trees models with the same expected L2 loss as those fitted from complete observations. Therefore, prediction and variable selection naturally follow. Unlike most survival trees in literature, our proposed models do not rely on the widely made proportional hazard assumption. Furthermore, such models reduce to ordinary regression trees without the presence of censoring. Survivor function estimates of interval-censored data are required to employ the imputations; various semiparametric and nonparametric approaches are considered and compared. In particular, we scrutinize the case of current status data in a separate section. The second half of the thesis addresses incomplete covariate data missing by design. Controlled by the investigators, the missingness is attributed to the budgetary constraints when measuring an ``expensive exposure variable" in real-life scenarios. We focus on the well-known two-phase studies which exploit the response and inexpensive auxiliary information of the population to select a phase II sub-sample for the collection of the expensive covariate. In Chapter 3, we look into an adaptive two-phase design that avoids the need for external pilot data. Dividing the phase II sub-sampling into multiple interim stages, we employ conventional sampling to select a fraction of the individuals of the phase II sub-sample to provide the information required for constructing an optimal sub-sample from those remaining to achieve maximum statistical efficiency subject to sampling constraints. Such adaptive two-phase designs naturally extend to multiple stages in phase II and are applicable when a surrogate of the exposure variable is available. Efficiency and robustness issues are investigated under various frameworks of analysis. As expected, the maximum likelihood approach that models the nuisance distribution tends to be more efficient, whereas inverse probability weighted estimating equations that avoid this tend to be more robust to the misspecification of the nuisance covariates models. The conditional maximum likelihood approach, to our delight, is well-balanced between the two. Moreover, the eagerness to gain efficiency while maintaining a certain level of robustness further drives us to explore semiparametric methods in all the analyses and designs. Chapter 4 onward pays attention to more complicated settings in which covariates are missing in a sequence of two-phase studies with multiple responses and sampling constraints conducted on a common platform. For a given two-phase study, we expect to exploit not only information of the responses and auxiliary covariates at hand but also those passed on from earlier studies. We consider joint response models and perform secondary analyses of a new response using previously studied exposure variables. Moreover, the exposure variables acquired from earlier studies serve as pilot data to help construct an optimal selection model in an upcoming two-phase study. As we assess the balance between efficiency and robustness of the analysis methods, the potential misspecification of the joint response model warrants our attention. Finally, we note that the work can be extended to deal with two-phase response-dependent sampling with longitudinal data in Chapter 5.Item Topics in the Design of Life History Studies(University of Waterloo, 2018-08-20) Moon, Nathalie C.; Zeng, Leilei; Cook, RichardSubstantial investments are being made in health research to support the conduct of large cohort studies with the objective of improving understanding of the relationships between diverse features (e.g. exposure to toxins, genetic biomarkers, demographic variables) and disease incidence, progression, and mortality. Longitudinal cohort studies are commonly used to study life history processes, that is patterns of disease onset, progression, and death in a population. While primary interest often lies in estimating the effect of some factor on a simple time-to-event outcome, multistate modelling offers a convenient and powerful framework for the joint consideration of disease onset, progression, and mortality, as well as the effect of one or more covariates on these transitions. Longitudinal studies are typically very costly, and the complexity of the follow-up scheme is often not fully considered at the design stage, which may lead to inefficient allocation of study resources and/or underpowered studies. In this thesis, several aspects of study design are considered to guide the design of complex longitudinal studies, with the general aim being to obtain efficient estimates of parameters of interest subject to cost constraints. Attention is focused on a general $K$ state model where states $1, \ldots, K-1$ represent different stages of a chronic disease and state $K$ is an absorbing state representing death. In Chapter 2, we propose an approach to design efficient tracing studies to mitigate the loss of information stemming from attrition, a common feature of prospective cohort studies. Our approach exploits observed information on state occupancy prior to loss-to-followup, covariates, and the time of loss-to-followup to inform the selection of individuals to be traced, leading to more judicious allocation of resources. Two settings are considered. In the first there are only constraints on the expected number of individuals to be traced, and in the second the constraints are imposed on the expected cost of tracing. In the latter, the fact that some types of data may be more costly to obtain via tracing than other types of data is dealt with. In Chapter 3, we focus on two key aspects of longitudinal cohort studies with intermittent assessments: sample size and the frequency of assessments. We derive the Fisher information as the basis for studying the interplay between these factors and to identify features of minimum-cost designs to achieve desired power. Extensions which accommodate the possibility of misclassification of disease status at the intermittent assessments times are developed. These are useful to assess the impact of imperfect screening or diagnostic tests in the longitudinal setting. In Chapter 4, attention is turned to state-dependent sampling designs for prevalent cohort studies. While incident cohorts involve recruiting individuals before they experience some event of interest (e.g. onset of a particular disease) and prospectively following them to observe this event, prevalent cohorts are obtained by recruiting individuals who have already experienced this event at some point in the past. Prevalent cohort sampling yields length-biased data which has been studied extensively in the survival setting; we demonstrate the impact of this in the multistate setting. We start with observation schemes in which data are subject to left- or right-truncation in the failure-time setting. We then generalize these findings to more complex multistate models. While the distribution of state occupancy at recruitment in a prevalent cohort sample may be driven by the prevalences in the population, we propose approaches for state-dependent sampling at the design stage to improve efficiency and/or minimize expected study cost. Finally, Chapter 5 features an overview of the key contributions of this research and outlines directions for future work.