First, we generate a small simulated data.
set.seed(1)
n <- 100
p <- 6
X <- MASS::mvrnorm(n, rep(0, p), cor.mat(p, rho=0.5))
beta0 <- rep(c(1,-1), times=3)
Y <- -2+ X %*% beta0 + rnorm(n, sd=1)
Then, we fit the linear regression model without missing values based on ILSE.
We can also create a (data.frame) object as input for ILSE.
dat <- data.frame(Y=Y, X=X)
ilse1 <- ilse(Y~., data=dat)
print(ilse1)
Coef(ilse1) # access the coefficients
Fitted.values(ilse1)[1:5]
Residuals(ilse1)[1:5]
Check the significant variables by bootstratp.
First, we randomly remove some entries in X.
mis_rate <- 0.3
set.seed(1)
na_id <- sample(1:(n*p), n*p*mis_rate)
Xmis <- X
Xmis[na_id] <- NA
ncomp <- sum(complete.cases(Xmis))
message("Number of complete cases is ", ncomp, '\n')
Second, we use lm to fit linear regression model based on complete cases, i.e., CC analysis. We can not detect any siginificant covariates.
Third, we use ILSE to fit the linear regression model based on all data. We can fit a linear regression model without intercept by setting formula:
Then, we fit a linear regression model with intercept by following command
Fourth, Bootstrap is applied to evaluate the standard error and p-values of each coefficients estimated by ILSE. We observe four significant coefficients.
In ILSE package, we also provide Full Information Maximum Likelihood for Linear Regression fimlreg. We show how to use it to handle the above missing data.
We also use bootstrap to evaluate the standard error and p-values of each coefficients estimated by ILSE. We observe only one significant coefficients.
We visualize the p-vaules of each methods , where red line denotes 0.05 in y-axis and blue line 0.1 in y-axis.
pMat <- cbind(CC=s_cc$coefficients[,4], ILSE=s2[,4], FIML=s_fiml[,4])
library(ggplot2)
df1 <- data.frame(Pval= as.vector(pMat[-1,]),
Method =factor(rep(c('CC', "ILSE", "FIML"),each=p)),
covariate= factor(rep(paste0("X", 1:p), times=3)))
ggplot(data=df1, aes(x=covariate, y=Pval, fill=Method)) + geom_bar(position = "dodge", stat="identity",width = 0.5) + geom_hline(yintercept = 0.05, color='red') + geom_hline(yintercept = 0.1, color='blue')
Base on the above data, we add a new column, a categorical variable (Sex), into the data.frame. This variable is not associated with the outcome variable.
dat <- data.frame(Y=Y, X=Xmis)
dat$Sex <- factor(rep(c('male', 'female'), times=n/2))
dat$Sex[sample(1:n, n*mis_rate)] <- NA
ilse1 <- ilse(Y~., data=dat, verbose = T)
We can change the bootstrap times in calculate the standard errors, Z value and p-values of coefficients.
First, we generate data from a linear regression model with three inportant variables(1,3,5) and three unimportant variables(2,4,6).
set.seed(10)
n <- 100
p <- 6
X <- MASS::mvrnorm(n, rep(0, p), cor.mat(p, rho=0.5))
beta0 <- rep(c(1,0), times=3)
Y <- -2+ X %*% beta0 + rnorm(n, sd=1)
message("The true regression coefficients are: ", paste0(beta0, ' '))
We randomly assign missing values in the design matrix.
Next, we use ILSE to fit model.
Fit model by using lm and FIML, finally compare ILSE with these two methods.
We visualize the p-vaules of each methods , where red line denotes 0.05 in y-axis. Under significance level 0.05, we found both ILSE and FIML can identify all important variables (X1, X3 and X5), while CC method only identified X1 and X5.
library(ggthemes)
pMat <- cbind(CC=s_cc$coefficients[,4], ILSE=s3[,4], FIML=s_fiml[,4])
df1 <- data.frame(Pval= as.vector(pMat[-1,]),
Method =factor(rep(c('CC', "ILSE", "FIML"),each=p)),
covariate= factor(rep(paste0("X", 1:p), times=3)))
ggplot(data=df1, aes(x=covariate, y=Pval, fill=Method)) + geom_bar(position = "dodge", stat="identity",width = 0.5) + geom_hline(yintercept = 0.05, color='red') + scale_fill_economist()
Here, we generate a data with 80% missing values, then use ILSE to fit model.
# generate data from linear model
set.seed(10)
n <- 100
p <- 6
X <- MASS::mvrnorm(n, rep(0, p), cor.mat(p, rho=0.5))
beta0 <- rep(c(1,-1), times=3)
Y <- -2+ X %*% beta0 + rnorm(n, sd=1)
# generate missing values
mis_rate <- 0.8
set.seed(1)
na_id <- sample(1:(n*p), n*p*mis_rate)
Xmis <- X
Xmis[na_id] <- NA
# retain 4 complete cases.
Xmis[1:4,] <- X[1:4, ]
sum(complete.cases(Xmis))
CC method will failed.
However, ILSE can still work.
We generate a large-scale data with n=1000 and p = 50
n <- 1000
p <- 50
X <- MASS::mvrnorm(n, rep(0, p), cor.mat(p, rho=0.5))
beta0 <- rep(c(1,-1), length=p)
Y <- -2+ X %*% beta0 + rnorm(n, sd=1)
mis_rate <- 0.3
set.seed(1)
na_id <- sample(1:(n*p), n*p*mis_rate)
Xmis <- X
Xmis[na_id] <- NA
Xmis[1:10,] <- X[1:10,]
lm1 <- lm(Y~Xmis)
lm1
system.time(ilse2 <- ilse(Y~Xmis, data=NULL, verbose=T))
sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ILSE_1.1.7 rmarkdown_2.28
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0 xfun_0.49
#> [5] maketools_1.3.1 cachem_1.1.0 parallel_4.4.1 knitr_1.48
#> [9] htmltools_0.5.8.1 buildtools_1.0.0 lifecycle_1.0.4 cli_3.6.3
#> [13] sass_0.4.9 jquerylib_0.1.4 compiler_4.4.1 sys_3.4.3
#> [17] tools_4.4.1 pbapply_1.7-2 evaluate_1.0.1 bslib_0.8.0
#> [21] Rcpp_1.0.13-1 yaml_2.3.10 jsonlite_1.8.9 rlang_1.1.4