Title: | Interpretable Multi-Omics Representation Learning via Covariate-Augumented Generalized Factor Model |
---|---|
Description: | Covariate-augumented generalized factor model is designed to account for cross-modal heterogeneity, capture nonlinear dependencies among the data, incorporate additional information, and provide excellent interpretability while maintaining high computational efficiency. |
Authors: | Wei Liu [aut, cre], Jiakun Jiang [aut], Dewei Xiang [aut], Xuancheng Zhou [aut] |
Maintainer: | Wei Liu <[email protected]> |
License: | GPL-3 |
Version: | 1.1 |
Built: | 2024-10-24 04:28:53 UTC |
Source: | https://github.com/feiyoung/cmgfm |
Fit the covariate-augumented generalized factor model
CMGFM( XList, Z, types, numvarmat, q = 15, Alist = NULL, init = c("LFM", "GFM", "random"), maxIter = 30, epsELBO = 1e-08, verbose = TRUE, add_IC_iter = FALSE, seed = 1 )
CMGFM( XList, Z, types, numvarmat, q = 15, Alist = NULL, init = c("LFM", "GFM", "random"), maxIter = 30, epsELBO = 1e-08, verbose = TRUE, add_IC_iter = FALSE, seed = 1 )
XList |
a list consisting of multiple matrices in which each matrix has the same type of values, i.e., continuous, or count, or binomial/binary values. |
Z |
a matrix, the fixed-dimensional covariate matrix with control variables. |
types |
a string vector, specify the variable type in each matrix in |
numvarmat |
a |
q |
an optional string, specify the number of factors; default as 15. |
Alist |
an optional vector, the offset for each unit; default as full-zero vector. |
init |
an optional character, specify the method in initialization. |
maxIter |
the maximum iteration of the VEM algorithm. The default is 30. |
epsELBO |
an optional positive value, tolerance of relative variation rate of the evidence lower bound value, default as '1e-8'. |
verbose |
a logical value, whether output the information in iteration. |
add_IC_iter |
a logical value, add the identifiability condition in iterative algorithm or add it after algorithm converges; default as FALSE. |
seed |
an integer, set the random seed in initialization, default as 1; |
None
return a list including the following components:
betaf
- the estimated regression coefficient vector for each modality;
Bf
- the estimated loading matrix for each modality;
M
- the estimated modality-shared factor matrix;
Xif
- the estimated modality-specified factor vector;
S
- the estimated covariance matrix of modality-shared latent factors;
Om
- the posterior variance of modality-specified latent factors;
muf
- the estimated intercept vector for each modality;
Sigmam
- the variance of modality-specified factors;
invLambdaf
- the inverse of the estimated variances of error for each modality.
ELBO
- the ELBO value when algorithm stops;
ELBO_seq
- the sequence of ELBO values.
time_use
- the running time in model fitting;
None
None
pveclist <- list('gaussian'=c(50, 150),'poisson'=c(50, 150), 'binomial'=c(100,60)) q <- 6 sigmavec <- rep(1,3) pvec <- unlist(pveclist) datlist <- gendata_cmgfm(pveclist = pveclist, seed = 1, n = 300,d = 3, q = q, rho = rep(1,length(pveclist)), rho_z=0.2, sigmavec=sigmavec, sigma_eps=1) XList <- datlist$XList Z <- datlist$Z numvarmat <- datlist$numvarmat types <- datlist$types rlist <- CMGFM(XList, Z, types=types, numvarmat, q=q) str(rlist)
pveclist <- list('gaussian'=c(50, 150),'poisson'=c(50, 150), 'binomial'=c(100,60)) q <- 6 sigmavec <- rep(1,3) pvec <- unlist(pveclist) datlist <- gendata_cmgfm(pveclist = pveclist, seed = 1, n = 300,d = 3, q = q, rho = rep(1,length(pveclist)), rho_z=0.2, sigmavec=sigmavec, sigma_eps=1) XList <- datlist$XList Z <- datlist$Z numvarmat <- datlist$numvarmat types <- datlist$types rlist <- CMGFM(XList, Z, types=types, numvarmat, q=q) str(rlist)
Generate simulated data from covariate-augumented generalized factor model
gendata_cmgfm( seed = 1, n = 300, pveclist = list(gaussian = c(50, 150), poisson = c(50), binomial = c(100, 60)), q = 6, d = 3, rho = rep(1, length(pveclist)), rho_z = 1, sigmavec = rep(0.5, length(pveclist)), n_bin = 1, sigma_eps = 1, seed.para = 1 )
gendata_cmgfm( seed = 1, n = 300, pveclist = list(gaussian = c(50, 150), poisson = c(50), binomial = c(100, 60)), q = 6, d = 3, rho = rep(1, length(pveclist)), rho_z = 1, sigmavec = rep(0.5, length(pveclist)), n_bin = 1, sigma_eps = 1, seed.para = 1 )
seed |
a positive integer, the random seed for reproducibility of data generation process. |
n |
a positive integer, specify the sample size. |
pveclist |
a named list, specify the number of modalities for each variable type and dimension of variables in each modality. |
q |
a positive integer, specify the number of modality-shared factors. |
d |
a positive integer, specify the dimension of covariate matrix. |
rho |
a numeric vector with length |
rho_z |
a positive real, specify the signal strength of covariates. |
sigmavec |
a positive vector with length |
n_bin |
a positive integer, specify the number of trails in Binomial distribution. |
sigma_eps |
a positive real, the variance of overdispersion error. |
seed.para |
a positive integer, the random seed for reproducibility of data generation process by fixing the regression coefficient vector and loading matrices. |
None
return a list including the following components:
XList
- a list consisting of multiple matrices in which each matrix has the same type of values, i.e., continuous, or count, or binomial/binary values.
Z
- a matrix, the fixed-dimensional covariate matrix with control variables;
Alist
- the the offset vector for each modality;
B0list
- the true loading matrix for each modality;
mu0
- the true intercept vector for each modality;
U0
- the modality-specified factor vector;
F0
- the modality-shared factor matrix;
Uplist
- the true intercept-loading matrix for each modality;
beta
- the true regression coefficient vector for each modality;
sigma_eps
- the standard deviation of error term;
numvarmat
- a length(types)-by-d matrix, the number of variables in modalities that belong to the same type.
None
n <- 300; pveclist = list('gaussian'=c(50, 150),'poisson'=c(50),'binomial'=c(100,60)) d <- 20; q <- 6; datlist <- gendata_cmgfm(n=n, pveclist=pveclist, q=q, d=d) str(datlist)
n <- 300; pveclist = list('gaussian'=c(50, 150),'poisson'=c(50),'binomial'=c(100,60)) d <- 20; q <- 6; datlist <- gendata_cmgfm(n=n, pveclist=pveclist, q=q, d=d) str(datlist)
Select the number of factors using maximum singular value ratio based method
MSVR( XList, Z, types, numvarmat, Alist = NULL, q_max = 20, threshold = 1e-05, ... )
MSVR( XList, Z, types, numvarmat, Alist = NULL, q_max = 20, threshold = 1e-05, ... )
XList |
a list consisting of multiple matrices in which each matrix has the same type of values, i.e., continuous, or count, or binomial/binary values. |
Z |
a matrix, the fixed-dimensional covariate matrix with control variables. |
types |
a string vector, specify the variable type in each matrix in |
numvarmat |
a |
Alist |
an optional vector, the offset for each unit; default as full-zero vector. |
q_max |
an optional string, specify the maximum number of factors; default as 20. |
threshold |
an optional positive value, a cutoff to filter the singular values that are smaller than it. |
... |
other arguments passed to CMGFM |
None
return the estimated number of factors.
None
None
pveclist <- list('gaussian'=c(50, 150),'poisson'=c(50, 150), 'binomial'=c(100,60)) q <- 6 sigmavec <- rep(1,3) pvec <- unlist(pveclist) datlist <- gendata_cmgfm(pveclist = pveclist, seed = 1, n = 300,d = 3, q = q, rho = rep(1,length(pveclist)), rho_z=0.2, sigmavec=sigmavec, sigma_eps=1) XList <- datlist$XList Z <- datlist$Z numvarmat <- datlist$numvarmat types <- datlist$types hq <- MSVR(XList, Z, types=types, numvarmat, q_max=20) print(c(q_true=q, q_est=hq))
pveclist <- list('gaussian'=c(50, 150),'poisson'=c(50, 150), 'binomial'=c(100,60)) q <- 6 sigmavec <- rep(1,3) pvec <- unlist(pveclist) datlist <- gendata_cmgfm(pveclist = pveclist, seed = 1, n = 300,d = 3, q = q, rho = rep(1,length(pveclist)), rho_z=0.2, sigmavec=sigmavec, sigma_eps=1) XList <- datlist$XList Z <- datlist$Z numvarmat <- datlist$numvarmat types <- datlist$types hq <- MSVR(XList, Z, types=types, numvarmat, q_max=20) print(c(q_true=q, q_est=hq))