This vignette introduces the CoFAST workflow for the analysis of PBMC3k single-cell RNA sequencing dataset. In this vignette, the workflow of CoFAST consists of three steps
We demonstrate the use of CoFAST to PBMC3k data that are in the SeuratData package, which can be downloaded to the current working path by the following command:
set.seed(2024) # set a random seed for reproducibility.
library(Seurat)
pbmc3k <- SeuratData::LoadData("pbmc3k")
## filter the seurat_annotation is NA
idx <- which(!is.na(pbmc3k$seurat_annotations))
pbmc3k <- pbmc3k[,idx]
pbmc3k
The package can be loaded with the command:
First, we normalize the data.
Then, we select the variable genes.
We introduce how to use the non-centered factor model (NCFM) to perform coembedding for this scRNA-seq data. First, we determine the dimension of coembeddings. Here, we use the parallel analysis method to select the dimension.
Subsequently, we calculate coembeddings by utilizing NCFM, and
observe that the reductions
field acquires an additional
component named ncfm
.
In the following, we show how to find the signature genes based on comebeddings. First, we calculate the distance matrix.
Next, we find the signature genes for each cell type
print(table(pbmc3k$seurat_annotations))
Idents(pbmc3k) <- pbmc3k$seurat_annotations
df_sig_list <- find.signature.genes(pbmc3k)
str(df_sig_list)
Then, we obtain the top five signature genes and organize them into a
data.frame. The colname distance
means the distance between
gene (i.e., VPREB3) and cells with the specific cell type (i.e., B
cell), which is calculated based on the coembedding of genes and cells
in the coembedding space. The distance is smaller, the association
between gene and the cell type is stronger. The colname
expr.prop
represents the expression proportion of the gene
(i.e., VPREB3) within the cell type (i.e., B cell). The colname
label
means the cell types and colname gene
denotes the gene name. By the data.frame object, we know
VPREB3
is the one of the top signature gene of B cell.
Next, we calculate the UMAP projections of coembeddings of cells and the selected signature genes.
pbmc3k <- coembedding_umap(
pbmc3k, reduction = "ncfm", reduction.name = "UMAP",
gene.set = unique(dat$gene))
Furthermore, we visualize the cells and top five signature genes of B cell in the UMAP space of coembedding. We observe that the UMAP projections of the five signature genes are near to B cells, which indicates these genes are enriched in B cells.
## choose beutifual colors
cols_cluster <- c("black", PRECAST::chooseColors(palettes_name = "Light 13", n_colors = 9, plot_colors = TRUE))
p1 <- coembed_plot(
pbmc3k, reduction = "UMAP",
gene_txtdata = subset(dat, label=='B'),
cols=cols_cluster,pt_text_size = 3)
p1
Then, we visualize the cells and top five signature genes of all involved cell types in the UMAP space of coembedding. We observe that the UMAP projections of the five signature genes are near to the corresponding cell type, which indicates these genes are enriched in the corresponding cells.
p2 <- coembed_plot(
pbmc3k, reduction = "UMAP",
gene_txtdata = dat, cols=cols_cluster,
pt_text_size = 3)
p2
In addtion, we can fully take advantages of the visualization
functions in Seurat
package for visualization. The
following is an example that visualizes the cell types on the UMAP
space.
cols_type <- cols_cluster[-1]
names(cols_type)<- sort(levels(Idents(pbmc3k)))
DimPlot(pbmc3k, reduction = 'UMAP', cols=cols_type)
Then, there is another example that we plot the two signature genes of B cell on UMAP space, in which we observed the high expression in B cells in constrast to other cell types.
sessionInfo()
#> R version 4.4.2 (2024-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.1 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] rmarkdown_2.29
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0 xfun_0.49
#> [5] maketools_1.3.1 cachem_1.1.0 knitr_1.49 htmltools_0.5.8.1
#> [9] buildtools_1.0.0 lifecycle_1.0.4 cli_3.6.3 sass_0.4.9
#> [13] jquerylib_0.1.4 compiler_4.4.2 sys_3.4.3 tools_4.4.2
#> [17] evaluate_1.0.1 bslib_0.8.0 yaml_2.3.10 jsonlite_1.8.9
#> [21] rlang_1.1.4