Statistics Functions

scCustomize contains a couple simple helper functions to return metrics of interest.

For this vignette I will be utilizing pbmc3k dataset from the SeuratData package.

# Load Packages
library(ggplot2)
library(patchwork)
library(dplyr)
library(magrittr)
library(Seurat)
library(scCustomize)
library(qs)

# Load example dataset for tutorial
pbmc <- pbmc3k.SeuratData::pbmc3k.final

Now let’s add some extra meta data for use with tutorial

# Add mito and ribo data
pbmc <- Add_Mito_Ribo(object = pbmc, species = "human")

# Add random sample and group variables
pbmc$orig.ident <- sample(c("sample1", "sample2", "sample3", "sample4"), size = ncol(pbmc), replace = TRUE)
pbmc@meta.data$group[pbmc@meta.data$orig.ident == "sample1" | pbmc@meta.data$orig.ident == "sample3"] <- "Group 1"
pbmc@meta.data$group[pbmc@meta.data$orig.ident == "sample2" | pbmc@meta.data$orig.ident == "sample4"] <- "Group 2"


pbmc@meta.data$treatment[pbmc@meta.data$orig.ident == "sample1"] <- "Control"
pbmc@meta.data$treatment[pbmc@meta.data$orig.ident == "sample2" | pbmc@meta.data$orig.ident == "sample4" |
    pbmc@meta.data$orig.ident == "sample3"] <- "Treated"

# Add dummy module score
pbmc <- AddModuleScore(object = pbmc, features = list(c("CD3E", "CD4", "THY1", "TCRA")), name = "module_score")

Cells Per Identity

It can be really helpful to know the number and percentage of cells per identity/cluster during analysis.
scCustomize provides the Cluster_Stats_All_Samples function to make this easy.

cluster_stats <- Cluster_Stats_All_Samples(seurat_object = pbmc)

cluster_stats

Cluster	Number	Freq	sample1	sample2	sample3	sample4	sample1_%	sample2_%	sample3_%	sample4_%
Naive CD4 T	697	26.4215315	182	176	179	160	26.9629630	26.3079223	27.2036474	25.1572327
Memory CD4 T	483	18.3093252	141	127	112	103	20.8888889	18.9835575	17.0212766	16.1949686
CD14+ Mono	480	18.1956027	118	126	117	119	17.4814815	18.8340807	17.7811550	18.7106918
B	344	13.0401820	79	90	78	97	11.7037037	13.4529148	11.8541033	15.2515723
CD8 T	271	10.2729340	60	65	78	68	8.8888889	9.7159940	11.8541033	10.6918239
FCGR3A+ Mono	162	6.1410159	42	34	40	46	6.2222222	5.0822123	6.0790274	7.2327044
NK	155	5.8756634	43	37	41	34	6.3703704	5.5306428	6.2310030	5.3459119
DC	32	1.2130402	7	10	9	6	1.0370370	1.4947683	1.3677812	0.9433962
Platelet	14	0.5307051	3	4	4	3	0.4444444	0.5979073	0.6079027	0.4716981
Total	2638	100.0000000	675	669	658	636	100.0000000	100.0000000	100.0000000	100.0000000

By default Cluster_Stats_All_Samples uses “orig.ident” as the group variable but user can specify any @meta.data slot variable.

cluster_stats <- Cluster_Stats_All_Samples(seurat_object = pbmc, group_by_var = "group")

cluster_stats

Cluster	Number	Freq	Group 1	Group 2	Group 1_%	Group 2_%
Naive CD4 T	697	26.4215315	361	336	27.0817704	25.7471264
Memory CD4 T	483	18.3093252	253	230	18.9797449	17.6245211
CD14+ Mono	480	18.1956027	235	245	17.6294074	18.7739464
B	344	13.0401820	157	187	11.7779445	14.3295019
CD8 T	271	10.2729340	138	133	10.3525881	10.1915709
FCGR3A+ Mono	162	6.1410159	82	80	6.1515379	6.1302682
NK	155	5.8756634	84	71	6.3015754	5.4406130
DC	32	1.2130402	16	16	1.2003001	1.2260536
Platelet	14	0.5307051	7	7	0.5251313	0.5363985
Total	2638	100.0000000	1333	1305	100.0000000	100.0000000

Plotting Cluster/Identity Proportions

Understanding the proportion of cells per cluster and how that may change across a group can be important aspect of single cell analysis. scCustomize contains function Proportion_Plot to provide some visualization of cell proportions.

Plotting proportionality can be tricky especially when dealing with clusters of very disparate sizes (especially small clusters) so plese be aware to examine raw data as well (see Cluster_Stats_All_Samples()).

Also note that when splitting plot by group there are now error bars on these plots for interpretation of variance and other plotting methods should be used to convery such information.

Basic Proportion Plots

Proportion_Plot can either return a bar graph or pie graph depending on parameter value for

Proportion_Plot(seurat_object = pbmc, plot_type = "bar")

Proportion_Plot(seurat_object = pbmc, plot_type = "pie")

Plot across groups

Like other plots you can also use Proportion_Plot to split plot across any categorical meta.data variable.

Proportion_Plot(seurat_object = pbmc, plot_type = "bar", split.by = "treatment")

Proportion_Plot(seurat_object = pbmc, plot_type = "pie", split.by = "treatment")

Plot raw cell numbers

The default for Proportion_Plot is to plot the proportion of cells and not the raw number of cells per identity because that can be deceptive. However, in some cases it can be useful and so users can change the plot_scale parameter to plot raw numbers of cells instead of proportions.

Proportion_Plot(seurat_object = pbmc, plot_type = "bar", split.by = "treatment", plot_scale = "count")

Percent of Cells Expressing Feature(s)

It can also be informative to understand the percent of cells/nuclei that express a given feature or set of features.
scCustomize provides the Percent_Expressing function to return these results.

percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"))

	Naive.CD4.T	Memory.CD4.T	CD14..Mono	B	CD8.T	FCGR3A..Mono	NK	DC	Platelet
CD4	6.743185	15.942029	24.375	1.453488	2.583026	27.160494	0.6451613	34.375	0
CD8A	13.342898	8.902691	1.875	2.616279	50.553505	3.703704	8.3870968	3.125	0

Change grouping variable

By default the function groups expression across @active.ident (see above) but user can specify difference variable if desired.

percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"), group_by = "orig.ident")

	sample3	sample1	sample4	sample2
CD4	11.39818	11.55556	11.79245	12.10762
CD8A	12.91793	12.74074	11.00629	10.46338

Split within groups

User can also supply a split_by variable to quantify expression within group split by meta data variable

percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"), split_by = "group")

	Naive.CD4.T_Group.1	Naive.CD4.T_Group.2	Memory.CD4.T_Group.1	Memory.CD4.T_Group.2	CD14..Mono_Group.1	CD14..Mono_Group.2	B_Group.1	B_Group.2	CD8.T_Group.1	CD8.T_Group.2	FCGR3A..Mono_Group.2	FCGR3A..Mono_Group.1	NK_Group.1	NK_Group.2	DC_Group.2	DC_Group.1	Platelet_Group.1	Platelet_Group.2
CD4	6.371191	7.142857	17.39130	14.347826	24.255319	24.48980	0.000000	2.673797	3.623188	1.503759	31.25	23.170732	0.000000	1.408451	37.5	31.25	0	0
CD8A	13.573407	13.095238	10.67194	6.956522	2.553192	1.22449	1.273885	3.743316	55.797101	45.112782	3.75	3.658537	7.142857	9.859155	0.0	6.25	0	0

Set threshold of expression

Users can also supply a threshold value to quantify percent of cells expressing feature(s) above a certain threshold (be sure to note which slot is being used to quantify expression when setting thresholds).
NOTE: Percent_Expressing currently only supports single threshold across all features

percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"), threshold = 2)

	Naive.CD4.T	Memory.CD4.T	CD14..Mono	B	CD8.T	FCGR3A..Mono	NK	DC	Platelet
CD4	1.147776	0.621118	5.8333333	0.5813953	0.3690037	1.851852	0.000000	3.125	0
CD8A	6.456241	2.070393	0.2083333	0.8720930	24.7232472	0.000000	3.225807	0.000	0

Calculate Summary Median Values

scCustomize contains function Median_Stats to quickly calculate the medians for basic QC stats (Genes/, UMIs/, % Mito/Cell).

Basic Use

By default Median_Stats will calculate medians for the following meta data columns (if present): “nCount_RNA”, “nFeature_RNA”, “percent_mito”, “percent_ribo”, “percent_mito_ribo”.

median_stats <- Median_Stats(seurat_object = pbmc, group_by_var = "orig.ident")

orig.ident	Median_nCount_RNA	Median_nFeature_RNA	Median_percent_mito	Median_percent_ribo	Median_percent_mito_ribo
sample1	2193	813.0	2.005731	36.91964	39.08969
sample2	2178	812.0	2.016383	37.03217	39.31227
sample3	2241	820.0	1.982099	37.40017	39.24930
sample4	2251	836.5	2.057405	36.16109	38.38636
Totals (All Cells)	2213	819.0	2.010702	36.92524	39.03276

Additional Variables

In addition to default variables, users can supply their own additional meta data columns to calculate medians using the median_var parameter.
NOTE: The meta data column must be in numeric format (e.g., integer, numeric, dbl)

median_stats <- Median_Stats(seurat_object = pbmc, group_by_var = "orig.ident", median_var = "module_score1")

orig.ident	Median_nCount_RNA	Median_nFeature_RNA	Median_percent_mito	Median_percent_ribo	Median_percent_mito_ribo	Median_module_score1
sample1	2193	813.0	2.005731	36.91964	39.08969	-0.0545099
sample2	2178	812.0	2.016383	37.03217	39.31227	-0.0678532
sample3	2241	820.0	1.982099	37.40017	39.24930	-0.0793297
sample4	2251	836.5	2.057405	36.16109	38.38636	-0.1935054
Totals (All Cells)	2213	819.0	2.010702	36.92524	39.03276	-0.0922378

Calculate Median Absolute Deviations

In addition to calculating the median values scCustomize includes function to calculate the median absolute deviation for each of those features with function MAD_Stats. By setting the parameter mad_num the function will retrun the MAD*mad_num.

mad <- MAD_Stats(seurat_object = pbmc, group_by_var = "orig.ident", mad_num = 2)

orig.ident	MAD x 2 nCount_RNA	MAD x 2 nFeature_RNA	MAD x 2 percent_mito	MAD x 2 percent_ribo	MAD x 2 percent_mito_ribo
sample1	1464.809	397.3368	1.640639	23.92318	23.93974
sample2	1319.514	361.7544	1.579981	23.47474	23.15831
sample3	1500.391	392.8890	1.557364	23.57180	23.52481
sample4	1478.152	381.0282	1.451580	24.44613	24.08378
Totals (All Cells)	1442.570	379.5456	1.544386	23.73656	23.38683

Plotting Median Data

scCustomize also contains series of functions for plotting the results of these median calculations:

Plot_Median_Genes(seurat_object = pbmc)

Plot_Median_Genes(seurat_object = pbmc, group_by = "group")

Plot_Median_Other(seurat_object = pbmc, median_var = "module_score1", group_by = "group")

Statistics Functions

Compiled: December 19, 2024