Statistics Functions

scCustomize contains a couple simple helper functions to return metrics of interest.

For this vignette I will be utilizing pbmc3k dataset from the SeuratData package.

# Load Packages
library(ggplot2)
library(dplyr)
library(magrittr)
library(Seurat)
library(scCustomize)
library(qs)

# Load example dataset for tutorial
pbmc <- pbmc3k.SeuratData::pbmc3k.final

Now let’s add some extra meta data for use with tutorial

# Add mito and ribo data
pbmc <- Add_Mito_Ribo(object = pbmc, species = "human")

# Add random sample and group variables
pbmc$orig.ident <- sample(c("sample1", "sample2", "sample3", "sample4"), size = ncol(pbmc), replace = TRUE)
pbmc@meta.data$group[pbmc@meta.data$orig.ident == "sample1" | pbmc@meta.data$orig.ident == "sample3"] <- "Group 1"
pbmc@meta.data$group[pbmc@meta.data$orig.ident == "sample2" | pbmc@meta.data$orig.ident == "sample4"] <- "Group 2"

# Add dummy module score
pbmc <- AddModuleScore(object = pbmc, features = list(c("CD3E", "CD4", "THY1", "TCRA")), name = "module_score")

Cells Per Identity

It can be really helpful to know the number and percentage of cells per identity/cluster during analysis.
scCustomize provides the Cluster_Stats_All_Samples function to make this easy.

cluster_stats <- Cluster_Stats_All_Samples(seurat_object = pbmc)

cluster_stats
Cluster Number Freq sample1 sample2 sample3 sample4 sample1_% sample2_% sample3_% sample4_%
Naive CD4 T 697 26.4215315 178 189 171 159 25.7971014 29.0322581 25.9878419 24.8826291
Memory CD4 T 483 18.3093252 124 122 117 120 17.9710145 18.7403994 17.7811550 18.7793427
CD14+ Mono 480 18.1956027 134 114 115 117 19.4202899 17.5115207 17.4772036 18.3098592
B 344 13.0401820 106 67 83 88 15.3623188 10.2918587 12.6139818 13.7715180
CD8 T 271 10.2729340 62 65 82 62 8.9855072 9.9846390 12.4620061 9.7026604
FCGR3A+ Mono 162 6.1410159 31 45 47 39 4.4927536 6.9124424 7.1428571 6.1032864
NK 155 5.8756634 44 42 28 41 6.3768116 6.4516129 4.2553191 6.4162754
DC 32 1.2130402 7 6 10 9 1.0144928 0.9216590 1.5197568 1.4084507
Platelet 14 0.5307051 4 1 5 4 0.5797101 0.1536098 0.7598784 0.6259781
Total 2638 100.0000000 690 651 658 639 100.0000000 100.0000000 100.0000000 100.0000000

By default Cluster_Stats_All_Samples uses “orig.ident” as the group variable but user can specify any @meta.data slot variable.

cluster_stats <- Cluster_Stats_All_Samples(seurat_object = pbmc, group_by_var = "group")

cluster_stats
Cluster Number Freq Group 1 Group 2 Group 1_% Group 2_%
Naive CD4 T 697 26.4215315 349 348 25.8902077 26.9767442
Memory CD4 T 483 18.3093252 241 242 17.8783383 18.7596899
CD14+ Mono 480 18.1956027 249 231 18.4718101 17.9069767
B 344 13.0401820 189 155 14.0207715 12.0155039
CD8 T 271 10.2729340 144 127 10.6824926 9.8449612
FCGR3A+ Mono 162 6.1410159 78 84 5.7863501 6.5116279
NK 155 5.8756634 72 83 5.3412463 6.4341085
DC 32 1.2130402 17 15 1.2611276 1.1627907
Platelet 14 0.5307051 9 5 0.6676558 0.3875969
Total 2638 100.0000000 1348 1290 100.0000000 100.0000000

Percent of Cells Expressing Feature(s)

It can also be informative to understand the percent of cells/nuclei that express a given feature or set of features.
scCustomize provides the Percent_Expressing function to return these results.

percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"))
Naive.CD4.T Memory.CD4.T CD14..Mono B CD8.T FCGR3A..Mono NK DC Platelet
CD4 6.743185 15.942029 24.375 1.453488 2.583026 27.160494 0.6451613 34.375 0
CD8A 13.342898 8.902691 1.875 2.616279 50.553505 3.703704 8.3870968 3.125 0

Change grouping variable

By default the function groups expression across @active.ident (see above) but user can specify difference variable if desired.

percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"), group_by = "orig.ident")
sample2 sample4 sample1 sample3
CD4 14.28571 9.076682 10.72464 12.76596
CD8A 11.98157 10.172144 12.02899 12.91793

Split within groups

User can also supply a split_by variable to quantify expression within group split by meta data variable

percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"), split_by = "group")
Naive.CD4.T_Group.2 Naive.CD4.T_Group.1 Memory.CD4.T_Group.1 Memory.CD4.T_Group.2 CD14..Mono_Group.1 CD14..Mono_Group.2 B_Group.2 B_Group.1 CD8.T_Group.1 CD8.T_Group.2 FCGR3A..Mono_Group.1 FCGR3A..Mono_Group.2 NK_Group.1 NK_Group.2 DC_Group.2 DC_Group.1 Platelet_Group.1 Platelet_Group.2
CD4 7.758621 5.730659 15.767635 16.11570 25.702811 22.943723 1.290323 1.587302 3.472222 1.574803 25.641026 28.571429 1.388889 0.00000 26.666667 41.17647 0 0
CD8A 12.643678 14.040115 9.958506 7.85124 2.008032 1.731602 2.580645 2.645503 56.250000 44.094488 1.282051 5.952381 4.166667 12.04819 6.666667 0.00000 0 0

Set threshold of expression

Users can also supply a threshold value to quantify percent of cells expressing feature(s) above a certain threshold (be sure to note which slot is being used to quantify expression when setting thresholds).
NOTE: Percent_Expressing currently only supports single threshold across all features

percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"), threshold = 2)
Naive.CD4.T Memory.CD4.T CD14..Mono B CD8.T FCGR3A..Mono NK DC Platelet
CD4 1.147776 0.621118 5.8333333 0.5813953 0.3690037 1.851852 0.000000 3.125 0
CD8A 6.456241 2.070393 0.2083333 0.8720930 24.7232472 0.000000 3.225807 0.000 0

Calculate Summary Median Values

scCustomize contains function Median_Stats to quickly calculate the medians for basic QC stats (Genes/, UMIs/, % Mito/Cell).

Basic Use

By default Median_Stats will calculate medians for the following meta data columns (if present): “nCount_RNA”, “nFeature_RNA”, “percent_mito”, “percent_ribo”, “percent_mito_ribo”.

median_stats <- Median_Stats(seurat_object = pbmc, group_by_var = "orig.ident")
orig.ident Median_nCount_RNA Median_nFeature_RNA Median_percent_mito Median_percent_ribo Median_percent_mito_ribo
sample1 2189.0 810.0 2.045561 37.30009 39.50258
sample2 2193.0 817.0 1.937208 36.68319 39.02439
sample3 2204.5 821.5 2.006391 36.89451 38.77124
sample4 2250.0 829.0 2.076271 36.90902 38.92387
Totals (All Cells) 2213.0 819.0 2.010702 36.92524 39.03276

Additional Variables

In addition to default variables, users can supply their own additional meta data columns to calculate medians using the median_var parameter.
NOTE: The meta data column must be in numeric format (e.g., integer, numeric, dbl)

median_stats <- Median_Stats(seurat_object = pbmc, group_by_var = "orig.ident", median_var = "module_score1")
orig.ident Median_nCount_RNA Median_nFeature_RNA Median_percent_mito Median_percent_ribo Median_percent_mito_ribo Median_module_score1
sample1 2189.0 810.0 2.045561 37.30009 39.50258 -0.1551981
sample2 2193.0 817.0 1.937208 36.68319 39.02439 0.0076579
sample3 2204.5 821.5 2.006391 36.89451 38.77124 -0.0939919
sample4 2250.0 829.0 2.076271 36.90902 38.92387 -0.1788730
Totals (All Cells) 2213.0 819.0 2.010702 36.92524 39.03276 -0.0922378

Calculate Median Absolute Deviations

In addition to calculating the median values scCustomize includes function to calculate the median absolute deviation for each of those features with function MAD_Stats. By setting the parameter mad_num the function will retrun the MAD*mad_num.

mad <- MAD_Stats(seurat_object = pbmc, group_by_var = "orig.ident", mad_num = 2)
orig.ident MAD x 2 nCount_RNA MAD x 2 nFeature_RNA MAD x 2 percent_mito MAD x 2 percent_ribo MAD x 2 percent_mito_ribo
sample1 736.8522 197.1858 0.8075960 11.35120 11.25176
sample2 650.8614 177.9120 0.7286244 12.60023 12.21798
sample3 742.0413 194.2206 0.7812998 11.74065 11.54864
sample4 736.8522 194.2206 0.7626530 11.75971 11.87102
Totals (All Cells) 721.2849 189.7728 0.7721931 11.86828 11.69341

Plotting Median Data

scCustomize also contains series of functions for plotting the results of these median calculations:

Plot_Median_Genes(seurat_object = pbmc)

Plot_Median_Genes(seurat_object = pbmc, group_by = "group")

Plot_Median_Other(seurat_object = pbmc, median_var = "module_score1", group_by = "group")