vignettes/articles/Statistics.Rmd
Statistics.Rmd
scCustomize contains a couple simple helper functions to return metrics of interest.
For this vignette I will be utilizing pbmc3k dataset from the SeuratData package.
# Load Packages
library(ggplot2)
library(patchwork)
library(dplyr)
library(magrittr)
library(Seurat)
library(scCustomize)
library(qs)
# Load example dataset for tutorial
pbmc <- pbmc3k.SeuratData::pbmc3k.final
Now let’s add some extra meta data for use with tutorial
# Add mito and ribo data
pbmc <- Add_Mito_Ribo(object = pbmc, species = "human")
# Add random sample and group variables
pbmc$orig.ident <- sample(c("sample1", "sample2", "sample3", "sample4"), size = ncol(pbmc), replace = TRUE)
pbmc@meta.data$group[pbmc@meta.data$orig.ident == "sample1" | pbmc@meta.data$orig.ident == "sample3"] <- "Group 1"
pbmc@meta.data$group[pbmc@meta.data$orig.ident == "sample2" | pbmc@meta.data$orig.ident == "sample4"] <- "Group 2"
pbmc@meta.data$treatment[pbmc@meta.data$orig.ident == "sample1"] <- "Control"
pbmc@meta.data$treatment[pbmc@meta.data$orig.ident == "sample2" | pbmc@meta.data$orig.ident == "sample4" |
pbmc@meta.data$orig.ident == "sample3"] <- "Treated"
# Add dummy module score
pbmc <- AddModuleScore(object = pbmc, features = list(c("CD3E", "CD4", "THY1", "TCRA")), name = "module_score")
It can be really helpful to know the number and percentage of cells
per identity/cluster during analysis.
scCustomize provides the Cluster_Stats_All_Samples
function
to make this easy.
cluster_stats <- Cluster_Stats_All_Samples(seurat_object = pbmc)
cluster_stats
Cluster | Number | Freq | sample1 | sample2 | sample3 | sample4 | sample1_% | sample2_% | sample3_% | sample4_% |
---|---|---|---|---|---|---|---|---|---|---|
Naive CD4 T | 697 | 26.4215315 | 182 | 176 | 179 | 160 | 26.9629630 | 26.3079223 | 27.2036474 | 25.1572327 |
Memory CD4 T | 483 | 18.3093252 | 141 | 127 | 112 | 103 | 20.8888889 | 18.9835575 | 17.0212766 | 16.1949686 |
CD14+ Mono | 480 | 18.1956027 | 118 | 126 | 117 | 119 | 17.4814815 | 18.8340807 | 17.7811550 | 18.7106918 |
B | 344 | 13.0401820 | 79 | 90 | 78 | 97 | 11.7037037 | 13.4529148 | 11.8541033 | 15.2515723 |
CD8 T | 271 | 10.2729340 | 60 | 65 | 78 | 68 | 8.8888889 | 9.7159940 | 11.8541033 | 10.6918239 |
FCGR3A+ Mono | 162 | 6.1410159 | 42 | 34 | 40 | 46 | 6.2222222 | 5.0822123 | 6.0790274 | 7.2327044 |
NK | 155 | 5.8756634 | 43 | 37 | 41 | 34 | 6.3703704 | 5.5306428 | 6.2310030 | 5.3459119 |
DC | 32 | 1.2130402 | 7 | 10 | 9 | 6 | 1.0370370 | 1.4947683 | 1.3677812 | 0.9433962 |
Platelet | 14 | 0.5307051 | 3 | 4 | 4 | 3 | 0.4444444 | 0.5979073 | 0.6079027 | 0.4716981 |
Total | 2638 | 100.0000000 | 675 | 669 | 658 | 636 | 100.0000000 | 100.0000000 | 100.0000000 | 100.0000000 |
By default Cluster_Stats_All_Samples
uses “orig.ident”
as the group variable but user can specify any @meta.data
slot variable.
cluster_stats <- Cluster_Stats_All_Samples(seurat_object = pbmc, group_by_var = "group")
cluster_stats
Cluster | Number | Freq | Group 1 | Group 2 | Group 1_% | Group 2_% |
---|---|---|---|---|---|---|
Naive CD4 T | 697 | 26.4215315 | 361 | 336 | 27.0817704 | 25.7471264 |
Memory CD4 T | 483 | 18.3093252 | 253 | 230 | 18.9797449 | 17.6245211 |
CD14+ Mono | 480 | 18.1956027 | 235 | 245 | 17.6294074 | 18.7739464 |
B | 344 | 13.0401820 | 157 | 187 | 11.7779445 | 14.3295019 |
CD8 T | 271 | 10.2729340 | 138 | 133 | 10.3525881 | 10.1915709 |
FCGR3A+ Mono | 162 | 6.1410159 | 82 | 80 | 6.1515379 | 6.1302682 |
NK | 155 | 5.8756634 | 84 | 71 | 6.3015754 | 5.4406130 |
DC | 32 | 1.2130402 | 16 | 16 | 1.2003001 | 1.2260536 |
Platelet | 14 | 0.5307051 | 7 | 7 | 0.5251313 | 0.5363985 |
Total | 2638 | 100.0000000 | 1333 | 1305 | 100.0000000 | 100.0000000 |
Understanding the proportion of cells per cluster and how that may
change across a group can be important aspect of single cell analysis.
scCustomize contains function Proportion_Plot
to provide
some visualization of cell proportions.
Plotting proportionality can be tricky especially when dealing with
clusters of very disparate sizes (especially small clusters) so plese be
aware to examine raw data as well (see
Cluster_Stats_All_Samples()
).
Also note that when splitting plot by group there are now error bars on these plots for interpretation of variance and other plotting methods should be used to convery such information.
Proportion_Plot
can either return a bar graph or pie
graph depending on parameter value for
Proportion_Plot(seurat_object = pbmc, plot_type = "bar")
Proportion_Plot(seurat_object = pbmc, plot_type = "pie")
Like other plots you can also use Proportion_Plot
to
split plot across any categorical meta.data variable.
Proportion_Plot(seurat_object = pbmc, plot_type = "bar", split.by = "treatment")
Proportion_Plot(seurat_object = pbmc, plot_type = "pie", split.by = "treatment")
The default for Proportion_Plot
is to plot the
proportion of cells and not the raw number of cells per identity because
that can be deceptive. However, in some cases it can be useful and so
users can change the plot_scale
parameter to plot raw
numbers of cells instead of proportions.
Proportion_Plot(seurat_object = pbmc, plot_type = "bar", split.by = "treatment", plot_scale = "count")
It can also be informative to understand the percent of cells/nuclei
that express a given feature or set of features.
scCustomize provides the Percent_Expressing
function to
return these results.
percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"))
Naive.CD4.T | Memory.CD4.T | CD14..Mono | B | CD8.T | FCGR3A..Mono | NK | DC | Platelet | |
---|---|---|---|---|---|---|---|---|---|
CD4 | 6.743185 | 15.942029 | 24.375 | 1.453488 | 2.583026 | 27.160494 | 0.6451613 | 34.375 | 0 |
CD8A | 13.342898 | 8.902691 | 1.875 | 2.616279 | 50.553505 | 3.703704 | 8.3870968 | 3.125 | 0 |
By default the function groups expression across
@active.ident
(see above) but user can specify difference
variable if desired.
percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"), group_by = "orig.ident")
sample3 | sample1 | sample4 | sample2 | |
---|---|---|---|---|
CD4 | 11.39818 | 11.55556 | 11.79245 | 12.10762 |
CD8A | 12.91793 | 12.74074 | 11.00629 | 10.46338 |
User can also supply a split_by
variable to quantify
expression within group split by meta data variable
percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"), split_by = "group")
Naive.CD4.T_Group.1 | Naive.CD4.T_Group.2 | Memory.CD4.T_Group.1 | Memory.CD4.T_Group.2 | CD14..Mono_Group.1 | CD14..Mono_Group.2 | B_Group.1 | B_Group.2 | CD8.T_Group.1 | CD8.T_Group.2 | FCGR3A..Mono_Group.2 | FCGR3A..Mono_Group.1 | NK_Group.1 | NK_Group.2 | DC_Group.2 | DC_Group.1 | Platelet_Group.1 | Platelet_Group.2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CD4 | 6.371191 | 7.142857 | 17.39130 | 14.347826 | 24.255319 | 24.48980 | 0.000000 | 2.673797 | 3.623188 | 1.503759 | 31.25 | 23.170732 | 0.000000 | 1.408451 | 37.5 | 31.25 | 0 | 0 |
CD8A | 13.573407 | 13.095238 | 10.67194 | 6.956522 | 2.553192 | 1.22449 | 1.273885 | 3.743316 | 55.797101 | 45.112782 | 3.75 | 3.658537 | 7.142857 | 9.859155 | 0.0 | 6.25 | 0 | 0 |
Users can also supply a threshold value to quantify percent of cells
expressing feature(s) above a certain threshold (be sure to note which
slot is being used to quantify expression when setting
thresholds).
NOTE: Percent_Expressing
currently only supports single
threshold across all features
percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"), threshold = 2)
Naive.CD4.T | Memory.CD4.T | CD14..Mono | B | CD8.T | FCGR3A..Mono | NK | DC | Platelet | |
---|---|---|---|---|---|---|---|---|---|
CD4 | 1.147776 | 0.621118 | 5.8333333 | 0.5813953 | 0.3690037 | 1.851852 | 0.000000 | 3.125 | 0 |
CD8A | 6.456241 | 2.070393 | 0.2083333 | 0.8720930 | 24.7232472 | 0.000000 | 3.225807 | 0.000 | 0 |
scCustomize contains function Median_Stats
to quickly
calculate the medians for basic QC stats (Genes/, UMIs/, %
Mito/Cell).
By default Median_Stats
will calculate medians for the
following meta data columns (if present): “nCount_RNA”, “nFeature_RNA”,
“percent_mito”, “percent_ribo”, “percent_mito_ribo”.
median_stats <- Median_Stats(seurat_object = pbmc, group_by_var = "orig.ident")
orig.ident | Median_nCount_RNA | Median_nFeature_RNA | Median_percent_mito | Median_percent_ribo | Median_percent_mito_ribo |
---|---|---|---|---|---|
sample1 | 2193 | 813.0 | 2.005731 | 36.91964 | 39.08969 |
sample2 | 2178 | 812.0 | 2.016383 | 37.03217 | 39.31227 |
sample3 | 2241 | 820.0 | 1.982099 | 37.40017 | 39.24930 |
sample4 | 2251 | 836.5 | 2.057405 | 36.16109 | 38.38636 |
Totals (All Cells) | 2213 | 819.0 | 2.010702 | 36.92524 | 39.03276 |
In addition to default variables, users can supply their own
additional meta data columns to calculate medians using the
median_var
parameter.
NOTE: The meta data column must be in numeric format (e.g., integer,
numeric, dbl)
median_stats <- Median_Stats(seurat_object = pbmc, group_by_var = "orig.ident", median_var = "module_score1")
orig.ident | Median_nCount_RNA | Median_nFeature_RNA | Median_percent_mito | Median_percent_ribo | Median_percent_mito_ribo | Median_module_score1 |
---|---|---|---|---|---|---|
sample1 | 2193 | 813.0 | 2.005731 | 36.91964 | 39.08969 | -0.0545099 |
sample2 | 2178 | 812.0 | 2.016383 | 37.03217 | 39.31227 | -0.0678532 |
sample3 | 2241 | 820.0 | 1.982099 | 37.40017 | 39.24930 | -0.0793297 |
sample4 | 2251 | 836.5 | 2.057405 | 36.16109 | 38.38636 | -0.1935054 |
Totals (All Cells) | 2213 | 819.0 | 2.010702 | 36.92524 | 39.03276 | -0.0922378 |
In addition to calculating the median values scCustomize includes
function to calculate the median absolute deviation for each of those
features with function MAD_Stats
. By setting the parameter
mad_num
the function will retrun the MAD*mad_num.
mad <- MAD_Stats(seurat_object = pbmc, group_by_var = "orig.ident", mad_num = 2)
orig.ident | MAD x 2 nCount_RNA | MAD x 2 nFeature_RNA | MAD x 2 percent_mito | MAD x 2 percent_ribo | MAD x 2 percent_mito_ribo |
---|---|---|---|---|---|
sample1 | 1464.809 | 397.3368 | 1.640639 | 23.92318 | 23.93974 |
sample2 | 1319.514 | 361.7544 | 1.579981 | 23.47474 | 23.15831 |
sample3 | 1500.391 | 392.8890 | 1.557364 | 23.57180 | 23.52481 |
sample4 | 1478.152 | 381.0282 | 1.451580 | 24.44613 | 24.08378 |
Totals (All Cells) | 1442.570 | 379.5456 | 1.544386 | 23.73656 | 23.38683 |
scCustomize also contains series of functions for plotting the results of these median calculations:
Plot_Median_Genes(seurat_object = pbmc)
Plot_Median_Genes(seurat_object = pbmc, group_by = "group")
Plot_Median_Other(seurat_object = pbmc, median_var = "module_score1", group_by = "group")