vignettes/articles/Statistics.Rmd
Statistics.Rmd
scCustomize contains a couple simple helper functions to return metrics of interest.
For this vignette I will be utilizing pbmc3k dataset from the SeuratData package.
# Load Packages
library(ggplot2)
library(dplyr)
library(magrittr)
library(Seurat)
library(scCustomize)
library(qs)
# Load example dataset for tutorial
pbmc <- pbmc3k.SeuratData::pbmc3k.final
Now let’s add some extra meta data for use with tutorial
# Add mito and ribo data
pbmc <- Add_Mito_Ribo(object = pbmc, species = "human")
# Add random sample and group variables
pbmc$orig.ident <- sample(c("sample1", "sample2", "sample3", "sample4"), size = ncol(pbmc), replace = TRUE)
pbmc@meta.data$group[pbmc@meta.data$orig.ident == "sample1" | pbmc@meta.data$orig.ident == "sample3"] <- "Group 1"
pbmc@meta.data$group[pbmc@meta.data$orig.ident == "sample2" | pbmc@meta.data$orig.ident == "sample4"] <- "Group 2"
# Add dummy module score
pbmc <- AddModuleScore(object = pbmc, features = list(c("CD3E", "CD4", "THY1", "TCRA")), name = "module_score")
It can be really helpful to know the number and percentage of cells
per identity/cluster during analysis.
scCustomize provides the Cluster_Stats_All_Samples
function
to make this easy.
cluster_stats <- Cluster_Stats_All_Samples(seurat_object = pbmc)
cluster_stats
Cluster | Number | Freq | sample1 | sample2 | sample3 | sample4 | sample1_% | sample2_% | sample3_% | sample4_% |
---|---|---|---|---|---|---|---|---|---|---|
Naive CD4 T | 697 | 26.4215315 | 178 | 189 | 171 | 159 | 25.7971014 | 29.0322581 | 25.9878419 | 24.8826291 |
Memory CD4 T | 483 | 18.3093252 | 124 | 122 | 117 | 120 | 17.9710145 | 18.7403994 | 17.7811550 | 18.7793427 |
CD14+ Mono | 480 | 18.1956027 | 134 | 114 | 115 | 117 | 19.4202899 | 17.5115207 | 17.4772036 | 18.3098592 |
B | 344 | 13.0401820 | 106 | 67 | 83 | 88 | 15.3623188 | 10.2918587 | 12.6139818 | 13.7715180 |
CD8 T | 271 | 10.2729340 | 62 | 65 | 82 | 62 | 8.9855072 | 9.9846390 | 12.4620061 | 9.7026604 |
FCGR3A+ Mono | 162 | 6.1410159 | 31 | 45 | 47 | 39 | 4.4927536 | 6.9124424 | 7.1428571 | 6.1032864 |
NK | 155 | 5.8756634 | 44 | 42 | 28 | 41 | 6.3768116 | 6.4516129 | 4.2553191 | 6.4162754 |
DC | 32 | 1.2130402 | 7 | 6 | 10 | 9 | 1.0144928 | 0.9216590 | 1.5197568 | 1.4084507 |
Platelet | 14 | 0.5307051 | 4 | 1 | 5 | 4 | 0.5797101 | 0.1536098 | 0.7598784 | 0.6259781 |
Total | 2638 | 100.0000000 | 690 | 651 | 658 | 639 | 100.0000000 | 100.0000000 | 100.0000000 | 100.0000000 |
By default Cluster_Stats_All_Samples
uses “orig.ident”
as the group variable but user can specify any @meta.data
slot variable.
cluster_stats <- Cluster_Stats_All_Samples(seurat_object = pbmc, group_by_var = "group")
cluster_stats
Cluster | Number | Freq | Group 1 | Group 2 | Group 1_% | Group 2_% |
---|---|---|---|---|---|---|
Naive CD4 T | 697 | 26.4215315 | 349 | 348 | 25.8902077 | 26.9767442 |
Memory CD4 T | 483 | 18.3093252 | 241 | 242 | 17.8783383 | 18.7596899 |
CD14+ Mono | 480 | 18.1956027 | 249 | 231 | 18.4718101 | 17.9069767 |
B | 344 | 13.0401820 | 189 | 155 | 14.0207715 | 12.0155039 |
CD8 T | 271 | 10.2729340 | 144 | 127 | 10.6824926 | 9.8449612 |
FCGR3A+ Mono | 162 | 6.1410159 | 78 | 84 | 5.7863501 | 6.5116279 |
NK | 155 | 5.8756634 | 72 | 83 | 5.3412463 | 6.4341085 |
DC | 32 | 1.2130402 | 17 | 15 | 1.2611276 | 1.1627907 |
Platelet | 14 | 0.5307051 | 9 | 5 | 0.6676558 | 0.3875969 |
Total | 2638 | 100.0000000 | 1348 | 1290 | 100.0000000 | 100.0000000 |
It can also be informative to understand the percent of cells/nuclei
that express a given feature or set of features.
scCustomize provides the Percent_Expressing
function to
return these results.
percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"))
Naive.CD4.T | Memory.CD4.T | CD14..Mono | B | CD8.T | FCGR3A..Mono | NK | DC | Platelet | |
---|---|---|---|---|---|---|---|---|---|
CD4 | 6.743185 | 15.942029 | 24.375 | 1.453488 | 2.583026 | 27.160494 | 0.6451613 | 34.375 | 0 |
CD8A | 13.342898 | 8.902691 | 1.875 | 2.616279 | 50.553505 | 3.703704 | 8.3870968 | 3.125 | 0 |
By default the function groups expression across
@active.ident
(see above) but user can specify difference
variable if desired.
percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"), group_by = "orig.ident")
sample2 | sample4 | sample1 | sample3 | |
---|---|---|---|---|
CD4 | 14.28571 | 9.076682 | 10.72464 | 12.76596 |
CD8A | 11.98157 | 10.172144 | 12.02899 | 12.91793 |
User can also supply a split_by
variable to quantify
expression within group split by meta data variable
percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"), split_by = "group")
Naive.CD4.T_Group.2 | Naive.CD4.T_Group.1 | Memory.CD4.T_Group.1 | Memory.CD4.T_Group.2 | CD14..Mono_Group.1 | CD14..Mono_Group.2 | B_Group.2 | B_Group.1 | CD8.T_Group.1 | CD8.T_Group.2 | FCGR3A..Mono_Group.1 | FCGR3A..Mono_Group.2 | NK_Group.1 | NK_Group.2 | DC_Group.2 | DC_Group.1 | Platelet_Group.1 | Platelet_Group.2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CD4 | 7.758621 | 5.730659 | 15.767635 | 16.11570 | 25.702811 | 22.943723 | 1.290323 | 1.587302 | 3.472222 | 1.574803 | 25.641026 | 28.571429 | 1.388889 | 0.00000 | 26.666667 | 41.17647 | 0 | 0 |
CD8A | 12.643678 | 14.040115 | 9.958506 | 7.85124 | 2.008032 | 1.731602 | 2.580645 | 2.645503 | 56.250000 | 44.094488 | 1.282051 | 5.952381 | 4.166667 | 12.04819 | 6.666667 | 0.00000 | 0 | 0 |
Users can also supply a threshold value to quantify percent of cells
expressing feature(s) above a certain threshold (be sure to note which
slot is being used to quantify expression when setting
thresholds).
NOTE: Percent_Expressing
currently only supports single
threshold across all features
percent_express <- Percent_Expressing(seurat_object = pbmc, features = c("CD4", "CD8A"), threshold = 2)
Naive.CD4.T | Memory.CD4.T | CD14..Mono | B | CD8.T | FCGR3A..Mono | NK | DC | Platelet | |
---|---|---|---|---|---|---|---|---|---|
CD4 | 1.147776 | 0.621118 | 5.8333333 | 0.5813953 | 0.3690037 | 1.851852 | 0.000000 | 3.125 | 0 |
CD8A | 6.456241 | 2.070393 | 0.2083333 | 0.8720930 | 24.7232472 | 0.000000 | 3.225807 | 0.000 | 0 |
scCustomize contains function Median_Stats
to quickly
calculate the medians for basic QC stats (Genes/, UMIs/, %
Mito/Cell).
By default Median_Stats
will calculate medians for the
following meta data columns (if present): “nCount_RNA”, “nFeature_RNA”,
“percent_mito”, “percent_ribo”, “percent_mito_ribo”.
median_stats <- Median_Stats(seurat_object = pbmc, group_by_var = "orig.ident")
orig.ident | Median_nCount_RNA | Median_nFeature_RNA | Median_percent_mito | Median_percent_ribo | Median_percent_mito_ribo |
---|---|---|---|---|---|
sample1 | 2189.0 | 810.0 | 2.045561 | 37.30009 | 39.50258 |
sample2 | 2193.0 | 817.0 | 1.937208 | 36.68319 | 39.02439 |
sample3 | 2204.5 | 821.5 | 2.006391 | 36.89451 | 38.77124 |
sample4 | 2250.0 | 829.0 | 2.076271 | 36.90902 | 38.92387 |
Totals (All Cells) | 2213.0 | 819.0 | 2.010702 | 36.92524 | 39.03276 |
In addition to default variables, users can supply their own
additional meta data columns to calculate medians using the
median_var
parameter.
NOTE: The meta data column must be in numeric format (e.g., integer,
numeric, dbl)
median_stats <- Median_Stats(seurat_object = pbmc, group_by_var = "orig.ident", median_var = "module_score1")
orig.ident | Median_nCount_RNA | Median_nFeature_RNA | Median_percent_mito | Median_percent_ribo | Median_percent_mito_ribo | Median_module_score1 |
---|---|---|---|---|---|---|
sample1 | 2189.0 | 810.0 | 2.045561 | 37.30009 | 39.50258 | -0.1551981 |
sample2 | 2193.0 | 817.0 | 1.937208 | 36.68319 | 39.02439 | 0.0076579 |
sample3 | 2204.5 | 821.5 | 2.006391 | 36.89451 | 38.77124 | -0.0939919 |
sample4 | 2250.0 | 829.0 | 2.076271 | 36.90902 | 38.92387 | -0.1788730 |
Totals (All Cells) | 2213.0 | 819.0 | 2.010702 | 36.92524 | 39.03276 | -0.0922378 |
In addition to calculating the median values scCustomize includes
function to calculate the median absolute deviation for each of those
features with function MAD_Stats
. By setting the parameter
mad_num
the function will retrun the MAD*mad_num.
mad <- MAD_Stats(seurat_object = pbmc, group_by_var = "orig.ident", mad_num = 2)
orig.ident | MAD x 2 nCount_RNA | MAD x 2 nFeature_RNA | MAD x 2 percent_mito | MAD x 2 percent_ribo | MAD x 2 percent_mito_ribo |
---|---|---|---|---|---|
sample1 | 736.8522 | 197.1858 | 0.8075960 | 11.35120 | 11.25176 |
sample2 | 650.8614 | 177.9120 | 0.7286244 | 12.60023 | 12.21798 |
sample3 | 742.0413 | 194.2206 | 0.7812998 | 11.74065 | 11.54864 |
sample4 | 736.8522 | 194.2206 | 0.7626530 | 11.75971 | 11.87102 |
Totals (All Cells) | 721.2849 | 189.7728 | 0.7721931 | 11.86828 | 11.69341 |
scCustomize also contains series of functions for plotting the results of these median calculations:
Plot_Median_Genes(seurat_object = pbmc)
Plot_Median_Genes(seurat_object = pbmc, group_by = "group")
Plot_Median_Other(seurat_object = pbmc, median_var = "module_score1", group_by = "group")