vignettes/articles/QC_Plots.Rmd
QC_Plots.Rmd
One of the first steps in all scRNA-seq analyses is performing a number of QC checks and plots so that data can be appropriately filtered. scCustomize contains a number of functions that can be used to quickly and easily generate some of the most relevant QC plots.
For this tutorial, I will be utilizing HCA bone marrow cell data from the SeuratData package.
library(ggplot2)
library(dplyr)
library(magrittr)
library(patchwork)
library(Seurat)
library(scCustomize)
library(qs)
# Load Example Dataset
hca_bm <- hcabm40k.SeuratData::hcabm40k
# Add pseudo group variable just for this vignette
hca_bm@meta.data$group[hca_bm@meta.data$orig.ident == "MantonBM1" | hca_bm@meta.data$orig.ident ==
"MantonBM2" | hca_bm@meta.data$orig.ident == "MantonBM3" | hca_bm@meta.data$orig.ident == "MantonBM4"] <- "Group 1"
hca_bm@meta.data$group[hca_bm@meta.data$orig.ident == "MantonBM5" | hca_bm@meta.data$orig.ident ==
"MantonBM6" | hca_bm@meta.data$orig.ident == "MantonBM7" | hca_bm@meta.data$orig.ident == "MantonBM8"] <- "Group 2"
This is the starting point of nearly every single cell analysis and scCustomize contains a number of helper functions to make the process fast and easy.
scCustomize contains easy wrapper function to automatically add both
Mitochondrial and Ribosomal count percentages to meta.data slot. If you
are using mouse, human, rat, zebrafish, drosophila, marmoset, or macaque
data all you need to do is specify the species
parameter.
# These defaults can be run just by providing accepted species name
hca_bm <- Add_Mito_Ribo(object = hca_bm, species = "Human")
NOTE: This function works seemlessly for both Seurat and LIGER objects.
To view list of accepted values for default species names simply setlist_species_names = TRUE
.
Mouse_Options | Human_Options | Marmoset_Options | Zebrafish_Options | Rat_Options | Drosophila_Options | Macaque_Options |
---|---|---|---|---|---|---|
Mouse | Human | Marmoset | Zebrafish | Rat | Drosophila | Macaque |
mouse | human | marmoset | zebrafish | rat | drosophila | macaque |
Ms | Hu | CJ | DR | RN | DM | Rhesus |
ms | hu | Cj | Dr | Rn | Dm | macaca |
Mm | Hs | cj | dr | rn | dm | mmulatta |
mm | hs | NA | NA | NA | NA | NA |
However custom prefixes can be used for non-human/mouse/marmoset
species with different annotations. Simply specify
species = other
and supply feature lists or regex patterns
for your species of interest.
NOTE: If desired please submit issue on GitHub for additional
default species. Please include regex pattern or list of genes for both
mitochondrial and ribosomal genes and I will add additional built-in
defaults to the function.
# Using gene name patterns
hca_bm <- Add_Mito_Ribo(object = hca_bm, species = "other", mito_pattern = "regexp_pattern", ribo_pattern = "regexp_pattern")
# Using feature name lists
mito_gene_list <- c("gene1", "gene2", "etc")
ribo_gene_list <- c("gene1", "gene2", "etc")
hca_bm <- Add_Mito_Ribo(object = hca_bm, species = "other", mito_features = mito_gene_list, ribo_features = ribo_gene_list)
# Using combination of gene lists and gene name patterns
hca_bm <- Add_Mito_Ribo(object = hca_bm, species = "Human", mito_features = mito_gene_list, ribo_pattern = "regexp_pattern")
scCustomize contains built in list of ensembl IDs that correspond to
mitochondrial and ribosomal genes for all default species. If your
object using ensembl IDs as features names then simply add
ensembl_ids
parameter.
# Using gene name patterns
hca_bm <- Add_Mito_Ribo(object = hca_bm, species = "Human", ensembl_ids = TRUE)
In addition to metrics like number of features and UMIs it can often be helpful to analyze the complexity of expression within a single cell. scCustomize provides functions to add two of these metrics to meta data.
scCustomize contains easy shortcut function to add a measure of cell complexity/novelty that can sometimes be useful to filter low quality cells. The metric is calculated by calculating the result of log10(nFeature) / log10(nCount).
# These defaults can be run just by providing accepted species name
hca_bm <- Add_Cell_Complexity(object = hca_bm)
NOTE: This function works seemlessly for both Seurat and LIGER objects.
Additionally, (or alternatively), scCustomize contains another metric
of complexity which is the top percent expression. The user supplies an
integer value for num_top_genes
(default is 50) which
species the number of genes and the function returns percentage of
counts occupied by top XX genes in each cell.
# These defaults can be run just by providing accepted species name
hca_bm <- Add_Top_Gene_Pct_Seurat(seurat_object = hca_bm, num_top_genes = 50)
In addition to those standard QC metrics it can be helpful when using network based QC analysis to add the percent of expression of genes related to common pathways. This function and the network-based analysis is further extension of the analysis/QC from our recent publication: Gazestani & Kamath et al., 2023 (Cell).
In scCustomize the percent of gene expression from the following gene
lists can be added as part of the Add_Cell_QC_Metrics
(see
next section for more details):
To simplify the process of adding cell QC metrics scCustomize
contains a wrapper function which can be customized to add all or some
of the available QC metrics. The default parameters of the function
Add_Cell_QC_Metrics
will add:
hca_bm <- Add_Cell_QC_Metrics(seurat_object = hca_bm, species = "human")
By changing the function parameters you can add a subset of available QC metrics. For instance:
hca_bm <- Add_Cell_QC_Metrics(seurat_object = hca_bm, species = "human", add_mito_ribo = TRUE, add_complexity = FALSE,
add_top_pct = TRUE, add_IEG = TRUE, add_MSigDB = FALSE, add_cell_cycle = FALSE)
scCustomize has a number of quick QC plotting options for ease of
use.
NOTE: Most scCustomize plotting functions contain ...
parameter to allow user to supply any of the parameters for the original
Seurat function that is being used under the hood.
VlnPlot
-Based QC Plots
scCustomize contains a number of shortcut/wrapper functions for QC
plotting, which are wrappers around
Seurat::VlnPlot()
/VlnPlot_scCustom
.
QC_Plots_Genes()
Plots genes/features per
cell/nucleus.QC_Plots_UMIs()
Plots UMIs per cell/nucleus.QC_Plots_Mito()
Plots mito% (named “percent_mito”) per
cell/nucleus.QC_Plots_Complexity()
Plots cell complexity metric
(log10GenesPerUMI) per cell/nucleus.QC_Plots_Feature()
Plots “feature” per cell/nucleus.
Using parameter feature
to allow plotting of any applicable
named feature in object@meta.data slot.QC_Plots_Combined_Vln()
Returns patchwork plot of
QC_Plots_Genes()
, QC_Plots_UMIs()
, &
QC_Plots_Mito()
.scCustomize functions have the added benefit of:
QC_Plots_Feature
).
# All functions contain
p1 <- QC_Plots_Genes(seurat_object = hca_bm, low_cutoff = 600, high_cutoff = 5500)
p2 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000)
p3 <- QC_Plots_Mito(seurat_object = hca_bm, high_cutoff = 20)
p4 <- QC_Plots_Complexity(seurat_object = hca_bm, high_cutoff = 0.8)
wrap_plots(p1, p2, p3, p4, ncol = 4)
In addition to being able to supply Seurat parameters with
...
these plots like many others in scCustomize contain
other additional parameters to customize plot output without need for
post-plot ggplot2 modifications
plot_title
: Change plot titlex_axis_label
/y_axis_label
: Change axis
labels.x_lab_rotate
: Should x-axis label be rotated 45
degrees?y_axis_log
: Should y-axis in linear or log10
scale.plot_median
& median_size
: Plot a line
at the median of each x-axis identity.plot_boxplot
: Plot boxplot on top of the violin.
p1 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0.1)
p2 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0.1,
y_axis_log = TRUE)
wrap_plots(p1, p2, ncol = 2)
p1 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0,
plot_median = TRUE)
p2 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0,
y_axis_log = TRUE, plot_median = TRUE)
wrap_plots(p1, p2, ncol = 2)
p1 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0,
plot_boxplot = TRUE)
p2 <- QC_Plots_UMIs(seurat_object = hca_bm, low_cutoff = 1200, high_cutoff = 45000, pt.size = 0,
y_axis_log = TRUE, plot_boxplot = TRUE)
wrap_plots(p1, p2, ncol = 2)
As a shortcut you can return single patchwork plot of the 3 main QC
Plots (Genes, UMIs, %Mito) by using single function,
QC_Plots_Combined_Vln()
.
QC_Plots_Combined_Vln(seurat_object = hca_bm, feature_cutoffs = c(600, 5500), UMI_cutoffs = c(1200,
45000), mito_cutoffs = 20, pt.size = 0.1)
FeatureScatter
-Based QC Plots
scCustomize contains 3 functions which wrap
Seurat::FeatureScatter()
with added visualization of
potential cutoff thresholds and some additional functionality:
QC_Plot_UMIvsGene()
Plots genes vs UMIs per
cell/nucleusQC_Plot_GenevsFeature()
Plots Genes vs. “feature” per
cell/nucleus. Using parameter feature1
to allow plotting of
any applicable named feature in object@meta.data slot.QC_Plot_UMIvsFeature()
Plots UMIs vs. “feature” per
cell/nucleus. Using parameter feature1
to allow plotting of
any applicable named feature in object@meta.data slot.shuffle = TRUE
by default to prevent hiding of
datasetsQC_Plot_UMIvsGene
(based on
values provided to high and low cutoff parameters)
# All functions contain
QC_Plot_UMIvsGene(seurat_object = hca_bm, low_cutoff_gene = 600, high_cutoff_gene = 5500, low_cutoff_UMI = 500,
high_cutoff_UMI = 50000)
QC_Plot_GenevsFeature(seurat_object = hca_bm, feature1 = "percent_mito", low_cutoff_gene = 600,
high_cutoff_gene = 5500, high_cutoff_feature = 20)
QC_Plot_UMIvsGene
contains the ability to color points
by continuous meta data variables.
This can be used to plot % of mito reads in addition to UMI vs. Gene
comparisons
QC_Plot_UMIvsGene(seurat_object = hca_bm, meta_gradient_name = "percent_mito", low_cutoff_gene = 600,
high_cutoff_gene = 5500, high_cutoff_UMI = 45000)
QC_Plot_UMIvsGene(seurat_object = hca_bm, meta_gradient_name = "percent_mito", low_cutoff_gene = 600,
high_cutoff_gene = 5500, high_cutoff_UMI = 45000, meta_gradient_low_cutoff = 20)
If you are interested in viewing QC_Plot_UMIvsGene
both
by discrete grouping variable and by continuous variable without writing
function twice you can use combination = TRUE
and plot
output will contain both plots.
QC_Plot_UMIvsGene(seurat_object = hca_bm, meta_gradient_name = "percent_mito", low_cutoff_gene = 600,
high_cutoff_gene = 5500, high_cutoff_UMI = 45000, meta_gradient_low_cutoff = 20, combination = TRUE)
Finally, scCustomize contains a function to plot QC metrics as histogram instead of violin for visualization of the distribution of feature.
QC_Histogram(seurat_object = hca_bm, features = "percent_mito", low_cutoff = 15)
QC_Histogram(seurat_object = hca_bm, features = "percent_mito", low_cutoff = 15, split.by = "group")
scCustomize also contains a few helpful functions for returning and plotting the median values for these metrics on per sample/library basis.
scCustomize contains function Median_Stats
to quickly
calculate the medians for basic QC stats (Genes/, UMIs/, %Mito/Cell,
etc) and return a data.frame.
median_stats <- Median_Stats(seurat_object = hca_bm, group_by_var = "orig.ident")
orig.ident | Median_nCount_RNA | Median_nFeature_RNA | Median_percent_mito | Median_percent_ribo | Median_percent_mito_ribo | Median_log10GenesPerUMI |
---|---|---|---|---|---|---|
MantonBM1 | 2634.5 | 807 | 3.928074 | 42.04921 | 46.43790 | 0.8506316 |
MantonBM2 | 2462.5 | 783 | 3.556994 | 42.96184 | 46.87269 | 0.8505270 |
MantonBM3 | 2617.0 | 769 | 3.301410 | 40.65558 | 44.10722 | 0.8483425 |
MantonBM4 | 2264.0 | 780 | 3.384891 | 33.73245 | 37.96850 | 0.8645139 |
MantonBM5 | 2677.0 | 741 | 3.265677 | 41.55203 | 45.33713 | 0.8360196 |
MantonBM6 | 2677.0 | 835 | 3.202053 | 43.07874 | 46.70119 | 0.8522162 |
MantonBM7 | 2429.0 | 732 | 3.216184 | 44.68831 | 48.20765 | 0.8477172 |
MantonBM8 | 2453.0 | 701 | 3.697331 | 44.97911 | 48.96420 | 0.8406128 |
Totals (All Cells) | 2524.0 | 769 | 3.447040 | 41.53183 | 45.40989 | 0.8494067 |
The Median_Stats
function has some column names stored
by default but will also calculate medians for additional meta.data
columns using the optional median_var
parameter
median_stats <- Median_Stats(seurat_object = hca_bm, group_by_var = "orig.ident", median_var = "meta_data_column_name")
scCustomize also contains a few functions to plot some of these median value calculations, which can be used on their own without need to return data.frame first.
Plot_Median_Genes()
Plot_Median_UMIs()
Plot_Median_Mito()
Plot_Median_Other()
Plot_Median_Genes(seurat_object = hca_bm, group_by = "group")
Plot_Median_UMIs(seurat_object = hca_bm, group_by = "group")
Plot_Median_Mito(seurat_object = hca_bm, group_by = "group")
Plot_Median_Other(seurat_object = hca_bm, median_var = "percent_ribo", group_by = "group")