vignettes/articles/Object_QC_Functions.Rmd
Object_QC_Functions.Rmd
scCustomize has several helper functions to simplify/streamline what is nearly always the first and most critical choices when starting an analysis: performing quality control and filtering.
Let’s load packages and raw data object for this tutorial.
# Load Packages
library(ggplot2)
library(dplyr)
library(magrittr)
library(patchwork)
library(Seurat)
library(scCustomize)
pbmc <- pbmc3k.SeuratData::pbmc3k
We’ll add some random meta data variables to pbmc data form use in this vignette
pbmc$sample_id <- sample(c("sample1", "sample2", "sample3", "sample4", "sample5", "sample6"), size = ncol(pbmc),
replace = TRUE)
pbmc$batch <- sample(c("Batch1", "Batch2"), size = ncol(pbmc), replace = TRUE)
All of scCustomize’s functions to add quality control metrics are 100% cross compatible across Seurat and LIGER objects using the same function calls. For more details on QC specific plotting functions see QC Plotting & Analysis Vignette.
Additionally, all of the QC functions support objects that use either gene symbols or Ensembl IDs. Ensembl IDs for default species (see below) are from Ensembl version 112 (updated in scCustomize on 4/29/2024).
If your object using ensembl IDs as features names then simply add
ensembl_ids
parameter that is present in all QC
functions.
# Using gene name patterns
obj <- Add_Mito_Ribo(object = obj, species = "Human", ensembl_ids = TRUE)
Many of the QC functions commonly performed depend on genes within a
particular family that have similar naming patterns (e.g., Mitochondrial
genes) or are species specific (see msigdb dependent parts of
Add_Cell_QC_Metrics()
).
To simplify the process of needing to remember species-specific patterns (or find Ensembl ID gene lists)
If you are using mouse, human, marmoset, zebrafish, rat, drosophila,
rhesus macaque, or chicken data all you need to do is specify the
species
parameter in the functions described below using
one of the following accepted names.
Mouse_Options | Human_Options | Marmoset_Options | Zebrafish_Options | Rat_Options | Drosophila_Options | Macaque_Options | Chicken_Options | |
---|---|---|---|---|---|---|---|---|
1 | Mouse | Human | Marmoset | Zebrafish | Rat | Drosophila | Macaque | Chicken |
2 | mouse | human | marmoset | zebrafish | rat | drosophila | macaque | chicken |
3 | Ms | Hu | CJ | DR | RN | DM | Rhesus | Gallus |
4 | ms | hu | Cj | Dr | Rn | Dm | macaca | gallus |
5 | Mm | Hs | cj | dr | rn | dm | mmulatta | Gg |
6 | mm | hs | NA | NA | NA | NA | NA | gg |
However custom prefixes can be used for species with different
annotations. Simply specify species = other
and supply
feature lists or regex patterns for your species of interest. NOTE:
If desired please submit issue on GitHub for additional default species.
Please include regex pattern or list of genes for both mitochondrial and
ribosomal genes and I will add additional built-in defaults to the
function.
# Using gene name patterns
pbmc <- Add_Mito_Ribo(object = pbmc, species = "other", mito_pattern = "regexp_pattern_mito", ribo_pattern = "regexp_pattern_ribo")
# Using feature name lists
mito_gene_list <- c("gene1", "gene2", "etc")
ribo_gene_list <- c("gene1", "gene2", "etc")
pbmc <- Add_Mito_Ribo(object = pbmc, species = "other", mito_features = mito_gene_list, ribo_features = ribo_gene_list)
# Using combination of gene lists and gene name patterns
pbmc <- Add_Mito_Ribo(object = pbmc, species = "Human", mito_features = mito_gene_list, ribo_pattern = "regexp_pattern_ribo")
To simplify the process of adding cell QC metrics scCustomize
contains a wrapper function which can be customized to add all or some
of the available QC metrics. This vignette will describe each of these
in more detail below but using the default parameters of the function
Add_Cell_QC_Metrics()
will add:
pbmc <- Add_Cell_QC_Metrics(object = pbmc, species = "human")
## • Adding Mito/Ribo Percentages to meta.data.
## Adding Percent Mitochondrial genes for human using gene symbol pattern: "^MT-".
## Adding Percent Ribosomal genes for human using gene symbol pattern: "^RP[SL]".
## Adding Percent Mito+Ribo by adding Mito & Ribo percentages.
## • Adding Cell Complexity #1 (log10GenesPerUMI) to meta.data.
## • Adding Cell Complexity #2 (Top 50 Percentages) to meta.data.
## Calculating percent expressing top 50 for layer: counts
## • Adding MSigDB Oxidative Phosphorylation, Apoptosis, and DNA Repair
## Percentages to meta.data.
## • Adding IEG Percentages to meta.data.
## • Adding Hemoglobin Percentages to meta.data.
## Adding Percent Hemoglobin for Human using gene symbol pattern: "^HB[^(P)]".
## • Adding Cell Cycle Scoring to meta.data.
## Calculating Cell Cycle Scores.
If you only want to add some but not all metrics you can either
customize Add_Cell_QC_Metrics
or use the individual
functions.
If you just want to calculate and add mitochondrial and ribosomal
count percentages per cell/nucleus you can use
Add_Mito_Ribo
.
Add_Mito_Ribo()
scCustomize contains easy wrapper function to automatically add both
Mitochondrial and Ribosomal percentages to meta.data slot. If you are
using mouse, human, marmoset, zebrafish, rat, drosophila, rhesus
macaque, or chicken data all you need to do is specify the
species
parameter.
# These defaults can be run just by providing accepted species name
pbmc <- Add_Mito_Ribo(object = pbmc, species = "human")
Some analyses are performed with cells aligned to a genome that contains multiple species (see Cell Ranger/10X documentation for more info). scCustomize now supports adding mitochondrial and ribosomal percentages for these datasets using optional parameters. Here we will use example data provided by 10X Genomics here.
pbmc_dual_species <- Read10X_h5(filename = "~/Downloads/10k_hgmm_3p_gemx_Multiplex_count_raw_feature_bc_matrix.h5")
pbmc_dual_species <- CreateSeuratObject(counts = pbmc_dual_species, min.cells = 5, min.features = 500)
For dual species analyses the only other information you need to provide is what the prefixes are used in front of gene IDs. In this case the prefixes are “GRCh38-” and “GRCm39-”.
pbmc_dual_species <- Add_Mito_Ribo(object = pbmc_dual_species, species = c("human", "mouse"), species_prefix = c("GRCh38-",
"GRCm39-"))
The added benefit of Add_Mito_Ribo
is that it will
return informative warnings if no Mitochondrial or Ribosomal features
are found using the current species, features, or pattern
specification.
# For demonstration purposes we can set `species = mouse` for this object of human cells
pbmc <- Add_Mito_Ribo(object = pbmc, species = "mouse")
## Error in `Add_Mito_Ribo()`:
## ! Columns with "percent_mito" and/or "percent_ribo" already present in
## meta.data slot.
## ℹ *To run function and overwrite columns set parameter `overwrite = TRUE` or
## change respective `mito_name`, `ribo_name`, and/or `mito_ribo_name`*
# Or if providing custom patterns/lists and features not found
pbmc <- Add_Mito_Ribo(object = pbmc, species = "other", mito_pattern = "^MT-", ribo_pattern = "BAD_PATTERN")
## Warning: No Ribo features found in object using pattern/feature list provided.
## ℹ No column will be added to meta.data.
## Adding Percent Mitochondrial genes for other using gene symbol
## pattern: "^MT-".
Add_Mito_Ribo
will also return warnings if columns are
already present in @meta.data
slot and prompt you to
provide override if you want to run the function.
pbmc <- Add_Mito_Ribo(object = pbmc, species = "human")
## Error in `Add_Mito_Ribo()`:
## ! Columns with "percent_mito" and/or "percent_ribo" already present in
## meta.data slot.
## ℹ *To run function and overwrite columns set parameter `overwrite = TRUE` or
## change respective `mito_name`, `ribo_name`, and/or `mito_ribo_name`*
In addition to metrics like number of features and UMIs it can often be helpful to analyze the complexity of expression within a single cell. scCustomize provides functions to add two of these metrics to meta data.
scCustomize contains easy shortcut function to add a measure of cell complexity/novelty that can sometimes be useful to filter low quality cells. The metric is calculated by calculating the result of log10(nFeature) / log10(nCount).
# These defaults can be run just by providing accepted species name
pbmc <- Add_Cell_Complexity(object = pbmc)
Additionally, (or alternatively), scCustomize contains another metric
of complexity which is the top percent expression. The user supplies an
integer value for num_top_genes
(default is 50) which
species the number of genes and the function returns percentage of
counts occupied by top XX genes in each cell.
# These defaults can be run just by providing accepted species name
pbmc <- Add_Top_Gene_Pct(object = pbmc, num_top_genes = 50)
scCustomize also contains function to add percentage of counts for hemoglobin genes. Use of this metric is much more situational. If your experiment has the potential for red blood cell contamination but you want to avoid that then this can be helpful. A high percentage of hemoglobin counts may indicate that your sample has high amount of ambient RNA present or RBCs in the cells captured.
pbmc <- Add_Hemo(object = pbmc, species = "human")
In addition to those standard QC metrics it can be helpful when using networ- based QC analysis to add the percent of expression of genes related to common pathways. This function and the network-based analysis is further extension of the analysis/QC from our recent publication: Gazestani & Kamath et al., 2023 (Cell).
In scCustomize the percent of gene expression from the following gene
lists can be added as part of the Add_Cell_QC_Metrics
: