Upating Human Gene Symbols

The official gene symbols used in a dataset can change depending on the reference version used in aligning that particular dataset. For human genes the official symbols are set by HGNC.

In the absence of more static identifier (Ensembl ID or Entrez ID Numbers) the only way to update gene symbols is to examine the current and past symbols for all genes in the HGNC database. However, many of the functions that perform this task come with caveats that vary from lack of ease of updating to newest HGNC data or at worst potentially improperly renaming symbols.

Load Seurat Object & Add QC Data

# read object
pbmc <- pbmc3k.SeuratData::pbmc3k.final
pbmc <- UpdateSeuratObject(pbmc)

Issues with other functions

In order to understand how scCustomize’s Update_HGNC_Symbols() improves process it is important to be aware of the caveats of some other tools.

Seurat’s UpdateSymbolList()

The first is the Seurat’s UpdateSymbolList() which takes an input vector of symbols and uses active connection to HGNC to query for updated symbols. However, there are two caveats with this function 1) it requires user to have internet connection anytime using the function, 2) it can potentially rename symbols incorrectly.

To illustrate the second issue I will use 3 gene symbols that have been current for some time: MCM2, MCM7, CCNL1. However, let’s take a look at some of the previous symbols for each of these genes:
- Previous Symbols for MCM2 are: CCNL1 & CDCL1
- Previous symbols for MCM7 are: MCM2
- Previous symbols for CCNL1 are: None

Now see what happens when we use UpdateSymbolList.

test_symbols <- c("MCM2", "MCM7", "CCNL1")

UpdateSymbolList(symbols = test_symbols)
## [1] "MCM7" "MCM7" "MCM2"

As you can see the functions does the following:
- Renames MCM2 > MCM7 because MCM2 is a previous symbol.
- Leaves MCM7 the same because no other gene has MCM7 as previous symbol.
- Renames CCNL1 > MCM2 because CCNL1 is previous symbol.

The reason that this happens is because UpdateSymbolList queries each symbol in isolation and not in the context of all of the genes being queried.

HGNChelper Package

After developing this function I was made aware of the HGNChelper package which also aims to provide symbol updates. It solves renaming issue in similar fashion to scCustomize (see below). It also provides a solution for requirement of internet access.

It does this by storing HGNC dataset as package data so that it comes bundled with the package. However, there is an issue with the way this is implemented. First, the bundled data is from 2019 so is approached 5 years old. Updated data can be downloaded interactively using a package function but this must be done in every R session where the data is needed requiring internet access to use current data. The authors do provide a solution to this but it involves cloning the github repo and running source scripts which may be beyond many R users.

Solving the Issue with scCustomize’s Update_HGNC_Symbols

scCustomize now provides the function Update_HGNC_Symbols to attempt to solve both of these caveats.

Requirement of internet access

Update_HGNC_Symbols does require internet access the first time the function is being used to download most recent data from HGNC. However, it then stores the downloaded data using BiocFileCache package, meaning subsequent uses don’t require any internet access. This also significant improves the speed of the function.

Inappropriate renaming

Second, Update_HGNC_Symbols uses the full input list and first automatically approves any symbol that is already an approved gene symbol so that there is not a chance of improperly updating any symbols. It then checks the remaining symbols for any symbol updates.

Let’s run our test symbol set:

results <- Updated_HGNC_Symbols(input_data = test_symbols)
## Input features contained 3 gene symbols
##  3 were already approved symbols.
##  0 were updated to approved symbol.
##  0 were not found in HGNC dataset and remain unchanged.
input_features Approved_Symbol Not_Found_Symbol Updated_Symbol Output_Features
MCM2 MCM2 NA NA MCM2
MCM7 MCM7 NA NA MCM7
CCNL1 CCNL1 NA NA CCNL1

As mentioned before the function is also very quick. Returning updated symbols for 36,000 genes in ~1 second.

# Read in full 10X reference genome feature list
features <- Read10X_h5("assets/Barcode_Rank_Example/sample1/outs/raw_feature_bc_matrix.h5")

features <- rownames(features)

# Load tictoc to give timing
library(tictoc)

# Get updated symbols
tic()
results <- Updated_HGNC_Symbols(input_data = features)
## Input features contained 36,601 gene symbols
##  23,360 were already approved symbols.
##  654 were updated to approved symbol.
##  12,587 were not found in HGNC dataset and remain unchanged.
toc()
## 0.654 sec elapsed

Examining the Results

Now let’s take a look at the output from Updated_HGNC_Symbols, which also has some detail advtanages vs other methods.

For this example I have picked section of the results that contains all 3 potential results.

results[168:177, ]
input_features Approved_Symbol Not_Found_Symbol Updated_Symbol Output_Features
168 NPHP4 NPHP4 NA NA NPHP4
169 KCNAB2 KCNAB2 NA NA KCNAB2
170 CHD5 CHD5 NA NA CHD5
171 RPL22 RPL22 NA NA RPL22
172 AL031847.1 NA AL031847.1 NA AL031847.1
173 RNF207 RNF207 NA NA RNF207
174 ICMT ICMT NA NA ICMT
175 LINC00337 NA NA ICMT-DT ICMT-DT
176 HES3 HES3 NA NA HES3
177 GPR153 GPR153 NA NA GPR153

As you can see the majority of these symbols are already updated so the input symbol matches the output symbol.

In the case of “AL031847.1” that annotation was not found in HGNC and therefore the symbol was left unchanged.

Finally in the case of “LINC00337” there was an updated symbol of “ICMT-DT” so the output symbol was updated to that current symbol.