scanpy demo: preprocessing, clustering, and cell-type annotation

1 Introduction

For convenience, see HTML or PDF.

PDF YAML specifications:

---
title: "scanpy demo: preprocessing, clustering, and cell-type annotation"
author: "Rui Yang"
date: "2025-12-08"
date-modified: last-modified
categories: [scanpy, demo]

format:
  pdf:
    pdf-engine: xelatex
    include-in-header:
      - text: |
          % 1. 中文支持与字体映射
          \usepackage{xeCJK}
          % 正文：宋体；粗体：黑体；斜体：楷体
          \setCJKmainfont[BoldFont=SimHei, ItalicFont=KaiTi]{SimSun}
          % 标题：微软雅黑
          \setCJKsansfont[BoldFont=Microsoft YaHei]{Microsoft YaHei}
          % 中文代码注释：仿宋
          \setCJKmonofont{FangSong}

          % 2. 长代码自动换行
          \usepackage{fvextra}
          \DefineVerbatimEnvironment{Highlighting}{Verbatim}{breaklines=true, commandchars=\\\{\}}

          % 3. 图片位置
          \usepackage{float}
          \let\origfigure\figure
          \let\endorigfigure\endfigure
          \renewenvironment{figure}[1][2] {
              \expandafter\origfigure\expandafter[H]
          } {
              \endorigfigure
          }

    toc: true
    toc-depth: 3

    colorlinks: true
    linkcolor: "blue"
    urlcolor: "blue"

    mainfont: "Times New Roman"
    sansfont: "Arial"
    monofont: "Consolas"
    fontsize: 11pt

    number-depth: 3
    number-sections: true

    fig-align: center
    fig-cap-location: bottom
    fig-format: png
    tbl-cap-location: top

    highlight-style: github
    code-line-numbers: true
    code-block-bg: "#f9f9f9"

    geometry:
      - top=25mm
      - left=25mm
      - right=25mm
      - bottom=25mm

jupyter: python3
execute:
  warning: false
  eval: false
---

2 Load packages and data

import anndata as ad
import scanpy as sc

# set global parameters for scanpy plots
sc.settings.set_figure_params(dpi=150, facecolor="white")

samples = {
  "s1d1": "s1d1_filtered_feature_bc_matrix.h5",
  "s1d3": "s1d3_filtered_feature_bc_matrix.h5"
}
adatas = {}

for sample_id, filename in samples.items():
  # read in 10x h5 file
  sample_adata = sc.read_10x_h5(filename)
  # make gene names unique
  sample_adata.var_names_make_unique()
  adatas[sample_id] = sample_adata

# `axis="obs"` by default
# `join="inner"` by default
# `merge="None"` by default
# add a column named 'sample' to store sample information, with values taken from the keys of `adatas`
adata = ad.concat(adatas, label="sample")
# make cell names unique
adata.obs_names_make_unique()

Introduction to concat() from anndata:

With concat, AnnData objects can be combined via a composition of two operations: concatenation, and merging.

Concatenation is when we keep all sub elements of each object, and stack these elements in an ordered way along one axis, such as along the observations by default.

即选择一个 axis，然后将 elements aligned to this axis among objects 简单地堆叠在一起。

Merging is combining a set of collections not aligned to the axis of concatenation into one resulting collection which contains elements from the objects.

这一步可以分为两类：

Inner and outer joins：对于 variables (such as genes and cells metadata columns) aligned to the axis of concatenation，在进行 stacking 时，通过 join 参数控制这些 variables 应该如何被 concatenated，特别是当这些 variables 在 objects 间不同时，如一个对象中有的基因不一定在另一个对象中。‘inner’ 表示 intersection，‘outer’ 表示 union。
Merging：combining elements not aligned to the axis of concatenation。如基因的 metadata columns。可选项如下：

None：全部丢弃。
same：保留相同的。
unique：elements for which there is only one possible value。
first：the first element seen。
only：保留仅在一个对象中出现的。

另外值得注意的参数：在合并的对象中表明数据来源。如果对象存储在一个 dict 中传给 concat()，可以直接指定 label="sample"，这表明在 obs 中另起一列 ‘sample’ 用来存储数据来源信息，值即为 dict keys。如果存储在一个 list 中，可以用 keys 指定每一个对象的值。为了避免合并的对象存在 shared names，可以用 index_unique 指定分隔符，将 names 重新命名为 <name><index_unique><key>。

For more info, see here.

3 Quality control

3.1 Filtering based on ordinary statistics

Identify which genes are mitochondrial genes, ribosomal genes, or hemoglobin genes:

# add three logical columns in `var`, indicating which genes are motichondrial genes, ribosomal genes, or hemoglobin genes
# mitochondrial genes
# "MT-" for human, "Mt-" for mouse
adata.var["mt"] = adata.var_names.str.startswith("MT-")
# ribosomal genes
adata.var["ribo"] = adata.var_names.str.startswith(("RPS", "RPL"))
# hemoglobin genes
adata.var["hb"] = adata.var_names.str.contains("^HB[^(P)]")

Calculate QC metrics:

# pp module means preprocessing,
# which provides functions for basic preprocessing, recipes, batch effect correction, doublet detection, and neighbors.
# `qc_vars` accepts keys for boolean columns of `var`
# `inplace`: whether to place calculated metrics in `obs` and `var`
# `log1p`: whether to compute `log1p` transformed annotations
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt", "ribo", "hb"], inplace=True, log1p=True
)

Plot violins over QC metrics:

# pl module means plotting
sc.pl.violin(adata, ["n_genes_by_counts", "total_counts", "pct_counts_mt"], jitter=0.4, multi_panel=True)

Check correlations among QC metrics:

sc.pl.scatter(adata, "total_counts", "n_genes_by_counts", color="pct_counts_mt")

Filter cells/genes:

sc.pp.filter_cells(adata, min_genes=100)
sc.pp.filter_genes(adata, min_cells=3)

3.2 Doublet detection

Using Scrublet to detect doublets.

R packages scDblFinder and DoubletFinder are two good alternatives.

因为它们并不依赖于 Seurat 对象，所以可以很方便的调用，然后读入 metadata 即可。

# Works best if the input is a raw (unnormalized) counts matrix from a single sample or a collection of similar samples from the same experiment.
# Important parameters: expected_doublet_rate (0.05), stdev_doublet_rate (0.02).
sc.pp.scrublet(adata, batch_key="sample")

4 Normalization

将 raw counts matrix 保存到 layers["counts"]，X 后续会被原位更新。

# Saving count data
# `X` is the active data matrix
# raw counts matrix is originally stored in `X`
adata.layers["counts"] = adata.X.copy()

主要涉及两个步骤：

Normalization：每个细胞的测序深度是不一样的，原始的 raw counts 在细胞间并不具有可比性。为了使基因表达在细胞间具有可比性，需要把测序深度调整到一样大。

\[ G_{norm} = \frac{G_{raw}}{Total\_Counts\_in\_Cell} \times Size\_Factor \]

为了消除测序深度的差异，我们将每个基因在细胞中的 raw counts 转为其在细胞中的表达占比（假设每个基因的 counts 与其表达量成正比，即建库和测序步骤完全遵循随机抽样）。

\(Size\_Factor\) 常设为 \(1e4\) 或 \(1e6\)（即 CPM），为了使基因表达量不至于太小（计算误差太大）。也可使用所有细胞总 counts 数的中位数，这样使得归一化后的数值范围和原始数据比较接近，物理意义更加直观，不会强行把数据拉的太大或太小。

下述函数中的 target_sum 参数就是 \(Size\_Factor\)。

log1p：使方差稳定：基因表达数据的分布通常偏态十分严重（大部分基因的表达量低，少数极高，且方差随均值的增大而增大——理论上这是不对的，technical effects 和 biological effects 对高/低表达基因的影响是一致的，即不管基因表达高/低，其方差应是相同的）。对数变换可以减弱这种“异方差性”；采用 \(\ln(G_{norm}+1)\) 避免 0 值。

# Notable parameters: exclude_highly_expressed, max_fraction.
# Normalizing to median total counts by default if  `target_sum` not set
sc.pp.normalize_total(adata, target_sum=10000)
# Logarithmize the data:
sc.pp.log1p(adata)

4.1 关于 `sc.pp.scale(adata)` 的讨论

To scale or not to scale.

前提是 normalize_total 和 log1p（raw counts 存在极强的 mean-variance relationship，即均值越高的基因，方差越大。log1p 可以扮演 variance stablizing transformation 的角色）已使基因表达在细胞间具有可比性且将基因表达的量级极大地拉近。

放弃 scale 等于相信基因间的“变异幅度（magnitude of variance）”本身包含生物学信息，即表达量的绝对改变幅度是有意义的。

z-score 会使“强信号”和“弱信号”在 PCA 面前完全平等，方差都变成 1，即使强信号的 SD=100，而弱信号的 SD=10。

另外 scale 本身容易导致噪声放大。因为在 scRNA-seq 中，低表达基因通常伴随着极高的技术噪声，即 scale 会赋予低表达、高噪音的基因过高的权重。

当然，在绘制热图（为了看清每个基因在不同细胞间的表达模式），多组学整合（存在系统性偏差？），寻找变化微弱但重要的基因（变异幅度极小）等情景时需要 scale。

Maybe, try sc.pp.scale(adata) and see what will be changed.

5 Feature selection

Reduce the dimensionality of the dataset and only include the most informative genes.

Keep a watchful eye on flavor when there is a better choice!

# batch_key: If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method.
# flavor: seurat (mean.var.plot - expect logarithmized data), seurat_v3 (vst - expect counts).
# n_top_genes: mandatory for seurat_v3.
# `zero_center=True`.
sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3", layer="counts", batch_key="sample")

# seurat: log-normalized data
# seurat_v3: raw counts
sc.pl.highly_variable_genes(adata, log=True)

6 Dimensionality reduction

Reveals the main axes of variation and denoises the data.

# tl module means tools
# mask_var: use which set of genes (`var['highly_variable']` by default)
# n_comps: the number of principal components to compute
sc.tl.pca(adata, n_comps=100)

sc.set_figure_params(figsize=(20, 8))
sc.pl.pca_variance_ratio(adata, n_pcs=100, log=True)
sc.set_figure_params(figsize=(4, 4))

7 Visualization

Compute the nearest neighbors distance matrix and a neighborhood graph of observations.

# n_neighbors: The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general, values should be in the range 2 to 100.
# 可以通过调整 n_neighbors 来控制 UMAP 是更多地保留 global structure 还是 local structure
# 对于 local structure，或许还是用 t-SNE
sc.pp.neighbors(adata, n_pcs=50)

# UMAP 是基于 neighborhood graph (k-NN graph) 进行投影的
# `n_components=2` by default
sc.tl.umap(adata)

sc.pl.umap(adata, color="sample")

8 Clustering

resolution = 1

sc.tl.leiden(adata, resolution=resolution, key_added=f"leiden_res{str(resolution).replace(".", "_")}", flavor="igraph")

sc.pl.umap(adata, color=[f"leiden_res{str(resolution).replace(".", "_")}"])

sc.pl.umap(adata, color=[f"leiden_res{str(resolution).replace(".", "_")}"], legend_loc="on data")

9 Re-assess quality control and cell filtering

adata.obs["predicted_doublet"] = adata.obs["predicted_doublet"].astype("category")
sc.pl.umap(
    adata,
    color=[f"leiden_res{str(resolution).replace(".", "_")}", "predicted_doublet", "doublet_score"],
    # increase horizontal space between panels
    wspace=0.5,
)

Remove predicted doublets:

# ~ means not
adata = adata[~adata.obs["predicted_doublet"].to_numpy()].copy()

由于我们并没有严格过滤掉低质量的细胞，如 cells with low number of total counts and low number of genes detected, as well as with high exopression levels of mitochondrial genes，所以从 UMAP 图上可以看出来，确实存在一部分低质量的细胞。

sc.pl.umap(
    adata, color=[f"leiden_res{str(resolution).replace(".", "_")}", "log1p_total_counts", "pct_counts_mt", "log1p_n_genes_by_counts"], wspace=0.5, ncols=2
)

10 Cell-type annotation

Try more resolutions:

sc.tl.leiden(adata, flavor="igraph", key_added="leiden_res0_5", resolution=0.5)
sc.tl.leiden(adata, flavor="igraph", key_added="leiden_res1_5", resolution=1.5)
sc.tl.leiden(adata, flavor="igraph", key_added="leiden_res2", resolution=2)

sc.pl.umap(
    adata,
    color=["leiden_res0_5", "leiden_res1", "leiden_res1_5", "leiden_res2"],
    legend_loc="on data",
    ncols=2,
)

10.1 Use canonical markers

marker_genes = {
    "CD14+ Mono": ["FCN1", "CD14"],
    "CD16+ Mono": ["TCF7L2", "FCGR3A", "LYN"],
    # Note: DMXL2 should be negative
    "cDC2": ["CST3", "COTL1", "LYZ", "DMXL2", "CLEC10A", "FCER1A"],
    "Erythroblast": ["MKI67", "HBA1", "HBB"],
    # Note: HBM and GYPA are negative markers
    "Proerythroblast": ["CDK6", "SYNGR1", "HBM", "GYPA"],
    "NK": ["GNLY", "NKG7", "CD247", "FCER1G", "TYROBP", "KLRG1", "FCGR3A"],
    "ILC": ["ID2", "PLCG2", "GNLY", "SYNE1"],
    "Naive CD20+ B": ["MS4A1", "IL4R", "IGHD", "FCRL1", "IGHM"],
    # Note: IGHD and IGHM are negative markers
    "B cells": ["MS4A1", "ITGB1", "COL4A4", "PRDM1", "IRF4", "PAX5", "BCL11A", "BLK", "IGHD", "IGHM"],
    "Plasma cells": ["MZB1", "HSP90B1", "FNDC3B", "PRDM1", "IGKC", "JCHAIN"],
    # Note: PAX5 is a negative marker
    "Plasmablast": ["XBP1", "PRDM1", "PAX5"],
    "CD4+ T": ["CD4", "IL7R", "TRBC2"],
    "CD8+ T": ["CD8A", "CD8B", "GZMK", "GZMA", "CCL5", "GZMB", "GZMH", "GZMA"],
    "T naive": ["LEF1", "CCR7", "TCF7"],
    "pDC": ["GZMB", "IL3RA", "COBLL1", "TCF4"],
}

resolution = "leiden_res1"

sc.tl.dendrogram(adata, groupby=resolution, n_pcs=50, optimal_ordering=True)
# `var_group_genes` means keys in `marker_genes`
sc.pl.dotplot(adata, marker_genes, groupby=resolution, dendrogram=True)

10.2 Automatic label prediction

Using CellTypist package.

import requests
import urllib3
import functools

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
# 保存原始的 Session.request 方法，防止无限递归
_original_session_request = requests.Session.request


# 定义新的劫持函数
@functools.wraps(_original_session_request)
def patched_request(self, method, url, *args, **kwargs):
    kwargs["verify"] = False
    return _original_session_request(self, method, url, *args, **kwargs)


requests.Session.request = patched_request

# for automatic label prediction
import celltypist as ct

# check available models
# ct.models.models_description()

# download all models
# models.download_models(force_update = True)
# download specified models
# ct.models.download_models(model=["Immune_All_Low.pkl"], force_update=True)
# ct.models.models_path

model = ct.models.Model.load(model="Immune_All_Low.pkl")

# 默认情况下，逐细胞注释。
# `majority_voting=True`：依据少数服从多数原理注释 clusters，可以结合参数 min_prop。
# min_prop: For the dominant cell type within a subcluster, the minimum proportion of cells required to support naming of the subcluster by this cell type.
# `over_clustering="leiden_res2"`：指定 over-clustering result。
# `mode="best match"` by default.
# p_thres: Probability threshold for the multi-label classification, for `mode="prob match"`.
predictions = ct.annotate(adata, model="Immune_All_Low.pkl", majority_voting=True, over_clustering="leiden_res2")
# convert back to anndata
adata = predictions.to_adata()

sc.pl.umap(adata, color="majority_voting", ncols=1)

10.3 Annotation with enrichment analysis

Using decoupler package.

Essentially, this will test if any collection of genes are enriched in any of the cells. Ultimately, this approach is similar to many other marker-based classifiers.

import decoupler as dc

Download and keep canonical cell type markers only from PanlaoDB:

# Query Omnipath and get PanglaoDB
markers = dc.op.resource(name="PanglaoDB", organism="human")
# Keep canonical cell type markers alone
markers = markers[markers["canonical_marker"]]

# Remove duplicated entries
markers = markers[~markers.duplicated(["cell_type", "genesymbol"])]
markers.head()

markers = markers.rename(columns=dict(cell_type="source", genesymbol="target"))
dc.mt.mlm(adata, net=markers, verbose=True)

Obtain the results:

# coefficient t-values
adata.obsm["score_mlm"].head()

# adjusted p-values
adata.obsm["padj_mlm"].head()

Visualize the obtained scores:

# Extracts values stored in `.obsm` as a new AnnData object.
# This allows to reuse scanpy functions to visualise enrichment scores.
acts = dc.pp.get_obsm(adata, "score_mlm")
sc.pl.umap(
    acts,
    color=[
        "majority_voting",
        "B cells",
        "T cells",
        "Monocytes",
        "Erythroid-like and erythroid precursor cells",
        "NK cells",
    ],
    ncols=1,
)

Transfer the max over-representation score estimates to assign a label to each cluster:

# Calculate the differential activity of each cluster against all others using a predefined gene set for each cell type.
enr = dc.tl.rankby_group(acts, groupby="leiden_res2")
annotation_dict = enr[enr["stat"] > 0].groupby("group").head(1).set_index("group")["name"].to_dict()
adata.obs["dc_anno"] = [annotation_dict[clust] for clust in adata.obs["leiden_res2"]]

sc.pl.umap(adata, color=["majority_voting", "dc_anno"], ncols=1)

10.4 Differentially-expressed genes as markers

# Obtain cluster-specific differentially expressed genes
sc.tl.rank_genes_groups(adata, groupby="leiden_res1")
# Filter those
sc.tl.filter_rank_genes_groups(adata, min_fold_change=1.5)

sc.pl.rank_genes_groups_dotplot(adata, groupby="leiden_res1", standard_scale="var", n_genes=5)

cluster_genes = ["LYZ", "ACTB", "S100A6", "S100A4", "CST3"]

sc.pl.umap(
    adata,
    color=[*cluster_genes, "leiden_res1"],
    legend_loc="on data",
    frameon=False,
    ncols=3,
)

sc.set_figure_params(figsize=(8, 4))
sc.pl.violin(adata, keys=cluster_genes[0:3], groupby="leiden_res1", multi_panel=True)