跟着Cell学单细胞转录组分析(七)：细胞比例差异分析与统计可视化-尧图网站建设

📅 发布时间：2026/6/30 10:23:34

1. 细胞比例差异分析的核心逻辑

单细胞转录组分析中，细胞比例差异分析是揭示生物学意义的关键步骤。想象一下，你手上有两组样本：健康组和疾病组，经过前面的聚类和注释，已经知道了各样本包含哪些细胞类型。这时候最直接的问题就是——疾病状态下哪些细胞类型比例发生了显著变化？

实际操作中，我们需要先构建一个包含以下信息的数据框：

样本ID（如Sample1、Sample2）
分组信息（如Control、Disease）
各细胞类型的比例值（如T细胞占该样本的15%）

在R中，这个数据结构通常长这样：

> head(cell_ratio_df) sample group cell_type proportion 1 S001 Control T_cell 0.1523 2 S002 Control T_cell 0.1687 3 S003 Disease T_cell 0.2541 4 S001 Control Myeloid_cell 0.3215 ...

统计检验的选择很有讲究：

t检验：适合数据符合正态分布时（可用shapiro.test()检验）
Wilcoxon检验：非参数检验，不依赖数据分布
ANOVA：多组比较时使用
Kruskal-Wallis：非参数版的多组比较

提示：实际分析中建议同时做参数和非参数检验，当两者结论一致时结果更可靠

2. 使用ggpubr进行组间比较

ggpubr包是ggplot2的扩展，专门为科研统计图表设计。安装时记得连带安装依赖：

install.packages(c("ggpubr", "rstatix"))

一个完整的差异分析可视化流程如下：

2.1 基础箱线图绘制

library(ggpubr) ggplot(cell_ratio_df, aes(x=cell_type, y=proportion, fill=group)) + geom_boxplot(outlier.shape = NA) + geom_jitter(width=0.1, alpha=0.5) + theme_minimal()

2.2 添加统计学显著性

# 方法1：自动添加星号标记 p <- ggboxplot(cell_ratio_df, x="cell_type", y="proportion", fill="group", palette = "npg") + stat_compare_means(aes(group=group), method="wilcox.test", label="p.signif", symnum.args=list( cutpoints = c(0, 0.001, 0.01, 0.05, 1), symbols = c("***", "**", "*", "ns") )) # 方法2：显示具体p值 p + stat_compare_means(aes(group=group), label="p.format", method="t.test", label.y = 0.9)

2.3 调整可视化细节

p + scale_y_continuous(labels=scales::percent) + # y轴显示百分比 labs(x="Cell Type", y="Population Proportion") + theme(axis.text.x = element_text(angle=45, hjust=1)) + scale_fill_manual(values=c("#1f77b4", "#ff7f0e")) # 自定义颜色

3. 高级可视化技巧

3.1 分面展示多组结果

当细胞类型较多时，建议使用分面展示：

ggplot(cell_ratio_df, aes(x=group, y=proportion)) + geom_boxplot(aes(fill=group)) + geom_point() + facet_wrap(~cell_type, scales="free_y") + stat_compare_means(method="wilcox.test", label="p.format", label.y.npc = "top") + theme_bw(base_size=12)

3.2 热图展示比例变化

library(pheatmap) # 准备矩阵数据 heatmap_data <- acast(cell_ratio_df, cell_type~sample, value.var="proportion") pheatmap(heatmap_data, scale="row", clustering_method="complete", annotation_col=annotation_df, show_colnames=TRUE)

3.3 交互式可视化

使用plotly增强交互性：

library(plotly) ggplotly( ggplot(cell_ratio_df, aes(x=cell_type, y=proportion)) + geom_boxplot(aes(fill=group)) + geom_jitter(width=0.1) )

4. 实战中的常见问题

4.1 零值处理

当某些细胞类型在样本中不存在时，会产生零值。建议：

添加伪计数（如0.0001）
过滤掉在<20%样本中出现的细胞类型

cell_ratio_df$proportion <- cell_ratio_df$proportion + 0.0001

4.2 多重检验校正

当比较多个细胞类型时，需要进行p值校正：

test_results <- compare_means(proportion ~ group, data=cell_ratio_df, group.by="cell_type", method="wilcox.test") test_results$p.adj <- p.adjust(test_results$p, method="BH")

4.3 样本量不平衡

当组间样本量差异大时（如10 vs 3）：

优先使用Wilcoxon检验
考虑使用permutation test

library(coin) wilcox_test(proportion ~ factor(group), data=cell_ratio_df, distribution="exact")

5. 自动化分析流程

对于大批量数据分析，建议封装成函数：

analyze_cell_ratio <- function(seurat_obj, group_var="group"){ # 计算比例 prop_data <- prop.table(table(Idents(seurat_obj), seurat_obj[[group_var]]), margin=2) # 转换为长格式 df_long <- melt(as.matrix(prop_data)) colnames(df_long) <- c("cell_type", "sample", "proportion") # 添加分组信息 df_long$group <- seurat_obj@meta.data[match(df_long$sample, rownames(seurat_obj@meta.data)), group_var] # 批量统计检验 stat_test <- df_long %>% group_by(cell_type) %>% wilcox_test(proportion ~ group) %>% adjust_pvalue(method="BH") %>% add_significance() # 可视化 plot_list <- lapply(unique(df_long$cell_type), function(ct){ ggboxplot(df_long[df_long$cell_type==ct,], x="group", y="proportion", add="jitter") + stat_pvalue_manual( stat_test[stat_test$cell_type==ct,], label="p.adj.signif", y.position=0.9 ) + labs(title=ct) }) return(list(data=df_long, stats=stat_test, plots=plot_list)) }

6. 结果解读与报告

在论文中呈现结果时要注意：

始终注明使用的统计检验方法
报告校正后的p值
效应量也很重要（如Cohen's d）：

library(effsize) cohen.d(proportion ~ group, data=cell_ratio_df)

典型图表组合建议：

主图：UMAP标注差异显著的细胞群
附图：细胞比例箱线图+统计显著性
附表：完整的统计检验结果

7. 扩展应用场景

7.1 时间序列分析

对于多时间点数据，可进行趋势检验：

library(nlme) lme(proportion ~ time*group, random=~1|sample, data=time_series_data)

7.2 协变量调整

当需要考虑年龄、性别等因素时：

lm(proportion ~ group + age + gender, data=cell_ratio_df)

7.3 与基因表达关联

将细胞比例与bulk RNA-seq关联：

cor.test(bulk_data$gene_exp, cell_ratio_df$proportion, method="spearman")