当前位置：首页 > news >正文

orthogene：一个包搞定760个物种的基因转化

news 2026/6/12 13:43:44

一、写在前面

原来我们介绍过通过biomart进行同源基因转换，但是总是会出一些网络bug：biomart同源基因转换的"HTTP 404 Not found"解决方案。最近发现了一个发布在预印刊上的新工具——orthogene[1]，一款用于简化跨760个物种基因映射的R包。基因的跨物种同源转化远不是一对一的简单对应关系，orthogene整合了自动化的物种与基因标识符标准化、跨多数据库的同源基因推断（HomoloGene、gProfiler、OrtholoGene）、灵活处理模糊同源关系的策略，以及将基因列表、表格、高维矩阵转换为可直接分析格式的功能。你也不需要再去学习每一个数据库的标记方式(Ensembl 转录本、Ensembl 基因、HGNC 基因名、Entrez、UniProt 等)。也就是说，有了orthogene后，我们可以轻松完成多物种的基因名/矩阵转化，这样不支持非模式物种的虚拟基因敲除、药物敏感性计算、细胞通讯、转录因子分析等操作，就可以一定程度上通过同源基因转化得到解决！

如果需要单细胞数据分析教学、生信热点全文复现、自测数据个性化分析辅导、实验科研服务和常态化实验学习，欢迎联系[Biomamba_zhushou]。

二、实战流程

Github上的教学也很简单[2]

# orthogene需要依赖BiocManager安装：if(!requireNamespace("BiocManager", quietly=TRUE))install.packages("BiocManager")# orthogene只能在Bioconductor>=3.14时获得，如果你的Bioconductor版本较老，可以更新一下：if(BiocManager::version()<"3.14")BiocManager::install(update=TRUE, ask=FALSE)# 安装orthogeneif(!requireNamespace("orthogene", quietly=TRUE))BiocManager::install("orthogene")

测试数据：

# 加载包：library(orthogene)# 加载内置测试数据：data("exp_mouse")# 查看内置测试数据：exp_mouse[1:4,1:4]

## 4 x 4 sparse Matrix of class "dgCMatrix"## astrocytes_ependymal endothelial-mural interneurons microglia## Tspan12 0.330357100 0.58723400 0.6413793 0.1428571## Tshz1 0.428571430 0.44680851 1.1551724 0.4387755## Fnbp1l 0.397321400 0.71914890 2.3758621 0.3367347## Adamts15 0.008928571 0.09787234 0.2206897 .

# 设置转换数据库为同源转换"homologene"，默认为"gprofiler"：

method<-"homologene"

三种方法的支持物种、基因比对情况、更新频率、同源数据库、数据坐标、链接方式、速度信息记录如下：

orthogene::map_species()

## Retrieving all organisms available in homologene.

## Returning table with all species.

## scientific_name taxonomy_id source id## 1 Mus musculus 10090 homologene mmusculus## 2 Rattus norvegicus 10116 homologene rnorvegicus## 3 Kluyveromyces lactis 28985 homologene klactis## 4 Magnaporthe oryzae 318829 homologene moryzae## 5 Eremothecium gossypii 33169 homologene egossypii## 6 Arabidopsis thaliana 3702 homologene athaliana## 7 Oryza sativa 4530 homologene osativa## 8 Schizosaccharomyces pombe 4896 homologene spombe## 9 Saccharomyces cerevisiae 4932 homologene scerevisiae## 10 Neurospora crassa 5141 homologene ncrassa## 11 Caenorhabditis elegans 6239 homologene celegans## 12 Anopheles gambiae 7165 homologene agambiae## 13 Drosophila melanogaster 7227 homologene dmelanogaster## 14 Danio rerio 7955 homologene drerio## 15 Xenopus (Silurana) tropicalis 8364 homologene xtropicalis## 16 Gallus gallus 9031 homologene ggallus## 17 Macaca mulatta 9544 homologene mmulatta## 18 Pan troglodytes 9598 homologene ptroglodytes## 19 Homo sapiens 9606 homologene hsapiens## 20 Canis lupus familiaris 9615 homologene clfamiliaris## 21 Bos taurus 9913 homologene btaurus## scientific_name_formatted## 1 mus musculus## 2 rattus norvegicus## 3 kluyveromyces lactis## 4 magnaporthe oryzae## 5 eremothecium gossypii## 6 arabidopsis thaliana## 7 oryza sativa## 8 schizosaccharomyces pombe## 9 saccharomyces cerevisiae## 10 neurospora crassa## 11 caenorhabditis elegans## 12 anopheles gambiae## 13 drosophila melanogaster## 14 danio rerio## 15 xenopus tropicalis## 16 gallus gallus## 17 macaca mulatta## 18 pan troglodytes## 19 homo sapiens## 20 canis lupus familiaris## 21 bos taurus

orthogene的主函数为convert_orthologs，指出处理数据框、表格、tibble、稀疏矩阵、列表、向量等多种格式：

gene_df<- orthogene::convert_orthologs(gene_df=exp_mouse, gene_input="rownames",# 输入的基因名为行名gene_output="rownames",# 输出的基因名也作为行名input_species="mouse",# 输入数据的物种output_species="human",# 输出数据的物种non121_strategy="drop_both_species", method=method)

## Preparing gene_df.

## sparseMatrix format detected.

## Extracting genes from rownames.

## 15,259 genes extracted.

## Converting mouse ==> human orthologs using: homologene

## Retrieving all organisms available in homologene.

## Mapping species name: mouse

## Common name mapping found for mouse

## 1 organism identified from search: 10090

## Retrieving all organisms available in homologene.

## Mapping species name: human

## Common name mapping found for human

## 1 organism identified from search: 9606

## Checking for genes without orthologs in human.

## Extracting genes from input_gene.

## 13,416 genes extracted.

## Extracting genes from ortholog_gene.

## 13,416 genes extracted.

## Checking for genes without 1:1 orthologs.

## Dropping 46 genes that have multiple input_gene per ortholog_gene (many:1).

## Dropping 56 genes that have multiple ortholog_gene per input_gene (1:many).

## Filtering gene_df with gene_map

## Setting ortholog_gene to rownames.

#### =========== REPORT SUMMARY ===========

## Total genes dropped after convert_orthologs :## 2,016 / 15,259 (13%)

## Total genes remaining after convert_orthologs :## 13,243 / 15,259 (87%)

可以看出这个矩阵的行名成功从小鼠基因名被转换为了人类基因名

gene_df[1:4,1:4]

## 4 x 4 sparse Matrix of class "dgCMatrix"## astrocytes_ependymal endothelial-mural interneurons microglia## TSPAN12 0.330357100 0.58723400 0.6413793 0.14285710## TSHZ1 0.428571430 0.44680851 1.1551724 0.43877551## ADAMTS15 0.008928571 0.09787234 0.2206897 .## CLDN12 0.223214290 0.11489362 0.5517241 0.05102041

需要注意的是non121_strategy这个选项，用于选择解决非1 to 1即11对应的同源基因。策略包括：填写”drop_both_species”或”dbs”或1时，会同时丢弃输入物种和输出物种中重复比对到的基因填写drop_input_species”或”dis”或2时，会丢弃输入物种中重复比对到的基因填写”drop_output_species”或”dos”或3时，会丢弃输出物种中重复比对到的基因填写”keep_both_species”或”kbs”或4 时，会保留两个物种中所有的基因，不管它是否重复填写”keep_popular”或”kp”或5时，会保留两个物种间最常用的同源比对，该方法通常能返回更多的基因，但代价是其中许多并非真正的生物学一对一直系同源基因。填写”sum”,“mean”,“median”,“min”,“max”时，当输入的数据是矩阵或数据框这类有表达量的对象时，会将”多对一”关系的基因做对应的处理，例如sum，就是将多对一的表达量求和。

用起来吧，体验感会比biomart好很多！

三、参考

[1]Brian M. Schilder, Alan E. et al. (2026). orthogene: a Bioconductor package to easily map genes within and across hundreds of species. bioRxiv, https://doi.org/10.64898/2026.01.17.700094

[2]https://github.com/neurogenomics/orthogene

查看全文

http://www.rkmt.cn/news/1510770.html