orthogene:一个包搞定760个物种的基因转化
一、写在前面
原来我们介绍过通过biomart进行同源基因转换,但是总是会出一些网络bug:biomart同源基因转换的"HTTP 404 Not found"解决方案。最近发现了一个发布在预印刊上的新工具——orthogene[1],一款用于简化跨760个物种基因映射的R包。基因的跨物种同源转化远不是一对一的简单对应关系,orthogene整合了自动化的物种与基因标识符标准化、跨多数据库的同源基因推断(HomoloGene、gProfiler、OrtholoGene)、灵活处理模糊同源关系的策略,以及将基因列表、表格、高维矩阵转换为可直接分析格式的功能。你也不需要再去学习每一个数据库的标记方式(Ensembl 转录本、Ensembl 基因、HGNC 基因名、Entrez、UniProt 等)。也就是说,有了orthogene后,我们可以轻松完成多物种的基因名/矩阵转化,这样不支持非模式物种的虚拟基因敲除、药物敏感性计算、细胞通讯、转录因子分析等操作,就可以一定程度上通过同源基因转化得到解决!
如果需要单细胞数据分析教学、生信热点全文复现、自测数据个性化分析辅导、实验科研服务和常态化实验学习,欢迎联系[Biomamba_zhushou]。
二、实战流程
Github上的教学也很简单[2]
# orthogene需要依赖BiocManager安装:if(!requireNamespace("BiocManager", quietly=TRUE))install.packages("BiocManager")# orthogene只能在Bioconductor>=3.14时获得,如果你的Bioconductor版本较老,可以更新一下:if(BiocManager::version()<"3.14")BiocManager::install(update=TRUE, ask=FALSE)# 安装orthogeneif(!requireNamespace("orthogene", quietly=TRUE))BiocManager::install("orthogene")测试数据:
# 加载包:library(orthogene)# 加载内置测试数据:data("exp_mouse")# 查看内置测试数据:exp_mouse[1:4,1:4]## 4 x 4 sparse Matrix of class "dgCMatrix"## astrocytes_ependymal endothelial-mural interneurons microglia## Tspan12 0.330357100 0.58723400 0.6413793 0.1428571## Tshz1 0.428571430 0.44680851 1.1551724 0.4387755## Fnbp1l 0.397321400 0.71914890 2.3758621 0.3367347## Adamts15 0.008928571 0.09787234 0.2206897 .# 设置转换数据库为同源转换"homologene",默认为"gprofiler":method<-"homologene"三种方法的支持物种、基因比对情况、更新频率、同源数据库、数据坐标、链接方式、速度信息记录如下:
orthogene::map_species()## Retrieving all organisms available in homologene.## Returning table with all species.## scientific_name taxonomy_id source id## 1 Mus musculus 10090 homologene mmusculus## 2 Rattus norvegicus 10116 homologene rnorvegicus## 3 Kluyveromyces lactis 28985 homologene klactis## 4 Magnaporthe oryzae 318829 homologene moryzae## 5 Eremothecium gossypii 33169 homologene egossypii## 6 Arabidopsis thaliana 3702 homologene athaliana## 7 Oryza sativa 4530 homologene osativa## 8 Schizosaccharomyces pombe 4896 homologene spombe## 9 Saccharomyces cerevisiae 4932 homologene scerevisiae## 10 Neurospora crassa 5141 homologene ncrassa## 11 Caenorhabditis elegans 6239 homologene celegans## 12 Anopheles gambiae 7165 homologene agambiae## 13 Drosophila melanogaster 7227 homologene dmelanogaster## 14 Danio rerio 7955 homologene drerio## 15 Xenopus (Silurana) tropicalis 8364 homologene xtropicalis## 16 Gallus gallus 9031 homologene ggallus## 17 Macaca mulatta 9544 homologene mmulatta## 18 Pan troglodytes 9598 homologene ptroglodytes## 19 Homo sapiens 9606 homologene hsapiens## 20 Canis lupus familiaris 9615 homologene clfamiliaris## 21 Bos taurus 9913 homologene btaurus## scientific_name_formatted## 1 mus musculus## 2 rattus norvegicus## 3 kluyveromyces lactis## 4 magnaporthe oryzae## 5 eremothecium gossypii## 6 arabidopsis thaliana## 7 oryza sativa## 8 schizosaccharomyces pombe## 9 saccharomyces cerevisiae## 10 neurospora crassa## 11 caenorhabditis elegans## 12 anopheles gambiae## 13 drosophila melanogaster## 14 danio rerio## 15 xenopus tropicalis## 16 gallus gallus## 17 macaca mulatta## 18 pan troglodytes## 19 homo sapiens## 20 canis lupus familiaris## 21 bos taurusorthogene的主函数为convert_orthologs,指出处理数据框、表格、tibble、稀疏矩阵、列表、向量等多种格式:
gene_df<- orthogene::convert_orthologs(gene_df=exp_mouse, gene_input="rownames",# 输入的基因名为行名gene_output="rownames",# 输出的基因名也作为行名input_species="mouse",# 输入数据的物种output_species="human",# 输出数据的物种non121_strategy="drop_both_species", method=method)## Preparing gene_df.## sparseMatrix format detected.## Extracting genes from rownames.## 15,259 genes extracted.## Converting mouse ==> human orthologs using: homologene## Retrieving all organisms available in homologene.## Mapping species name: mouse## Common name mapping found for mouse## 1 organism identified from search: 10090## Retrieving all organisms available in homologene.## Mapping species name: human## Common name mapping found for human## 1 organism identified from search: 9606## Checking for genes without orthologs in human.## Extracting genes from input_gene.## 13,416 genes extracted.## Extracting genes from ortholog_gene.## 13,416 genes extracted.## Checking for genes without 1:1 orthologs.## Dropping 46 genes that have multiple input_gene per ortholog_gene (many:1).## Dropping 56 genes that have multiple ortholog_gene per input_gene (1:many).## Filtering gene_df with gene_map## Setting ortholog_gene to rownames.#### =========== REPORT SUMMARY ===========## Total genes dropped after convert_orthologs :## 2,016 / 15,259 (13%)## Total genes remaining after convert_orthologs :## 13,243 / 15,259 (87%)可以看出这个矩阵的行名成功从小鼠基因名被转换为了人类基因名
gene_df[1:4,1:4]## 4 x 4 sparse Matrix of class "dgCMatrix"## astrocytes_ependymal endothelial-mural interneurons microglia## TSPAN12 0.330357100 0.58723400 0.6413793 0.14285710## TSHZ1 0.428571430 0.44680851 1.1551724 0.43877551## ADAMTS15 0.008928571 0.09787234 0.2206897 .## CLDN12 0.223214290 0.11489362 0.5517241 0.05102041需要注意的是non121_strategy这个选项,用于选择解决非1 to 1即11对应的同源基因。策略包括:填写”drop_both_species”或”dbs”或1时,会同时丢弃输入物种和输出物种中重复比对到的基因填写drop_input_species”或”dis”或2时,会丢弃输入物种中重复比对到的基因填写”drop_output_species”或”dos”或3时,会丢弃输出物种中重复比对到的基因填写”keep_both_species”或”kbs”或4 时,会保留两个物种中所有的基因,不管它是否重复填写”keep_popular”或”kp”或5时,会保留两个物种间最常用的同源比对,该方法通常能返回更多的基因,但代价是其中许多并非真正的生物学一对一直系同源基因。填写”sum”,“mean”,“median”,“min”,“max”时,当输入的数据是矩阵或数据框这类有表达量的对象时,会将”多对一”关系的基因做对应的处理,例如sum,就是将多对一的表达量求和。
用起来吧,体验感会比biomart好很多!
三、参考
[1]Brian M. Schilder, Alan E. et al. (2026). orthogene: a Bioconductor package to easily map genes within and across hundreds of species. bioRxiv, https://doi.org/10.64898/2026.01.17.700094
[2]https://github.com/neurogenomics/orthogene
