QuIBL方法检验基因渐渗和不完全谱系分选
QuIBL是2019年science蝴蝶辐射演化分析中检测渐渗的新方法,其用法为python脚本QuIBL.py的使用
1、安装
在github下载软件包:https://github.com/miriammiyagi/QuIBL
1 | wget https://github.com/miriammiyagi/QuIBL/archive/refs/heads/master.zip |
该脚本依赖python2.7,且依赖以下包joblib, ete3, itertools, sys, numpy, math, ConfigParser, csv, and multiprocessing
#创建python2的环境
1 | conda create -n python2.7 python=2.7 |
#通过运行示例文件来检查不存在的依赖包
1 | python QuIBL.py ./Small_Test_Example/sampleInputFile.txt |
#通过多次运行实例文件的报错,发现te3和joblib是存在问题的,按照作者的建议安装特定版本的ete3和joblib,ete3==3.0.0b35和joblib==0.11,利用conda安装它们
1 | conda install ete3==3.0.0b35 |
#再次运行示例文件,可以运行成功
1 | python QuIBL.py ./Small_Test_Example/sampleInputFile.txt |
2、输入文件
QuIBL运行较为简单,只需要准备两个文件
①sampleInputFile.txt,该参数配置文件要根据软件自带的配置文件进行修改,以下是参数的具体意义:
1 | treefile: The path to the trees to be analyzed. |
②树文件smallTestTrees.txt,该文件包括4个物种的拓扑关系及其枝长信息。
利用ete3从包含多个基因树的文件(每行包括一个基因树)中批量提取子树,可以在python3环境下运行以下脚本,然后将结果写到空文件smallTestTrees.txt,注意去除中间树节点在不同基因树的编号
1 | from ete3 import Tree |
3、运行QuIBL.py脚本
1 | python QuIBL.py my_analysis/sampleInputFile.txt |
4、结果解读
我的运行结果如下
triplet | outgroup | C1 | C2 | mixprop1 | mixprop2 | lambda2Dist | lambda1Dist | BIC2Dist | BIC1Dist | count |
---|---|---|---|---|---|---|---|---|---|---|
aint_aawi_aflo | aint | 0 | 0.760561129 | 0.801436519 | 0.198563481 | 0.003752992 | 0.004153213 | -33700.04628 | -33728.42803 | 3762 |
aint_aawi_aflo | aawi | 0 | 10.96069501 | 0.975929055 | 0.024070945 | 0.00378819 | 0.004690472 | -20079.01796 | -19744.4163 | 2264 |
aint_aawi_aflo | aflo | 0 | 10.88296285 | 0.975452203 | 0.024547797 | 0.0035074 | 0.004364335 | -15592.90708 | -15335.1854 | 1730 |
假如物种树的拓扑是((aawi, aflo), aint),查看aint与aflo的渐渗要看第二行,其中mixprop2表示渐渗程度,若BIC2Dist与BIC1Dist之差小于-10,则支持存在渐渗的模型。同理,查看aint与aawi的渐渗要看第三行。
文章附件的解释: |
---|
triplet: The three-taxon subset considered. Species abbreviations separated by underscores. Outgroup: Species inferred to be the outgroup in the triplet gene tree topology tested. |
Cx: Inferred species tree branch length for (1) the ILS case and (2) the non-ILS case. The ILS case is forced to be 0, as all lineages must be in the same population. |
Topology proportions: Inferred mixture proportion for the ILS and non-ILS distributions. These values sum to 1. |
numTrees: Frequency of the topology in the sample. BICx: Raw BIC values for each model dBIC: difference in BIC value between the models. dBIC < -10 implies that the ILS+introgression model is a better fit for the data. |
total non-ILS: topology non-ILS proportion * (numTrees/total trees in sample). This value represents the genome-wide introgression fraction |
5、穷尽运行物种树的所有三物种组合,批量运行QuIBL.py
①在物种树上提取所有的三物种组合,运行以下脚本01_comb_4spes.py得到4物种组合的文件out_four_species_array.txt
1 | 1 # coding=utf-8 |
②按照out_four_species_array.txt文件中的4物种组合,在定根后的基因树集合文件里提取子树,包含枝长信息。运行以下脚本02_extract_subtree_for_4spes.py,得到不同物种组合的subtree文件,命名格式为”out_subtree_”+str(b)+”.txt”,其中变量b依次取自然数
1 | 1 # coding=utf-8 |
产生多个4物种组合的基因树文件,对其进行处理,将祖先节点的序号去掉
1 | sed -i 's/)\([0-9]*\):/):/g' out_subtree_* |
③产生sampleInputFile.txt文件,在以下shell命令中修改路径
1 | for file in `ls out_subtree*`; do echo -e "[Input]\ntreefile: /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/7756trees/$file\nnumdistributions: 2\nlikelihoodthresh: 0.01\nnumsteps: 50\ngradascentscalar: 0.5\ntotaloutgroup: mcap\nmultiproc: True\nmaxcores:70\n\n[Output]\nOutputPath: /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/QuIBL-master/my_455_annalysis/out$file.csv\n" > /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/QuIBL-master/my_455_annalysis/run$file.txt;done |
④批量运行脚本
1 | #构建批量运行的sh文件03_run.sh |
⑤结果整合
1 | for file in *.csv; do sed -n '2,4p' "$file" >> results_qulbl.txt; done |
参考:miriammiyagi/QuIBL: Quantifying Introgression via Branch Lengths (github.com)