QuIBL方法检验基因渐渗和不完全谱系分选

QuIBL是2019年science蝴蝶辐射演化分析中检测渐渗的新方法，其用法为python脚本QuIBL.py的使用

1、安装

在github下载软件包：https://github.com/miriammiyagi/QuIBL

1
2
3

wget https://github.com/miriammiyagi/QuIBL/archive/refs/heads/master.zip

unzip master.zip

该脚本依赖python2.7，且依赖以下包joblib, ete3, itertools, sys, numpy, math, ConfigParser, csv, and multiprocessing

#创建python2的环境

1 2	conda create -n python2.7 python=2.7 conda activate python2.7

#通过运行示例文件来检查不存在的依赖包

1	python QuIBL.py ./Small_Test_Example/sampleInputFile.txt

#通过多次运行实例文件的报错，发现te3和joblib是存在问题的，按照作者的建议安装特定版本的ete3和joblib，ete3==3.0.0b35和joblib==0.11，利用conda安装它们

1 2	conda install ete3==3.0.0b35 conda install joblib==0.11

#再次运行示例文件，可以运行成功

1	python QuIBL.py ./Small_Test_Example/sampleInputFile.txt

2、输入文件

QuIBL运行较为简单，只需要准备两个文件

①sampleInputFile.txt，该参数配置文件要根据软件自带的配置文件进行修改，以下是参数的具体意义：

treefile: The path to the trees to be analyzed.

numdistributions: The number of branch length distributions in the mixture to test. For now, only two is supported (this corresponds to one ILS and one non-ILS distribution).

likelihoodthresh: The maximum change in likelihood allowed for the gradient ascent search for theta to stop.

numsteps: The number of total EM steps. For thousands of trees, we reccomend trying around 50.

gradascentscalar: The factor to shrink the stepsize when a gradient ascent step fails.

totaloutgroup: The name of the ultimate outrgroup of your sample. All trees are assumed to be rooted using this taxon.

multiproc: Accepts `True` or `False` and either turns multiprocessing on or off.

OutputPath: Where the output gets written.

maxcores: The maximum number of cores QuIBL is allowed to use.

②树文件smallTestTrees.txt，该文件包括4个物种的拓扑关系及其枝长信息。

利用ete3从包含多个基因树的文件（每行包括一个基因树）中批量提取子树，可以在python3环境下运行以下脚本，然后将结果写到空文件smallTestTrees.txt，注意去除中间树节点在不同基因树的编号

from ete3 import Tree

# 定义subtree_taxa
subtree_taxa = ["amil", "aflo", "aint", "aawi"]

# 打开文件并读取每一行
with open('alltree.rooted.txt', 'r') as file:
    for line in file:
        t = Tree(line)
        t.prune(subtree_taxa, preserve_branch_length=True)
        t.write()

3、运行QuIBL.py脚本

1	python QuIBL.py my_analysis/sampleInputFile.txt

4、结果解读

我的运行结果如下

triplet	outgroup	C2	mixprop1	mixprop2	lambda2Dist	lambda1Dist	BIC2Dist	BIC1Dist	count
aint_aawi_aflo	aint	0.760561129	0.801436519	0.198563481	0.003752992	0.004153213	-33700.04628	-33728.42803	3762
aint_aawi_aflo	aawi	10.96069501	0.975929055	0.024070945	0.00378819	0.004690472	-20079.01796	-19744.4163	2264
aint_aawi_aflo	aflo	10.88296285	0.975452203	0.024547797	0.0035074	0.004364335	-15592.90708	-15335.1854	1730

假如物种树的拓扑是((aawi, aflo), aint)，查看aint与aflo的渐渗要看第二行，其中mixprop2表示渐渗程度，若BIC2Dist与BIC1Dist之差小于-10，则支持存在渐渗的模型。同理，查看aint与aawi的渐渗要看第三行。

文章附件的解释：
triplet: The three-taxon subset considered. Species abbreviations separated by underscores. Outgroup: Species inferred to be the outgroup in the triplet gene tree topology tested.
Cx: Inferred species tree branch length for (1) the ILS case and (2) the non-ILS case. The ILS case is forced to be 0, as all lineages must be in the same population.
Topology proportions: Inferred mixture proportion for the ILS and non-ILS distributions. These values sum to 1.
numTrees: Frequency of the topology in the sample. BICx: Raw BIC values for each model dBIC: difference in BIC value between the models. dBIC < -10 implies that the ILS+introgression model is a better fit for the data.
total non-ILS: topology non-ILS proportion * (numTrees/total trees in sample). This value represents the genome-wide introgression fraction

5、穷尽运行物种树的所有三物种组合，批量运行QuIBL.py

①在物种树上提取所有的三物种组合，运行以下脚本01_comb_4spes.py得到4物种组合的文件out_four_species_array.txt

 1 # coding=utf-8
 2 from itertools import combinations
 3 from ete3 import Tree
 4 # 给定的元素
 5 elements = ['aawi', 'acer', 'acyt', 'adig', 'aflo', 'ahya', 'aint', 'amic', 'amil', 'amur', 'anas', 'apal', 'asel', 'aten', 'ayon']
 6
 7 # 提取三个元素为一组
 8 combs_3 = list(combinations(elements, 3))
 9
10 # 每组加上'mcap'形成四个元素
11 combs_4 = [(comb[0], comb[1], comb[2], 'mcap') for comb in combs_3]
12
13 # 将每个组合写入文件中，每行一个组合
14 with open('out_four_species_array.txt', 'w') as file:
15         for comb in combs_4:
16                 file.write(' '.join(comb) + '\n')

②按照out_four_species_array.txt文件中的4物种组合，在定根后的基因树集合文件里提取子树，包含枝长信息。运行以下脚本02_extract_subtree_for_4spes.py，得到不同物种组合的subtree文件，命名格式为”out_subtree_”+str(b)+”.txt”，其中变量b依次取自然数

 1 # coding=utf-8
 2 from ete3 import Tree
 3
 4 file_path = 'out_four_species_array.txt'
 5 b = 1
 6 with open(file_path, 'r') as file:
 7     # 逐行读取文件内容
 8         for line in file:
 9                 # 以空格为分隔符划分多个字符串
10                 subtree_taxa = line.strip().split(' ')
11                 # 打印每行划分后的字符串数组
12                 #print(strings_array)
13
14 # 定义subtree_taxa
15 #subtree_taxa = ["amil", "aflo", "aint", "aawi"]
16
17                 result_array = []
18                 # 打开文件并读取每一行
19                 with open('alltree.rooted.txt', 'r') as file:
20                         for line in file:
21                                 t = Tree(line)
22                                 t.prune(subtree_taxa, preserve_branch_length=True)
23                                 a = t.write()
24                                 result_array.append(a)
25
26                         filename = "out_subtree_"+str(b)+".txt"
27                         with open(filename, 'w') as file:
28                                 for element in result_array:
29                                         file.write(element + '\n')
30                 b+=1

产生多个4物种组合的基因树文件，对其进行处理，将祖先节点的序号去掉

1	sed -i 's/)$[0-9]$:/):/g' out_subtree_

③产生sampleInputFile.txt文件，在以下shell命令中修改路径

for file in `ls out_subtree*`; do echo -e "[Input]\ntreefile: /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/7756trees/$file\nnumdistributions: 2\nlikelihoodthresh: 0.01\nnumsteps: 50\ngradascentscalar: 0.5\ntotaloutgroup: mcap\nmultiproc: True\nmaxcores:70\n\n[Output]\nOutputPath: /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/QuIBL-master/my_455_annalysis/out$file.csv\n" > /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/QuIBL-master/my_455_annalysis/run$file.txt;done

④批量运行脚本

#构建批量运行的sh文件03_run.sh
for file in `ls runout_subtree*`;do echo -e "python QuIBL.py /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/QuIBL-master/my_455_annalysis/$file";done > 03_run.sh

#利用Parafly并行跑程序，该软件在python2和python3环境下均可以利用conda安装
ParaFly -c ./my_455_annalysis/03_run.sh -CPU 40