QuIBL是2019年science蝴蝶辐射演化分析中检测渐渗的新方法,其用法为python脚本QuIBL.py的使用

1、安装

在github下载软件包:https://github.com/miriammiyagi/QuIBL

1
2
3
wget https://github.com/miriammiyagi/QuIBL/archive/refs/heads/master.zip

unzip master.zip

该脚本依赖python2.7,且依赖以下包joblib, ete3, itertools, sys, numpy, math, ConfigParser, csv, and multiprocessing

#创建python2的环境

1
2
conda create -n python2.7 python=2.7
conda activate python2.7

#通过运行示例文件来检查不存在的依赖包

1
python QuIBL.py ./Small_Test_Example/sampleInputFile.txt

#通过多次运行实例文件的报错,发现te3和joblib是存在问题的,按照作者的建议安装特定版本的ete3和joblib,ete3==3.0.0b35和joblib==0.11,利用conda安装它们

1
2
conda install ete3==3.0.0b35
conda install joblib==0.11

#再次运行示例文件,可以运行成功

1
python QuIBL.py ./Small_Test_Example/sampleInputFile.txt

2、输入文件

QuIBL运行较为简单,只需要准备两个文件

①sampleInputFile.txt,该参数配置文件要根据软件自带的配置文件进行修改,以下是参数的具体意义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
treefile: The path to the trees to be analyzed.

numdistributions: The number of branch length distributions in the mixture to test. For now, only two is supported (this corresponds to one ILS and one non-ILS distribution).

likelihoodthresh: The maximum change in likelihood allowed for the gradient ascent search for theta to stop.

numsteps: The number of total EM steps. For thousands of trees, we reccomend trying around 50.

gradascentscalar: The factor to shrink the stepsize when a gradient ascent step fails.

totaloutgroup: The name of the ultimate outrgroup of your sample. All trees are assumed to be rooted using this taxon.

multiproc: Accepts `True` or `False` and either turns multiprocessing on or off.

OutputPath: Where the output gets written.

maxcores: The maximum number of cores QuIBL is allowed to use.

②树文件smallTestTrees.txt,该文件包括4个物种的拓扑关系及其枝长信息。

利用ete3从包含多个基因树的文件(每行包括一个基因树)中批量提取子树,可以在python3环境下运行以下脚本,然后将结果写到空文件smallTestTrees.txt,注意去除中间树节点在不同基因树的编号

1
2
3
4
5
6
7
8
9
10
11
from ete3 import Tree

# 定义subtree_taxa
subtree_taxa = ["amil", "aflo", "aint", "aawi"]

# 打开文件并读取每一行
with open('alltree.rooted.txt', 'r') as file:
for line in file:
t = Tree(line)
t.prune(subtree_taxa, preserve_branch_length=True)
t.write()

3、运行QuIBL.py脚本

1
python QuIBL.py my_analysis/sampleInputFile.txt

4、结果解读

我的运行结果如下

triplet outgroup C1 C2 mixprop1 mixprop2 lambda2Dist lambda1Dist BIC2Dist BIC1Dist count
aint_aawi_aflo aint 0 0.760561129 0.801436519 0.198563481 0.003752992 0.004153213 -33700.04628 -33728.42803 3762
aint_aawi_aflo aawi 0 10.96069501 0.975929055 0.024070945 0.00378819 0.004690472 -20079.01796 -19744.4163 2264
aint_aawi_aflo aflo 0 10.88296285 0.975452203 0.024547797 0.0035074 0.004364335 -15592.90708 -15335.1854 1730

假如物种树的拓扑是((aawi, aflo), aint),查看aint与aflo的渐渗要看第二行,其中mixprop2表示渐渗程度,若BIC2Dist与BIC1Dist之差小于-10,则支持存在渐渗的模型。同理,查看aint与aawi的渐渗要看第三行。

文章附件的解释:
triplet: The three-taxon subset considered. Species abbreviations separated by underscores. Outgroup: Species inferred to be the outgroup in the triplet gene tree topology tested.
Cx: Inferred species tree branch length for (1) the ILS case and (2) the non-ILS case. The ILS case is forced to be 0, as all lineages must be in the same population.
Topology proportions: Inferred mixture proportion for the ILS and non-ILS distributions. These values sum to 1.
numTrees: Frequency of the topology in the sample. BICx: Raw BIC values for each model dBIC: difference in BIC value between the models. dBIC < -10 implies that the ILS+introgression model is a better fit for the data.
total non-ILS: topology non-ILS proportion * (numTrees/total trees in sample). This value represents the genome-wide introgression fraction

5、穷尽运行物种树的所有三物种组合,批量运行QuIBL.py

①在物种树上提取所有的三物种组合,运行以下脚本01_comb_4spes.py得到4物种组合的文件out_four_species_array.txt

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
 1 # coding=utf-8
2 from itertools import combinations
3 from ete3 import Tree
4 # 给定的元素
5 elements = ['aawi', 'acer', 'acyt', 'adig', 'aflo', 'ahya', 'aint', 'amic', 'amil', 'amur', 'anas', 'apal', 'asel', 'aten', 'ayon']
6
7 # 提取三个元素为一组
8 combs_3 = list(combinations(elements, 3))
9
10 # 每组加上'mcap'形成四个元素
11 combs_4 = [(comb[0], comb[1], comb[2], 'mcap') for comb in combs_3]
12
13 # 将每个组合写入文件中,每行一个组合
14 with open('out_four_species_array.txt', 'w') as file:
15 for comb in combs_4:
16 file.write(' '.join(comb) + '\n')

②按照out_four_species_array.txt文件中的4物种组合,在定根后的基因树集合文件里提取子树,包含枝长信息。运行以下脚本02_extract_subtree_for_4spes.py,得到不同物种组合的subtree文件,命名格式为”out_subtree_”+str(b)+”.txt”,其中变量b依次取自然数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
 1 # coding=utf-8
2 from ete3 import Tree
3
4 file_path = 'out_four_species_array.txt'
5 b = 1
6 with open(file_path, 'r') as file:
7 # 逐行读取文件内容
8 for line in file:
9 # 以空格为分隔符划分多个字符串
10 subtree_taxa = line.strip().split(' ')
11 # 打印每行划分后的字符串数组
12 #print(strings_array)
13
14 # 定义subtree_taxa
15 #subtree_taxa = ["amil", "aflo", "aint", "aawi"]
16
17 result_array = []
18 # 打开文件并读取每一行
19 with open('alltree.rooted.txt', 'r') as file:
20 for line in file:
21 t = Tree(line)
22 t.prune(subtree_taxa, preserve_branch_length=True)
23 a = t.write()
24 result_array.append(a)
25
26 filename = "out_subtree_"+str(b)+".txt"
27 with open(filename, 'w') as file:
28 for element in result_array:
29 file.write(element + '\n')
30 b+=1

产生多个4物种组合的基因树文件,对其进行处理,将祖先节点的序号去掉

1
sed -i 's/)\([0-9]*\):/):/g' out_subtree_*

③产生sampleInputFile.txt文件,在以下shell命令中修改路径

1
for file in `ls out_subtree*`; do echo -e "[Input]\ntreefile: /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/7756trees/$file\nnumdistributions: 2\nlikelihoodthresh: 0.01\nnumsteps: 50\ngradascentscalar: 0.5\ntotaloutgroup: mcap\nmultiproc: True\nmaxcores:70\n\n[Output]\nOutputPath: /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/QuIBL-master/my_455_annalysis/out$file.csv\n" > /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/QuIBL-master/my_455_annalysis/run$file.txt;done

④批量运行脚本

1
2
3
4
5
#构建批量运行的sh文件03_run.sh
for file in `ls runout_subtree*`;do echo -e "python QuIBL.py /project/tianzhenWu/acropora_06four_cluster_gene/03_building_tree/4species/QuIBL-master/my_455_annalysis/$file";done > 03_run.sh

#利用Parafly并行跑程序,该软件在python2和python3环境下均可以利用conda安装
ParaFly -c ./my_455_annalysis/03_run.sh -CPU 40

⑤结果整合

1
for file in *.csv; do sed -n '2,4p' "$file" >> results_qulbl.txt; done

参考:miriammiyagi/QuIBL: Quantifying Introgression via Branch Lengths (github.com)