test_pipeline-of-phylogeny

以2019年发表在MER期刊上的“Transcriptome-based target-enrichment baits for stony corals (Cnidaria: Anthozoa: Scleractinia)”文章中数据为测试数据，跑一遍构建物种分化时间树的探索流程如下：

1、数据获取

在文献中获取452个直系同源基因序列文件，其中每个文件包括不同数量的物种，未进行多序列比对；分析所需软件和脚本：Mafft、Trimal、catfasta2phyml.pl、fasta2relaxedPhylip.pl、PartitionFinder、Iqtree、Raxml、astral.5.7.8.jar、Mcmctree等，多数可用conda安装，最好单独创建python2和python3环境用于以下软件的使用。

2、利用mafft软件进行多序列比对；

1
2
3

ls>temp.txt;sed -i 's/temp.txt//g' temp.txt;for i in `cat temp.txt`;do echo "mafft --maxiterate 1000 --localpair $i > mafft-$i" >> multiple_mafft.sh;done;rm temp.txt #比对DNA

ParaFly -c multiple.sh -CPU 50  #运行mafft比对

3、利用trimal剪切；

1	ls mafft-OG00* >temp.txt;sed -i 's/temp.txt//g' temp.txt;for i in `cat temp.txt`;do trimal -in $i -out out-$i -automated1;done #剪切，去除非保守区域

4、过滤；

1	ls>temp;for i in `cat temp`;do number=`grep -c '>' $i`;if [ $number -lt 6 ]; then mv $i ./remove_file; fi;done #将物种少于6个的文件移到remove_file文件中

5、利用串联法建树；

perl ../../../scripts/catfasta2phyml.pl -c -f *.fasta > super_sequences.fasta     #利用perl脚本将基因进行串联

perl ../../../scripts/fasta2relaxedPhylip.pl -f super_sequences.fasta -o super_sequences.phylip #利用perl脚本将fasta格式转换为phylip格式

python PartitionFinder.py task20211220/ #利用PartitionFinder计算分区方案，cfg文件参数为branchlengths=linked;models=GTR,GTR+G,GTR+I+G;model_selection=aicc;search=greedy。此软件所依赖包是python2环境下的，因此要在python2环境下运行此步骤

iqtree -s super_sequences.fasta -redo -pre outtree -p best_scheme_foriqtree.txt -b 1000 -nt AUTO #利用iqtree构树,-p best_scheme_foriqtree.txt指定上述方法中的最佳分区方案，但是在运行时发现14个物种的gap过多，这些数据导致未能通过卡方检验，运行失败，猜测是数据质量问题，于是使用不分区方案直接对串联序列进行构树

iqtree -s super_sequences.fasta -redo -pre outtree -m MFP -b 1000 -nt AUTO #利用-m MFP参数自动检测最佳替换模型

6、溯祖法建树

ls out-mafft*>temp;for i in `cat temp`;do echo "modeltest-ng -i ../$i -d nt" >>multiple_modeltest.sh;done;rm temp  #为每个基因检测最佳碱基替换模型

ls *.out>temp;for i in `cat temp`;do grep 'raxml-ng --msa' $i |tail -n 1 >> multiple_raxml.sh;done
sed -i -e 's/raxml-ng/raxml-ng --all/g' -e 's/>//' -e 's/$/ --bs-trees 1000/g' multiple_raxml.sh #提取每个基因的最佳替换模型，并整理为批量运行raxml-ng软件的sh文件，其中--bs-trees参数为1000

sh multiple_raxml.sh #利用raxml-ng建树

for i in *.bestTree;do cat $i >>in.tree;done  #将raxml-ng建树结果整理为astral输入格式

java -jar ~/softwares/Astral/astral.5.7.8.jar -i in.tree -o out.tre 2>out.log #将多个基因树利用astral合并为一棵树。注：此处得到的tree只包含了内部节点的枝长，而不能得到末端枝枝长，某些美化树的软件不能显示（Figtree可以），因此我们也可以通过添加虚拟末端枝长以利用某些软件进行可视化（添加末端枝长的脚本链接为https://github.com/smirarab/global/blob/master/src/mirphyl/utils/add-bl.py）

方案优化点：
①最初使用DNA直接比对后剪切会导致单个基因变不完整（非3整倍数），在检测分区时不能具体到每个密码子中的3个位置（第三位碱基往往比前两位碱基具有更高的突变率），因此，在比对时应采取利用氨基酸比对，再回译为DNA，从而保证基因是3的整倍数，进而得到更为精细的分区方案。
②未进行评估序列异质性、饱和程度步骤。