batch sequences alignment using MAFFT

利用Mafft软件分别对蛋白和DNA进行批量比对

首先利用conda安装相关软件和脚本文件：

conda install mafft 
conda install Gblocks
conda install trimal
conda install pal2nal.pl

1、将DNA翻译成氨基酸进行多序列比对，再剪切，最后回译为DNA。

将序列文件置于当前文件下 ./

1	ls>temp.txt;sed -i 's/temp.txt//g' temp.txt;for i in `cat temp.txt`;do faTrans $i aa-$i;done;rm temp.txt;mkdir pepfile;mv aa-* pepfile #将DNA翻译成蛋白并移入pepfile文件夹；

1	ls>temp.txt;sed -i 's/temp.txt//g' temp.txt;for i in `cat temp.txt`;do echo "mafft --maxiterate 1000 --localpair $i > mafft-$i" >> multiple.sh;done;rm temp.txt #创建mafft批量运行sh文件；

1	ParaFly -c multiple.sh -CPU 50 #运行mafft；

1	ls mafft*>temp.txt;for i in `cat temp.txt`;do Gblocks $i -b5=h;done #剪切氨基酸;

1	rename .fasta-gb .fasta *-gb #修改后缀;

1	ls *.fasta>temp.txt;for i in `cat temp.txt`;do pal2nal.pl ./pepfile/mafft-aa-$i $i -output fasta > out-$i;done #氨基酸回译为dna

此时遇到的问题是：报错inconsistency between the following pep and nuc seqs，导致回译之后的序列文件部分为空。并且此时得到的结果可能包括全长为gap的序列，这会影响后续系统发育结构的构建，应将序列中的此类物种去掉。

两款剪切软件的区别：

Gblocks：剪切后不会丢弃数据缺失物种；

trimal：剪切后会将数据缺失物种丢弃。

2、将DNA直接比对后直接剪切，用于后续分析。

1	ls>temp.txt;sed -i 's/temp.txt//g' temp.txt;for i in `cat temp.txt`;do echo "mafft --maxiterate 1000 --localpair $i > mafft-$i" >> multiple_mafft.sh;done;rm temp.txt #比对DNA

1	ParaFly -c multiple.sh -CPU 50 #运行mafft；

接下来直接用Trimal软件剪切序列，去除非保守区域。

1	ls mafft-OG00* >temp.txt;sed -i 's/temp.txt//g' temp.txt;for i in `cat temp.txt`;do trimal -in $i -out out-$i -automated1;done #剪切

参考：

Quek, R. Z., Jain, S. S., Neo, M. L., Rouse, G. W., & Huang, D. (2020). Transcriptome‐based target‐enrichment baits for stony corals (Cnidaria: Anthozoa: Scleractinia). Molecular ecology resources, 20(3), 807-818.