以下机数据为基础,跑通的转录组分析流程

一、部分软件安装

利用conda安装trimmomatic、fastqc、hisat2、samtools等软件

HTSeq的安装需要在python2.7环境下(方法如下):

1
2
3
4
5
6
7
8
wget https://pypi.python.org/packages/source/H/HTSeq/HTSeq-0.6.1p1.tar.gz
tar -zxvf HTSeq-0.6.1p1.tar.gz
cd HTSeq-0.6.1p1/
python setup.py build
python setup.py install
vi ~/.bashrc
export PATH="$PATH:/(省略)/software/HTSeq-0.6.1p1/build/scripts-2.7/htseq-count"
source ~/.bashrc

注:安装的htseq-count在HTSeq-0.6.1p1/build/scripts-2.7/目录下。

注:安装的htseq-count在HTSeq-0.6.1p1/build/scripts-2.7/目录下。

或安装featureCounts

1
2
wget https://nchc.dl.sourceforge.net/project/subread/subread-1.6.3/subread-1.6.3-source.tar.gz &
tar -zxvf subread-1.6.3-source.tar.gz

添加环境变量后即可使用。

1
featureCounts -T 20 -t exon -g gene_id -a Danio_rerio.GRCz11.104.gtf -o count.txt align.bam

添加环境变量后即可使用。
featureCounts -T 20 -t exon -g gene_id -a Danio_rerio.GRCz11.104.gtf -o count.txt align.bam

Rstudio安装DESeq2

1
2
3
install.packages("BiocManager")
BiocManager::install("DESeq2")
library(DESeq2)

二、数据处理流程

过滤街接头序列,质量较差等不成对序列[1]

1
trimmomatic PE -threads 20 clocka-clocka-mut_combined_R1.fastq.gz clocka-clocka-mut_combined_R2.fastq.gz -baseout clocka-clocka-mut_combined ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:8:true SLIDINGWINDOW:5:20 LEADING:3 TRAILING:3 MINLEN:36

质控[2]

1
fastqc clocka-clocka-mut_combined_1P clocka-clocka-mut_combined_2P -o ./ -t 20   

过滤后的质控结果发现Per base sequence content和Sequence Duplication Levels[3]两项是红叉,通过查阅资料,两者对后续分析无负面影响。

1
2
3
4
5
6
7
8
9
10
11
extract_exons.py Danio_rerio.GRCz11.104.gtf > genome.exon
extract_splice_sites.py Danio_rerio.GRCz11.104.gtf > genome.ss
hisat2-build -p 20 GCF_000002035.6_GRCz11_genomic.fna --ss genome.ss --exon genome.exon genome_tran

hisat2 -p 30 --dta -x genome_tran -1 clocka-clocka-mut_combined_1P -2 clocka-clocka-mut_combined_2P -S align.sam #比对
Warning: Unsupported file format

samtools view -S align.sam -b > align.bam #转化格式sam-bam参考https://blog.csdn.net/weixin_39790504/article/details/111376943
samtools sort -l 4 -o align_sort.bam align.bam #排序
samtools index align_sort.bam align_sort.bam.bai #建立索引
htseq-count -f bam -r name -i gene_id -s yes -t gene -m intersection-nonempty align_sort.bam Danio_rerio.GRCz11.104.gtf > count.txt #计数

未完,待续


三、其他参考文章

[1]https://blog.csdn.net/sinat_32872729/article/details/93487342
[2]https://www.jianshu.com/p/fe6af418a8bc
[3]https://www.biostars.org/p/307361/#307372

转录组详细流程参考于Dawn_WangTP用户 :https://www.jianshu.com/u/a64003068454
对于转录组所涉及的文件格式的理解,参考https://www.jianshu.com/p/03bc06c1e84a