Old Tag-seq analysis for PJB

Porites bleaching TagSeq Bioinformatics

Script modified from my awesome lab mates S. Gurr and E. Chille.

Initial diagnostics upon sequence upload to HPC

HP uploaded data to Bluewaves and we are waiting for the md5 checksum from UT.

HP ran initial multiQC

Trimming and post-trim quality check of ‘clean’ reads

shell script: fastp_multiqc.sh

#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=kevin_wong1@uri.edu
#SBATCH --output=../../../../kevin_wong1/20210315_Past_Tagseq/output/clean/"%x_out.%j"
#SBATCH --error=../../../../kevin_wong1/20210315_Past_Tagseq/output/clean/"%x_err.%j"
#SBATCH -D /data/putnamlab/KITT/hputnam/20210312_Pastreoides_TagSeq/Pastreoides_TagSeq_JA21015
#SBATCH --cpus-per-task=3

# load modules needed
module load fastp/0.19.7-foss-2018b
module load FastQC/0.11.8-Java-1.8
module load MultiQC/1.7-foss-2018b-Python-2.7.15

# Make an array of sequences to trim
array1=($(ls *.fastq.gz))

# fastp loop; trim the Read 1 TruSeq adapter sequence; trim poly x default 10 (to trim polyA)
for i in ${array1[@]}; do
	fastp --in1 ${i} --out1 ../../../../kevin_wong1/20210315_Past_Tagseq/output/clean/clean.${i} --adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA --trim_poly_x 6 -q 20 -y -Y 50
        fastqc ../../../../kevin_wong1/20210315_Past_Tagseq/output/clean/clean.${i}
done

echo "Read trimming of adapters complete." $(date)

# Quality Assessment of Trimmed Reads

cd ../../../../kevin_wong1/20210315_Past_Tagseq/output/clean/ #The following command will be run in the /clean directory

multiqc ./ #Compile MultiQC report from FastQC files

echo "Cleaned MultiQC report generated." $(date)

20210318 job #1883852

EXPORT MULTIQC REPORT

exit bluewaves and run from terminal

save to gitrepo as multiqc_clean.html

scp kevin_wong1@bluewaves.uri.edu:/data/putnamlab/kevin_wong1/20210315_Past_Tagseq/output/clean/multiqc_report.html  MyProjects/Porites_Rim_Bleaching_2019/output/TagSeq

HPC Job: HISAT2 Index Reference and Alignment

This step was not totally necessary, as I am not using these outputs. However, this step did help me determine the mapping percentages using the Kenkel et al. 2013 reference transcriptome.

create directory output\hisat2

mkdir HISAT2

Mapping #1 - Porites astreodes transcriptome from Kenkel et al. 2013

index reference and alignment

uploading reference transcriptome from Kenkel et al. 2013

scp Desktop/URI_PHD/pastreoides_2014_ref/pastreoides_may2014/past.fasta kevin_wong1@bluewaves.uri.edu:/data/putnamlab/kevin_wong1/REFS/Past/Kenkel2013_past_transcriptome.fasta

input

Kenkel2013_past_transcriptome.fasta = reference transcriptome
clean/.fastq.gz *= all clean TagSeq reads

ouput

Past_transcriptome_ref = indexed reference by hisat2-build; stored in the output/hisat2 folder as 1.hy2, 2.ht2… 8.ht2
.sam *=hisat2 output, readable text file; removed at the end of the script*
.bam *=converted binary file complementary to the hisat sam files*

shell script: HISAT2.sh

#!/bin/bash
#SBATCH -t 120:00:00
#SBATCH --nodes=1 --ntasks-per-node=5
#SBATCH --account=putnamlab
#SBATCH --export=NONE
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=kevin_wong1@uri.edu
#SBATCH --output="%x_out.%j"
#SBATCH --error="%x_err.%j"
#SBATCH -D /data/putnamlab/kevin_wong1/20210315_Past_Tagseq/output/hisat2/Kenkel2013_REF
#SBATCH --cpus-per-task=3

#load packages
module load HISAT2/2.1.0-foss-2018b #Alignment to reference genome: HISAT2
module load SAMtools/1.9-foss-2018b #Preparation of alignment for assembly: SAMtools

# symbolically link 'clean' reads to hisat2 dir
ln -s ../../../clean/clean*.fastq.gz ./

# index the reference transcriptome for Porites astreoides output index to working directory
hisat2-build -f ../../../../REFS/Past/Kenkel2013_past_transcriptome.fasta ./Past_transcriptome_ref # called the reference transcriptome (scaffolds)
echo "Referece transcriptome indexed. Starting alingment" $(date)

# This script exports alignments as bam files
# sorts the bam file because Stringtie takes a sorted file for input (--dta)
# removes the sam file because it is no longer needed
array=($(ls *.fastq.gz)) # call the symbolically linked sequences - make an array to align
for i in ${array[@]}; do
        sample_name=`echo $i| awk -F [.] '{print $2}'`
        hisat2 -p 8 --dta -x Past_transcriptome_ref -U ${i} -S ${sample_name}.sam
        samtools sort -@ 8 -o ${sample_name}.bam ${sample_name}.sam
                echo "${i} bam-ified!"
        rm ${sample_name}.sam
done

20210319 Submitted job 1883908

Took ~2 days

This job “failed” but I think it was because I had an extra “clean*.fastq.qz” file. All of the samples had mapping percentages and sam files were converted too bam then deleted.

[bam_sort_core] merging from 0 files and 8 in-memory blocks...
Extra parameter(s) specified: "clean.A10-KW10_S233_L002_R1_001.fastq.gz", "clean.A11-KW11_S236_L001_R1_001.fastq.gz", "clean.A11-KW11_S236_L002_R1_001.fastq.gz", "clean.A12-KW12_S239_L001_R1_001.fastq.gz", "clean.A12-KW12_S239_L002_R1_001.fastq.gz", "clean.A$
Error: Encountered internal HISAT2 exception (#1)
Command: /opt/software/HISAT2/2.1.0-foss-2018b/bin/hisat2-align-s --wrapper basic-0 -p 8 --dta -x Past_transcriptome_ref -S A10-KW10_S233_L001_R1_001.sam -U /tmp/31840.unp clean.A10-KW10_S233_L002_R1_001.fastq.gz clean.A11-KW11_S236_L001_R1_001.fastq.gz clea$
(ERR): hisat2-align exited with value 1
[E::hts_open_format] Failed to open file A10-KW10_S233_L001_R1_001.sam
samtools sort: can't open "A10-KW10_S233_L001_R1_001.sam": No such file or directory
rm: cannot remove `A10-KW10_S233_L001_R1_001.sam': No such file or directory

Exporting to report mapping percentages

scp kevin_wong1@bluewaves.uri.edu:/data/putnamlab/kevin_wong1/20210315_Past_Tagseq/output/hisat2/Kenkel2013_REF/HISAT2.sh_err.1883908  MyProjects/Porites_Rim_Bleaching_2019/output/TagSeq

Salmon

Replaces HISAT2 and StringTie because I have no reference genome with a GFF3 file

Need to also try with SAM/BAM files

Merging lanes of cleaned reads for Salmon

#!/bin/bash
#SBATCH --job-name="cat.clean"
#SBATCH -t 100:00:00
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --exclusive
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=kevin_wong1@uri.edu
#SBATCH -D /data/putnamlab/kevin_wong1/20210315_Past_Tagseq/output/clean
#SBATCH --mem=100GB

module load SAMtools/1.9-foss-2018b

echo "Concatenate files" $(date)

ls *R1_001.fastq.gz | awk -F '[_]' '{print $1"_"$2}' | sort | uniq > ID

for i in `cat ./ID`;
	do cat $i\_L001_R1_001.fastq.gz $i\_L002_R1_001.fastq.gz > $i\_ALL.fastq.gz;
	done

echo "Mission complete." $(date)

Salmon shell script

https://combine-lab.github.io/salmon/getting_started/

#!/bin/bash
#SBATCH --job-name="Salmon"
#SBATCH -t 100:00:00
#SBATCH --export=NONE
#SBATCH --nodes=1 --ntasks-per-node=20
#SBATCH --exclusive
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=kevin_wong1@uri.edu
#SBATCH -D /data/putnamlab/kevin_wong1/20210315_Past_Tagseq/output/salmon
#SBATCH --mem=100GB

module load Salmon/1.1.0-gompi-2019b

#echo "Building index..." $(date)
#salmon index -t ../../../REFS/Past/Kenkel2013_past_transcriptome.fasta -i PAST_Kenkel_ref_index -k 31

# symbolically link 'clean' reads to salmon dir
ln -s ../clean/*_ALL.fastq.gz ./

echo "Running the alignment and abundance estimation." $(date)

array1=($(ls *_ALL.fastq.gz))
for i in ${array1[@]}; do
salmon quant -i PAST_Kenkel_ref_index -l A \
        -r ${i} \
        --validateMappings -p 8 -o quants/${i}_quant
done

echo "Mission complete." $(date)

Quant merge

This takes all the quant.sf files in each individual folder and combined them into a single gene count matrix.

cd /data/putnamlab/kevin_wong1/20210315_Past_Tagseq/output/salmon/quants

salmon quantmerge --quants {clean.A10-KW10_S233_ALL.fastq.gz_quant,clean.A11-KW11_S236_ALL.fastq.gz_quant,clean.A12-KW12_S239_ALL.fastq.gz_quant,clean.A1-KW1_S197_ALL.fastq.gz_quant,clean.A2-KW2_S201_ALL.fastq.gz_quant,clean.A3-KW3_S205_ALL.fastq.gz_quant,clean.A4-KW4_S209_ALL.fastq.gz_quant,clean.A5-KW5_S213_ALL.fastq.gz_quant,clean.A6-KW6_S217_ALL.fastq.gz_quant,clean.A7-KW7_S221_ALL.fastq.gz_quant,clean.A8-KW8_S225_ALL.fastq.gz_quant,clean.A9-KW9_S229_ALL.fastq.gz_quant,clean.B10-KW22_S234_ALL.fastq.gz_quant,clean.B11-KW23_S237_ALL.fastq.gz_quant,clean.B12-KW24_S240_ALL.fastq.gz_quant,clean.B1-KW13_S198_ALL.fastq.gz_quant,clean.B2-KW14_S202_ALL.fastq.gz_quant,clean.B3-KW15_S206_ALL.fastq.gz_quant,clean.B4-KW16_S210_ALL.fastq.gz_quant,clean.B5-KW17_S214_ALL.fastq.gz_quant,clean.B6-KW18_S218_ALL.fastq.gz_quant,clean.B7-KW19_S222_ALL.fastq.gz_quant,clean.B8-KW20_S226_ALL.fastq.gz_quant,clean.B9-KW21_S230_ALL.fastq.gz_quant,clean.C10-KW34_S235_ALL.fastq.gz_quant,clean.C11-KW35_S238_ALL.fastq.gz_quant,clean.C12-KW36_S241_ALL.fastq.gz_quant,clean.C1-KW25_S199_ALL.fastq.gz_quant,clean.C2-KW26_S203_ALL.fastq.gz_quant,clean.C3-KW27_S207_ALL.fastq.gz_quant,clean.C4-KW28_S211_ALL.fastq.gz_quant,clean.C5-KW29_S215_ALL.fastq.gz_quant,clean.C6-KW30_S219_ALL.fastq.gz_quant,clean.C7-KW31_S223_ALL.fastq.gz_quant,clean.C8-KW32_S227_ALL.fastq.gz_quant,clean.C9-KW33_S231_ALL.fastq.gz_quant,clean.D1-KW37_S200_ALL.fastq.gz_quant,clean.D2-KW38_S204_ALL.fastq.gz_quant,clean.D3-KW39_S208_ALL.fastq.gz_quant,clean.D4-KW40_S212_ALL.fastq.gz_quant,clean.D5-KW41_S216_ALL.fastq.gz_quant,clean.D6-KW42_S220_ALL.fastq.gz_quant,clean.D7-KW43_S224_ALL.fastq.gz_quant,clean.D8-KW44_S228_ALL.fastq.gz_quant,clean.D9-KW45_S232_ALL.fastq.gz_quant} --column numreads --output /data/putnamlab/kevin_wong1/20210315_Past_Tagseq/output/salmon/full_count_matrix.txt

scp kevin_wong1@bluewaves.uri.edu:/data/putnamlab/kevin_wong1/20210315_Past_Tagseq/output/salmon/full_count_matrix.txt  MyProjects/Porites_Rim_Bleaching_2019/output/TagSeq

Written on April 27, 2022