Obtaining sequences for HCR in Nvec
Obtaining sequences for HCR
Objective
I am interested in finding the sequences for these two genes in Nematostella:
- NV2g011441000.1 (FOX1A)
- NV2g019682000.1 (FRIS-like-8)
Genomic information
- I am using the NV2 genome: https://simrbase.stowers.org/starletseaanemone
- I am using the transcript file from here: https://simrbase.stowers.org/files/pub/nematostella/Nvec/genomes/Nvec200/aligned/tcs_v2/20240221/NV2g.20240221.transcripts.fa
- Whole genome can be found here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_033964005.1/
Determine if sequences are in transcript file
[kxw755@pegasus NV2_Nvec]$ grep -n "NV2g011441000\.1" NV2g.20240221.transcripts.fa | head
1305742:>NV2t011441001.1 gene=NV2g011441000.1 CDS=473-1333
1305772:>NV2t011441002.1 gene=NV2g011441000.1 CDS=1215-2075
1305813:>NV2t011441003.1 gene=NV2g011441000.1 CDS=283-1143
[kxw755@pegasus NV2_Nvec]$ grep -n "NV2g019682000\.1" NV2g.20240221.transcripts.fa | head
1059715:>NV2t019682001.1 gene=NV2g019682000.1 CDS=109-627
Sweet they are in there but one has multiple transcripts.
Extract sequences
awk '
BEGIN{RS=">"; ORS=""}
NR>1 {
if ($0 ~ /gene=NV2g011441000\.1([[:space:]]|$)/ ||
$0 ~ /gene=NV2g019682000\.1([[:space:]]|$)/)
print ">" $0
}
' NV2g.20240221.transcripts.fa \
> NV2g011441000.1_NV2g019682000.1.transcripts.fa
output
less NV2g011441000.1_NV2g019682000.1.transcripts.fa
>NV2t019682001.1 gene=NV2g019682000.1 CDS=109-627
AGATCCCCGTTTTCTGCTTCAACAGTGATTGAACCGAAACAGGACTCACACGCGAATTTCCAAAACTTTT
CAGTTTTACTTTTATCCTTGCCATAAAACTCGTTCAAGATGTCGCTCTCAGTTTGCCGTCAGAACTATCA
CGAGGAGTCCGAGGCGGGCGTCAACAAGCAGATCAACCTCGAGTTGTACGCCAGCTACGTCTACATGTCG
ATGGCCTACCATTTCGACCGTGATGATGTAGCTTTGCCTGGATTCCACAAGTACTTTATGAAGGCCTCGC
ATGAAGAGCGCGAGCATGCCGAGAAGCTTGCCAAGTTCCAGCTGCAACGTGGAGGCCGCATTGTGCTTCA
AGACATCAAGCGCCCTGAGCGCGACGACTGGGGTTGTGGACAGGATGCCATTCAGGCAGCTCTTGACCTG
GAAAAACATGTCAACCAGGCCTTACTTGATCTGCACAAGGTCGCCGAGAAGCACGGTGACTCTCAGATGC
AAGACTGGCTCGAGTCGCATTACCTGACTGAGCAAGTGGAGGCCATCAAGGAGCTTGCTGGTCACTTGAC
CAACCTGAAGCGTGTTGGCCCTGGCTTGGGAGAATTCCAGTTCGACAAGCTCACCCTCGACGACTAGAGG
GGTGCAGGCTGGACTGAGTCTTGAAACCATGGATTGACCTTTAAACCGAAGTAGATCTTATCAACCCTGA
TGTGACACACAGCGCCGCCCTTGTTCACATAAATCGGAATTCATGGCAAGCCTTTGGATAATTCTATCTT
TCCACTCCGAGGCACACTTTGCCGTGCCCTGGTCCCTATTTCATCTAATGGTAAAGGGATCAGTGGAGCC
GTTTTTTGTTAACAAGGGCATCTTTTTTTCTTTTGTTAGACTTGTTTTCTGTAGTGACTAAATAAAAGCA
TTATAAAATCAATGTGATGACTGCTTTATTTTTGTTGAGAAATAAAAACCAGATAACATCCCAACATCCT
GCCCCAAGTTTTCAATATAAAGAAGTCTTTCTTCAGAGCATTGATAATCCCAGGGGGATCGGGATAATAG
GAAATGGAATAAATAGTACTGAGATTAGAAACCTTGGAGATTGCCAAACAACAAGTGGCCACAGGAAAGG
AGGGCACCTACTTTTTTTTTTCTTAATGTGAAGTTGGCATGTACATTAATATGTATTTAACGTCATAGGT
ATAAATTAATTGCTTGATAAAAGACGAATCACACATTT
>NV2t011441001.1 gene=NV2g011441000.1 CDS=473-1333
GAACAAAACAAAATGGCCGTGGCTGTGTCAAAAACAGAATCATTCACAAGTGATGAGATTTATGGATAGT
GACATCAAAAGAGCGGGGCCGCATTTCTGCTCTGAATGCAATTTTCCTCTTGGAAATTTCGCACCATCCT
TGCAAAATCTCGCACAAATTGTCATAGAAAAATACCGCTGTAAACAAGGGTCAATAACTGTAATTATTGA
CGGCCGTGTAGTTTTTCGGAGCATTCAAAATCCAGTACCAATCCACTTGAGTGCAATAAATATTCCTGTA
AGAAAGCACCAGTGGTAGAAAGACCGTAAAACACAAAGAAAGAGTATCCAGAAACTGTGCACGCGACGTG
CAAGGCTTGAAGAGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCG
CGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGG
GTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGA
AAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCAT
TCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTAC
TACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGA
AGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATAT
GTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCAC
CTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCAC
AACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACA
CGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCT
ATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTT
ATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCAC
GACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCG
TGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAA
CCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTG
CTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAG
AAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACA
AAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGA
CACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAAC
TTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTT
CTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTT
CACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTG
TGGATACTTGAAGAAAATAAATTTTTAACGAATC
>NV2t011441002.1 gene=NV2g011441000.1 CDS=1215-2075
GAACAAAACAAAATGGCCGTGGCTGTGTCAAAAACAGAATCATTCACAAGTGATGAGATTTATGGATAGT
GACATCAAAAGAGCGGGGCCGCATTTCTGCTCTGAATGCAATTTTCCTCTTGGAAATTTCGCACCATCCT
TGCAAAATCTCGCACAAATTGTCATAGAAAAATACCGCTGTAAACAAGGGTCAATAACTGTAATTATTGA
CGGCCGTGTAGTTTTTCGGAGCATTCAAAATCCAGTACCAATCCACTTGAGTGCAATAAATATTCCTGTA
AGGTAAGAACTAGCCCTTAAAATACATCGCCGTGATTAAACGTAATTCGGGGGAAAAGTTCTATTTTTTT
TAATTTTGTTTTATTCAAAATTAAGATATTGTTTAAAAAAAAGAGTAGTTGTAGATACATTGGGATACAA
TGATAATTACGTAGGAATTTAAGTACTTGTCTTCCTGTTCCAAACTAAGTTTTAAGATATACACTCACAG
ATATAATTTTATCAAAACTGCTAGGAATTCGACAAGCTTCTAGCATTCGAACGGTATTGCTCGGTTTAAC
ACTTAAACTCGTTAAAAGTAATAACTAAAGTGTTTCTCTTTAAACACAATCTCGATGAAGCGCAAATCTG
CAAATATTTATGTCGCACCATGCGTCAGTTTGTTTACAAGAACAAATGGCGGGGTTTGAAGCACGCCGTG
TGTTATTGCGAAGAGCGAGTGTTTGTCCTTCGCTTATCGTGTGTTTATATGATGGGTCAATATTTATCTG
AGCAAAGGCTATCATATCTGAACGGCATAACTGATCGGGGTCAGTATTCATCGCCCTTCCACGGAAGCAC
AGCTCTAGCAAGTGTGGCCAAAAACTTTTAGGGGATTATAGAACATGTCTTCTTCCTGGAAAAATGGCGC
ACAGTACCTCGAAAAGAGGAAGAATTGTTGGGTACACAGTTGGAATGTCATTCGAACACTGCGAACAATA
CTTTTTTTTATATCAAGATGATCAAGAGGTAGAAAGCAAAATTGAAAGCACCAGTGGTAGAAAGACCGTA
AAACACAAAGAAAGAGTATCCAGAAACTGTGCACGCGACGTGCAAGGCTTGAAGAGGAGGTAATCTGCAC
AAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCT
CAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCAT
CACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGC
CAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACA
CTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGA
ACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGG
GAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGG
CAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACA
ACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTT
CCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATG
GCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACG
AAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTC
CCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGT
TGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTA
GAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGA
AAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATC
ATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAA
AAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGC
AAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGT
TTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATG
AAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTT
GCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGG
GAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAA
CGAATC
>NV2t011441003.1 gene=NV2g011441000.1 CDS=283-1143
GGACTGCGAGCAAGCATGTGTTCTATATAAACGAGGTAGTGATCTATCTTAGTTTATGTAGGGAATAGAA
ACCTACTACATTACCCTTCCAGCACAACGTGCCTATTTGTTCAACGCTGACTCCGTCATCTAGGCAATAA
CAGTAACAAGGTCATTATCTTCCTTCACGAGAAGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGT
GAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAG
AGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAA
GAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATC
TCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCA
TCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATC
ATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTC
CACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGA
AAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCAT
GGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGG
ACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGT
CGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGA
CCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCC
TACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTT
GTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAA
AAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATA
TCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAAC
TAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAA
ATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATAT
TTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACAC
ATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTT
GGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAA
CTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGT
ACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAACGAATC
Since NV2g011441000.1 has multiple transcripts, we will have to find the common region shared by all isoforms (i.e. gene level validation instead of isoform level)
Find what differs between the transcripts (from the GTF)
gene="NV2g011441000.1"
awk -v g="$gene" '
$3=="exon" && $0 ~ "gene_id \"" g "\"" {
if (match($0, /transcript_id "([^"]+)"/, m)) tx=m[1]; else next
print tx "\t" $1 "\t" $4 "\t" $5 "\t" $7
}
' NV2g.20240221.gtf | sort -k1,1 -k3,3n > ${gene}.exons.tsv
column -t ${gene}.exons.tsv | head -n 50
output
NV2t011441001.1 chr2 7110167 7110448 +
NV2t011441001.1 chr2 7114469 7114550 +
NV2t011441001.1 chr2 7131875 7133504 +
NV2t011441002.1 chr2 7110167 7111190 +
NV2t011441002.1 chr2 7114469 7114550 +
NV2t011441002.1 chr2 7131875 7133504 +
NV2t011441003.1 chr2 7127953 7128126 +
NV2t011441003.1 chr2 7131875 7133504 +
Identify exons shared by all isoforms vs unique
gene="NV2g011441000.1"
awk -v g="$gene" '
$3=="exon" && $0 ~ "gene_id \"" g "\"" {
if (match($0, /transcript_id "([^"]+)"/, m)) tx=m[1]; else next
key=$1":"$4"-"$5":"$7
seen[tx,key]=1
txs[tx]=1
}
END{
# count transcripts
for (t in txs) ntx++
# count exon intervals across tx
for (k in seen) {
split(k, a, SUBSEP); tx=a[1]; exon=a[2]
count[exon]++
}
print "Total transcripts:", ntx > "/dev/stderr"
print "EXON_INTERVAL\tN_TX"
for (e in count) print e "\t" count[e]
}
' NV2g.20240221.gtf | sort -k2,2nr > ${gene}.exon_sharedness.tsv
head ${gene}.exon_sharedness.tsv
output
less NV2g011441000.1.exon_sharedness.tsv
chr2:7131875-7133504:+ 3
chr2:7114469-7114550:+ 2
chr2:7110167-7110448:+ 1
chr2:7110167-7111190:+ 1
chr2:7127953-7128126:+ 1
EXON_INTERVAL N_TX
Interpretation:
- N_TX = 3 → exon interval appears in all 3 transcripts (great for gene-level probe)
- N_TX = 1 → exon interval is isoform-unique (candidate for isoform-specific probe)
Therefore, the chr2:7131875-7133504:+ region is common across all three transcripts and should be used for probe design
Extract the common sequence directly from the transcript FASTA
Isolate just the transcripts for NV2g011441000.1:
awk '
BEGIN{RS=">"; ORS=""}
NR>1 && $0 ~ /gene=NV2g011441000\.1/ { print ">" $0 }
' NV2g.20240221.transcripts.fa \
> NV2g011441000.1.isoforms.fa
grep "^>" NV2g011441000.1.isoforms.fa
>NV2t011441001.1 gene=NV2g011441000.1 CDS=473-1333
>NV2t011441002.1 gene=NV2g011441000.1 CDS=1215-2075
>NV2t011441003.1 gene=NV2g011441000.1 CDS=283-1143
Convert FASTA → one-line sequences
awk '
/^>/ {if (seq) print seq; print; seq=""; next}
{seq=seq$0}
END{print seq}
' NV2g011441000.1.isoforms.fa \
> NV2g011441000.1.isoforms.oneline.fa
less NV2g011441000.1.isoforms.oneline.fa
>NV2t011441001.1 gene=NV2g011441000.1 CDS=473-1333
GAACAAAACAAAATGGCCGTGGCTGTGTCAAAAACAGAATCATTCACAAGTGATGAGATTTATGGATAGTGACATCAAAAGAGCGGGGCCGCATTTCTGCTCTGAATGCAATTTTCCTCTTGGAAATTTCGCACCATCCTTGCAAAATCTCGCACAAATTGTCATAGAAAAATACCGCTGTAAACAAGGGTCAATAACTGTAATTATTGACGGCCGTGTAGTTTTTCGGAGCATTCAAAATCCAGTACCAATCCACTTGAGTGCAATAAATATTCCTGTAAGAAAGCACCAGTGGTAGAAAGACCGTAAAACACAAAGAAAGAGTATCCAGAAACTGTGCACGCGACGTGCAAGGCTTGAAGAGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAACGAATC
>NV2t011441002.1 gene=NV2g011441000.1 CDS=1215-2075
GAACAAAACAAAATGGCCGTGGCTGTGTCAAAAACAGAATCATTCACAAGTGATGAGATTTATGGATAGTGACATCAAAAGAGCGGGGCCGCATTTCTGCTCTGAATGCAATTTTCCTCTTGGAAATTTCGCACCATCCTTGCAAAATCTCGCACAAATTGTCATAGAAAAATACCGCTGTAAACAAGGGTCAATAACTGTAATTATTGACGGCCGTGTAGTTTTTCGGAGCATTCAAAATCCAGTACCAATCCACTTGAGTGCAATAAATATTCCTGTAAGGTAAGAACTAGCCCTTAAAATACATCGCCGTGATTAAACGTAATTCGGGGGAAAAGTTCTATTTTTTTTAATTTTGTTTTATTCAAAATTAAGATATTGTTTAAAAAAAAGAGTAGTTGTAGATACATTGGGATACAATGATAATTACGTAGGAATTTAAGTACTTGTCTTCCTGTTCCAAACTAAGTTTTAAGATATACACTCACAGATATAATTTTATCAAAACTGCTAGGAATTCGACAAGCTTCTAGCATTCGAACGGTATTGCTCGGTTTAACACTTAAACTCGTTAAAAGTAATAACTAAAGTGTTTCTCTTTAAACACAATCTCGATGAAGCGCAAATCTGCAAATATTTATGTCGCACCATGCGTCAGTTTGTTTACAAGAACAAATGGCGGGGTTTGAAGCACGCCGTGTGTTATTGCGAAGAGCGAGTGTTTGTCCTTCGCTTATCGTGTGTTTATATGATGGGTCAATATTTATCTGAGCAAAGGCTATCATATCTGAACGGCATAACTGATCGGGGTCAGTATTCATCGCCCTTCCACGGAAGCACAGCTCTAGCAAGTGTGGCCAAAAACTTTTAGGGGATTATAGAACATGTCTTCTTCCTGGAAAAATGGCGCACAGTACCTCGAAAAGAGGAAGAATTGTTGGGTACACAGTTGGAATGTCATTCGAACACTGCGAACAATACTTTTTTTTATATCAAGATGATCAAGAGGTAGAAAGCAAAATTGAAAGCACCAGTGGTAGAAAGACCGTAAAACACAAAGAAAGAGTATCCAGAAACTGTGCACGCGACGTGCAAGGCTTGAAGAGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAACGAATC
>NV2t011441003.1 gene=NV2g011441000.1 CDS=283-1143
GGACTGCGAGCAAGCATGTGTTCTATATAAACGAGGTAGTGATCTATCTTAGTTTATGTAGGGAATAGAAACCTACTACATTACCCTTCCAGCACAACGTGCCTATTTGTTCAACGCTGACTCCGTCATCTAGGCAATAACAGTAACAAGGTCATTATCTTCCTTCACGAGAAGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAACGAATC
Compute the longest common shared sequence
- This finds the exact nucleotide region present in all three isoforms.
awk '
/^>/ {
if (seq) seqs[++n]=seq
seq=""
next
}
{seq=$0}
END{
seqs[++n]=seq
# find shortest sequence
min=1
for (i=2;i<=n;i++)
if (length(seqs[i]) < length(seqs[min])) min=i
ref=seqs[min]
best=""
for (i=1;i<=length(ref);i++) {
for (j=i;j<=length(ref);j++) {
frag = substr(ref, i, j-i+1)
ok=1
for (k=1;k<=n;k++) {
if (index(seqs[k], frag) == 0) { ok=0; break }
}
if (ok && length(frag) > length(best)) best = frag
}
}
print ">NV2g011441000.1_COMMON_SEQUENCE"
print best
}
' NV2g011441000.1.isoforms.oneline.fa \
> NV2g011441000.1.common.fa
Sanity check
grep "^>" NV2g011441000.1.common.fa
>NV2g011441000.1_COMMON_SEQUENCE
grep -v "^>" NV2g011441000.1.common.fa | wc -c
1633
less NV2g011441000.1.common.fa
>NV2g011441000.1_COMMON_SEQUENCE
AGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAACGAATC
In summary, here are the two sequences we will use for HCRs:
- NV2g011441000.1 (FOX1A)
NV2g011441000.1_COMMON_SEQUENCE AGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAACGAATC
- NV2g019682000.1 (FRIS-like-8)
NV2t019682001.1 gene=NV2g019682000.1 CDS=109-627 AGATCCCCGTTTTCTGCTTCAACAGTGATTGAACCGAAACAGGACTCACACGCGAATTTCCAAAACTTTT CAGTTTTACTTTTATCCTTGCCATAAAACTCGTTCAAGATGTCGCTCTCAGTTTGCCGTCAGAACTATCA CGAGGAGTCCGAGGCGGGCGTCAACAAGCAGATCAACCTCGAGTTGTACGCCAGCTACGTCTACATGTCG ATGGCCTACCATTTCGACCGTGATGATGTAGCTTTGCCTGGATTCCACAAGTACTTTATGAAGGCCTCGC ATGAAGAGCGCGAGCATGCCGAGAAGCTTGCCAAGTTCCAGCTGCAACGTGGAGGCCGCATTGTGCTTCA AGACATCAAGCGCCCTGAGCGCGACGACTGGGGTTGTGGACAGGATGCCATTCAGGCAGCTCTTGACCTG GAAAAACATGTCAACCAGGCCTTACTTGATCTGCACAAGGTCGCCGAGAAGCACGGTGACTCTCAGATGC AAGACTGGCTCGAGTCGCATTACCTGACTGAGCAAGTGGAGGCCATCAAGGAGCTTGCTGGTCACTTGAC CAACCTGAAGCGTGTTGGCCCTGGCTTGGGAGAATTCCAGTTCGACAAGCTCACCCTCGACGACTAGAGG GGTGCAGGCTGGACTGAGTCTTGAAACCATGGATTGACCTTTAAACCGAAGTAGATCTTATCAACCCTGA TGTGACACACAGCGCCGCCCTTGTTCACATAAATCGGAATTCATGGCAAGCCTTTGGATAATTCTATCTT TCCACTCCGAGGCACACTTTGCCGTGCCCTGGTCCCTATTTCATCTAATGGTAAAGGGATCAGTGGAGCC GTTTTTTGTTAACAAGGGCATCTTTTTTTCTTTTGTTAGACTTGTTTTCTGTAGTGACTAAATAAAAGCA TTATAAAATCAATGTGATGACTGCTTTATTTTTGTTGAGAAATAAAAACCAGATAACATCCCAACATCCT GCCCCAAGTTTTCAATATAAAGAAGTCTTTCTTCAGAGCATTGATAATCCCAGGGGGATCGGGATAATAG GAAATGGAATAAATAGTACTGAGATTAGAAACCTTGGAGATTGCCAAACAACAAGTGGCCACAGGAAAGG AGGGCACCTACTTTTTTTTTTCTTAATGTGAAGTTGGCATGTACATTAATATGTATTTAACGTCATAGGT ATAAATTAATTGCTTGATAAAAGACGAATCACACATTT
