Obtaining sequences for HCR in Nvec

Obtaining sequences for HCR

Objective

I am interested in finding the sequences for these two genes in Nematostella:

  • NV2g011441000.1 (FOX1A)
  • NV2g019682000.1 (FRIS-like-8)

Genomic information

  • I am using the NV2 genome: https://simrbase.stowers.org/starletseaanemone
  • I am using the transcript file from here: https://simrbase.stowers.org/files/pub/nematostella/Nvec/genomes/Nvec200/aligned/tcs_v2/20240221/NV2g.20240221.transcripts.fa
  • Whole genome can be found here: https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_033964005.1/

Determine if sequences are in transcript file

[kxw755@pegasus NV2_Nvec]$ grep -n "NV2g011441000\.1" NV2g.20240221.transcripts.fa | head 
1305742:>NV2t011441001.1 gene=NV2g011441000.1 CDS=473-1333 
1305772:>NV2t011441002.1 gene=NV2g011441000.1 CDS=1215-2075 
1305813:>NV2t011441003.1 gene=NV2g011441000.1 CDS=283-1143 
[kxw755@pegasus NV2_Nvec]$ grep -n "NV2g019682000\.1" NV2g.20240221.transcripts.fa | head 
1059715:>NV2t019682001.1 gene=NV2g019682000.1 CDS=109-627

Sweet they are in there but one has multiple transcripts.

Extract sequences

awk '
  BEGIN{RS=">"; ORS=""}
  NR>1 {
    if ($0 ~ /gene=NV2g011441000\.1([[:space:]]|$)/ ||
        $0 ~ /gene=NV2g019682000\.1([[:space:]]|$)/)
      print ">" $0
  }
' NV2g.20240221.transcripts.fa \
> NV2g011441000.1_NV2g019682000.1.transcripts.fa

output

less NV2g011441000.1_NV2g019682000.1.transcripts.fa

>NV2t019682001.1 gene=NV2g019682000.1 CDS=109-627
AGATCCCCGTTTTCTGCTTCAACAGTGATTGAACCGAAACAGGACTCACACGCGAATTTCCAAAACTTTT
CAGTTTTACTTTTATCCTTGCCATAAAACTCGTTCAAGATGTCGCTCTCAGTTTGCCGTCAGAACTATCA
CGAGGAGTCCGAGGCGGGCGTCAACAAGCAGATCAACCTCGAGTTGTACGCCAGCTACGTCTACATGTCG
ATGGCCTACCATTTCGACCGTGATGATGTAGCTTTGCCTGGATTCCACAAGTACTTTATGAAGGCCTCGC
ATGAAGAGCGCGAGCATGCCGAGAAGCTTGCCAAGTTCCAGCTGCAACGTGGAGGCCGCATTGTGCTTCA
AGACATCAAGCGCCCTGAGCGCGACGACTGGGGTTGTGGACAGGATGCCATTCAGGCAGCTCTTGACCTG
GAAAAACATGTCAACCAGGCCTTACTTGATCTGCACAAGGTCGCCGAGAAGCACGGTGACTCTCAGATGC
AAGACTGGCTCGAGTCGCATTACCTGACTGAGCAAGTGGAGGCCATCAAGGAGCTTGCTGGTCACTTGAC
CAACCTGAAGCGTGTTGGCCCTGGCTTGGGAGAATTCCAGTTCGACAAGCTCACCCTCGACGACTAGAGG
GGTGCAGGCTGGACTGAGTCTTGAAACCATGGATTGACCTTTAAACCGAAGTAGATCTTATCAACCCTGA
TGTGACACACAGCGCCGCCCTTGTTCACATAAATCGGAATTCATGGCAAGCCTTTGGATAATTCTATCTT
TCCACTCCGAGGCACACTTTGCCGTGCCCTGGTCCCTATTTCATCTAATGGTAAAGGGATCAGTGGAGCC
GTTTTTTGTTAACAAGGGCATCTTTTTTTCTTTTGTTAGACTTGTTTTCTGTAGTGACTAAATAAAAGCA
TTATAAAATCAATGTGATGACTGCTTTATTTTTGTTGAGAAATAAAAACCAGATAACATCCCAACATCCT
GCCCCAAGTTTTCAATATAAAGAAGTCTTTCTTCAGAGCATTGATAATCCCAGGGGGATCGGGATAATAG
GAAATGGAATAAATAGTACTGAGATTAGAAACCTTGGAGATTGCCAAACAACAAGTGGCCACAGGAAAGG
AGGGCACCTACTTTTTTTTTTCTTAATGTGAAGTTGGCATGTACATTAATATGTATTTAACGTCATAGGT
ATAAATTAATTGCTTGATAAAAGACGAATCACACATTT
>NV2t011441001.1 gene=NV2g011441000.1 CDS=473-1333
GAACAAAACAAAATGGCCGTGGCTGTGTCAAAAACAGAATCATTCACAAGTGATGAGATTTATGGATAGT
GACATCAAAAGAGCGGGGCCGCATTTCTGCTCTGAATGCAATTTTCCTCTTGGAAATTTCGCACCATCCT
TGCAAAATCTCGCACAAATTGTCATAGAAAAATACCGCTGTAAACAAGGGTCAATAACTGTAATTATTGA
CGGCCGTGTAGTTTTTCGGAGCATTCAAAATCCAGTACCAATCCACTTGAGTGCAATAAATATTCCTGTA
AGAAAGCACCAGTGGTAGAAAGACCGTAAAACACAAAGAAAGAGTATCCAGAAACTGTGCACGCGACGTG
CAAGGCTTGAAGAGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCG
CGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGG
GTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGA
AAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCAT
TCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTAC
TACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGA
AGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATAT
GTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCAC
CTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCAC
AACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACA
CGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCT
ATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTT
ATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCAC
GACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCG
TGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAA
CCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTG
CTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAG
AAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACA
AAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGA
CACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAAC
TTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTT
CTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTT
CACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTG
TGGATACTTGAAGAAAATAAATTTTTAACGAATC
>NV2t011441002.1 gene=NV2g011441000.1 CDS=1215-2075
GAACAAAACAAAATGGCCGTGGCTGTGTCAAAAACAGAATCATTCACAAGTGATGAGATTTATGGATAGT
GACATCAAAAGAGCGGGGCCGCATTTCTGCTCTGAATGCAATTTTCCTCTTGGAAATTTCGCACCATCCT
TGCAAAATCTCGCACAAATTGTCATAGAAAAATACCGCTGTAAACAAGGGTCAATAACTGTAATTATTGA
CGGCCGTGTAGTTTTTCGGAGCATTCAAAATCCAGTACCAATCCACTTGAGTGCAATAAATATTCCTGTA
AGGTAAGAACTAGCCCTTAAAATACATCGCCGTGATTAAACGTAATTCGGGGGAAAAGTTCTATTTTTTT
TAATTTTGTTTTATTCAAAATTAAGATATTGTTTAAAAAAAAGAGTAGTTGTAGATACATTGGGATACAA
TGATAATTACGTAGGAATTTAAGTACTTGTCTTCCTGTTCCAAACTAAGTTTTAAGATATACACTCACAG
ATATAATTTTATCAAAACTGCTAGGAATTCGACAAGCTTCTAGCATTCGAACGGTATTGCTCGGTTTAAC
ACTTAAACTCGTTAAAAGTAATAACTAAAGTGTTTCTCTTTAAACACAATCTCGATGAAGCGCAAATCTG
CAAATATTTATGTCGCACCATGCGTCAGTTTGTTTACAAGAACAAATGGCGGGGTTTGAAGCACGCCGTG
TGTTATTGCGAAGAGCGAGTGTTTGTCCTTCGCTTATCGTGTGTTTATATGATGGGTCAATATTTATCTG
AGCAAAGGCTATCATATCTGAACGGCATAACTGATCGGGGTCAGTATTCATCGCCCTTCCACGGAAGCAC
AGCTCTAGCAAGTGTGGCCAAAAACTTTTAGGGGATTATAGAACATGTCTTCTTCCTGGAAAAATGGCGC
ACAGTACCTCGAAAAGAGGAAGAATTGTTGGGTACACAGTTGGAATGTCATTCGAACACTGCGAACAATA
CTTTTTTTTATATCAAGATGATCAAGAGGTAGAAAGCAAAATTGAAAGCACCAGTGGTAGAAAGACCGTA
AAACACAAAGAAAGAGTATCCAGAAACTGTGCACGCGACGTGCAAGGCTTGAAGAGGAGGTAATCTGCAC
AAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCT
CAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCAT
CACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGC
CAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACA
CTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGA
ACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGG
GAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGG
CAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACA
ACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTT
CCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATG
GCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACG
AAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTC
CCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGT
TGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTA
GAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGA
AAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATC
ATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAA
AAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGC
AAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGT
TTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATG
AAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTT
GCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGG
GAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAA
CGAATC
>NV2t011441003.1 gene=NV2g011441000.1 CDS=283-1143
GGACTGCGAGCAAGCATGTGTTCTATATAAACGAGGTAGTGATCTATCTTAGTTTATGTAGGGAATAGAA
ACCTACTACATTACCCTTCCAGCACAACGTGCCTATTTGTTCAACGCTGACTCCGTCATCTAGGCAATAA
CAGTAACAAGGTCATTATCTTCCTTCACGAGAAGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGT
GAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAG
AGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAA
GAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATC
TCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCA
TCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATC
ATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTC
CACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGA
AAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCAT
GGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGG
ACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGT
CGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGA
CCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCC
TACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTT
GTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAA
AAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATA
TCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAAC
TAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAA
ATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATAT
TTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACAC
ATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTT
GGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAA
CTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGT
ACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAACGAATC

Since NV2g011441000.1 has multiple transcripts, we will have to find the common region shared by all isoforms (i.e. gene level validation instead of isoform level)

Find what differs between the transcripts (from the GTF)

gene="NV2g011441000.1"

awk -v g="$gene" '
  $3=="exon" && $0 ~ "gene_id \"" g "\"" {
    if (match($0, /transcript_id "([^"]+)"/, m)) tx=m[1]; else next
    print tx "\t" $1 "\t" $4 "\t" $5 "\t" $7
  }
' NV2g.20240221.gtf | sort -k1,1 -k3,3n > ${gene}.exons.tsv

column -t ${gene}.exons.tsv | head -n 50

output

NV2t011441001.1 chr2    7110167 7110448 +
NV2t011441001.1 chr2    7114469 7114550 +
NV2t011441001.1 chr2    7131875 7133504 +
NV2t011441002.1 chr2    7110167 7111190 +
NV2t011441002.1 chr2    7114469 7114550 +
NV2t011441002.1 chr2    7131875 7133504 +
NV2t011441003.1 chr2    7127953 7128126 +
NV2t011441003.1 chr2    7131875 7133504 +

Identify exons shared by all isoforms vs unique

gene="NV2g011441000.1"

awk -v g="$gene" '
  $3=="exon" && $0 ~ "gene_id \"" g "\"" {
    if (match($0, /transcript_id "([^"]+)"/, m)) tx=m[1]; else next
    key=$1":"$4"-"$5":"$7
    seen[tx,key]=1
    txs[tx]=1
  }
  END{
    # count transcripts
    for (t in txs) ntx++
    # count exon intervals across tx
    for (k in seen) {
      split(k, a, SUBSEP); tx=a[1]; exon=a[2]
      count[exon]++
    }
    print "Total transcripts:", ntx > "/dev/stderr"
    print "EXON_INTERVAL\tN_TX"
    for (e in count) print e "\t" count[e]
  }
' NV2g.20240221.gtf | sort -k2,2nr > ${gene}.exon_sharedness.tsv

head ${gene}.exon_sharedness.tsv

output

less NV2g011441000.1.exon_sharedness.tsv

chr2:7131875-7133504:+  3
chr2:7114469-7114550:+  2
chr2:7110167-7110448:+  1
chr2:7110167-7111190:+  1
chr2:7127953-7128126:+  1
EXON_INTERVAL   N_TX

Interpretation:

  • N_TX = 3 → exon interval appears in all 3 transcripts (great for gene-level probe)
  • N_TX = 1 → exon interval is isoform-unique (candidate for isoform-specific probe)

Therefore, the chr2:7131875-7133504:+ region is common across all three transcripts and should be used for probe design

Extract the common sequence directly from the transcript FASTA

Isolate just the transcripts for NV2g011441000.1:

awk '
  BEGIN{RS=">"; ORS=""}
  NR>1 && $0 ~ /gene=NV2g011441000\.1/ { print ">" $0 }
' NV2g.20240221.transcripts.fa \
> NV2g011441000.1.isoforms.fa

grep "^>" NV2g011441000.1.isoforms.fa

>NV2t011441001.1 gene=NV2g011441000.1 CDS=473-1333
>NV2t011441002.1 gene=NV2g011441000.1 CDS=1215-2075
>NV2t011441003.1 gene=NV2g011441000.1 CDS=283-1143

Convert FASTA → one-line sequences

awk '
  /^>/ {if (seq) print seq; print; seq=""; next}
  {seq=seq$0}
  END{print seq}
' NV2g011441000.1.isoforms.fa \
> NV2g011441000.1.isoforms.oneline.fa

less NV2g011441000.1.isoforms.oneline.fa

>NV2t011441001.1 gene=NV2g011441000.1 CDS=473-1333
GAACAAAACAAAATGGCCGTGGCTGTGTCAAAAACAGAATCATTCACAAGTGATGAGATTTATGGATAGTGACATCAAAAGAGCGGGGCCGCATTTCTGCTCTGAATGCAATTTTCCTCTTGGAAATTTCGCACCATCCTTGCAAAATCTCGCACAAATTGTCATAGAAAAATACCGCTGTAAACAAGGGTCAATAACTGTAATTATTGACGGCCGTGTAGTTTTTCGGAGCATTCAAAATCCAGTACCAATCCACTTGAGTGCAATAAATATTCCTGTAAGAAAGCACCAGTGGTAGAAAGACCGTAAAACACAAAGAAAGAGTATCCAGAAACTGTGCACGCGACGTGCAAGGCTTGAAGAGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAACGAATC
>NV2t011441002.1 gene=NV2g011441000.1 CDS=1215-2075
GAACAAAACAAAATGGCCGTGGCTGTGTCAAAAACAGAATCATTCACAAGTGATGAGATTTATGGATAGTGACATCAAAAGAGCGGGGCCGCATTTCTGCTCTGAATGCAATTTTCCTCTTGGAAATTTCGCACCATCCTTGCAAAATCTCGCACAAATTGTCATAGAAAAATACCGCTGTAAACAAGGGTCAATAACTGTAATTATTGACGGCCGTGTAGTTTTTCGGAGCATTCAAAATCCAGTACCAATCCACTTGAGTGCAATAAATATTCCTGTAAGGTAAGAACTAGCCCTTAAAATACATCGCCGTGATTAAACGTAATTCGGGGGAAAAGTTCTATTTTTTTTAATTTTGTTTTATTCAAAATTAAGATATTGTTTAAAAAAAAGAGTAGTTGTAGATACATTGGGATACAATGATAATTACGTAGGAATTTAAGTACTTGTCTTCCTGTTCCAAACTAAGTTTTAAGATATACACTCACAGATATAATTTTATCAAAACTGCTAGGAATTCGACAAGCTTCTAGCATTCGAACGGTATTGCTCGGTTTAACACTTAAACTCGTTAAAAGTAATAACTAAAGTGTTTCTCTTTAAACACAATCTCGATGAAGCGCAAATCTGCAAATATTTATGTCGCACCATGCGTCAGTTTGTTTACAAGAACAAATGGCGGGGTTTGAAGCACGCCGTGTGTTATTGCGAAGAGCGAGTGTTTGTCCTTCGCTTATCGTGTGTTTATATGATGGGTCAATATTTATCTGAGCAAAGGCTATCATATCTGAACGGCATAACTGATCGGGGTCAGTATTCATCGCCCTTCCACGGAAGCACAGCTCTAGCAAGTGTGGCCAAAAACTTTTAGGGGATTATAGAACATGTCTTCTTCCTGGAAAAATGGCGCACAGTACCTCGAAAAGAGGAAGAATTGTTGGGTACACAGTTGGAATGTCATTCGAACACTGCGAACAATACTTTTTTTTATATCAAGATGATCAAGAGGTAGAAAGCAAAATTGAAAGCACCAGTGGTAGAAAGACCGTAAAACACAAAGAAAGAGTATCCAGAAACTGTGCACGCGACGTGCAAGGCTTGAAGAGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAACGAATC
>NV2t011441003.1 gene=NV2g011441000.1 CDS=283-1143
GGACTGCGAGCAAGCATGTGTTCTATATAAACGAGGTAGTGATCTATCTTAGTTTATGTAGGGAATAGAAACCTACTACATTACCCTTCCAGCACAACGTGCCTATTTGTTCAACGCTGACTCCGTCATCTAGGCAATAACAGTAACAAGGTCATTATCTTCCTTCACGAGAAGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAACGAATC

Compute the longest common shared sequence

  • This finds the exact nucleotide region present in all three isoforms.
awk '
  /^>/ {
    if (seq) seqs[++n]=seq
    seq=""
    next
  }
  {seq=$0}
  END{
    seqs[++n]=seq

    # find shortest sequence
    min=1
    for (i=2;i<=n;i++)
      if (length(seqs[i]) < length(seqs[min])) min=i
    ref=seqs[min]

    best=""
    for (i=1;i<=length(ref);i++) {
      for (j=i;j<=length(ref);j++) {
        frag = substr(ref, i, j-i+1)
        ok=1
        for (k=1;k<=n;k++) {
          if (index(seqs[k], frag) == 0) { ok=0; break }
        }
        if (ok && length(frag) > length(best)) best = frag
      }
    }

    print ">NV2g011441000.1_COMMON_SEQUENCE"
    print best
  }
' NV2g011441000.1.isoforms.oneline.fa \
> NV2g011441000.1.common.fa

Sanity check

grep "^>" NV2g011441000.1.common.fa

>NV2g011441000.1_COMMON_SEQUENCE

grep -v "^>" NV2g011441000.1.common.fa | wc -c

1633

less NV2g011441000.1.common.fa

>NV2g011441000.1_COMMON_SEQUENCE
AGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAACGAATC

In summary, here are the two sequences we will use for HCRs:

  • NV2g011441000.1 (FOX1A)

NV2g011441000.1_COMMON_SEQUENCE AGGAGGTAATCTGCACAAGTCGCTAGCGGAGACACAGTGAACACCAACACCAGCAGCGCGCAACGTCAACCGGCAAACTGGATCCTCAGTACAGAAACCTCGAGATAGAGATGATGGAGCACACGGGGGTGCCTCCAGCGGCCATGCAAGACCCATCACAAAACCCGCACGAGCTCAAGAAATCCAAGGACAAGGAGAAAGCGTATCGCCGGAGCTACACGCACGCCAAGCCGCCATATTCATATATCTCACTCATCACGATGGCCATTCAACAGAGCCCAAACAAGATGCTCACACTGAGCGAGATCTACCAATTCATCATGGACTTGTTTCCCTACTACAGGCAAAACCAACAGCGCTGGCAGAACTCTATCCGGCACAGTTTATCATTCAATGATTGCTTCGTGAAGGTGCCGCGCTCTCCTGACCGCCCCGGGAAAGGCAGTTACTGGACTCTCCACCCGGACTGCGGTAATATGTTCGAGAACGGGTGCTACCTTCGCAGGCAGAAGCGCTTCAAAGCCGAGAAAAAACCGGACCTGAGTCACCTTAGCAAGGTGAGCAGTATGACACACAACCCGGTCACAGTACAAAGCATGGCGAAGAGCATGGCGGCACAACCTCGCTCAATGGGAACCCCTAGCTTCCTTGCACCGTCTCCGTACGGGACGGCTATGGGCATGGGACACGTTGGGAGCATGACCGCCATGGGTATGGCAGGTATGCCCATGAATAAGTCGTTTAATCACCCATTCGCTATCAAGAATATCATCGCGCAAGATCACGAAGCTGAGCTTCGAGGCTACGACCCCATGCACTTCAGTCCTTATCATCCATCACTCCAATCCATGGGTTCCCTAGGACTCCCTAAATCCGCCTACGAATCGCAACCTATCACGACGGATACGAGTCCGTACTACCAGGGTTGCGTCTTCACGCCGTCGAGTTGTGGTATATCAAATCTTTCGTGAACTTAGCGCAGCACTTAGAACACTAGAAGTAGAAATTATTCTAAGAAAAGCTGAATATTATGATAAACCTGTATATATAAACATTGAGGCACTGAAAATTGAGCTATCCGACGAATATCTCGTCTATGTACAATGTGCTTGAGCTTTCGTATGATTATATTGATCATTTTTTTACTACAAGCACAACTAGATTTCGTAAAGAGCTAGAAAATTTAATAATTTTATGAACAATCAAAAATGTAAAATAGAAGATGAAAATGTTAGGCTAGGAAAAACAAAAATGATTAGAGTATTTTGAAGAAAGCAAATCTTCCGAAATCAAAATATTTTGAAATTGAATTAACAGACACAAGTGATTATTAGCTAAGAAAATGTTTGTAAAGATTCCTATATACACATATATATTGTTTCGCAAACTTGTTCTGTTGAAGCTGACCATTAGATGAAAGAGCGTGCAGTTAATATTTGGCTTTTCGACAGTCGCTTTCTGTATTCGGGGCATAACATTGACTGTTGCTCAGTTTGTTCTCGTGTCAACTCTGTATTTATCATTGCTTCACTTTTATTTATACTAGTCAGGCGAGGGAACGGTTTTTATTGTAAATGTACTCTCATGGTCAACGTTTGTGGATACTTGAAGAAAATAAATTTTTAACGAATC

  • NV2g019682000.1 (FRIS-like-8)

NV2t019682001.1 gene=NV2g019682000.1 CDS=109-627 AGATCCCCGTTTTCTGCTTCAACAGTGATTGAACCGAAACAGGACTCACACGCGAATTTCCAAAACTTTT CAGTTTTACTTTTATCCTTGCCATAAAACTCGTTCAAGATGTCGCTCTCAGTTTGCCGTCAGAACTATCA CGAGGAGTCCGAGGCGGGCGTCAACAAGCAGATCAACCTCGAGTTGTACGCCAGCTACGTCTACATGTCG ATGGCCTACCATTTCGACCGTGATGATGTAGCTTTGCCTGGATTCCACAAGTACTTTATGAAGGCCTCGC ATGAAGAGCGCGAGCATGCCGAGAAGCTTGCCAAGTTCCAGCTGCAACGTGGAGGCCGCATTGTGCTTCA AGACATCAAGCGCCCTGAGCGCGACGACTGGGGTTGTGGACAGGATGCCATTCAGGCAGCTCTTGACCTG GAAAAACATGTCAACCAGGCCTTACTTGATCTGCACAAGGTCGCCGAGAAGCACGGTGACTCTCAGATGC AAGACTGGCTCGAGTCGCATTACCTGACTGAGCAAGTGGAGGCCATCAAGGAGCTTGCTGGTCACTTGAC CAACCTGAAGCGTGTTGGCCCTGGCTTGGGAGAATTCCAGTTCGACAAGCTCACCCTCGACGACTAGAGG GGTGCAGGCTGGACTGAGTCTTGAAACCATGGATTGACCTTTAAACCGAAGTAGATCTTATCAACCCTGA TGTGACACACAGCGCCGCCCTTGTTCACATAAATCGGAATTCATGGCAAGCCTTTGGATAATTCTATCTT TCCACTCCGAGGCACACTTTGCCGTGCCCTGGTCCCTATTTCATCTAATGGTAAAGGGATCAGTGGAGCC GTTTTTTGTTAACAAGGGCATCTTTTTTTCTTTTGTTAGACTTGTTTTCTGTAGTGACTAAATAAAAGCA TTATAAAATCAATGTGATGACTGCTTTATTTTTGTTGAGAAATAAAAACCAGATAACATCCCAACATCCT GCCCCAAGTTTTCAATATAAAGAAGTCTTTCTTCAGAGCATTGATAATCCCAGGGGGATCGGGATAATAG GAAATGGAATAAATAGTACTGAGATTAGAAACCTTGGAGATTGCCAAACAACAAGTGGCCACAGGAAAGG AGGGCACCTACTTTTTTTTTTCTTAATGTGAAGTTGGCATGTACATTAATATGTATTTAACGTCATAGGT ATAAATTAATTGCTTGATAAAAGACGAATCACACATTT

Written on January 15, 2026