Sequence Folding

SeqFold.jl implements the Zuker and Stiegler (1981) dynamic programming algorithm for predicting nucleic acid secondary structures.

Showcase

julia> seq = "GGGAGGTCAGCAAACCTGAACCTGTTGAGATGTTGACGTCAGGAAACCCT";

julia> # Get the list of possible secondary structures for a sequence

julia> folded = fold(seq) # calculated at 37°C by default, Julia uses 1-based indexing
12-element Vector{SeqFold.Structure}:
    3   40   -1.3  STACK:GA/CT    
    4   39   -1.4  STACK:AGG/TGC  
    6   37   -1.5  STACK:GT/CA    
    7   36   -1.3  STACK:TC/AG    
    8   35   -1.5  STACK:CA/GT    
    9   34   -2.0  STACK:AGC/TTG  
   11   32   -1.5  STACK:CA/GT    
   12   31    2.8  INTERIOR_LOOP:4/2
   16   29   -1.3  STACK:CT/GA    
   17   28   -0.1  STACK:TGA/AGT  
   19   26   -1.0  STACK:AA/TT    
   20   25    3.5  HAIRPIN:AC/TG 

julia> dot_bracket(seq, folded) # get the dot-bracket notation
"..((.((((.((...((.((....)).)).)).)))).)).........."

julia> folded_hot = fold(seq, temp=80)
4-element Vector{SeqFold.Structure}:
    7   19   -0.4  STACK:TC/AG    
    8   18   -0.5  STACK:CA/GT    
    9   17   -0.4  STACK:AG/TC    
   10   16    3.6  HAIRPIN:GC/CC 

julia> dot_bracket(seq, folded_hot) # at a higher temperature number of secondary structures is decreased
"......((((.....))))..............................."

julia> dg(folded) # get the total free energy of the predicted secondary structures
-6.6

julia> dg(folded_hot)
2.3

julia> dg(seq, temp=4) # this method calls the computationally intensive `fold` internally
-16.2

Exported names

SeqFold.foldFunction
fold(seq; temp = 37.0) -> Vector{Structure}

Predict the minimum free energy secondary structure of a nucleic acid sequence using a dynamic programming algorithm based on the Zuker and Stiegler (1981) approach.

Implements the core of the nucleic acid folding algorithm. It calculates the thermodynamically most stable secondary structure for a given single-stranded DNA or RNA sequence. The result is a vector of SeqFold.Structure objects, each element representing a distinct secondary structure (e.g., hairpin loop, stacked pair, bulge, interior loop, multibranch loop) that contributes to the overall folded structure.

An optimization is applied where "isolated" base pairs (those not adjacent to other base pairs) are penalized with a high energy cost (1600.0 kcal/mol) to speed up computation. A base pair (i,j) is considered isolated if neither the pair (i-1, j+1) nor the pair (i+1, j-1) are complementary according to the sequence's complementarity rules. This optimization is applied regardless of sequence length.

Arguments

  • seq::AbstractString: DNA/RNA sequence
  • temp::Real: The temperature (°C) at which the folding is performed (default: 37.0).

Examples

julia> fold("CCAACCGGTTGG")
4-element Vector{SeqFold.Structure}:
    1   12   -1.8  STACK:CC/GG    
    2   11   -1.5  STACK:CA/GT    
    3   10   -1.0  STACK:AA/TT    
    4    9    3.5  HAIRPIN:AC/TG  

julia> fold("CCAACCGGTTGG", temp=70)
2-element Vector{SeqFold.Structure}:
    5   12   -1.6  STACK_DE:CC/GG 
    6   11    3.2  HAIRPIN:CG/GT  

See also

dg, dot_bracket, SeqFold.Structure

source
SeqFold.dot_bracketFunction
dot_bracket(seq, structs) -> String

Generate the dot-bracket notation representation of a predicted nucleic acid secondary structure.

Arguments

  • seq::AbstractString: The original nucleotide sequence that was folded.
  • structs::Vector{Structure}: A vector of Structure objects describing the folded structure, typically obtained from the fold function.

Examples

julia> s = "AATTACGTTAC";

julia> dot_bracket(s, fold(s))
"((.....)).."

julia> seq2 = "GGGAGGTCAGCAAACCTGAACCTGTTGAGATGTTGACGTCAGGAAACCCT";

julia> structs2 = fold(seq2)
12-element Vector{SeqFold.Structure}:
    3   40   -1.3  STACK:GA/CT    
    4   39   -1.4  STACK:AGG/TGC  
    6   37   -1.5  STACK:GT/CA    
    7   36   -1.3  STACK:TC/AG    
    8   35   -1.5  STACK:CA/GT    
    9   34   -2.0  STACK:AGC/TTG  
   11   32   -1.5  STACK:CA/GT    
   12   31    2.8  INTERIOR_LOOP:4/2
   16   29   -1.3  STACK:CT/GA    
   17   28   -0.1  STACK:TGA/AGT  
   19   26   -1.0  STACK:AA/TT    
   20   25    3.5  HAIRPIN:AC/TG   

julia> dbn = dot_bracket(seq2, structs2)
"..((.((((.((...((.((....)).)).)).)))).)).........."

julia> length(dbn) == length(seq2)
true

julia> count(==('('), dbn)  # Count base pairs
12

Notes

  • The function only considers the base pairs explicitly listed in the .ij field of each Structure.
  • It does not validate that the input structures are consistent or represent a physically possible configuration.
  • Pseudoknots (non-nested base pairs) are not handled by this simple representation and will not be correctly shown if present in the input structures.

See also

fold, SeqFold.Structure

source
SeqFold.dgFunction
dg(seq; temp = 37.0) -> Float64
dg(structures) -> Float64

Compute the minimum free energy (ΔG, kcal/mol⁻¹) of a single-stranded nucleic acid sequence at a specified temperature.

The function is a thin wrapper around the more general fold routine, which generates all energetically feasible secondary structures for the sequence. Only the sum of the free-energy contributions of the returned structures is reported, rounded to two decimal places.

Arguments

  • seq::AbstractString – the nucleotide sequence to be folded;
  • temp::Real – the temperature (°C) at which to perform the folding (default: 37.0);
  • structures::Vector{Structure} – result of fold function.

Returns

  • ΔG::Float64 – the total free energy of the predicted structure, rounded to two decimal places.

Examples

julia> seq = "GCGCGCGCGCG";

julia> dg(seq)
-3.0

julia> dg(seq, temp=20)
-4.9

julia> structures = fold(seq);

julia> dg(structures)
-3.0

See also

fold, SeqFold.dg_cache

source

Public names

SeqFold.StructureType
SeqFold.Structure(e, desc, ij)

Represents a structural element within a predicted nucleic acid secondary structure.

A Structure object encapsulates the energetic contribution (e), a descriptive string (desc) detailing the type of structural element (e.g., hairpin, stack, bulge, multibranch), and the base-pairing information (ij).

This type is primarily used internally by the folding algorithm (see fold) to represent and cache the energetics of different structural motifs. It can also be used to inspect the detailed components of a folded structure.

Fields

  • e::Float64: The free energy contribution (in kcal/mol) of this structural element. A value of -Inf typically signifies an uninitialized or invalid structure, while Inf often represents a null or discarded possibility during folding.
  • desc::String: A human-readable description of the structural element. Common descriptions include:
    • "HAIRPIN:<seq>": A hairpin loop closed by specific base pairs.
    • "STACK:<bp1>/<bp2>": A stacked base pair.
    • "BULGE:<size>": A bulge loop of a specific size.
    • "INTERIOR_LOOP:<size1>/<size2>": An interior loop with specified sizes on each side.
    • "BIFURCATION:<unpaired>n/<helices>h": A multibranch loop with a certain number of unpaired nucleotides and helices.
  • ij::Vector{Tuple{Int, Int}}: A list of base-pair indices (i, j) involved in this structural element. For simple elements like a single hairpin or stack, this typically contains one tuple (i, j) indicating the 1-based indices of the paired nucleotides. For complex elements like multibranch loops, it can contain multiple tuples representing the various stems (helices) that constitute the junction.

Examples

julia> s1 = SeqFold.Structure() # Default, uninitialized
   0    0   -Inf

julia> s2 = SeqFold.Structure(-3.2, "HAIRPIN:CG/CG", [(2, 7)])
   2    7   -3.2  HAIRPIN:CG/CG

julia> s3 = SeqFold.Structure(5.1, "BIFURCATION:2n/3h", [(1, 10), (15, 20), (25, 30)])
   1   10    5.1  BIFURCATION:2n/3h

julia> Bool(s1) # Check if structure is valid (not -Inf)
false

julia> Bool(s2)
true

julia> s2 == SeqFold.Structure(-3.2, "HAIRPIN:CG/CG", [(2, 7)]) # Structures can be compared
true

See also

fold, dg, dot_bracket

source
SeqFold.dg_cacheFunction
SeqFold.dg_cache(seq; temp = 37.0) -> Matrix{Float64}

Compute a matrix of free energy values for all possible subsequences of a nucleic acid sequence.

The resulting matrix has element [i, j] representing the minimum free energy (ΔG) of the subsequence from position i to position j. This cache enables efficient energy calculations for various subsequences without redundant computations.

Arguments

  • seq::AbstractString: The nucleic acid sequence to analyze
  • temp::Real: The temperature (°C) at which to perform the energy calculation (default: 37.0).

Returns

A Matrix{Float64} where element [i, j] contains the free energy (in kcal/mol) of the subsequence from position i to position j, inclusive. Elements where j < i contain NaN as they represent invalid ranges, and single-nucleotide subsequences also have NaN as they don't have meaningful energy values.

Examples

julia> SeqFold.dg_cache("ATCAT")
5×5 Matrix{Float64}:
 NaN   Inf   Inf   Inf    4.0
 NaN  NaN    Inf   Inf   Inf
 NaN  NaN   NaN    Inf   Inf
 NaN  NaN   NaN   NaN    Inf
 NaN  NaN   NaN   NaN   NaN

julia> SeqFold.dg_cache("ATCAT", temp=4)
5×5 Matrix{Float64}:
 NaN   Inf   Inf   Inf    3.6
 NaN  NaN    Inf   Inf   Inf
 NaN  NaN   NaN    Inf   Inf
 NaN  NaN   NaN   NaN    Inf
 NaN  NaN   NaN   NaN   NaN

Implementation

The function uses dynamic programming to build a cache of free energy values for all subsequences. The algorithm has O(n²) time and space complexity, where n is the sequence length. This approach avoids redundant calculations when multiple energy values for different subsequences are needed.

Note

The reliability of this function is questionable. For given SeqFold.dg_cache(seq) its [1, end]-th element always equals to the dg(seq) called for the same temp value, but other [i, j] values oftentimes fail to represent exact dg(seq[i:j]). The flaw may be with the original algorithm, as was shown.

See also

fold, SeqFold.tm_cache, SeqFold.gc_cache

source