Sequence Folding
SeqFold.jl implements the Zuker and Stiegler (1981) dynamic programming algorithm for predicting nucleic acid secondary structures.
Showcase
julia> seq = "GGGAGGTCAGCAAACCTGAACCTGTTGAGATGTTGACGTCAGGAAACCCT";
julia> # Get the list of possible secondary structures for a sequence
julia> folded = fold(seq) # calculated at 37°C by default, Julia uses 1-based indexing
12-element Vector{SeqFold.Structure}:
3 40 -1.3 STACK:GA/CT
4 39 -1.4 STACK:AGG/TGC
6 37 -1.5 STACK:GT/CA
7 36 -1.3 STACK:TC/AG
8 35 -1.5 STACK:CA/GT
9 34 -2.0 STACK:AGC/TTG
11 32 -1.5 STACK:CA/GT
12 31 2.8 INTERIOR_LOOP:4/2
16 29 -1.3 STACK:CT/GA
17 28 -0.1 STACK:TGA/AGT
19 26 -1.0 STACK:AA/TT
20 25 3.5 HAIRPIN:AC/TG
julia> dot_bracket(seq, folded) # get the dot-bracket notation
"..((.((((.((...((.((....)).)).)).)))).)).........."
julia> folded_hot = fold(seq, temp=80)
4-element Vector{SeqFold.Structure}:
7 19 -0.4 STACK:TC/AG
8 18 -0.5 STACK:CA/GT
9 17 -0.4 STACK:AG/TC
10 16 3.6 HAIRPIN:GC/CC
julia> dot_bracket(seq, folded_hot) # at a higher temperature number of secondary structures is decreased
"......((((.....))))..............................."
julia> dg(folded) # get the total free energy of the predicted secondary structures
-6.6
julia> dg(folded_hot)
2.3
julia> dg(seq, temp=4) # this method calls the computationally intensive `fold` internally
-16.2Exported names
SeqFold.fold — Functionfold(seq; temp = 37.0) -> Vector{Structure}Predict the minimum free energy secondary structure of a nucleic acid sequence using a dynamic programming algorithm based on the Zuker and Stiegler (1981) approach.
Implements the core of the nucleic acid folding algorithm. It calculates the thermodynamically most stable secondary structure for a given single-stranded DNA or RNA sequence. The result is a vector of SeqFold.Structure objects, each element representing a distinct secondary structure (e.g., hairpin loop, stacked pair, bulge, interior loop, multibranch loop) that contributes to the overall folded structure.
An optimization is applied where "isolated" base pairs (those not adjacent to other base pairs) are penalized with a high energy cost (1600.0 kcal/mol) to speed up computation. A base pair (i,j) is considered isolated if neither the pair (i-1, j+1) nor the pair (i+1, j-1) are complementary according to the sequence's complementarity rules. This optimization is applied regardless of sequence length.
Arguments
seq::AbstractString: DNA/RNA sequencetemp::Real: The temperature (°C) at which the folding is performed (default:37.0).
Examples
julia> fold("CCAACCGGTTGG")
4-element Vector{SeqFold.Structure}:
1 12 -1.8 STACK:CC/GG
2 11 -1.5 STACK:CA/GT
3 10 -1.0 STACK:AA/TT
4 9 3.5 HAIRPIN:AC/TG
julia> fold("CCAACCGGTTGG", temp=70)
2-element Vector{SeqFold.Structure}:
5 12 -1.6 STACK_DE:CC/GG
6 11 3.2 HAIRPIN:CG/GT See also
SeqFold.dot_bracket — Functiondot_bracket(seq, structs) -> StringGenerate the dot-bracket notation representation of a predicted nucleic acid secondary structure.
Arguments
seq::AbstractString: The original nucleotide sequence that was folded.structs::Vector{Structure}: A vector ofStructureobjects describing the folded structure, typically obtained from thefoldfunction.
Examples
julia> s = "AATTACGTTAC";
julia> dot_bracket(s, fold(s))
"((.....)).."
julia> seq2 = "GGGAGGTCAGCAAACCTGAACCTGTTGAGATGTTGACGTCAGGAAACCCT";
julia> structs2 = fold(seq2)
12-element Vector{SeqFold.Structure}:
3 40 -1.3 STACK:GA/CT
4 39 -1.4 STACK:AGG/TGC
6 37 -1.5 STACK:GT/CA
7 36 -1.3 STACK:TC/AG
8 35 -1.5 STACK:CA/GT
9 34 -2.0 STACK:AGC/TTG
11 32 -1.5 STACK:CA/GT
12 31 2.8 INTERIOR_LOOP:4/2
16 29 -1.3 STACK:CT/GA
17 28 -0.1 STACK:TGA/AGT
19 26 -1.0 STACK:AA/TT
20 25 3.5 HAIRPIN:AC/TG
julia> dbn = dot_bracket(seq2, structs2)
"..((.((((.((...((.((....)).)).)).)))).)).........."
julia> length(dbn) == length(seq2)
true
julia> count(==('('), dbn) # Count base pairs
12Notes
- The function only considers the base pairs explicitly listed in the
.ijfield of eachStructure. - It does not validate that the input structures are consistent or represent a physically possible configuration.
- Pseudoknots (non-nested base pairs) are not handled by this simple representation and will not be correctly shown if present in the input structures.
See also
SeqFold.dg — Functiondg(seq; temp = 37.0) -> Float64
dg(structures) -> Float64Compute the minimum free energy (ΔG, kcal/mol⁻¹) of a single-stranded nucleic acid sequence at a specified temperature.
The function is a thin wrapper around the more general fold routine, which generates all energetically feasible secondary structures for the sequence. Only the sum of the free-energy contributions of the returned structures is reported, rounded to two decimal places.
Arguments
seq::AbstractString– the nucleotide sequence to be folded;temp::Real– the temperature (°C) at which to perform the folding (default:37.0);structures::Vector{Structure}– result offoldfunction.
Returns
ΔG::Float64– the total free energy of the predicted structure, rounded to two decimal places.
Examples
julia> seq = "GCGCGCGCGCG";
julia> dg(seq)
-3.0
julia> dg(seq, temp=20)
-4.9
julia> structures = fold(seq);
julia> dg(structures)
-3.0See also
Public names
SeqFold.Structure — TypeSeqFold.Structure(e, desc, ij)Represents a structural element within a predicted nucleic acid secondary structure.
A Structure object encapsulates the energetic contribution (e), a descriptive string (desc) detailing the type of structural element (e.g., hairpin, stack, bulge, multibranch), and the base-pairing information (ij).
This type is primarily used internally by the folding algorithm (see fold) to represent and cache the energetics of different structural motifs. It can also be used to inspect the detailed components of a folded structure.
Fields
e::Float64: The free energy contribution (in kcal/mol) of this structural element. A value of-Inftypically signifies an uninitialized or invalid structure, whileInfoften represents a null or discarded possibility during folding.desc::String: A human-readable description of the structural element. Common descriptions include:"HAIRPIN:<seq>": A hairpin loop closed by specific base pairs."STACK:<bp1>/<bp2>": A stacked base pair."BULGE:<size>": A bulge loop of a specific size."INTERIOR_LOOP:<size1>/<size2>": An interior loop with specified sizes on each side."BIFURCATION:<unpaired>n/<helices>h": A multibranch loop with a certain number of unpaired nucleotides and helices.
ij::Vector{Tuple{Int, Int}}: A list of base-pair indices(i, j)involved in this structural element. For simple elements like a single hairpin or stack, this typically contains one tuple(i, j)indicating the 1-based indices of the paired nucleotides. For complex elements like multibranch loops, it can contain multiple tuples representing the various stems (helices) that constitute the junction.
Examples
julia> s1 = SeqFold.Structure() # Default, uninitialized
0 0 -Inf
julia> s2 = SeqFold.Structure(-3.2, "HAIRPIN:CG/CG", [(2, 7)])
2 7 -3.2 HAIRPIN:CG/CG
julia> s3 = SeqFold.Structure(5.1, "BIFURCATION:2n/3h", [(1, 10), (15, 20), (25, 30)])
1 10 5.1 BIFURCATION:2n/3h
julia> Bool(s1) # Check if structure is valid (not -Inf)
false
julia> Bool(s2)
true
julia> s2 == SeqFold.Structure(-3.2, "HAIRPIN:CG/CG", [(2, 7)]) # Structures can be compared
trueSee also
SeqFold.dg_cache — FunctionSeqFold.dg_cache(seq; temp = 37.0) -> Matrix{Float64}Compute a matrix of free energy values for all possible subsequences of a nucleic acid sequence.
The resulting matrix has element [i, j] representing the minimum free energy (ΔG) of the subsequence from position i to position j. This cache enables efficient energy calculations for various subsequences without redundant computations.
Arguments
seq::AbstractString: The nucleic acid sequence to analyzetemp::Real: The temperature (°C) at which to perform the energy calculation (default:37.0).
Returns
A Matrix{Float64} where element [i, j] contains the free energy (in kcal/mol) of the subsequence from position i to position j, inclusive. Elements where j < i contain NaN as they represent invalid ranges, and single-nucleotide subsequences also have NaN as they don't have meaningful energy values.
Examples
julia> SeqFold.dg_cache("ATCAT")
5×5 Matrix{Float64}:
NaN Inf Inf Inf 4.0
NaN NaN Inf Inf Inf
NaN NaN NaN Inf Inf
NaN NaN NaN NaN Inf
NaN NaN NaN NaN NaN
julia> SeqFold.dg_cache("ATCAT", temp=4)
5×5 Matrix{Float64}:
NaN Inf Inf Inf 3.6
NaN NaN Inf Inf Inf
NaN NaN NaN Inf Inf
NaN NaN NaN NaN Inf
NaN NaN NaN NaN NaNImplementation
The function uses dynamic programming to build a cache of free energy values for all subsequences. The algorithm has O(n²) time and space complexity, where n is the sequence length. This approach avoids redundant calculations when multiple energy values for different subsequences are needed.
The reliability of this function is questionable. For given SeqFold.dg_cache(seq) its [1, end]-th element always equals to the dg(seq) called for the same temp value, but other [i, j] values oftentimes fail to represent exact dg(seq[i:j]). The flaw may be with the original algorithm, as was shown.
See also