DNA sequence compression algorithms are divided into two categories: reference-based methods, that exploit similarity between sequences and a reference genome, and reference-free methods, that exploit similarity between reads themselves.
Lena provides by default a reference-based compression. The general idea for reference-based compression is to map reads on the genome and store only the data needed to regenerate reads: a position and a list of differences.
This approach has been explored by many academic tools and provides good compression ratios, but usually suffers from low speed compression and decompression. The compression algorithm developed by Enancio has been carefully optimized and fully multi-threaded for the need of lossless compression in order to achieve high compression ratios and high compression and decompression speeds.
Quality scores are compacted and modeled according to their types (4, 8 or 40 different quality values depending on the sequencing platform) and then processed through an arithmetic encoder.
Lena also provides a reference-free algorithm. It takes more time and achieves slightly lower compression ratio, but may be useful when a reference genome is not available. In that case, a reference is built from the reads. It is not represented as a set of sequences, but rather encoded as a probabilistic de Bruijn Graph, which is both more compact and faster to build than conducting a full assembly. Reads are then aligned on the graph, and encoded with a starting node and a list of bifurcations. This reference-free algorithm derives from academic work performed at INRIA Rennes, published here.