100% lossless compression
Compression with Lena is completely lossless: it preserves all the information, byte per byte. The format embeds two checksums, one to validate that the decompressed file is indeed the same as the original data, and one to check if some data corruption occurred during transmission or storage. In such an event, the file format will pinpoint the location in the file where the error occurred.
High compression Ratio
On data generated by the latest Novaseq sequencer, Lena typically achieves a compression ratio greater than 5x compared to a gzipped file. On other sequencers, compression ratio mainly depends on how the quality values are encoded in the fastq file. With a Hiseq X Ten, typical compression ratio is about 3x compared to gzipped data.
The code has been thoroughly optimized to provide very good compression and decompression speed, without hampering the compression ratio. Speed is crucial to allow for an easy integration into an existing workflow, and to speed up file transfer.
Compression and decompression speed is between 2.5 and 5 times faster than the gzip software, compared to the multi-threaded pigz version. Actual execution time may be limited by disk IO speed. However, when using a fast SSD or decompressing on-the-fly in memory to feed a decompressed data stream to an analysis software, decompression goes up to an impressive 2.1 GB/s on a 8-CPU system.
Direct usage by bioinformatic software
When the compressed fastq.lena file is needed for some computation, e.g. for mapping with BWA, it is possible to avoid decompression of the lena file on the disk. Instead, the file may be decompressed on the fly and fed directly to the analysis software that requires a fastq file input. This greatly reduces read / write to the disk and achieves much better performance.
Streaming ready / Cloud friendly
When files are stored in a distant server or in a cloud, it is sometimes necessary to transfer them to another server for analysis. Depending on the analysis software used, it is not always necessary to wait for the file to be completely transferred before starting the analysis. With the lena format, it is possible to start using the file as soon as it starts arriving, i.e. in a streaming fashion. The transfer time may be completely hidden, the user will only see the analysis computation time. When using a cloud for storage, Lena also offers the possibility to upload to /download from the cloud in a streaming fashion, i.e. the file can be used as soon as it starts arriving.