An experimental sorting method for improving metagenomic data encoding

January 03, 2024 ยท Entered Twilight ยท ๐Ÿ› Data Compression Conference

๐Ÿ’ค TWILIGHT: Eternal Rest
Repo abandoned since publication

Repo contents: LICENSE, MizaR.sh, Plot_channels.sh, Plot_coverage.sh, Plot_sequences.sh, README.md, RunAll.sh, Simulate.sh, VDB_MT_ALL_REF.fa.lzma

Authors Diogo Pratas, Armando J. Pinho arXiv ID 2401.01786 Category cs.IT: Information Theory Cross-listed q-bio.GN Citations 2 Venue Data Compression Conference Repository https://github.com/cobilab/mizar โญ 1 Last Checked 3 months ago
Abstract
Minimizing data storage poses a significant challenge in large-scale metagenomic projects. In this paper, we present a new method for improving the encoding of FASTQ files generated by metagenomic sequencing. This method incorporates metagenomic classification followed by a recursive filter for clustering reads by DNA sequence similarity to improve the overall reference-free compression. In the results, we show an overall improvement in the compression of several datasets. As hypothesized, we show a progressive compression gain for higher coverage depth and number of identified species. Additionally, we provide an implementation that is freely available at https://github.com/cobilab/mizar and can be customized to work with other FASTQ compression tools.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

๐Ÿ“œ Similar Papers

In the same crypt โ€” Information Theory