๐
๐
The Cartographer
On the Maximal Independent Sets of $k$-mers with the Edit Distance
March 20, 2023 ยท Entered Twilight ยท ๐ ACM International Conference on Bioinformatics, Computational Biology and Biomedicine
Repo contents: .gitignore, findMIS.cpp, findMISHamming.cpp, readme.md
Authors
Leran Ma, Ke Chen, Mingfu Shao
arXiv ID
2303.10926
Category
cs.DS: Data Structures & Algorithms
Citations
2
Venue
ACM International Conference on Bioinformatics, Computational Biology and Biomedicine
Repository
https://github.com/Shao-Group/kmerspace
Last Checked
3 months ago
Abstract
In computational biology, $k$-mers and edit distance are fundamental concepts. However, little is known about the metric space of all $k$-mers equipped with the edit distance. In this work, we explore the structure of the $k$-mer space by studying its maximal independent sets (MISs). An MIS is a sparse sketch of all $k$-mers with nice theoretical properties, and therefore admits critical applications in clustering, indexing, hashing, and sketching large-scale sequencing data, particularly those with high error-rates. Finding an MIS is a challenging problem, as the size of a $k$-mer space grows geometrically with respect to $k$. We propose three algorithms for this problem. The first and the most intuitive one uses a greedy strategy. The second method implements two techniques to avoid redundant comparisons by taking advantage of the locality-property of the $k$-mer space and the estimated bounds on the edit distance. The last algorithm avoids expensive calculations of the edit distance by translating the edit distance into the shortest path in a specifically designed graph. These algorithms are implemented and the calculated MISs of $k$-mer spaces and their statistical properties are reported and analyzed for $k$ up to 15. Source code is freely available at https://github.com/Shao-Group/kmerspace .
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Data Structures & Algorithms
R.I.P.
๐ป
Ghosted
Route Planning in Transportation Networks
R.I.P.
๐ป
Ghosted
Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration
R.I.P.
๐ป
Ghosted
Hierarchical Clustering: Objective Functions and Algorithms
R.I.P.
๐ป
Ghosted
Graph Isomorphism in Quasipolynomial Time
๐
๐
The Cartographer