Estimating Sequence Similarity from Contig Sets

Ryšavý, Petr and Železný, Filip

A key task in computational biology is to determine mutual similarity of two genomic sequences. Current bio-technologies are usually not able to determine the full sequential content of a genome from biological material, and rather produce a set of large substrings (contigs) whose order and relative mutual positions within the genome are unknown. Here we design a function estimating the sequential similarity (in terms of the inverse Levenshtein distance) of two genomes, given their respective contig-sets. Our approach consists of two steps, based respectively on an adaptation of the tractable Smith-Waterman local alignment algorithm, and a problem reduction to the weighted interval scheduling problem soluble efficiently with dynamic programming. In hierarchical-clustering experiments with Influenza and Hepatitis genomes, our approach outperforms the standard baseline where only the longest contigs are compared. For high-coverage settings, it also outperforms estimates produced by the recent method [8] that avoids contig construction completely.
@inproceedings{ida2017,
  author = {Ryšavý, Petr and Železný, Filip},
  editor = {Adams, Niall and Tucker, Allan and Weston, David},
  title = {Estimating Sequence Similarity from Contig Sets},
  booktitle = {Advances in Intelligent Data Analysis XVI: 16th International Symposium, IDA 2017},
  year = {2017},
  publisher = {Springer International Publishing},
  address = {Cham},
  pages = {272--283},
  isbn = {978-3-319-68765-0},
  doi = {10.1007/978-3-319-68765-0\_23}
}