# Estimating Sequence Similarity from Contig Sets

## Ryšavý, Petr and Železný, Filip

A key task in computational biology is to determine mutual similarity of two genomic sequences. Current bio-technologies are usually not able to determine the full sequential content of a genome from biological material, and rather produce a set of large substrings (contigs) whose order and relative mutual positions within the genome are unknown. Here we design a function estimating the sequential similarity (in terms of the inverse Levenshtein distance) of two genomes, given their respective contig-sets. Our approach consists of two steps, based respectively on an adaptation of the tractable Smith-Waterman local alignment algorithm, and a problem reduction to the weighted interval scheduling problem soluble efficiently with dynamic programming. In hierarchical-clustering experiments with Influenza and Hepatitis genomes, our approach outperforms the standard baseline where only the longest contigs are compared. For high-coverage settings, it also outperforms estimates produced by the recent method [8] that avoids contig construction completely.
@inproceedings{ida2017,
author = {Ryšavý, Petr and Železný, Filip},
editor = {Adams, Niall and Tucker, Allan and Weston, David},
title = {Estimating Sequence Similarity from Contig Sets},
booktitle = {Advances in Intelligent Data Analysis XVI: 16th International Symposium, IDA 2017},
year = {2017},
publisher = {Springer International Publishing},