Logo
Nazad
Antonio A. Ginart, Joseph Hui, Kaiyuan Zhu, Ibrahim Numanagić, T. Courtade, S. C. Sahinalp, David Tse
15 8. 2. 2018.

Optimal compressed representation of high throughput sequence data via light assembly

The most effective genomic data compression methods either assemble reads into contigs, or replace them with their alignment positions on a reference genome. Such methods require significant computational resources, but faster alternatives that avoid using explicit or de novo-constructed references fail to match their performance. Here, we introduce a new reference-free compressed representation for genomic data based on light de novo assembly of reads, where each read is represented as a node in a (compact) trie. We show how to efficiently build such tries to compactly represent reads and demonstrate that among all methods using this representation (including all de novo assembly based methods), our method achieves the shortest possible output. We also provide an lower bound on the compression rate achievable on uniformly sampled genomic read data, which is approximated by our method well. Our method significantly improves the compression performance of alternatives without compromising speed. Increase in high throughput sequencing (HTS) data warrants compression methods to facilitate better storage and communication. Here, Ginart et al. introduce Assembltrie, a reference-free compression tool which is guaranteed to achieve optimality for error-free reads.


Pretplatite se na novosti o BH Akademskom Imeniku

Ova stranica koristi kolačiće da bi vam pružila najbolje iskustvo

Saznaj više