![]() More recently, MinHash has been applied to the relevant problems of genome assembly, 16S rDNA gene clustering, and metagenomic sequence clustering. The MinHash technique is a form of locality-sensitive hashing that has been widely used for the detection of near-duplicate Web pages and images, but has seen limited use in genomics despite initial applications over ten years ago. to triage and cluster sequence data, assign species labels, build large guide trees, identify mis-tracked samples, and search genomic databases. Potential applications include any problem where an approximate, global distance is acceptable, e.g. ![]() This has important applications for large-scale genomic data management and emerging long-read, single-molecule sequencing technologies. Thus, sketches comprising just a few hundred values can be used to approximate the similarity of arbitrarily large datasets. ![]() Importantly, the error of this computation depends only on the size of the sketch and is independent of the genome size. Using only the sketches, which can be thousands of times smaller, the similarity of the original sequences can be rapidly estimated with bounded error. To address this, we consider the general problem of computing an approximate distance between two sequences and describe Mash, a general-purpose toolkit that utilizes the MinHash technique to reduce large sequences (or sequence sets) to compressed sketch representations. New methods are needed that can manage and help organize this scale of data. When BLAST was first published in 1990, there were less than 50 million bases of nucleotide sequence in the public archives now a single sequencing instrument can produce over 1 trillion bases per run.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |