Skip to content

Latest commit

 

History

History
230 lines (152 loc) · 6.25 KB

README-tmpl.rst

File metadata and controls

230 lines (152 loc) · 6.25 KB

Edlib

Lightweight, super fast library for sequence alignment using edit (Levenshtein) distance.

Popular use cases: aligning DNA sequences, calculating word/text similarity.

{{ Basic code examples will be generated here. }}

Edlib is actually a C/C++ library, and this package is it's wrapper for Python. Python Edlib has mostly the same API as C/C++ Edlib, so feel free to check out C/C++ Edlib docs for more code examples, details on API and how Edlib works.

Features

  • Calculates edit distance.
  • It can find optimal alignment path (instructions how to transform first sequence into the second sequence).
  • It can find just the start and/or end locations of alignment path - can be useful when speed is more important than having exact alignment path.
  • Supports multiple alignment methods: global(NW), prefix(SHW) and infix(HW), each of them useful for different scenarios.
  • You can extend character equality definition, enabling you to e.g. have wildcard characters, to have case insensitive alignment or to work with degenerate nucleotides.
  • It can easily handle small or very large sequences, even when finding alignment path.
  • Super fast thanks to Myers's bit-vector algorithm.

NOTE: Alphabet length has to be <= 256 (meaning that query and target together must have <= 256 unique values).

Installation

pip install edlib

API

Edlib has two functions, align() and getNiceAlignment():

align()

align(query, target, [mode], [task], [k], [additionalEqualities])

Aligns query against target with edit distance.

query and target can be strings, bytes, or any iterables of hashable objects, as long as all together they don't have more than 256 unique values.

{{ Content of help(edlib.align) will be generated here. }}

getNiceAlignment()

getNiceAlignment(alignResult, query, target)

Represents alignment from align() in a visually attractive format.

{{ Content of help(edlib.getNiceAlignment) will be generated here. }}

Usage

{{ Additional usage examples will be generated here. }}

Benchmark

I run a simple benchmark on 7 Feb 2017 (using timeit, on Python3) to get a feeling of how Edlib compares to other Python libraries: editdistance and python-Levenshtein.

As input data I used pairs of DNA sequences of different lengths, where each pair has about 90% similarity.

#1: query length: 30, target length: 30
edlib.align(query, target): 1.88µs
editdistance.eval(query, target): 1.26µs
Levenshtein.distance(query, target): 0.43µs

#2: query length: 100, target length: 100
edlib.align(query, target): 3.64µs
editdistance.eval(query, target): 3.86µs
Levenshtein.distance(query, target): 14.1µs

#3: query length: 1000, target length: 1000
edlib.align(query, target): 0.047ms
editdistance.eval(query, target): 5.4ms
Levenshtein.distance(query, target): 1.9ms

#4: query length: 10000, target length: 10000
edlib.align(query, target): 0.0021s
editdistance.eval(query, target): 0.56s
Levenshtein.distance(query, target): 0.2s

#5: query length: 50000, target length: 50000
edlib.align(query, target): 0.031s
editdistance.eval(query, target): 13.8s
Levenshtein.distance(query, target): 5.0s

More

Check out C/C++ Edlib docs for more information about Edlib!

Development

Check out Edlib python package on Github.