Skip to content

Commit

Permalink
[cwi] finish intro
Browse files Browse the repository at this point in the history
  • Loading branch information
RagnarGrootKoerkamp committed Jul 13, 2023
1 parent e133a12 commit 3cc065b
Show file tree
Hide file tree
Showing 3 changed files with 39 additions and 16 deletions.
7 changes: 4 additions & 3 deletions paper-notes/references/APSP.org
Original file line number Diff line number Diff line change
Expand Up @@ -263,9 +263,10 @@ All are $O(n)$ memory.
- Can we extend to fuzzy matching, allowing some errors?
- Can we A* to efficiently construct a fuzzy string-graph, by only considering
sufficiently good candidates?
- If my reading is correct, [cite/t:@assembly-graph-fm] computes all irreducible
edges of $AlltoAll$ in $O(n)$ using the FM-index. If that is indeed the case,
that pretty much seems like the best we can wish for.
- If my reading is correct, [cite/t:@assembly-graph-fm] computes all
edges of length $\geq \tau$ of $AlltoAll$ in $O(n+output)$ using the FM-index.
It can also directly return all irreducible edges (of length $\geq \tau$) in $O(n)$ total time, which
seems very nice and in a way the best we can wish for.

WIP research proposal is [[../../posts/cwi-proposal.org][here]].

Expand Down
37 changes: 24 additions & 13 deletions posts/cwi-proposal.org
Original file line number Diff line number Diff line change
Expand Up @@ -36,10 +36,10 @@ After additional cleaning of the graph, the assembled genome is found as a set o
through it covering all nodes (for string graphs) or edges (depending on the
exact type of De Bruijn graph used).

In the overlap graph and string graph approach, an important step is to find the longest
suffix-prefix overlap between all pairs of reads $(S_i, S_j)$. Since a full
alignment per pair is slow, this is often sped up using an (approximate) /filter/
[cite:@minimap]:
In the overlap graph and string graph approach, an important step is to find the
longest suffix-prefix overlap between all pairs of reads $(S_i, S_j)$. Since a
full alignment per pair is slow and long reads are often noisy, this is usually
sped up using an (approximate) /filter/ [cite:@minimap]:
1. BLAST stores $k$-mers per read in a hashmap [cite:@blast] and counts matching
$k$-mers.
2. DAligner efficiently finds matching $k$-mers between two (sets of) reads by
Expand All @@ -48,19 +48,30 @@ alignment per pair is slow, this is often sped up using an (approximate) /filter
4. Minimap sketches the /minimizers/ in a sequence using MinHash [cite:@minimap].

Alternatively, efficient datastructures can be used to compute all exact maximal
pairwise suffix-prefix overlaps in $O(n + k^2)$ time.
- SGA [cite:@sga] uses the FM-index for $O(n)$ [TODO: Confirm] time construction of all
pairwise suffix-prefix overlaps:
- Gusfield [cite:@gusfield] gives an $O(n+k^2)$ suffix-tree based algorithm.
- SGA [cite:@sga] uses the FM-index for $O(n)$ time construction of all
irreducible edges [cite:@assembly-graph-fm].
- Hifiasm [cite:@hifiasm] is unclear but also seems to only use exact
sufix-prefix matches, given that hifi reads have sufficient quality for exact overlaps.
- [WIP]

As a starting point, we take the paper [cite/t:@suffix-prefix-queries].
Let $R$ be a set of strings $\{S_1, \dots, S_k\}$ with total length $n:= |S_1| +
\dots + |S_k|$. A /suffix-prefix/ query asks for the longest exact overlap between a
suffix of $S_i$ and a prefix of $S_j$.
- [cite/t:@suffix-prefix-queries] takes a different approach and instead of
computing all (irreducible) pairwise overlaps up-front, it introduces multiple queries:
- $OneToOne(i,j)$ computes the longest overlap between $S_i$ and $S_j$ in
$O(\log \log k)$.
- $OneToAll(i)$: computes the longest overlap between $S_i$ and each other
$S_j$ in $O(k)$.
- $Report(i,l)$ reports all overlaps longer than $l$ in $O(\log n +
output)$[fn::This and the methods below can also be done with $\log n / \log
\log n$ complexity instead of $\log n$ using more advances geometric algorithms.].
- $Count(i,l)$ counts the overlaps longer than $l$ in $O(\log n)$.
- $Top(i,K)$ returns the top $K$ longest overlaps of $S_i$ in $O(\log^2 n + K)$.

* Plan
WIP; see [[../paper-notes/references/APSP.org][here]] for some ideas.
The plan for this internship is to improve and extend the results of this last
paper, [cite/t:@suffix-prefix-queries]:
1. Improve the complexity of the $Top(i,K)$ to $O(n)$ construction time and
$O(\log n+K)$ query time.
2. Extend $Top(i,K)$ to the $AllToAll$ setting, to return the top $K$ overlaps
for each $S_i$.

#+print_bibliography:
11 changes: 11 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1555,3 +1555,14 @@ @Article{efficient-qgram-filters
ISBN = 9783540319504,
publisher = {Springer Berlin Heidelberg}
}

@Article{gusfield,
author = {Gusfield, Dan},
title = {Algorithms on Strings, Trees and Sequences},
year = 1997,
month = {May},
doi = {10.1017/cbo9780511574931},
url = {http://dx.doi.org/10.1017/CBO9780511574931},
ISBN = 9780511574931,
publisher = {Cambridge University Press}
}

0 comments on commit 3cc065b

Please sign in to comment.