[cwi] finish intro

RagnarGrootKoerkamp · Jul 13, 2023 · 3cc065b · 3cc065b
1 parent e133a12
commit 3cc065b
Show file tree

Hide file tree

Showing 3 changed files with 39 additions and 16 deletions.
diff --git a/paper-notes/references/APSP.org b/paper-notes/references/APSP.org
@@ -263,9 +263,10 @@ All are $O(n)$ memory.
 - Can we extend to fuzzy matching, allowing some errors?
 - Can we A* to efficiently construct a fuzzy string-graph, by only considering
   sufficiently good candidates?
-- If my reading is correct, [cite/t:@assembly-graph-fm] computes all irreducible
-  edges of $AlltoAll$ in $O(n)$ using the FM-index. If that is indeed the case,
-  that pretty much seems like the best we can wish for.
+- If my reading is correct, [cite/t:@assembly-graph-fm] computes all
+  edges of length $\geq \tau$ of $AlltoAll$ in $O(n+output)$ using the FM-index.
+  It can also directly return all irreducible edges (of length $\geq \tau$) in $O(n)$ total time, which
+  seems very nice and in a way the best we can wish for.
 
 WIP research proposal is [[../../posts/cwi-proposal.org][here]].
 

diff --git a/posts/cwi-proposal.org b/posts/cwi-proposal.org
@@ -36,10 +36,10 @@ After additional cleaning of the graph, the assembled genome is found as a set o
 through it covering all nodes (for string graphs) or edges (depending on the
 exact type of De Bruijn graph used).
 
-In the overlap graph and string graph approach, an important step is to find the longest
-suffix-prefix overlap between all pairs of reads $(S_i, S_j)$. Since a full
-alignment per pair is slow, this is often sped up using an (approximate) /filter/
-[cite:@minimap]:
+In the overlap graph and string graph approach, an important step is to find the
+longest suffix-prefix overlap between all pairs of reads $(S_i, S_j)$. Since a
+full alignment per pair is slow and long reads are often noisy, this is usually
+sped up using an (approximate) /filter/ [cite:@minimap]:
 1. BLAST stores $k$-mers per read in a hashmap [cite:@blast] and counts matching
    $k$-mers.
 2. DAligner efficiently finds matching $k$-mers between two (sets of) reads by
@@ -48,19 +48,30 @@ alignment per pair is slow, this is often sped up using an (approximate) /filter
 4. Minimap sketches the /minimizers/ in a sequence using MinHash [cite:@minimap].
 
 Alternatively, efficient datastructures can be used to compute all exact maximal
-pairwise suffix-prefix overlaps in $O(n + k^2)$ time.
-- SGA [cite:@sga] uses the FM-index for $O(n)$ [TODO: Confirm] time construction of all
+pairwise suffix-prefix overlaps:
+- Gusfield [cite:@gusfield] gives an $O(n+k^2)$ suffix-tree based algorithm.
+- SGA [cite:@sga] uses the FM-index for $O(n)$ time construction of all
   irreducible edges [cite:@assembly-graph-fm].
 - Hifiasm [cite:@hifiasm] is unclear but also seems to only use exact
   sufix-prefix matches, given that hifi reads have sufficient quality for exact overlaps.
-- [WIP]
-
-As a starting point, we take the paper [cite/t:@suffix-prefix-queries].
-Let $R$ be a set of strings $\{S_1, \dots, S_k\}$ with total length $n:= |S_1| +
-\dots + |S_k|$. A /suffix-prefix/ query asks for the longest exact overlap between a
-suffix of $S_i$ and a prefix of $S_j$.
+- [cite/t:@suffix-prefix-queries] takes a different approach and instead of
+  computing all (irreducible) pairwise overlaps up-front, it introduces multiple queries:
+  - $OneToOne(i,j)$ computes the longest overlap between $S_i$ and $S_j$ in
+    $O(\log \log k)$.
+  - $OneToAll(i)$: computes the longest overlap between $S_i$ and each other
+    $S_j$ in $O(k)$.
+  - $Report(i,l)$ reports all overlaps longer than $l$ in $O(\log n +
+    output)$[fn::This and the methods below can also be done with $\log n / \log
+    \log n$ complexity instead of $\log n$ using more advances geometric algorithms.].
+  - $Count(i,l)$ counts the overlaps longer than $l$ in $O(\log n)$.
+  - $Top(i,K)$ returns the top $K$ longest overlaps of $S_i$ in $O(\log^2 n + K)$.
 
 * Plan
-WIP; see [[../paper-notes/references/APSP.org][here]] for some ideas.
+The plan for this internship is to improve and extend the results of this last
+paper, [cite/t:@suffix-prefix-queries]:
+1. Improve the complexity of the $Top(i,K)$ to $O(n)$ construction time and
+   $O(\log n+K)$ query time.
+2. Extend $Top(i,K)$ to the $AllToAll$ setting, to return the top $K$ overlaps
+   for each $S_i$.
 
 #+print_bibliography:
diff --git a/references.bib b/references.bib
@@ -1555,3 +1555,14 @@ @Article{efficient-qgram-filters
   ISBN         = 9783540319504,
   publisher    = {Springer Berlin Heidelberg}
 }
+
+@Article{gusfield,
+  author       = {Gusfield, Dan},
+  title        = {Algorithms on Strings, Trees and Sequences},
+  year         = 1997,
+  month        = {May},
+  doi          = {10.1017/cbo9780511574931},
+  url          = {http://dx.doi.org/10.1017/CBO9780511574931},
+  ISBN         = 9780511574931,
+  publisher    = {Cambridge University Press}
+}