diff --git a/paper-notes/references/APSP.org b/paper-notes/references/APSP.org index 460357c..a74cb10 100644 --- a/paper-notes/references/APSP.org +++ b/paper-notes/references/APSP.org @@ -263,9 +263,10 @@ All are $O(n)$ memory. - Can we extend to fuzzy matching, allowing some errors? - Can we A* to efficiently construct a fuzzy string-graph, by only considering sufficiently good candidates? -- If my reading is correct, [cite/t:@assembly-graph-fm] computes all irreducible - edges of $AlltoAll$ in $O(n)$ using the FM-index. If that is indeed the case, - that pretty much seems like the best we can wish for. +- If my reading is correct, [cite/t:@assembly-graph-fm] computes all + edges of length $\geq \tau$ of $AlltoAll$ in $O(n+output)$ using the FM-index. + It can also directly return all irreducible edges (of length $\geq \tau$) in $O(n)$ total time, which + seems very nice and in a way the best we can wish for. WIP research proposal is [[../../posts/cwi-proposal.org][here]]. diff --git a/posts/cwi-proposal.org b/posts/cwi-proposal.org index 330a161..5a2473a 100644 --- a/posts/cwi-proposal.org +++ b/posts/cwi-proposal.org @@ -36,10 +36,10 @@ After additional cleaning of the graph, the assembled genome is found as a set o through it covering all nodes (for string graphs) or edges (depending on the exact type of De Bruijn graph used). -In the overlap graph and string graph approach, an important step is to find the longest -suffix-prefix overlap between all pairs of reads $(S_i, S_j)$. Since a full -alignment per pair is slow, this is often sped up using an (approximate) /filter/ -[cite:@minimap]: +In the overlap graph and string graph approach, an important step is to find the +longest suffix-prefix overlap between all pairs of reads $(S_i, S_j)$. Since a +full alignment per pair is slow and long reads are often noisy, this is usually +sped up using an (approximate) /filter/ [cite:@minimap]: 1. BLAST stores $k$-mers per read in a hashmap [cite:@blast] and counts matching $k$-mers. 2. DAligner efficiently finds matching $k$-mers between two (sets of) reads by @@ -48,19 +48,30 @@ alignment per pair is slow, this is often sped up using an (approximate) /filter 4. Minimap sketches the /minimizers/ in a sequence using MinHash [cite:@minimap]. Alternatively, efficient datastructures can be used to compute all exact maximal -pairwise suffix-prefix overlaps in $O(n + k^2)$ time. -- SGA [cite:@sga] uses the FM-index for $O(n)$ [TODO: Confirm] time construction of all +pairwise suffix-prefix overlaps: +- Gusfield [cite:@gusfield] gives an $O(n+k^2)$ suffix-tree based algorithm. +- SGA [cite:@sga] uses the FM-index for $O(n)$ time construction of all irreducible edges [cite:@assembly-graph-fm]. - Hifiasm [cite:@hifiasm] is unclear but also seems to only use exact sufix-prefix matches, given that hifi reads have sufficient quality for exact overlaps. -- [WIP] - -As a starting point, we take the paper [cite/t:@suffix-prefix-queries]. -Let $R$ be a set of strings $\{S_1, \dots, S_k\}$ with total length $n:= |S_1| + -\dots + |S_k|$. A /suffix-prefix/ query asks for the longest exact overlap between a -suffix of $S_i$ and a prefix of $S_j$. +- [cite/t:@suffix-prefix-queries] takes a different approach and instead of + computing all (irreducible) pairwise overlaps up-front, it introduces multiple queries: + - $OneToOne(i,j)$ computes the longest overlap between $S_i$ and $S_j$ in + $O(\log \log k)$. + - $OneToAll(i)$: computes the longest overlap between $S_i$ and each other + $S_j$ in $O(k)$. + - $Report(i,l)$ reports all overlaps longer than $l$ in $O(\log n + + output)$[fn::This and the methods below can also be done with $\log n / \log + \log n$ complexity instead of $\log n$ using more advances geometric algorithms.]. + - $Count(i,l)$ counts the overlaps longer than $l$ in $O(\log n)$. + - $Top(i,K)$ returns the top $K$ longest overlaps of $S_i$ in $O(\log^2 n + K)$. * Plan -WIP; see [[../paper-notes/references/APSP.org][here]] for some ideas. +The plan for this internship is to improve and extend the results of this last +paper, [cite/t:@suffix-prefix-queries]: +1. Improve the complexity of the $Top(i,K)$ to $O(n)$ construction time and + $O(\log n+K)$ query time. +2. Extend $Top(i,K)$ to the $AllToAll$ setting, to return the top $K$ overlaps + for each $S_i$. #+print_bibliography: diff --git a/references.bib b/references.bib index 6fdb824..4e6fbce 100644 --- a/references.bib +++ b/references.bib @@ -1555,3 +1555,14 @@ @Article{efficient-qgram-filters ISBN = 9783540319504, publisher = {Springer Berlin Heidelberg} } + +@Article{gusfield, + author = {Gusfield, Dan}, + title = {Algorithms on Strings, Trees and Sequences}, + year = 1997, + month = {May}, + doi = {10.1017/cbo9780511574931}, + url = {http://dx.doi.org/10.1017/CBO9780511574931}, + ISBN = 9780511574931, + publisher = {Cambridge University Press} +}