Skip to content

Commit

Permalink
[cwi] add todos
Browse files Browse the repository at this point in the history
  • Loading branch information
RagnarGrootKoerkamp committed Jul 13, 2023
1 parent 1a61890 commit eb787f1
Show file tree
Hide file tree
Showing 2 changed files with 39 additions and 7 deletions.
31 changes: 24 additions & 7 deletions posts/cwi-proposal.org
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,13 @@ pairwise suffix-prefix overlaps:
- $Top(i,K)$ returns the top $K$ longest overlaps of $S_i$ in $O(\log^2 n + K)$.

* Plan
The plan for this internship is to improve and extend the results of this last
paper, [cite/t:@suffix-prefix-queries]:
The plan for this internship is to improve, extend, and apply the results of this last
paper, [cite/t:@suffix-prefix-queries].

[TODO to improve text]

** Improve query performance using Heavy-Light Decomposition

- Improve the complexity of the $Count(i,l)$ and $Top(i,K)$ to $O(n)$ construction time and
$O(\log n)$ resp. $O(\log n+K)$ query time:

Expand All @@ -81,10 +86,22 @@ paper, [cite/t:@suffix-prefix-queries]:
- Extend $Top(i,K)$ (and other methods) to only return irreducible edges,
ideally reducing the $output$ component of the query complexity.

Furthermore, I would like to implement a fast algorithm to build the string
graph, based on the queries provided about and/or existing $O(n+k^2)$ or
$O(n+output)$ methods. Also, I would like to investigate extensions of these
exact algorithms to allow fuzzy matching in case reads are erroneous with a
small (e.g. $2\%$) error rate.
** Extend to non-exact suffix-prefix-overlap that allows for read errors
I would like to investigate extensions of these exact algorithms to allow fuzzy
matching in case reads are erroneous with a small (e.g. $2\%$) error rate.

- Short reads are very exact, so exact suffix-overlap detection was sufficient
- Long reads are very noisy (up to $10\%$), so up to $20\%$ read-read overlap
can be present. This has typically been worked around by using
$k$-mer/minimizer based filters rather than using datastructure based approaches.
- New hifi reads are up to $10kbp$ long with error rates as low as $0.1\%$ after
cleaning. This makes datastructure based algorithms useful again. (TODO:
Investigate exactly what hifiasm does here.)

** Implement an algorithm to build string graphs, and possibly a full assembler

I would like to implement a fast algorithm to build the string graph, based on
the queries provided about and/or existing $O(n+k^2)$ or $O(n+output)$ methods.


#+print_bibliography:
15 changes: 15 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -1566,3 +1566,18 @@ @Article{gusfield
ISBN = 9780511574931,
publisher = {Cambridge University Press}
}

@Article{rank-select-revisited,
author = {Mäkinen, Veli and Navarro, Gonzalo},
title = {Rank and select revisited and extended},
journal = {Theoretical Computer Science},
year = 2007,
volume = 387,
number = 3,
month = {Nov},
pages = {332–347},
issn = {0304-3975},
doi = {10.1016/j.tcs.2007.07.013},
url = {http://dx.doi.org/10.1016/j.tcs.2007.07.013},
publisher = {Elsevier BV}
}

0 comments on commit eb787f1

Please sign in to comment.