Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra large kmer #22

Open
apaytuvi opened this issue Oct 21, 2022 · 6 comments
Open

Extra large kmer #22

apaytuvi opened this issue Oct 21, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@apaytuvi
Copy link

Dear cuttlefish authors,

Thank you for this useful tool. I have a large database of genomes and I want to reduce the redundancy to reduce the computational time, improve speed and reduce RAM usage of a mapping against such a big database. I tried cuttlefish, and is useful but I would like a larger kmer, let's say, e.g. 1000. Why? Long-read technology requires long sequences for a correct mapping, but by setting low kmer lengths such as 127 most sequences remain that size, which is clearly not enough for long-read mapping. Do you have a suggestion for that?

Thanks,

@jamshed jamshed added the enhancement New feature or request label Oct 24, 2022
@jamshed
Copy link
Member

jamshed commented Oct 24, 2022

Hi @apaytuvi,

Thanks for using cuttlefish! I'll incorporate the capability of using extra-large k-mers into cuttlefish; but that might take a little time. In the meantime, I can try posting a hack in a separate branch for you to try it out. Would it work you?

Regards.

@apaytuvi
Copy link
Author

That would be great. Thank you so much!

@jamshed
Copy link
Member

jamshed commented Oct 30, 2022

Hi @apaytuvi: we've found some bug(s) in the initial k-mer enumeration phase of cuttlefish, only occurring with huge k-values (e.g. with k >= 1000)—hence the delay! I'll get back to this once we could address the issue.

@apaytuvi
Copy link
Author

apaytuvi commented Oct 31, 2022 via email

@apaytuvi
Copy link
Author

It seems the bug has been solved and this feature should be available. Could you please confirm that @jamshed? Thanks a lot!

jamshed added a commit that referenced this issue Dec 5, 2022
support req. #22
@jamshed
Copy link
Member

jamshed commented Dec 5, 2022

Hi @apaytuvi: sorry for the delay in response!

I've pushed a new branch, extra-large-k, with the required support. This needs to be compiled from source, as instructed here. But the cmake line needs to be replaced with the following

cmake -DINSTANCE_COUNT=256 -DCMAKE_INSTALL_PREFIX=../ ..

Currently this supports k up-to 1023. Let me know if you want to try with even larger k—we can extend the range for that.
But note that, the installation takes quite some time with large k (you may use make -j install to make it faster with more threads). Also, the execution performance is also quite time- and disk-heavy—specifically, the initial (k+1)-mer and k-mer enumeration stages.

Let me know if you could test it successfully!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants