Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with extra large k-mers #204

Closed
jamshed opened this issue Oct 30, 2022 · 6 comments
Closed

Bug with extra large k-mers #204

jamshed opened this issue Oct 30, 2022 · 6 comments

Comments

@jamshed
Copy link
Contributor

jamshed commented Oct 30, 2022

Hello Marek,

I was wondering if KMC could be used to enumerate very large k-mers, e.g. with k >= 1000. The need arose from this issue: COMBINE-lab/cuttlefish#22.

To support KMC upto k = 1024, I've compiled it using make CFLAGS=-DMAX_K=1024. Now, some test results for the E. Coli genome reference:

  • With k = 31:
Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :      4554269
   No. of unique counted k-mers       :      4554269
   Total no. of k-mers                :      4641622
   Total no. of sequences             :            1
   Total no. of super-k-mers          :       355871
  • With k = 501:
Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :      4606355
   No. of unique counted k-mers       :      4606355
   Total no. of k-mers                :      4641152
   Total no. of sequences             :            1
   Total no. of super-k-mers          :        26810
  • With k = 1001:
Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :      3951266
   No. of unique counted k-mers       :      3951266
   Total no. of k-mers                :      3951267
   Total no. of sequences             :            1
   Total no. of super-k-mers          :        22181

Note that, we can compute the total length of the input sequence from the KMC stats results as the following quantity:
Total no. of k-mers + Total no. of sequences * (k - 1). With k = 31, 501, and 1001, we get the input lengths as 4641652, 4641652, and 3952267. If the input sequences do not have any N / indeterminate characters, then this quantity should be the same for different k's. And we observe a mismatching value with k = 1001.

On top of that, I'm also observing different (and indeterminate) results for k = 1001 if the number of threads is increased beyond 1. For example, the following are two different runs' results with -t16:

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :       109417
   No. of unique k-mers               :      4028568
   No. of unique counted k-mers       :      3919151
   Total no. of k-mers                : 304178492578579
   Total no. of sequences             :            1
   Total no. of super-k-mers          :        22181
Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :        48691
   No. of unique k-mers               :      4028568
   No. of unique counted k-mers       :      3979877
   Total no. of k-mers                : 135264335877777
   Total no. of sequences             :            1
   Total no. of super-k-mers          :        22181

I wonder if you might have any insights on this behavior?

Regards!

@marekkokot
Copy link
Contributor

Hello Jamshed,

I have tried to reproduce this using the current master branch, but it seems the results are correct:

 KMC/bin/kmc -t16 -k1000 -ci1 -fm  GCF_000005845.2_ASM584v2_genomic.fna o .
**
Stage 1: 100%
Stage 2: 100%
1st stage: 0.225505s
2nd stage: 6.39249s
Total    : 6.618s
Tmp size : 6MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :      4626637
   No. of unique counted k-mers       :      4626637
   Total no. of k-mers                :      4640653
   Total no. of sequences             :            1
   Total no. of super-k-mers          :        22200

Could you please show me the full command line and also confirm if you are using the same code as I do (current master branch).

Best,
Marek

@jamshed
Copy link
Contributor Author

jamshed commented Oct 30, 2022

Hi Marek,

I'm using the latest code from the master branch. I'm using the following command:

k=1001; /usr/bin/time ./KMC/bin/kmc -v -k$k -fm -ci1 -t1 ./ecoli.fna /mnt/scratch4/jamshed/ecoli.k$k.kmc ./temp/

This is the output log:

**

********** Used parameters for Stage 1 : **********
No. of input files           : 1
Output file name             : 
No. of working directories   : 1
Input format                 : MULTI LINE FASTA
Output format                : KMC

k-mer length                 : 1001
Max. k-mer length            : 1024
Signature length             : 9
Both strands                 : true
RAM only mode                : false

******* Stage 1 configuration: *******

No. of bins                  : 512
Bin part size                : 65536
Input buffer size            : 16777216

No. of readers               : 1
No. of splitters             : 1

Max. mem. size               : 12000MB
Max. mem. per storer         :  6192MB
Max. mem. for single package :    24MB

Max. mem. for PMM (bin parts):  9526MB
Max. mem. for PMM (FASTQ)    :  1650MB
Max. mem. for PMM (reads)    :     1MB
Max. mem. for PMM (b. reader):   805MB

Stage 1: 100%

********** Used parameters for Stage 2 : **********
Min. count threshold         : 1
Max. count threshold         : 1000000000
Max. counter value           : 255

******* Stage 2 configuration: *******
No. of threads               : 1

Max. mem. for 2nd stage      :  1098MB

Stage 2: 100%
1st stage: 0.88179s
2nd stage: 7.46185s
Total    : 8.34364s
Tmp size : 6MB

Stats:
   No. of k-mers below min. threshold :            0
   No. of k-mers above max. threshold :            0
   No. of unique k-mers               :      3951266
   No. of unique counted k-mers       :      3951266
   Total no. of k-mers                :      3951267
   Total no. of sequences             :            1
   Total no. of super-k-mers          :        22181
8.17user 1.45system 0:08.73elapsed 110%CPU (0avgtext+0avgdata 917660maxresident)k
0inputs+1952800outputs (0major+379890minor)pagefaults 0swaps

@marekkokot
Copy link
Contributor

Hi,

thanks. I think I know the cause that I have different results.
I modified kmc_core/defs.h to support larger k, while you did it by compiler flags.
I'm not sure why your approach does not work, I mean there must be a bug somewhere in the code.
Anyway, could you please alter the appropriate #define in kmc_core/defs.h ?

@jamshed
Copy link
Contributor Author

jamshed commented Oct 31, 2022

Hello Marek,

Thanks. I made a clean installation by directly setting MAX_K inside the source code, and it seems to work correctly! But when I set it using make CFLAGS=-DMAX_K=1024, the behavior becomes buggy.

marekkokot added a commit that referenced this issue Nov 1, 2022
@marekkokot
Copy link
Contributor

Hello Jamshed,

it seems the bug was quite complex to find, but I think I have it now and it's fixed with 726ecbf.
Let me know if it works, and if it does please close the issue.

@jamshed
Copy link
Contributor Author

jamshed commented Nov 9, 2022

Thanks Marek! It works correctly on some example datasets where the earlier commits produced incorrect results.

@jamshed jamshed closed this as completed Nov 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants