Skip to content

Commit

Permalink
fast-minimizers: v4 rescan
Browse files Browse the repository at this point in the history
  • Loading branch information
RagnarGrootKoerkamp committed Jul 13, 2024
1 parent f721d33 commit 786b413
Showing 1 changed file with 196 additions and 1 deletion.
197 changes: 196 additions & 1 deletion posts/fast-minimizers/fast-minimizers.org
Original file line number Diff line number Diff line change
Expand Up @@ -178,6 +178,7 @@ impl Minimizer for V0NaiveLex {
j + window
.windows(self.k)
.enumerate()
// min_by_key returns the leftmost minimum.
.min_by_key(|(_idx, kmer)| *kmer)
.unwrap()
.0
Expand Down Expand Up @@ -231,6 +232,8 @@ rnd/1_naive_fx time: [69.025 ms 69.032 ms 69.039 ms]
rnd/2_naive_wy time: [99.193 ms 99.203 ms 99.215 ms]
#+end_src
Observe:
- Each method takes 50-100ms to process 1 million characters. That would be
50-100s for 1Gbp.
- Measurements between runs are very stable.
- FxHash is fastest. It's just one multiply-add per 8 bytes of kmer.
- WyHash is actually slower than lexicographic comparison in this case!
Expand Down Expand Up @@ -398,7 +401,199 @@ Great! Already very close to the =minimizer-iter= crate, and we didn't even
write much code yet.
From now on, I'll leave out the naive $O(wn)$ implementations.

* TODO Away with the deque
** Profiling what we have
We're already close to the reference implementation, but not quite there yet.
Let's do some profiling. For this, we can pass ~--profile-time 5~ to
~criterion~, so that instead of the usual benchmarking, it just runs the
selected benchmarks for 5 seconds. We start with a flamegraph of the v3 method above.
#+begin_src just
flame test='':
cargo flamegraph --bench bench --open -- --bench --profile-time 2 {{test}}
#+end_src

#+caption: A flamegraph made using =just flame 3_queue= showing that some time is spent in the warm-up, some in main loop, and that most time is spent in the =push= function.
#+attr_html: :class inset
[[file:3_flame.svg][file:3_flame.svg]]

This is not yet super insightful though. It's pretty much expected that most
time is in the =push= function anyway. Let's get some more statistics using
=perf stat=:
#+begin_src just
stat test='':
cargo build --profile bench --benches
perf stat -d cargo criterion -- --profile-time 2 {{test}}
#+end_src
#+begin_src txt
2,380.66 msec task-clock:u # 1.005 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
22,675 page-faults:u # 9.525 K/sec
5,873,141,947 cycles:u # 2.467 GHz
13,624,513,378 instructions:u # 2.32 insn per cycle
1,893,102,104 branches:u # 795.201 M/sec
77,266,703 branch-misses:u # 4.08% of all branches
2,960,654,139 L1-dcache-loads:u # 1.244 G/sec
19,781,179 L1-dcache-load-misses:u # 0.67% of all L1-dcache accesses
1,659,216 LLC-loads:u # 696.957 K/sec
269,546 LLC-load-misses:u # 16.25% of all LL-cache accesses
#+end_src

There is still nothing that stands out as very bad. 2.3 instructions per cycle
is not great, but still reasonable. (It can go up to 4 for my processor, and
above 3 is good usually.) Maybe $4\%$ of branch misses is a problem though.
Let's dive deeper and look at =perf record=:
#+begin_src just
perf test='':
cargo build --profile bench --benches
perf record cargo criterion -- --profile-time 2 {{test}}
perf report
#+end_src
#+caption: =just perf 3_queue=
#+begin_src txt
65.79% <bench::randmini::sliding_min::MonotoneQueue<Val> as bench::randmini::sliding_min::SlidingMin<Val>>::push
22.15% <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::fold
...
#+end_src

Now the problem is clear! The =push= function is not inlined. Let's fix that.
(TODO commit)
#+begin_src diff
+#[inline(always)]
fn new(w: usize) -> Self {

+#[inline(always)]
fn push(&mut self, val: Val) -> Elem<Val> {
#+end_src
#+begin_src txt
rnd/ext_minimizer_iter time: [20.023 ms 20.092 ms 20.206 ms]
rnd/3a_queue time: [23.936 ms 23.948 ms 23.964 ms]
rnd/3b_inlined_queue time: [22.786 ms 22.874 ms 22.988 ms]
#+end_src

A well, forgive me my optimism. Either way, this is decently close to the baseline
version. Let's look in slightly more detail at the =perf report=:

#+begin_src asm
1.02 │2f0: lea (%rdx,%rax,1),%rdi ; start of the while pop-back loop.
0.58 │ cmp %rcx,%rdi
1.41 │ mov $0x0,%edi
0.67 │ cmovae %rcx,%rdi
0.35 │ shl $0x4,%rdi
0.59 │ mov %rsi,%r8
1.28 │ sub %rdi,%r8
1.74 │ ┌──cmp %r12,(%r8) ; Is back > val?
1.91 │ ├──jbe 321 ; -> NO, pop it.
*9.08*│ │ dec %rax
3.41 │ │ mov %rax,0x18(%rbx)
0.35 │ │ add $0xfffffffffffffff0,%rsi
0.10 │ │ test %rax,%rax
0.33 │ │↑ jne 2f0 ; jumps back to the top
0.28 │ │ xor %eax,%eax
│ │self.q.push_back(Elem { pos: self.pos, val });
*12.02*│321:└─→mov 0x28(%rbx),%r13 ; -> YES, stop.
#+end_src
We can see that a lot of time, 21% of the total, is spent on the two
instructions right after the branch. Indeed, this branch checks whether the
previous element is larger than the current one, and that is basically 50-50
random, as bad as it can be.

Thus, we would like a less branchy and more predictable method for sliding windows.

* Re-scanning: Away with the queue
One way to simplify all these queue operations is by just dropping the queue
completely. I learned of this method via Daniel Liu's [[https://gist.github.com/Daniel-Liu-c0deb0t/7078ebca04569068f15507aa856be6e8][gist]] for robust winnowing,
but I believe it is folklore[fn::Citation needed, both for it being folklore and
for the original idea.].
Instead of being smart and incrementally keeping track of all the elements in
the queue, we can simply only track the minimum after all ;)
But then when 'the minimum falls out of the window', we have a problem. We solve
this by simply re-scanning the entire window for the new minimum. Thus, we keep
a cyclic buffer of the last $w$ values, and scan it as needed. One point of attention
is that we need the /leftmost/ minimal value, but as the buffer is cyclic, that
is not the first minimum in the buffer. Thus, we partition the scan into two
parts, and prefer minima in the second (older) half.

#+caption: =Rescan= implementation of =SlidingMin=.
#+begin_src rust
pub struct Rescan<Val: Ord> {
w: usize,
/// Position of next element.
pos: usize,
/// Index in `vals` of next element.
idx: usize,
/// Position of the smallest element in the last window.
min_pos: usize,
/// Value of the smallest element in the last window.
min_val: Val,
vals: Vec<Val>,
}

impl<Val: Ord + Copy + Max> SlidingMin<Val> for Rescan<Val> {
#[inline(always)]
fn new(w: usize) -> Self {
assert!(w > 0);
Self {
w,
pos: 0,
idx: 0,
min_pos: 0,
min_val: Val::MAX,
vals: vec![Val::MAX; w],
}
}

#[inline(always)]
fn push(&mut self, val: Val) -> Elem<Val> {
self.vals[self.idx] = val;
if val < self.min_val {
self.min_val = val;
self.min_pos = self.pos;
}
if self.pos - self.min_pos == self.w {
// Find the position of the minimum, preferring older elements that come *after* self.idx.
let p1 = self.vals[self.idx + 1..].iter().position_min();
let p2 = self.vals[..=self.idx].iter().position_min().unwrap();
(self.min_val, self.min_pos) = if let Some(p1) = p1 {
let p1 = self.idx + 1 + p1;
if self.vals[p1] <= self.vals[p2] {
(self.vals[p1], self.pos - self.idx + p1 - self.w)
} else {
(self.vals[p2], self.pos - self.idx + p2)
}
} else {
(self.vals[p2], self.pos - self.idx + p2)
};
}
self.pos += 1;
self.idx += 1;
if self.idx == self.w {
self.idx = 0;
}

return Elem {
pos: self.min_pos,
val: self.min_val,
};
}
}
#+end_src

As before, the corresponding minimizer scheme is trivial to implement:
#+caption: v4: rescan
#+begin_src diff
-pub struct V3Queue {..}
+pub struct V4Rescan {..}
...
- let mut q = MonotoneQueue::new(self.w);
+ let mut q = Rescan::new(self.w);
#+end_src
The result is *fast*: almost twice as fast as the previous best!
#+begin_src txt
rnd/ext_minimizer_iter time: [19.992 ms 19.995 ms 19.999 ms]
rnd/3a_queue time: [24.036 ms 24.042 ms 24.048 ms]
rnd/3b_inlined_queue time: [22.586 ms 22.604 ms 22.631 ms]
rnd/4_rescan time: [10.851 ms 10.854 ms 10.858 ms]
#+end_src

* TODO NtHash: a rolling kmer hash

Expand Down

0 comments on commit 786b413

Please sign in to comment.