From b99ebe72eed359e83a884290f79bae8be84f8b24 Mon Sep 17 00:00:00 2001 From: Ragnar Groot Koerkamp Date: Fri, 6 Oct 2023 12:42:02 +0200 Subject: [PATCH] pthash: notes on inversion and some new ideas --- posts/pthash/pthash.org | 64 +++++++++++++++++++++++++++++++++++++---- 1 file changed, 59 insertions(+), 5 deletions(-) diff --git a/posts/pthash/pthash.org b/posts/pthash/pthash.org index c84ed72..9d0dea8 100644 --- a/posts/pthash/pthash.org +++ b/posts/pthash/pthash.org @@ -205,6 +205,8 @@ Rust, so I converted the few parts I need. There is also [[https://crates.io/crates/strength_reduce][=strength_reduce=]], which contains a similar but distinct algorithm for ~a % b~ that computes the remainder from the quotient. +** TODO Try out =fastdivide= and =reciprocal= crates + ** First benchmark I [[https://github.com/RagnarGrootKoerkamp/pthash-rs/commit/c070936558e756bafaae92af5be31ac383f2c3ee][implemented]] these under a generic =Reduce= trait. @@ -784,8 +786,6 @@ Preliminary results: this seems tricky to get right and tends to be slower. It sometimes generates unwanted =gather= instructions, but even when it doesn't it's slow although I don't know exactly why yet. *Does pipelining work with SIMD instructions?* -** TODO Try out =reciprocal= crate - ** Inverting $h(k_i)$ :PROPERTIES: @@ -861,14 +861,14 @@ also simplify this inverse? Or can it always be done? I don't know.. - I'm playing with the idea of implementing some kind of interpolation sort algorithm that just inserts things directly in the right place in an array of =Option>= of size $(1+\epsilon)n$ or maybe $n + C \cdot - \sqrt(n)$ and then runs a collect on this. Should work quite well I think. + \sqrt n$ and then runs a collect on this. Should work quite well I think. ** TODO Possible sorting algorithms - [[https://github.com/mlochbaum/rhsort][Robinhoodsort]] - [[https://pvk.ca/Blog/2019/09/29/a-couple-of-probabilistic-worst-case-bounds-for-robin-hood-linear-probing/][Bounds on linear probing]] - Flashsort ([[https://en.wikipedia.org/wiki/Flashsort][wikipedia]], [[http://www.neubert.net/Flapaper/9802n.htm][article]]) - Drawback: bad cache locality when writing out buckets. Maybe just write to - $O(\sqrt(n))$ buckets (should fit in L2 cache ideally) and then sort each + $O(\sqrt n)$ buckets (should fit in L2 cache ideally) and then sort each bucket individually. ** Diving into the inverse hash problem @@ -1378,6 +1378,60 @@ This is $O(64)$ and takes around a minute to invert $10^8$ hashes. h_2(k_i)$ maps to a chosen free slot (when $h_2$ is a =FastReduce= instance). This should allow us to fill the last slots of the table much faster. -** TODO Hash-inversion for faster PTHash construction +** Hash-inversion for faster PTHash construction + +So now we have a fast way to find $k_i$ for the /tail/ of the last $t$ buckets. +We will assume that these buckets all have size $1$. (Otherwise decrease $t$.) +Let $F$ be the set of free positions once the /head/ of $m-t$ buckets has been processed. +We always have $|F| \geq t$ and when $\alpha = 1$ we have $|F| = t$. +We can then implement two strategies: +- Lazy :: Iterate over buckets and free slots in parallel, matching each bucket + to a slot. Then compute the $k_i$ that sends each bucket to the corresponding + free slot. This will give $k_i\sim n$ in expectation, uses $t \cdot + \log_2(n)$ bits in total, and runs in $O(t)$. +- Greedy :: For each bucket (in order), compute $k_i(f)$ for each candidate slot + $f$, and choose the minimal value. When $\alpha=1$ this gives $k_i \sim n/f$ and runs in + $O(t^2)$. + The total number of bits is + $$ + \sum_{f=1}^t \log_2(n/f) + = t\log_2(n) - \log_2(t!) + \sim t (\log_2(n) - \log_2(t)) + $$ + For $t=O(\sqrt n)$, this saves up to half the bits for these numbers. + +In some quick experiments with $n=10^8$, the lazy strategy seems to give at most +around $15\%$ speedup ($35$ to $30$ seconds for $t=10000$), which is not as much +as I had hoped. This seems to be because relatively a lot of time is also spent +on finding $k_i$ for the last buckets of size $2$ and $3$. + +** Fast path for small buckets +For small buckets (size $\leq 4$) it pays of to use a code path that knows the +explicit bucket size and processes a fixed size =&[Hash; BUCKET_SIZE]= array +instead of an arbitrary sized slice =&[Hash]=. This allows for better code generation. + +** TODO Dictionary encoding +The dictionary will be quite dense for numbers up to some threshold (say + $1024$), and sparser afterwards. We can encode the small numbers directly and + only do the dictionary lookup for larger ones. + - TODO: Figure out if the branch is worth the savings of the lookup. + +** TODO Larger buckets +The largest bucket should be able to have size $O(\sqrt n)$ without issues. +From there it should slowly decay (TODO: figure out the math) to constant size. +This could put the elements that are currently in the largest $\sim 1\%$ of +buckets all together in a few buckets, reducing the average size of the +remaining buckets. Although the reduction seems only minimal, so this may not +give too much benefit. + +One way of achieving such a skew distribution might be to replace the +partitioning of $h\in [0, 2^{64})$ in $m$ chunks, by a partitioning of $h^2 \in [0, +2^{128})$ in $m$ chunks. + +** TODO Prefetching free slots +Looking up whether slots in the array for a certain $k_i$ are free is quite slow +and memory bound. Maybe we can prefetch values for a few $k_i$ ahead. + +Also, the computation of =position= could be vectorized. #+print_bibliography: