Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding another inversible function #23

Open
fp64 opened this issue Sep 14, 2022 · 21 comments
Open

Consider adding another inversible function #23

fp64 opened this issue Sep 14, 2022 · 21 comments

Comments

@fp64
Copy link

fp64 commented Sep 14, 2022

The "xorsquare" function (which I briefly describe here) is of the form:

x=((x+a)&-2)^(x*x);

for any integer a.

Note: alternative version (x^=((x+a)*(x+a))&-2;) produces identical results for values of a that differ in the most significant bit.

It is straightforward to show (e.g. by induction on wordsize) that it is inversible.

@Marc-B-Reynolds
Copy link

x^=((x+a)*(x+a))&-2;

FWIW: I spent a little bit of time looking at this function WRT strict avalanche criterion measures (more like a spoonful than a grain of salt for these observations). The constant 'a' doesn't seem to justify itself (adding one to the dependency chain). Continued here: https://twitter.com/marc_b_reynolds/status/1576848952846319616

@fp64
Copy link
Author

fp64 commented Oct 23, 2022

Somewhat lagged reply.

The constant 'a' doesn't seem to justify itself

This matches my experience trying to use it as a step of a stateless PRNG.

Note that x^=((x+a)*(x+a))&-2; is particularly bad latency-wise, x=((x+a)&-2)^(x*x); may be ok.

I might as well mention (even though it may be obvious), that the inverse may be computed with:

template<typename T>
T inverse(T (&fn)(T),T x)
{
    T ret=0;
    for(T bit=1;bit;bit*=2)
        ret^=bit*(((fn(ret)^x)&(2*bit-1))!=0);
    return ret;
}

(edit: simplified mask)
which should work on any uintN->uintN bijection where high bits (of input) do not affect low bits (of output), e.g. a mix of conventional and bitwise arithmetic, without right shifts.

I don't know of any "really O(1)" inverse for this.

tommyettinger added a commit to tommyettinger/sarong that referenced this issue Nov 28, 2022
Xorsquare was found (probably?) by fp64 here, skeeto/hash-prospector#23 , and this is a variant that has no fixed points.
@tommyettinger
Copy link

Oh, whoops, didn't mean to link the issue. But, the x ^= x * x | 1 version of xorsquare at least seems to be one-to-one for all 16-bit inputs, and because it always changes the LSB, there's no case where n maps to n itself. Xorsquare is a really neat find!

@fp64
Copy link
Author

fp64 commented Dec 18, 2022

Actually, I also settled on XQO form in my PRNGs.

Essentially, once you realize that x^(x*x) produces every even integer exactly twice, there's a whole lot of ways to "fix" that.

Arguably, the well-known triangular numbers offer a similar idea.

@tommyettinger
Copy link

Another trick with XQO: the constant used with OR can be any odd number, not just 1, and you end up with something on the way to ~x (which is equivalent to x ^= (x * x) | -1; if negative numbers are two's-complement and such, like how Java and C# have it). This makes me wonder about a constant that changes itself, such as an MCG (since those are always odd). x ^= x * x | (c *= 0xf1357aea2e62a9c5ULL); uses a 64-bit MCG constant from https://arxiv.org/abs/2001.05304 , which seems fine to me, though I haven't tested this. 0x93d765ddUL is a 32-bit alternative. The period of the combined construction could be interesting, when used as an RNG state transition -- the MCG has a shorter period than some other options, like an additive counter that adds twice an odd number. The Hamming weight of c determines how little the square of x matters to the result. Using x ^= x * x | (c & (c *= 0xf1357aea2e62a9c5ULL)); still guarantees that the ORed value is odd, but reduces its Hamming weight (the ORed value can't ever be -1 with this multiplier, also). I'll tinker a bit with this. Right now my machine I use for testing is running two different long tests (64TB on the CPU and 100+ PB on the GPU), so I'll have to wait to more-thoroughly examine these.

@tommyettinger
Copy link

OK, new finding; this isn't useful for a unary hash, but for the related field of pseudo-random number generation, it might be?

state = (state ^ (state * state | o5o7)) * m3;

Where (o5o7 & 7) must equal 5 or 7, and (m3 & 3) must equal 3. This appears to change to all states in a random-seeming order for 16-bit states 32-bit states, and probably 64-bit states, for any o5o7 and any m3 I have tested that meet the mentioned criteria. Like an LCG, it alternates even and odd states, and the low 10 or so bits are very predictable. For 64-bit states, XORing the upper 32 bits with the lower 32 bits (but not storing that in state) seems to make all bits fairly random.

state = -(state ^ (state * state | o5o7)); is a special case here, since negating is the same as multiplying by -1, and -1 fits the criterion for m3. The most extreme limit is state = -(state ^ -1);, which is the same as state = state + 1;, and that's certainly full-period.

The downside to this is that it's much harder computationally to go backwards than forwards with what we know now, though it should be possible. Skipping around in this sequence seems very challenging.

@Marc-B-Reynolds
Copy link

It's the same isn't it? The final integer product peels off into a second logical step.

@tommyettinger
Copy link

The difference is the period; repeatedly calling state ^= (state * state) | 1; will cycle before all possible values are reached, with cycles of different length (e.g. if state is 0 or 1, then the cycle length is 2). The variant with o507 and m3 doesn't have shorter subcycles, which seems unusual. Just multiplying by 3 isn't full-period (it covers a quarter of all possible values), and 7 is even worse (covers an eighth). Multiplying by -1 works fine with the XQO construction and is full-period there, but has a period of 2 if just multiplying by -1 (of course, since it will just flip the sign forever).

@Marc-B-Reynolds
Copy link

Marc-B-Reynolds commented Dec 23, 2022

Yes. I was just talking about forming the inverse function. That last product is simply reversed by the mod-inverse of m3.

@Marc-B-Reynolds
Copy link

Here's some mixing functions (based on CRC32-C) which are cheapish and strongly mix the bottom 32-bits. https://gist.github.com/Marc-B-Reynolds/7db198a050c422936be4520fb0655a6f

@fp64
Copy link
Author

fp64 commented Jan 24, 2023

Another trick with XQO: the constant used with OR can be any odd number, not just 1

Well. This also yields much less conspicuous zeroes, e.g. (x^(x*x|1))==0 happens at x==1, but (x^(x*x|5))==0 happens at x==0x...3EF7226D (lower bits are the same for all wordsizes).
It only works for x^(x*x|1) though, not (x|1)^(x*x) (e.g. (x|5)^(x*x) is not a bijection). I assumed that form is more latency-friendly (though is it even true?).

Where (o5o7 & 7) must equal 5 or 7, and (m3 & 3) must equal 3.

Interesting. The proof eludes me so far (does anyone know?). I did exhaustively verify that it is both necessary and sufficient up to, and including, uint10_t.

XORing the upper 32 bits with the lower 32 bits (but not storing that in state) seems to make all bits fairly random.

This does fail PractRand at 16GB (2^32 outputs). For comparison, high^low of a 64-bit LCG fails at 32MB (2^23 outputs). Still, this probably means that you need fancier output function, at which point it is not obvious if there is any advantage over PCG (e.g. with RS output function the quality seems about the same for this and LCG).

Speaking of PCG, using XQO instead of the first step in pcg_hash from Hash Functions for GPU Rendering (see also) like this

uint32_t pxq_hash(uint32_t input)
{
    uint32_t state = (input | 1u) ^ (input * input);
    uint32_t word = ((state >> ((state >> 28u) + 4u)) ^ state) * 277803737u;
    return (word >> 22u) ^ word;
}

seems to improve PractRand results (from failing at 4MB to failing at 512MB).

Both fail gjrand --tiny badly, though. Both also fail birthday test, being bijections.

Doesn't seem better than triple32, in both quality and speed. Update: this (and pcg_hash) appears to be massively worse (avalanche-wise) than even lowbias32 with very evident structure in avalanche matrix.

@fp64
Copy link
Author

fp64 commented Jan 28, 2023

The (hand-made) function

uint32_t hash(uint32_t x)
{
    x ^= x >> 15;
    x ^= (x * x) | 1u;
    x ^= x >> 17;
    x *= 0x9E3779B9u;
    x ^= x >> 13;
    return x;
}

appears to have better avalanche scores than known murmur3-style two-rounders (but weaker than triple32), and does decently as a PRNG (according to PractRand).

Function Linf RMS PractRand
xxHash 8658392 1195645.4 Fails at 16KB
MurmurHash3 4044016 566904.4 Fails at 16KB
lowbias32 2023971 372660.4 Fails at 32KB
lowbias32_tib 1211488 231074.2 Fails at 64KB
this 836260 121867.6 Fails at 1GB
triple32 167788 44857.8 Fails at 1GB ('mildly suspicious' at 8KB)

lowbias32_tib is TheIronBorn's version.

Avalanche scores use unscaled abs(counts[i][j]-(1ul<<31)). Looks like RMS is 2^31/1000 times what prospector -E reports.

PractRand is the result of while(1) output(hash(i++)); piped to RNG_test stdin32 -tlmin 1K -te 0 -tf 2 (you may want -te 1 instead).

Would be nice if someone double-checks this.

Might be worthwhile to prospect around a bit for functions of this form.

Update: https://www.shadertoy.com/view/dllSW7 .

@fp64
Copy link
Author

fp64 commented Jan 28, 2023

I assumed that form is more latency-friendly

It may be, slightly: https://godbolt.org/z/5jznWeoW5 .

So you might be better off writing it as

uint32_t hash(uint32_t x)
{
    x ^= x >> 15;
    x = (x | 1u) ^ (x * x);
    x ^= x >> 17;
    x *= 0x9E3779B9u;
    x ^= x >> 13;
    return x;
}

which gives identical results.

@TheIronBorn
Copy link

Might be worth comparing some of these to those on https://mostlymangling.blogspot.com/ which uses a similar PractRand technique

@fp64
Copy link
Author

fp64 commented Jan 30, 2023

https://mostlymangling.blogspot.com/

Unless I'm missing something (and I well may), measuring things becomes somewhat harder in 64 bit. E.g. I just spent a number of minutes, running Monte-Carlo (exhaustive is not practical for 64 bit) avalanche calculation, and got... this:

Function Emax Erms
hash64 1.967±0.406 0.503±0.011
murmur64 1.841±0.378 0.498±0.015
lea64 1.869±0.298 0.499±0.011
degski64 1.812±0.259 0.500±0.017
nasam 1.863±0.373 0.501±0.016
moremur 1.924±0.467 0.503±0.019
ettingerMixer 490.445±0.210 53.035±0.044
rrmxmx 1.856±0.239 0.499±0.019
rrxmrrxmsx_0 1.895±0.415 0.498±0.010

Not very precise (assuming I haven't botched it in the first place)...
Notes:

  1. Test uses K=8 runs of N=10^6 samples for each function. The "X ± Y" are simply average and 3*σ of the resulting 8 values (edit: without Bessel's correction).
  2. Avalanche scores are "normalized", i.e. divided by sqrt(N).
  3. Not sure what's up with ettingerMixer.
  4. hash64 is an adaptation ([31,35,⌊2^64/φ⌋,27], probably not actually good) of the above function to 64 bit.

Similarly with PractRand. For 32 bit 32GB is enough, but e.g. RRC-64-42-TF2-0.94 is 4TB, and some of the mentioned tests are up to 128TB.

Someone with access to more computing power might want to give it a try.

@tommyettinger
Copy link

I (Tommy Ettinger) also wonder what's up with ettingerMixer, but I wouldn't be surprised if it's just bad when measured with specific tests. @fp64 , can you post or link the source of the ettingerMixer you used? It might give us some insight into what makes generators fail, since not many seem to do that right now.

@fp64
Copy link
Author

fp64 commented Jan 31, 2023

From https://mostlymangling.blogspot.com/2019/01/better-stronger-mixer-and-test-procedure.html (except I changed suffixes to ull):

uint64_t ettingerMixer(uint64_t v) {
  uint64_t z = (v ^ 0xDB4F0B9175AE2165ull) * 0x4823A80B2006E21Bull;
  z ^= rol64(z, 52) ^ rol64(z, 21) ^ 0x9E3779B97F4A7C15ull;
  z *= 0x81383173ull;
  return z ^ z >> 28;
}

Update: heatmap (scaled to 0..9):

64x64
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000001000000000000000000000000000
0000000000000000000000000000000000000200000000100000000000000000
0000000000000000000000000000000000000020000000010000000000000000
0100000000000000000000000000000000000002010000001000000000000000
0010000000000000000000000000000000000000201000000100000000000000
0001000001000000000000000000000000000100020100000010000000000000
0010100100000000000000000000000000000000002010000001000000000000
1001011020010000000000000000000000002001020202010000100000000000
0000101102001000000000000000000000000200102020201000010000000000
1010020210200100000000000000000000001020010202020100001010000000
0101003021020011000000000000000000002102001131402010000101001010
0021202502102111102000000100001000010210211113680201021020100201
1102121360210211100200000010000000006021021110578020102102111020
1230302368021030204021000001001010006802103020478902010230212302
2047030478802603015702000030100102007880260301578990214046020470
1426815267890571524680420115022130326789057152468899052625714368

@fp64
Copy link
Author

fp64 commented Feb 1, 2023

Not very precise

Some more context on accuracy of Monte Carlo.

Show data

Let's take 32-bit function above (so we can actually do exhaustive measurement):

Method N Linf Linf/sqrt(N) RMS RMS/sqrt(N)
Monte Carlo 1000 57 1.802 15.8 0.501
Monte Carlo 2000 71 1.588 22.3 0.498
Monte Carlo 4000 94 1.486 31.6 0.500
Monte Carlo 8000 142 1.588 42.9 0.479
Monte Carlo 16000 250 1.976 62.0 0.490
Monte Carlo 32000 262 1.465 89.0 0.497
Monte Carlo 64000 392 1.550 125.6 0.496
Monte Carlo 128000 540 1.509 175.8 0.491
Monte Carlo 256000 849 1.678 254.3 0.503
Monte Carlo 512000 1118 1.562 346.5 0.484
Monte Carlo 1024000 2015 1.991 528.5 0.522
Monte Carlo 2048000 2360 1.649 702.6 0.491
Monte Carlo 4096000 3475 1.717 995.9 0.492
Monte Carlo 8192000 5122 1.790 1436.0 0.502
Monte Carlo 16384000 8038 1.986 1915.1 0.473
Monte Carlo 32768000 11206 1.958 2997.3 0.524
Monte Carlo 65536000 15361 1.897 4277.6 0.528
Monte Carlo 131072000 26863 2.346 6721.9 0.587
Monte Carlo 262144000 58485 3.612 10895.5 0.673
Monte Carlo 524288000 110290 4.817 18779.9 0.820
Monte Carlo 1048576000 203025 6.270 34348.4 1.061
Monte Carlo 2097152000 422054 9.216 63478.4 1.386
Monte Carlo 4294967296 829977 12.664 126039.3 1.923
Exhausitve 4294967296 836260 12.760 121867.6 1.860

Compare to MurmurHash3:

Method N Linf Linf/sqrt(N) RMS RMS/sqrt(N)
Monte Carlo 1000 54 1.708 16.3 0.515
Monte Carlo 2000 71 1.588 21.7 0.486
Monte Carlo 4000 128 2.024 31.4 0.497
Monte Carlo 8000 143 1.599 44.4 0.497
Monte Carlo 16000 233 1.842 62.1 0.491
Monte Carlo 32000 347 1.940 94.3 0.527
Monte Carlo 64000 415 1.640 128.6 0.508
Monte Carlo 128000 626 1.750 178.6 0.499
Monte Carlo 256000 906 1.791 259.7 0.513
Monte Carlo 512000 1146 1.602 359.7 0.503
Monte Carlo 1024000 1772 1.751 529.0 0.523
Exhausitve 4294967296 4044016 61.707 566904.4 8.650

And to pcg_hash:

Method N Linf Linf/sqrt(N) RMS RMS/sqrt(N)
Monte Carlo 1000 89 2.814 16.5 0.523
Monte Carlo 2000 156 3.488 23.8 0.533
Monte Carlo 4000 289 4.569 35.5 0.561
Monte Carlo 8000 598 6.686 55.2 0.617
Monte Carlo 16000 1209 9.558 85.9 0.679
Monte Carlo 32000 2344 13.103 148.4 0.830
Monte Carlo 64000 4775 18.875 281.9 1.114
Monte Carlo 128000 10078 28.169 528.2 1.476
Monte Carlo 256000 20025 39.578 1030.4 2.037
Monte Carlo 512000 39575 55.308 2028.7 2.835
Monte Carlo 1024000 78691 77.763 4003.0 3.956
Exhausitve 4294967296 331871348 75063.955 16645540.6 253.991

As we can see, somewhat good functions look basically indistinguishable with Monte Carlo for fairly sizeable N. But also Monte Carlo can detect more sharp deviations a lot earlier.

@fp64
Copy link
Author

fp64 commented Feb 2, 2023

Speaking of measuring things exactly, some optimal in their class uint8_t->uint8_t hash functions look like this:

Type Constants Linf RMS Notes
xorr,mul,xorr,mul,xorr [ 1,0x0F, 3,0xDD, 4] 16 7.3654599 Optimal for both, [ 1,0F, 3,0x5D, 4] has same Linf
xorr,xqo,xorr,mul,xorr [ 2, 5,0x93, 5] 20 8.9442719 Optimal for both, [ 5, 4,0x9F, 2] and [ 3, 6,0x75, 1] have the same Linf
xorr,mul,xorr,xqo,xorr [ 5,0x7B, 4, 4] 20 11.1691540 Linf-optimal, [ 4,0x43, 3, 4],[ 4,0xC3, 3, 4] have the same Linf
xorr,mul,xorr,xqo,xorr [ 3,0x3B, 4, 4] 24 9.4339811 RMS-optimal
xorr,mul,xorr,xqo,xorr [ 3,0xBB, 4, 4] 24 9.4339811 RMS-optimal
xorr,xqo,xorr,xqo,xorr [ 4, 5, 2] 24 11.8532696 Linf-optimal
xorr,xqo,xorr,xqo,xorr [ 2, 5, 4] 32 10.5237826 RMS-optimal

@fp64
Copy link
Author

fp64 commented Feb 12, 2023

The function

uint32_t hash(uint32_t x)
{
    x ^= x >> 23;
    x = (x | 1u) ^ (x * x);
    x ^= x >> 13;
    x = (x | 1u) ^ (x * x);
    x ^= x >>  9;
    x = (x | 1u) ^ (x * x);
    x ^= x >> 17;
    return x;
}

appears to produce better avalanche scores than triple32 (actually there's a lot of functions with better Linf scores, like the one below, but lower RMS is harder to obtain).

Function Linf RMS PractRand
triple32 167788 44857.8 Fails at 1GB ('mildly suspicious' at 8KB)
[23,13, 9,17] 158752 44181.3 Fails at 1GB
[14,14,14,14] 133756 46713.3 Fails at 1GB

The [14,14,14,14] one is easy to memorize.

@fp64
Copy link
Author

fp64 commented Sep 20, 2023

@tommyettinger, I just noticed something curious. As mentioned before, 32-bit rng

uint32_t rng32()
{
    uint64_t state=0ull;
    state=-(state^(state*state|5ull));
    return (uint32_t)(state>>32);
}

performs very smoothly in PractRand, up until it abruptly harshly fails at 16GB. Similar for 16-bit rng, which fails at 512KB. This makes sense, since the period of k-th bit is 2^k (tests fail close to the period of lowest bit of output).
However, 64-bit version

uint64_t rng64()
{
    static __uint128_t state=0ull;
    state=-(state^(state*state|5ull));
    return (uint64_t)(state>>64);
}

seems to fail very fast (but not immediately!).

rng=RNG_stdin64, seed=unknown
length= 1 kilobyte (2^10 bytes), time= 0.5 seconds
  Test Name                         Raw       Processed     Evaluation
  DC6-9x1Bytes-1                    R=  +5.7  p = 0.019     unusual          
  ...and 5 test result(s) without anomalies

rng=RNG_stdin64, seed=unknown
length= 2 kilobytes (2^11 bytes), time= 0.7 seconds
  Test Name                         Raw       Processed     Evaluation
  DC6-9x1Bytes-1                    R=  +6.6  p =  8.5e-3   unusual          
  ...and 7 test result(s) without anomalies

rng=RNG_stdin64, seed=unknown
length= 4 kilobytes (2^12 bytes), time= 1.0 seconds
  Test Name                         Raw       Processed     Evaluation
  Gap-16:A                          R= +15.4  p =  1.6e-13    FAIL           
  Gap-16:B                          R= +16.2  p =  1.0e-13    FAIL           
  ...and 22 test result(s) without anomalies

Seeding the state with something like -1ull/5 makes it ok for at least couple of gigabytes. Some small seeds yield "suspicious" early on, but proceed smoothly afterwards.
Returning (uint64_t)(state^(state>>64)) produces "mildly suspicious" at 2KB, and few "unusual" later, but passes for at least a few gigabytes (for 32-bit version there seems to be not much difference between returning state>>32 and state^(state>>32)).
Note: the actual code didn't use __uint128_t, emulating it with uint64_ts, but the results should match.

rurban added a commit to rurban/fast-hash that referenced this issue Jan 15, 2024
rurban added a commit to rurban/fast-hash that referenced this issue Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants