Suggestions list for future evolutions #458

Cyan4973 · 2020-10-01T18:00:15Z

List updated June 30th 2023 :

Objectives for v0.8.2 - Completed

Add XXH3 specification : completed (Add algorithm specification of XXH3 #750)
Add WASM vector code path : complete (Add WebAssembly SIMD128 implementation and Node.JS support #825)
Select which part(s) of Support SVE with assembly implementation #762 can be merged : so far, only the SVE detection message at CLI prompt.
Consider prefetching for long inputs on XXH64 and maybe XXH32 ? (tentative) (may not benefit) -> benefit is too small, abandoned.

Objectives for v0.8.3

Update cmake minimum version to v3.5 (cmake: bump minimum required to version 3.5.0 #859) - completed
Non-BSD format for XXH3-64bit (XXH3 output format, non-BSD style checksum lines #645) - completed
Settle on which part or successor of Refactor to improve code size, fix GCC 12 #802 to merge
Ship xxhsum with DISPATCH enabled by default ? (tentative)
- Requires a robust x86/x64 detector
- Settle on what to do when AVX2 is enabled at compilation time (Reconsider XXH_X86DISPATCH_ALLOW_AVX check #839)
- Display the detected local vector extension, as an explicit statement of which variant the dispatch will select at runtime.
Add an arm64dispatch variant, for SVE/NEON dispatch (tentative, based on Support SVE with assembly implementation #762)
- Requires a runtime SVE vector width detector

Objectives for v0.9.0

Stabilize XXH_generateSecret()
Remove legacy XXH_OLDNAME (or maybe v1.0 ?)
Split xxhash.h into multiple smaller files, and ship an ability to create a single amalgamated file from them ? (tentative)
- Complete that with an ability to be selective about which part of the library is actually included in the amalgamated artifact, for source and binary size control.
- provide a more direct way to selectively compile each algorithm, such as XXH_ENABLE_XXH32, XXH_ENABLE_XXH64 and XXH_ENABLE_XXH3
  - Currently, it's possible to compile only XXH32 (with XXH_NO_LONG_LONG), or XXH32 + XXH64 (with XXH_NO_XXH3) or everything. It's not possible to compile selectively only XXH64 for example.
Ship libxxhash dynamic library with runtime vector extension detection enabled by default (xxh_x64dispatch) ? (tentative)
- Note : already possible on demand, by providing DISPATCH=1 makefile parameter. In which case, it unconditionally adds xxh_x86dispatch.c to the unit list, which crashes on non-x86 targets. Making this option default requires a safe capability to detect x86 targets at build or compilation time.
- But in this case, the objective would be to remain transparent for user applications. DISPATCH must therefore substitute its symbols to original ones.

The text was updated successfully, but these errors were encountered:

easyaspi314 · 2020-10-01T18:49:17Z

Another one I was thinking was an option to disable streaming.

The streaming API takes up a good chunk of the binary size:

Clang 10.0.1, aarch64, Termux

$ clang -Oz -shared -fPIC xxhash.c -s -o libxxh_stream.so
$ clang -Oz -shared -fPIC xxhash.c -s -o libxxh_nostream.so -DXXH_NO_STREAM
$ size -A *.so | grep -E '(:|Total)'
libxxh_nostream.so  :
Total                 7219
libxxh_stream.so  :
Total                 12848

Edit: -O3:

libxxh_nostream.so  :
Total                 21966
libxxh_stream.so  :
Total                 35067

easyaspi314 · 2020-10-03T03:19:09Z

Another ~~cheat~~ optimization is to use __attribute__((__pure__)) on as many xxHash functions as possible.

It says, roughly:

This function has no side effects
This function will return the same value given that the parameters, memory pointed to by the parameters, and global variables are not modified.
This function can safely have repeated calls eliminated by the compiler as long as nothing else changes.

Basically, __pure__ is the magic behind the strlen optimization. (Excluding the second optimization where the compiler treats it as a built-in function and inlines/const props it)

const char *bad_strchr(const char *s, int c)
{
    for (size_t i = 0; i < strlen(s) /* EEK */; i++) {
        if ((unsigned char)s[i] == (unsigned char)c) {
            return &s[i];
        }
    }
    return NULL;
}

Any decent compiler will change it to this:

const char *bad_strchr(const char *s, int c)
{
    const size_t len = strlen(s); // strlen is pure, we only need to call it once
    for (size_t i = 0; i < len; i++) {
        if ((unsigned char)s[i] == (unsigned char)c) {
            return &s[i];
        }
    }
    return NULL;
}

Although it is easiest to see with this code:

size_t strlenx2(const char *s)
{
     return strlen(s) + strlen(s);
}

Equivalent code with how Clang shuffles registers on x86_64:

// optimized
size_t strlenx2(const char *s)
{
    size_t len = strlen(s);
    len += len;
    return len;
}
// unoptimized
size_t strlenx2(const char *s)
{
    size_t tmp_len; // r14
    const char *tmp_s; // rbx
    tmp_s = s;
    size_t len = strlen(s);
    tmp_len = len;
    s = tmp_s;
    len = strlen(s);
    len += tmp_len;
    return len;
}

This could possibly improve performance on some hash tables depending on how they are used. Primarily thinking of code that looks like this:

    table[key].foo = "Foo";
    table[key].bar = "Bar";

Note that the compiler sometimes can figure this out on its own if xxHash is inlined, but this applies to both inline and extern functions.

I would have an option added to explicitly disable it for a fair benchmark.

Other ideas:

add the overloads to allow at least XXH3_64bits to drop in as a replacement for common std::hash uses in C++
Add type safe overload inlines for C++, which can be opted out of
- Checks if the pointed objects are sane types to hash (std::is_trivial, std::has_unique_object_representations in C++17, check len % sizeof(T))
add a noexcept/__attribute__((__nothrow__)) specifier since the single shot functions will never throw an exception allowing the compiler to leave out unwind tables and shrink code size a bit.

Cyan4973 · 2020-10-03T04:19:59Z

Yes, I like pure functions, so I'm all for it.
Also note the existence of const functions in gcc, with an even stricter set of restrictions.

In general, I would expect -O3 to spend enough cpu to discover which functions are pure, so it's unclear if xxhash will receive a measurable boost to performance with these function qualifiers,
that being said, I like them even if the only impact is to provide better code documentation.
Also, in the context of library linking, this is an information that the linker can't guess from just the function signature, so it could end up being actually useful on the user side.

easyaspi314 · 2020-10-04T20:12:49Z

In general, I would expect -O3 to spend enough cpu to discover which functions are pure, so it's unclear if xxhash will receive a measurable boost to performance with these function qualifiers.

It appears that with XXH_INLINE_ALL, Clang and GCC can't tell that XXH3 is pure without the annotation, but it can figure out XXH32 and XXH64.

easyaspi314 · 2020-10-10T01:46:35Z

A few months ago, you mentioned how XXH3 is threadable. Obviously this would be an opt-in feature, as some programs like compilers (which I know at least Clang and GNU LD use XXH64) are designed to remain on a single thread to parallelize with make.

With some experimentation, it seems to be beneficial to spawn a second thread once you get to ~8-16MB.

On my phone (haven't tested on my PC yet because I have yet to master Windows threads), 6-8 MB seems to be the range where it is beneficial, with a max speed of 7.3 GB/s compared to 5.2 GB/s on one thread.

The implementation would be pretty simple; the most complicated thing here is dealing with the pthread struct and the accumulate loop which should probably be outlined to its own function to avoid copypasta.

I believe we can do a similar thing with CreateThread on Windows.

Note that this would technically conflict with the __attribute__((__pure__)) idea, although if we talk about the end effects, it will still be pure.

Code I was testing

#ifdef XXH_PTHREADS
#include <pthread.h>
/*
 * A data block for pthreads.
 * XXX: Some of these fields can probably be removed.
 */
typedef struct {
    xxh_u64 acc[XXH_ACC_NB];
    const xxh_u8 *input;
    const xxh_u8 *secret;
    size_t secretSize;
    size_t nbStripes;
    size_t block_len;
    size_t nb_blocks;
    XXH3_f_accumulate_512 f_acc512;
    XXH3_f_scrambleAcc f_scramble;
} XXH3_pthreadData_t;

/*
 * The length at which xxHash should spawn a pthread.
 * Spawning threads has significant overhead and is a waste
 * of time for short hashes. 
 *
 * The most optimal value varies significantly on the platform.
 */
#ifndef XXH_PTHREADS_THRESHOLD
#  define XXH_PTHREADS_THRESHOLD (8 * 1048576U) /* 8 MiB */
#endif

XXH_NO_INLINE
void* XXH3_accumulate_pthread(void* d)
{
    XXH3_pthreadData_t* data = (XXH3_pthreadData_t *)d;
    size_t n;
    /* TODO: DRY, this loop is copied multiple times. */
    for (n = 0; n < data->nb_blocks; n++) {
        XXH3_accumulate(data->acc, data->input + n*data->block_len, data->secret, data->nbStripes, data->f_acc512);
        (data->f_scramble)(data->acc, data->secret + data->secretSize - XXH_STRIPE_LEN);
    }
    pthread_exit((void*)data);
}
#endif

/* Note: move XXH_alignedMalloc/XXH_alignedFree above this or give protos */

XXH_FORCE_INLINE void
XXH3_hashLong_internal_loop(xxh_u64* XXH_RESTRICT acc,
                      const xxh_u8* XXH_RESTRICT input, size_t len,
                      const xxh_u8* XXH_RESTRICT secret, size_t secretSize,
                            XXH3_f_accumulate_512 f_acc512,
                            XXH3_f_scrambleAcc f_scramble)
{
    size_t const nbStripesPerBlock = (secretSize - XXH_STRIPE_LEN) / XXH_SECRET_CONSUME_RATE;
    size_t const block_len = XXH_STRIPE_LEN * nbStripesPerBlock;
    size_t const nb_blocks = (len - 1) / block_len;

    size_t n = 0;

    XXH_ASSERT(secretSize >= XXH3_SECRET_SIZE_MIN);
#ifdef XXH_PTHREADS
    /*
     * XXH3 operates in blocks which are added together.
     *
     * Normally, this is constantly added to the acc array on the fly, like so;
     *   acc = acc + sum[0->N] { accumulate(N) };
     *
     * Due to the properties of addition, we can actually calculate blocks in
     * parallel if we start with a second acc starting zeroed:
     *   acc = (acc + sum[0->N/2] { accumulate(N) })
     *       + (  0 + sum[N/2->N] { accumulate(N) })
     *
     * Spawning a single pthread to process half of the data is
     * beneficial after about 8 MiB (*very* platform dependent).
     * There is not much use in spawning any more threads; it already takes
     * hundreds of thousands of iterations for there to be a benefit,
     * and it would get very messy and add even more overhead.
     */
    if (len >= XXH_PTHREADS_THRESHOLD) {
        /*
         * Using malloc is faster for some reason. Likely aliasing rules.
         */
        XXH3_pthreadData_t* threadData = (XXH3_pthreadData_t*)XXH_alignedMalloc(sizeof(XXH3_pthreadData_t), 64);
        /*
         * If malloc succeeds, try to start a thread, otherwise fall back to
         * the single threaded loop after the #endif.
         */
        if (threadData != NULL) {
            pthread_t thread;
            int threadLaunched;
            void* status = (void *)threadData;

            /* Fill the struct and set it to process the second half. */
            memset(threadData->acc, 0, sizeof(threadData->acc));
            threadData->input = input + ((nb_blocks - (nb_blocks / 2)) * block_len);
            threadData->secret = secret;
            threadData->secretSize = secretSize;
            threadData->nbStripes = nbStripesPerBlock;
            threadData->nb_blocks = nb_blocks - (nb_blocks / 2);
            threadData->f_acc512 = f_acc512;
            threadData->f_scramble = f_scramble;

            /*
             * Launch the thread on the second half of the input.
             *
             * We don't care about whether it actually started until later.
             */
            threadLaunched = !pthread_create(&thread, NULL, &XXH3_accumulate_pthread, status);

            /* Process the first half on the main thread */
            for (; n < nb_blocks / 2; n++) {
                XXH3_accumulate(acc, input + n*block_len, secret, nbStripesPerBlock, f_acc512);
                f_scramble(acc, secret + secretSize - XXH_STRIPE_LEN);
            }

            /*
             * If we have launched our thread, finish it, merge its acc with
             " the main thread's acc and mark the section as completed.
             * If it failed, we process the second half after the #endif
             * on the main thread. 
             */
            if (threadLaunched && !pthread_join(thread, &status)) {
                size_t i;
                /* Merge the acc fragments */
                for (i = 0; i < XXH_ACC_NB; i++) {
                    /* associative property */
                    acc[i] += threadData->acc[i];
                }
                /* Mark that the thread successfully processed the second half */
                n += threadData->nb_blocks;
            }
            XXH_alignedFree(threadData);
        }
    }
#endif /* XXH_PTHREADS */
    for (; n < nb_blocks; n++) {
        XXH3_accumulate(acc, input + n*block_len, secret, nbStripesPerBlock, f_acc512);
        f_scramble(acc, secret + secretSize - XXH_STRIPE_LEN);
    }
    /* ... */
}

As I mention in the comment, I don't see any reason to spawn more than one helper thread, as we waste hundreds of thousands of possible accumulate loops by setting up each pthread, meaning 4 threads would likely require a ridiculous 64-128 MB and a much more complicated error handling routine.

easyaspi314 · 2020-10-15T01:43:35Z

So I was wondering if we should start doing Doxygen? We don't necessarily have to set up a server for it.

Especially since xxhash.h is massive now, having a little Doxygen site might help and would probably be easier than writing xxh3_spec.md

It also gives us some opportunity to document the internals because we can group them.

Here are some examples:

Also, didn't we plan on switching XXH64_hash_t to unsigned long on 64-bit Linux?

Cyan4973 · 2020-10-15T04:54:29Z

I was wondering if we should start doing Doxygen?

Yes, that's a good idea.
Moving code comments to Doxygen parsing convention can be done progressively.

didn't we plan on switching XXH64_hash_t to unsigned long on 64-bit Linux?

I don't see a benefit in such a change

easyaspi314 · 2020-10-15T10:45:29Z

I don't see a benefit in such a change

uint64_t is unsigned long on LP64, so it would be consistent.

easyaspi314 · 2020-10-15T10:57:48Z

$ cat test.c
#include <xxhash.h>
#include <stdio.h>
#include <inttypes.h>

int main(void)
{
    printf("%#016" PRIx64 "\n", XXH64("hello", 5, 0));
    return 0;
}
$ gcc -std=gnu99 -O2 -Wall -c test.c -I xxHash
// Ok
$ g++ -std=gnu++11 -O2 -Wall -c -xc++ test.c -I xxHash
// Ok
$ gcc -std=gnu90 -O2 -Wall -c test.c -I xxHash -Wno-long-long
test.c: In function 'main':
test.c:7:12: warning: format '%lx' expects argument of type 'long unsigned int', but argument 2 has type 'XXH64_hash_t' {aka 'long long unsigned int'} [-Wformat=]
    7 |     printf("%#016" PRIx64 "\n", XXH64("Hello", 5, 0));
      |            ^~~~~~~              ~~~~~~~~~~~~~~~~~~~~
      |                                 |
      |                                 XXH64_hash_t {aka long long unsigned int}
In file included from test.c:3:
/usr/include/inttypes.h:127:34: note: format string is defined here
  127 | #define PRIx64   __PRI_64_prefix"x"  /* uint64_t */
$

In gnu90 mode, -Wformat fires off because on LP64, PRIx64 is "lx".

The reverse is true if you do "%llx", it will fire a warning on C++ and C99.

easyaspi314 · 2020-10-15T22:49:02Z

Some Doxygen documentation added in #462

Cyan4973 · 2021-08-09T23:54:09Z

Declaring relevant functions as pure and const would be a good follow up.

easyaspi314 · 2021-12-14T04:16:32Z

XXH_SIZEOPT config option

==0: normal
==1: Disables forceinline and manual unrolling
==2: Reuse streaming API for single shot, other dirty size hacks?

Potential speed boost: We can hash multiple acc blocks at a time until a scramble. So something like this:

accumulate_doubleAcc()
{
    xxh_u64 acc2[8] = {0};
    size_t n = 0;
    // alternative: duffs device but that might harm
    // interleaving
    if (nbStripes % 2 == 1) {
        accumulate_512(acc2);
        n++;
    }
    while (n < nbStripes) {
        accumulate_512(acc);
        n++;
        accumulate_512(acc2);
        n++;
    }
    for (n = 0; n < 8; n++) {
        acc[n] += acc2[n];
    }
}

I didn't see any benefit on NEON AArch64 no matter how many NEON lanes I chose, and ARMv7 and SSE2 don't have enough registers.

However, I think that AVX2 and AVX512 would likely benefit since their loops are tighter. I will benchmark on Ryzen when I get a chance.

Edit: Clang has no difference on Ryzen 5 3600 and GCC clearly gets confused.

Cyan4973 · 2023-06-30T01:29:06Z

Updated list of objectives for v0.8.2

Cyan4973 · 2023-07-12T23:05:38Z

I was considering (as a tentative objective) shipping DISPATCH=1 by default for the generation of xxhsum CLI, but there is still a significant amount of work to do to reach this stage safely, and I'm concerned it would delay v0.8.2 release by too long.
So, instead, I've pushed this objective to an hypothetical v0.8.3 future release.
(will be folded into v0.9.0 if need be).

t-mat · 2023-07-12T23:21:15Z

As for XXH_OLD_NAMES, should we add deprecation message for it in v0.8.2?

Cyan4973 · 2023-07-12T23:25:43Z

As for XXH_OLD_NAMES, should we add deprecation message for it in v0.8.2?

This seems like a good idea.
If I understand properly, XXH_OLD_NAMES is disabled by default, so the warning could be triggered just on detecting if it's set.

t-mat · 2023-07-15T21:51:25Z

We can add "Milestone" via right side pane of issue view. And we also can assign/reuse it to other issues to indicate the milestone. It'll be useful at the future review of issues.
For example, we can assign v0.9.0 milestone to #483.

Cyan4973 added the Long-term label Nov 26, 2021

Cyan4973 mentioned this issue Dec 11, 2021

Silence GCC __attr(pure) missing warning. #655

Closed

Cyan4973 changed the title ~~Suggestions list~~ Suggestions list for future evolutions May 6, 2022

Cyan4973 self-assigned this Jul 12, 2023

t-mat mentioned this issue Jul 13, 2023

release a new version #862

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions list for future evolutions #458

Suggestions list for future evolutions #458

Cyan4973 commented Oct 1, 2020 •

edited

Loading

easyaspi314 commented Oct 1, 2020 •

edited

Loading

easyaspi314 commented Oct 3, 2020 •

edited

Loading

Cyan4973 commented Oct 3, 2020 •

edited

Loading

easyaspi314 commented Oct 4, 2020

easyaspi314 commented Oct 10, 2020 •

edited

Loading

easyaspi314 commented Oct 15, 2020 •

edited

Loading

Cyan4973 commented Oct 15, 2020 •

edited

Loading

easyaspi314 commented Oct 15, 2020 •

edited

Loading

easyaspi314 commented Oct 15, 2020 •

edited

Loading

easyaspi314 commented Oct 15, 2020

Cyan4973 commented Aug 9, 2021 •

edited

Loading

easyaspi314 commented Dec 14, 2021 •

edited

Loading

Cyan4973 commented Jun 30, 2023

Cyan4973 commented Jul 12, 2023

t-mat commented Jul 12, 2023

Cyan4973 commented Jul 12, 2023

t-mat commented Jul 15, 2023 •

edited

Loading

Suggestions list for future evolutions #458

Suggestions list for future evolutions #458

Comments

Cyan4973 commented Oct 1, 2020 • edited Loading

Objectives for v0.8.2 - Completed

Objectives for v0.8.3

Objectives for v0.9.0

easyaspi314 commented Oct 1, 2020 • edited Loading

easyaspi314 commented Oct 3, 2020 • edited Loading

Cyan4973 commented Oct 3, 2020 • edited Loading

easyaspi314 commented Oct 4, 2020

easyaspi314 commented Oct 10, 2020 • edited Loading

easyaspi314 commented Oct 15, 2020 • edited Loading

Cyan4973 commented Oct 15, 2020 • edited Loading

easyaspi314 commented Oct 15, 2020 • edited Loading

easyaspi314 commented Oct 15, 2020 • edited Loading

easyaspi314 commented Oct 15, 2020

Cyan4973 commented Aug 9, 2021 • edited Loading

easyaspi314 commented Dec 14, 2021 • edited Loading

Cyan4973 commented Jun 30, 2023

Cyan4973 commented Jul 12, 2023

t-mat commented Jul 12, 2023

Cyan4973 commented Jul 12, 2023

t-mat commented Jul 15, 2023 • edited Loading

Cyan4973 commented Oct 1, 2020 •

edited

Loading

easyaspi314 commented Oct 1, 2020 •

edited

Loading

easyaspi314 commented Oct 3, 2020 •

edited

Loading

Cyan4973 commented Oct 3, 2020 •

edited

Loading

easyaspi314 commented Oct 10, 2020 •

edited

Loading

easyaspi314 commented Oct 15, 2020 •

edited

Loading

Cyan4973 commented Oct 15, 2020 •

edited

Loading

easyaspi314 commented Oct 15, 2020 •

edited

Loading

easyaspi314 commented Oct 15, 2020 •

edited

Loading

Cyan4973 commented Aug 9, 2021 •

edited

Loading

easyaspi314 commented Dec 14, 2021 •

edited

Loading

t-mat commented Jul 15, 2023 •

edited

Loading