-
Notifications
You must be signed in to change notification settings - Fork 775
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestions list for future evolutions #458
Comments
Another one I was thinking was an option to disable streaming. The streaming API takes up a good chunk of the binary size:
Edit:
|
Another It says, roughly:
Basically, const char *bad_strchr(const char *s, int c)
{
for (size_t i = 0; i < strlen(s) /* EEK */; i++) {
if ((unsigned char)s[i] == (unsigned char)c) {
return &s[i];
}
}
return NULL;
} Any decent compiler will change it to this: const char *bad_strchr(const char *s, int c)
{
const size_t len = strlen(s); // strlen is pure, we only need to call it once
for (size_t i = 0; i < len; i++) {
if ((unsigned char)s[i] == (unsigned char)c) {
return &s[i];
}
}
return NULL;
} Although it is easiest to see with this code: size_t strlenx2(const char *s)
{
return strlen(s) + strlen(s);
} Equivalent code with how Clang shuffles registers on x86_64: // optimized
size_t strlenx2(const char *s)
{
size_t len = strlen(s);
len += len;
return len;
}
// unoptimized
size_t strlenx2(const char *s)
{
size_t tmp_len; // r14
const char *tmp_s; // rbx
tmp_s = s;
size_t len = strlen(s);
tmp_len = len;
s = tmp_s;
len = strlen(s);
len += tmp_len;
return len;
} This could possibly improve performance on some hash tables depending on how they are used. Primarily thinking of code that looks like this: table[key].foo = "Foo";
table[key].bar = "Bar"; Note that the compiler sometimes can figure this out on its own if xxHash is inlined, but this applies to both inline and extern functions. I would have an option added to explicitly disable it for a fair benchmark. Other ideas:
|
Yes, I like pure functions, so I'm all for it. In general, I would expect |
It appears that with XXH_INLINE_ALL, Clang and GCC can't tell that XXH3 is pure without the annotation, but it can figure out XXH32 and XXH64. |
A few months ago, you mentioned how XXH3 is threadable. Obviously this would be an opt-in feature, as some programs like compilers (which I know at least Clang and GNU LD use XXH64) are designed to remain on a single thread to parallelize with With some experimentation, it seems to be beneficial to spawn a second thread once you get to ~8-16MB. On my phone (haven't tested on my PC yet because I have yet to master Windows threads), 6-8 MB seems to be the range where it is beneficial, with a max speed of 7.3 GB/s compared to 5.2 GB/s on one thread. The implementation would be pretty simple; the most complicated thing here is dealing with the pthread struct and the accumulate loop which should probably be outlined to its own function to avoid copypasta. I believe we can do a similar thing with Note that this would technically conflict with the Code I was testing#ifdef XXH_PTHREADS
#include <pthread.h>
/*
* A data block for pthreads.
* XXX: Some of these fields can probably be removed.
*/
typedef struct {
xxh_u64 acc[XXH_ACC_NB];
const xxh_u8 *input;
const xxh_u8 *secret;
size_t secretSize;
size_t nbStripes;
size_t block_len;
size_t nb_blocks;
XXH3_f_accumulate_512 f_acc512;
XXH3_f_scrambleAcc f_scramble;
} XXH3_pthreadData_t;
/*
* The length at which xxHash should spawn a pthread.
* Spawning threads has significant overhead and is a waste
* of time for short hashes.
*
* The most optimal value varies significantly on the platform.
*/
#ifndef XXH_PTHREADS_THRESHOLD
# define XXH_PTHREADS_THRESHOLD (8 * 1048576U) /* 8 MiB */
#endif
XXH_NO_INLINE
void* XXH3_accumulate_pthread(void* d)
{
XXH3_pthreadData_t* data = (XXH3_pthreadData_t *)d;
size_t n;
/* TODO: DRY, this loop is copied multiple times. */
for (n = 0; n < data->nb_blocks; n++) {
XXH3_accumulate(data->acc, data->input + n*data->block_len, data->secret, data->nbStripes, data->f_acc512);
(data->f_scramble)(data->acc, data->secret + data->secretSize - XXH_STRIPE_LEN);
}
pthread_exit((void*)data);
}
#endif
/* Note: move XXH_alignedMalloc/XXH_alignedFree above this or give protos */
XXH_FORCE_INLINE void
XXH3_hashLong_internal_loop(xxh_u64* XXH_RESTRICT acc,
const xxh_u8* XXH_RESTRICT input, size_t len,
const xxh_u8* XXH_RESTRICT secret, size_t secretSize,
XXH3_f_accumulate_512 f_acc512,
XXH3_f_scrambleAcc f_scramble)
{
size_t const nbStripesPerBlock = (secretSize - XXH_STRIPE_LEN) / XXH_SECRET_CONSUME_RATE;
size_t const block_len = XXH_STRIPE_LEN * nbStripesPerBlock;
size_t const nb_blocks = (len - 1) / block_len;
size_t n = 0;
XXH_ASSERT(secretSize >= XXH3_SECRET_SIZE_MIN);
#ifdef XXH_PTHREADS
/*
* XXH3 operates in blocks which are added together.
*
* Normally, this is constantly added to the acc array on the fly, like so;
* acc = acc + sum[0->N] { accumulate(N) };
*
* Due to the properties of addition, we can actually calculate blocks in
* parallel if we start with a second acc starting zeroed:
* acc = (acc + sum[0->N/2] { accumulate(N) })
* + ( 0 + sum[N/2->N] { accumulate(N) })
*
* Spawning a single pthread to process half of the data is
* beneficial after about 8 MiB (*very* platform dependent).
* There is not much use in spawning any more threads; it already takes
* hundreds of thousands of iterations for there to be a benefit,
* and it would get very messy and add even more overhead.
*/
if (len >= XXH_PTHREADS_THRESHOLD) {
/*
* Using malloc is faster for some reason. Likely aliasing rules.
*/
XXH3_pthreadData_t* threadData = (XXH3_pthreadData_t*)XXH_alignedMalloc(sizeof(XXH3_pthreadData_t), 64);
/*
* If malloc succeeds, try to start a thread, otherwise fall back to
* the single threaded loop after the #endif.
*/
if (threadData != NULL) {
pthread_t thread;
int threadLaunched;
void* status = (void *)threadData;
/* Fill the struct and set it to process the second half. */
memset(threadData->acc, 0, sizeof(threadData->acc));
threadData->input = input + ((nb_blocks - (nb_blocks / 2)) * block_len);
threadData->secret = secret;
threadData->secretSize = secretSize;
threadData->nbStripes = nbStripesPerBlock;
threadData->nb_blocks = nb_blocks - (nb_blocks / 2);
threadData->f_acc512 = f_acc512;
threadData->f_scramble = f_scramble;
/*
* Launch the thread on the second half of the input.
*
* We don't care about whether it actually started until later.
*/
threadLaunched = !pthread_create(&thread, NULL, &XXH3_accumulate_pthread, status);
/* Process the first half on the main thread */
for (; n < nb_blocks / 2; n++) {
XXH3_accumulate(acc, input + n*block_len, secret, nbStripesPerBlock, f_acc512);
f_scramble(acc, secret + secretSize - XXH_STRIPE_LEN);
}
/*
* If we have launched our thread, finish it, merge its acc with
" the main thread's acc and mark the section as completed.
* If it failed, we process the second half after the #endif
* on the main thread.
*/
if (threadLaunched && !pthread_join(thread, &status)) {
size_t i;
/* Merge the acc fragments */
for (i = 0; i < XXH_ACC_NB; i++) {
/* associative property */
acc[i] += threadData->acc[i];
}
/* Mark that the thread successfully processed the second half */
n += threadData->nb_blocks;
}
XXH_alignedFree(threadData);
}
}
#endif /* XXH_PTHREADS */
for (; n < nb_blocks; n++) {
XXH3_accumulate(acc, input + n*block_len, secret, nbStripesPerBlock, f_acc512);
f_scramble(acc, secret + secretSize - XXH_STRIPE_LEN);
}
/* ... */
} As I mention in the comment, I don't see any reason to spawn more than one helper thread, as we waste hundreds of thousands of possible accumulate loops by setting up each pthread, meaning 4 threads would likely require a ridiculous 64-128 MB and a much more complicated error handling routine. |
So I was wondering if we should start doing Doxygen? We don't necessarily have to set up a server for it. Especially since It also gives us some opportunity to document the internals because we can group them. Here are some examples: Also, didn't we plan on switching |
Yes, that's a good idea.
I don't see a benefit in such a change |
|
$ cat test.c
#include <xxhash.h>
#include <stdio.h>
#include <inttypes.h>
int main(void)
{
printf("%#016" PRIx64 "\n", XXH64("hello", 5, 0));
return 0;
}
$ gcc -std=gnu99 -O2 -Wall -c test.c -I xxHash
// Ok
$ g++ -std=gnu++11 -O2 -Wall -c -xc++ test.c -I xxHash
// Ok
$ gcc -std=gnu90 -O2 -Wall -c test.c -I xxHash -Wno-long-long
test.c: In function 'main':
test.c:7:12: warning: format '%lx' expects argument of type 'long unsigned int', but argument 2 has type 'XXH64_hash_t' {aka 'long long unsigned int'} [-Wformat=]
7 | printf("%#016" PRIx64 "\n", XXH64("Hello", 5, 0));
| ^~~~~~~ ~~~~~~~~~~~~~~~~~~~~
| |
| XXH64_hash_t {aka long long unsigned int}
In file included from test.c:3:
/usr/include/inttypes.h:127:34: note: format string is defined here
127 | #define PRIx64 __PRI_64_prefix"x" /* uint64_t */
$ In The reverse is true if you do |
Some Doxygen documentation added in #462 |
Declaring relevant functions as |
accumulate_doubleAcc()
{
xxh_u64 acc2[8] = {0};
size_t n = 0;
// alternative: duffs device but that might harm
// interleaving
if (nbStripes % 2 == 1) {
accumulate_512(acc2);
n++;
}
while (n < nbStripes) {
accumulate_512(acc);
n++;
accumulate_512(acc2);
n++;
}
for (n = 0; n < 8; n++) {
acc[n] += acc2[n];
}
} I didn't see any benefit on NEON AArch64 no matter how many NEON lanes I chose, and ARMv7 and SSE2 don't have enough registers. However, I think that AVX2 and AVX512 would likely benefit since their loops are tighter. I will benchmark on Ryzen when I get a chance. Edit: Clang has no difference on Ryzen 5 3600 and GCC clearly gets confused. |
Updated list of objectives for |
I was considering (as a tentative objective) shipping |
As for |
This seems like a good idea. |
We can add "Milestone" via right side pane of issue view. And we also can assign/reuse it to other issues to indicate the milestone. It'll be useful at the future review of issues. |
List updated June 30th 2023 :
Objectives for v0.8.2 - Completed
SVE
detection message at CLI prompt.XXH64
and maybeXXH32
? (tentative) (may not benefit) -> benefit is too small, abandoned.Objectives for v0.8.3
cmake
minimum version tov3.5
(cmake: bump minimum required to version 3.5.0 #859) - completedxxhsum
withDISPATCH
enabled by default ? (tentative)AVX2
is enabled at compilation time (Reconsider XXH_X86DISPATCH_ALLOW_AVX check #839)arm64dispatch
variant, for SVE/NEON dispatch (tentative, based on Support SVE with assembly implementation #762)Objectives for v0.9.0
XXH_generateSecret()
XXH_OLDNAME
(or maybe v1.0 ?)xxhash.h
into multiple smaller files, and ship an ability to create a single amalgamated file from them ? (tentative)XXH_ENABLE_XXH32
,XXH_ENABLE_XXH64
andXXH_ENABLE_XXH3
XXH32
(withXXH_NO_LONG_LONG
), orXXH32
+XXH64
(withXXH_NO_XXH3
) or everything. It's not possible to compile selectively onlyXXH64
for example.libxxhash
dynamic library with runtime vector extension detection enabled by default (xxh_x64dispatch
) ? (tentative)DISPATCH=1
makefile parameter. In which case, it unconditionally addsxxh_x86dispatch.c
to the unit list, which crashes on non-x86 targets. Making this option default requires a safe capability to detectx86
targets at build or compilation time.DISPATCH
must therefore substitute its symbols to original ones.The text was updated successfully, but these errors were encountered: