Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support base64 encoding #45

Open
steverep opened this issue Mar 29, 2023 · 6 comments
Open

Support base64 encoding #45

steverep opened this issue Mar 29, 2023 · 6 comments

Comments

@steverep
Copy link

Just a feature request to support encodings other than hex natively in the web assembly, particularly base64 and base64url for shorter digests.

Thanks for the package 👍🏻

@jungomi
Copy link
Owner

jungomi commented Mar 30, 2023

As I've mentioned in #27 (comment) the encoding is out of scope for this library, and if you want to have an optimised WebAssembly encoding, it would be much more appropriate to have a specialised library for this.

It seems that you are suggesting that the output of the hash function (= digest) should be base64 encoded, but it does not have any encoding at all in this library. The output of the hash functions are unsigned integers (u32 and u64 respectively) and h**ToString simply gives you the hex representation of that integer, just because that is the most common representation and particularly useful due to the lack of support for (native) unsigned integers in JavaScript. Furthermore, that is a built-in functionality (in JS not WASM), and has been added purely for convenience.

If I understood you correctly, you'd want a WebAssembly implementation of base64, but as I mentioned before, this is not in scope of this library and wouldn't give you any benefit over using an external library to convert the simple integer output. The overhead that occurs from this is marginal, as it is just passing the integer to WASM (not even a bunch of memory, just a single integer) .

@steverep
Copy link
Author

steverep commented Apr 2, 2023

So I created this request for both performance and API conformance...

For performance, I was misreading the Performance section of the README. It sounded on initial read that you were able to encode the hashes 20% faster compared to outside the WASM. My bad.

For conformance, since you are mimicking node's crypto with the create* API, my thought process was that it might be an issue plugging into another package that expects the digest method to accept an encoding type argument. Probably not necessary in many cases, but something to consider.

@jungomi
Copy link
Owner

jungomi commented Apr 3, 2023

Oh that's right, the Node crypto API allows to pass an encoding to the digest. That made me realise there isn't even a way currently to get the hash as a string from the streaming API (there is no equivalent to h**ToString in the streaming API).

Full compatibility with the Node crypto API is not really a goal and as far as I am aware, there is no way to get an integer from crypto, which is the default here, so it's guaranteed to be different.
However, I can see the interest in having the possibility to specify the desired encoding. If people would like to use this functionality, I'd be happy to accept a PR for it.

Base64

For base64 there is btoa() available in JavaScript, so that would be very easy to include.

This would again be purely for convenience and there is no performance benefit over doing it manually afterwards, as it is essentially:

const digest = create32()
  .update("some data")
  .digest();
const digestBase64 = btoa(digest);

// If the digest is a standalone call (after all updates on `hash`)
// it becomes:
const digestBase64 = btoa(hash.digest());

If we really wanted to go into micro-optimisations, this would be faster than any implementation that works with .digest("base64"), at least in the end-user case where you only use one format (or at least know the format beforehand for the specific use-case), because of the extra branching to check which output format was specified as well as just having that extra string allocation. It's very minute and quite silly to be talking about it, but that is also one of the reasons that for example h32 and h32ToString are separate functions, in order to avoid any extra overhead, however small it might be.

All in all, if that functionality should be added, it needs to be considered that the convenience should not come at the expense of the default path. In that case, different methods, such as digestBase64(), would likely be more appropriate rather than mirroring the Node crypto API.

@steverep
Copy link
Author

steverep commented Aug 2, 2023

const digest = create32()
  .update("some data")
  .digest();
const digestBase64 = btoa(digest);

This would not produce the desired result as btoa takes a byte string as input, not an integer. The only way I know to do it right would be to convert the integer to a buffer first:

// Utility methods to output base64 from xxhash-wasm
// Note that the uint hash must be written in big endian for output to be
// platform independent and conform to canonical spec for xxhash

import { Buffer } from "node:buffer";
import xxhash from "xxhash-wasm";

const hasher = await xxhash();

// Node only version using Buffer class' toString()
export const xx64ToString = (data, encoding) => {
  const hashBuffer = Buffer.allocUnsafe(8);
  hashBuffer.writeBigUInt64BE(hasher.h64(data));
  return hashBuffer.toString(encoding);
};

// Native version for browser using DataView and btoa()
export const xx64ToBase64 = (data) => {
  const hashBuffer = new ArrayBuffer(8);
  const view = new DataView(hashBuffer);
  view.setBigUint64(0, hasher.h64(data));
  const hashBytes = new Uint8Array(hashBuffer);
  return btoa(String.fromCodePoint(...hashBytes));
};

The node version is more versatile since it can output any encoding supported by node. For example, outputting base64url in the browser would require an extra step to replace any +, /, or = characters. The browser version would also be expected to be somewhat slower since it always fills the buffer with zeros first.

I am sensitive to arguments for keeping packages scoped properly, but in this case I'd argue that at least handing the user a proper big endian buffer representation of the hash is in scope. For example, it would be just as easy above to create a typed array and pull the buffer from that:

  const hashBuffer = (new BigUint64Array([hasher.h64(data)])).buffer;

However, encoding that would produce an incorrect result on little endian platforms unless the bytes were reversed first.

@marcusdarmstrong
Copy link
Contributor

I am sensitive to arguments for keeping packages scoped properly, but in this case I'd argue that at least handing the user a proper big endian buffer representation of the hash is in scope. For example, it would be just as easy above to create a typed array and pull the buffer from that:

xxhash-wasm never has a buffer of the hash, however, so it's no better at producing a buffer of this data than a user-space function, and as you've demonstrated with your samples, there's not a good portable way of doing so.

Fwiw, practically speaking, a lookup table is going to be the fastest portable solution here given that we have known number of bits to work (meaning padding is easily handled), so something like

const lookup = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".split('');

function u64ToBase64(n) {
	let r = '';
	for (let i = 58n; i >= 0n; i -= 6n) {
		r += lookup[Number(n >> i) & 63];
	}
	r += '=';
	return r;
} 

@steverep
Copy link
Author

steverep commented Aug 2, 2023

xxhash-wasm never has a buffer of the hash, however, so it's no better at producing a buffer of this data than a user-space function, and as you've demonstrated with your samples, there's not a good portable way of doing so.

The latter half of that confuses me. Those methods are portable across architectures, and the 2nd is portable in the sense of JS engine as well. My point was to demonstrate (a) how to do it right, and (b) if left entirely to the user, it's easy to do it wrong (e.g. by making architecture assumptions or using a little endian form of the integer).

Including something in the library to return a buffer or alternate encoding would be purely for user convenience to save the time of having to lookup the canonical format for xxhash and understand byte order in general. Alternatively, you could just demonstrate the right way in the docs and I'd be happy to provide a gist for linking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants