Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative compressed JSON encoding #225

Open
matthijskooijman opened this issue Jan 15, 2015 · 2 comments
Open

Alternative compressed JSON encoding #225

matthijskooijman opened this issue Jan 15, 2015 · 2 comments

Comments

@matthijskooijman
Copy link
Collaborator

Right now, we use a custom compressed form for sending reports, that relies on both ends of the connection having the same int to string mapping, and relies on knowing the structure of the JSON in advance (to some degree).

However, a more generic binary JSON format might be useful to allow expressing arbitrary values in a more consise way, while not requiring any pre-existing agreement over field names etc. (which will probably increase the size again, so the end result will probably be about as big as we have it now).

I had a look at some binary JSON formats today:

  • BSON is a format for JSON plus a bunch of extras (dates, regexes, javascript code) we don't need. Arrays are encoded as objects with numerical keys, which aren't very consise (they have to be explicitely specified).
  • Universal binary JSON is a simple format, that uses single ASCII characters to indicate value types. Since they use a full byte for the type, (without putting a length or small value in the unused bits) some space is wasted. OTOH, it does allow specifying a single type specifier for all values in an array or object, which allows homogenous arrays to be written fairly compact. The regular data format makes UBJSON somewhat readable. Binary numbers are still not ASCII of course, but at least the type specifiers are readable.
  • BJSON seems like an unfinished draft. It has broken links and the spec isn't really clear or complete.
  • MessagePack is a reasonably complex binary encoding, that efficiently uses all bits in every type bit. It fits unsigned integers of 7 bits (0 to 127) or negative integers of 5 bits (so -32 to -1) and only needs a single type byte for strings up to 31 bytes in length. It adds a binary and "ext" format (basically also arbitrary binary string) which are not present in JSON. [messagepack-c] is a fairly complex C and (partly separate) C++ implementation. Documentation isn't great, but it does look rather complete and the use of templates suggests that it might be able to generate very low overhead code. It was written with STL containers in mind though, so we'd have to remove some stuff and/or only use the lower level interface perhaps. cmp is a lot more basic, but probably easier to get working. Neither support translating to JSON directly, but it is probably fairly easy to glue cmp to js0n.
  • Smile is a complex binary encoding that maps 100% to JSON. It has quite some different value types, uses vint and zigzag encoding for numbers, it fits signed integers of 5 bits (so -16 to 15) in a single byte. It assumes strings are either ASCII or UTF-8 (and depends on some bytes never occuring in both). Other strings could be encoded as binary strings (not sure how that maps to JSON). It has string sharing where you can reference earlier strings in the same datastream, which could be very helpful when strings or keys are repeated.
  • CBOR (specified in RFC7049 is a compact, but also fairly regular binary format. It reminds of Smile in that it has a 3-bit type and 5-bit extended value, but it is a lot less complex. This results in some values being unused and some encodings to be a bit bigger than possible, but it also results in a more simple parser (which is an expclitely defined goal of CBOR). In a single byte, a 5-bit unsigned or negative integer can be encoded (so 0-31 or -32 to -1). No C/C++ implementation is available (yet).

@quartzjer, have you looked any of these before? Any other suggestions? I'm leaning towards messagepack, or perhaps ubjson (that was before seeing CBOR, CBOR actually seems like the way to go).

@quartzjer
Copy link
Contributor

Heh, you missed the one that I think has the best long-term potential, and I plan on having full native support for in telehash :)

http://cbor.me

@matthijskooijman
Copy link
Collaborator Author

@quartzjer Hah, totally right there. CBOR looks great - it's compact enough, but also reasonably easy to parse. Having a proper RFC is also an advantage, the spec is very well-written and clear. Too bad there is no C implementation yet, but it seems one is forthcoming :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants