feat: add zstd compression support #60

shanipribadi · 2024-09-25T21:48:25Z

zstd experimental feature is enabled to calculate upper bound of Vec capacity to be allocated while decompressing data using zstd::bulk::Decompressor::upper_bound.

supported levels are -128~22, with 0 defaulting to level 3 (due to zstd library behaviour).

benchmark Load block from disk

Load block from disk/1024 KiB [no compression] time:   [6.0854 µs 6.2072 µs 6.3423 µs]
Load block from disk/1024 KiB [lz4] time:   [7.2133 µs 7.3160 µs 7.4288 µs]
Load block from disk/1024 KiB [miniz] time:   [10.358 µs 10.542 µs 10.740 µs]
Load block from disk/1024 KiB [zstd(-3)] time:   [9.2772 µs 9.5900 µs 9.9649 µs]
Load block from disk/1024 KiB [zstd(-1)] time:   [9.0652 µs 9.1867 µs 9.3248 µs]
Load block from disk/1024 KiB [zstd(1)] time:   [9.0680 µs 9.2127 µs 9.3672 µs]
Load block from disk/1024 KiB [zstd(3)] time:   [9.0162 µs 9.1445 µs 9.2872 µs]
Load block from disk/1024 KiB [zstd(12)] time:   [10.432 µs 10.605 µs 10.795 µs]

Load block from disk/131072 KiB [no compression] time:   [150.79 µs 153.20 µs 155.90 µs]
Load block from disk/131072 KiB [lz4] time:   [193.57 µs 196.56 µs 199.61 µs]
Load block from disk/131072 KiB [miniz] time:   [232.92 µs 236.82 µs 241.00 µs]
Load block from disk/131072 KiB [zstd(-3)] time:   [197.10 µs 199.85 µs 203.01 µs]
Load block from disk/131072 KiB [zstd(-1)] time:   [198.58 µs 201.26 µs 204.22 µs]
Load block from disk/131072 KiB [zstd(1)] time:   [201.25 µs 203.61 µs 206.26 µs]
Load block from disk/131072 KiB [zstd(3)] time:   [197.61 µs 199.59 µs 201.84 µs]
Load block from disk/131072 KiB [zstd(12)] time:   [217.46 µs 220.58 µs 224.06 µs]

zstd experimental feature is enabled to calculate upper bound of Vec capacity to be allocated while decompressing data using zstd::bulk::Decompressor::upper_bound. supported levels are 1~22, with 0 defaulting to level 3.

shanipribadi · 2024-09-26T10:41:18Z

There's a few things about this PR that might need some feedback/discussions.

zstd-rs is a binding to the zstd library, so it's not pure rust, not sure if it's a design goal for lsm-tree to be pure rust or not.
I've used zstd::bulk because the interface is convenient (uses Vec), and the doc says that it can be faster than the streaming one because it allocates the buffers in memory. This does mean that if the data being compressed/uncompressed is large then the memory that needed to be allocated is larger as well. I don't know enough to get a feel of how big the typical data being processed for compression/decompression in lsm-tree. I also am not sure if it's even possible to use the streaming interface given the current interface used by lsm-tree for compression.
Wanted to get feedback on the serialization of Zstd(value), zstd itself accepts -(1<<17) ~ 22 for compression rate. so far with the u8 limit on the serialization, I have only map -128 ~ 22. based on testing (load block from disk bench) -128 is faster than lz4 (forgot to include it in the above description, will have to run it again later). If we want to expose the full range of possible fast levels, then need to figure out how to best map -(1<<17)~-1 to -128-1 due to the u8 limit.
Would appreciate some guidelines on how the benchmarks in https://fjall-rs.github.io/post/announcing-fjall-2/ is being done, so I could replicate it to see whether it's actually worth it to add zstd. Based on my knowledge, typically zstd provide more optimal cpu/compression compared to DEFLATE, but it doesn't really compete against lz4 for pure decompression performance (unless I guess we use extreme fast/negative levels). But hoping that the better compression ratio and tunability of zstd would be useful.
There's also a few unrelated change (e.g. benches/tree.rs block->data_block rename, some renaming of test names), please let me know if you'd rather I split out those into separate PR/commit.

marvin-j97 · 2024-09-26T12:40:00Z

zstd-rs is a binding to the zstd library, so it's not pure rust, not sure if it's a design goal for lsm-tree to be pure rust or not

I would like it to be, but there's no production ready library out there right now. KillingSpark/zstd-rs#65 has some encoding efforts going on, but it's far from usable. Ideally, it could just be switched out if there ever is a worthy contender.

This does mean that if the data being compressed/uncompressed is large then the memory that needed to be allocated is larger as well

It never is. Blocks tend to be 4 - 64 KB in size, blobs maybe up to a couple of MB max. It's also needed in memory because the size needs to be known (for the block header).

If we want to expose the full range of possible fast levels, then need to figure out how to best map -(1<<17)~-1 to -128-1 due to the u8 limit.

Hmm yeah, interesting. The problem is that the block header needs to be fixed, so I went with 2 bytes because I haven't looked at how many compression levels there tend to be. Miniz just has 10 or so.

With a u8 we can go from 20 down to -234. Most sources tend to recommend something along the lines of -7 - 20. So I'm not sure how important it is to even support negative levels that are much lower than 200. That would need some benchmarking. If -8000 is barely faster than -127 at much worse space savings, there's not point in supporting it I think.

so I could replicate it to see whether it's actually worth it to add zstd

For the benchmarks in the first chapter I used this project: https://gist.github.com/marvin-j97/22dfbe2ae2d9a8b9bcc938c8d48e54c7 - it needs a corpus of text documents on disk (DOCS_FOLDER) that it will ingest.

You'll need to use fjall 2.0.1+ because I had to fix a bug.

feat: add zstd compression support

d566645

zstd experimental feature is enabled to calculate upper bound of Vec capacity to be allocated while decompressing data using zstd::bulk::Decompressor::upper_bound. supported levels are 1~22, with 0 defaulting to level 3.

marvin-j97 added enhancement New feature or request api labels Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add zstd compression support #60

feat: add zstd compression support #60

shanipribadi commented Sep 25, 2024

shanipribadi commented Sep 26, 2024 •

edited

Loading

marvin-j97 commented Sep 26, 2024 •

edited

Loading

feat: add zstd compression support #60

Are you sure you want to change the base?

feat: add zstd compression support #60

Conversation

shanipribadi commented Sep 25, 2024

shanipribadi commented Sep 26, 2024 • edited Loading

marvin-j97 commented Sep 26, 2024 • edited Loading

shanipribadi commented Sep 26, 2024 •

edited

Loading

marvin-j97 commented Sep 26, 2024 •

edited

Loading