Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(jpeg): faster idct by transposing at zigzag level #157

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

cyyynthia
Copy link

Implements some of the optimizations discussed in #155. Marking this PR as draft as this currently causes the quality of the image to be degraded; probably due to the need to adapt the IDCT routine but this hasn't been done at this time (and I'm not super familiar with the algorithm used here).

Rough testing on my computer show an approx. 10% performance increase from these changes (x86 AVX2). The output image are almost correct, but some noise is introduced (making jpeg tests fail at this time).

I'm also not a fan of the quick solution I put together for the scalar code: it doesn't do any transposition and since I didn't want to introduce checks during the huffman decode logic, I sort of tucked transposition in the scalar decode to put everything back where it's supposed to be. A better solution would probably be to change the access to in_vector instead, but for now it's mostly here as a PoC.

@cyyynthia cyyynthia changed the title perf: faster idct by transposing at zigzag level perf(jpeg): faster idct by transposing at zigzag level Jan 19, 2024
This currently causes the quality of the image to be degraded; probably due to the need to adapt the IDCT routine but this hasn't been done at this time.
...so that the dequantization process actually works. Doing it this way allows it to be in the same orientation as the DCT coefficients so it is still eligible for painless vectorization.
@cyyynthia
Copy link
Author

Turns out I was just a bit stupid and didn't take into account the quantization process also needs to be adapted. 🙃

The jpeg tests still don't pass but the images look visually identical. I'm not sure my eyeballs are a good enough metric so I'll let you see what differs for yourself instead.

The performance improvement with AVX2 is in the neighborhood of 6-8%. However, the changes to the scalar code seem to be introducing a regression, as the compiler output (with SSE4.2 enabled) is ~100 instructions longer. https://rust.godbolt.org/z/sE3f4eoz6

@cyyynthia cyyynthia marked this pull request as ready for review January 20, 2024 17:28
@etemesi254
Copy link
Owner

The jpeg tests still don't pass but the images look visually identical. I'm not sure my eyeballs are a good enough metric so I'll let you see what differs for yourself instead.

A bit tricky this probably means there is a problem, I'll investigate and report, but thanks for the PR

@cyyynthia
Copy link
Author

cyyynthia commented Jan 21, 2024

My hypothesis is that the order of the IDCT-1D passes being reversed cause rounding errors to ever so slightly vary, they mismatch the rounding errors when passes are done the other way around. In terms of exact mathematics there's no difference so I don't know, haven't tested that either but it should be trivial to test with AVX2 code (just doing a transpose before, in the middle and after which should yield the same behavior as without zigzag-level transposition)

@etemesi254
Copy link
Owner

One way to test may be to compare the dssim between the new images and imagemagick which uses libjpeg turbo.

The script I use is

set -e

input="$1"
output="$(mktemp --tmpdir result_XXXXXXXXXXXXX.png)"
trap "rm -f "$output"" EXIT

if ! yes | /home/caleb/Documents/rust/zune-image/target/release/zune --input "$input" --out "$output" --yes --experimental   2>&1; then
    echo "Failed to decode $input" 1>&2
    exit 1
fi
similarity=$(compare -quiet -metric RMSE "$input[0]" "$output" /dev/null 2>&1) || true
echo "RMSE $similarity $input"


similarity=$(compare -quiet -metric AE   "$input[0]" "$output" /dev/null 2>&1) || true
echo "Absolute error count $similarity $input"


similarity=$(compare -quiet -metric DSSIM  "$input[0]" "$output" /dev/null 2>&1) || true
echo "DSSIM  $similarity $input"

similarity=$(compare -quiet -metric PSNR "$input[0]" "$output" /dev/null 2>&1) || true
echo "PSNR $similarity $input"

And you call it as

sh ./psnr.sh {IMAGE}

But remember to change the directory of the above zune binary

You can compile a binary by calling cargo build --release in the top level if you didn't know

@cyyynthia
Copy link
Author

cyyynthia commented Jan 22, 2024

I did the test for the images in test-images, and first I can confirm my theory is true, the order in which the IDCT passes are done does have an impact; using the transposed zigzag transpose, but with 3 transpose (to "restore" the original IDCT coefficient matrix before doing things the "old" way, the outputs are identical).

Regarding the error values, the impact of the IDCT passes order has a negative impact across the board 😔 It's unfortunate the 2D-DCT introduces these; from a mathematical standpoint my approach would, assuming exact mathematics, have the same output 🤔

Given these I'm unsure if pursuing this direction is viable given the current IDCT implementation...

DSSIM (before)
DSSIM  0.00136545 ./test-images/jpeg/2029.jpg
DSSIM  0.455216 ./test-images/jpeg/cymk.jpg
DSSIM  7.87632e-05 ./test-images/jpeg/down_sampled_grayscale_prog.jpg
DSSIM  0.436452 ./test-images/jpeg/four_components.jpg
DSSIM  0.000169361 ./test-images/jpeg/huffman_third_index.jpg
DSSIM  0.000273764 ./test-images/jpeg/huge_sof_number.jpg
DSSIM  0.513952 ./test-images/jpeg/Kiara_limited_progressive_four_components.jpg
DSSIM  0.00061322 ./test-images/jpeg/large_horiz_samp_7680_4320.jpg
DSSIM  0.000609311 ./test-images/jpeg/large_no_samp_7680_4320.jpg
DSSIM  0.000589548 ./test-images/jpeg/large_vertical_samp_7680_4320.jpg
DSSIM  0.00260926 ./test-images/jpeg/medium_horiz_samp_2500x1786.jpg
DSSIM  0.00287932 ./test-images/jpeg/medium_no_samp_2500x1786.jpg
DSSIM  0.00287433 ./test-images/jpeg/medium_vertical_samp_2500x1786.jpg
DSSIM  0.00101241 ./test-images/jpeg/mjpeg_huffman.jpg
DSSIM  0.00129616 ./test-images/jpeg/rebuilt_relax_fill_bytes_before_marker.jpg
DSSIM  0.000371049 ./test-images/jpeg/weid_sampling_factors.jpg
DSSIM  0.000514302 ./test-images/jpeg/weird_components.jpg
DSSIM (after)
DSSIM  0.00144355 ./test-images/jpeg/2029.jpg
DSSIM  0.455202 ./test-images/jpeg/cymk.jpg
DSSIM  0.000111749 ./test-images/jpeg/down_sampled_grayscale_prog.jpg
DSSIM  0.436425 ./test-images/jpeg/four_components.jpg
DSSIM  0.000184554 ./test-images/jpeg/huffman_third_index.jpg
DSSIM  0.000338528 ./test-images/jpeg/huge_sof_number.jpg
DSSIM  0.513946 ./test-images/jpeg/Kiara_limited_progressive_four_components.jpg
DSSIM  0.000688253 ./test-images/jpeg/large_horiz_samp_7680_4320.jpg
DSSIM  0.000694474 ./test-images/jpeg/large_no_samp_7680_4320.jpg
DSSIM  0.000660411 ./test-images/jpeg/large_vertical_samp_7680_4320.jpg
DSSIM  0.00278002 ./test-images/jpeg/medium_horiz_samp_2500x1786.jpg
DSSIM  0.00306143 ./test-images/jpeg/medium_no_samp_2500x1786.jpg
DSSIM  0.00306547 ./test-images/jpeg/medium_vertical_samp_2500x1786.jpg
DSSIM  0.0012283 ./test-images/jpeg/mjpeg_huffman.jpg
DSSIM  0.00139533 ./test-images/jpeg/rebuilt_relax_fill_bytes_before_marker.jpg
DSSIM  0.000406538 ./test-images/jpeg/weid_sampling_factors.jpg
DSSIM  0.000553421 ./test-images/jpeg/weird_components.jpg

@etemesi254
Copy link
Owner

It's weird that the outputs vary depending on the order, can't seem to figure out why

Btw what were the perf gains?

You can get difference by running cargo bench "jpeg" to only run jpeg ones, they take a while tho,

Could you do a before and after the pr? we can then see if there is a way to reduce and not be lying to ourselves that we are making anything better

@cyyynthia
Copy link
Author

The 6-8% perf improvement was measured by using cargo bench, although I was running in a quite noisy environment (IntelliJ, Spotify, and a bunch of Chromiums running (thanks for nothing, Electron)). I'll do a clean bench on a relatively silent environment tomorrow!

It's weird that the outputs vary depending on the order, can't seem to figure out why

My theory this has to do with the nature of the approximation; if the IDCT is lossy (due to not using perfect mathematics), then the order of operations do matter because the loss won't be the same and will depend on the input. Since columns are processed first, the intermediate state lost data "differently" than if it was done row-first.

But I'm unsure how to put this theory to the test 😵‍💫

@cyyynthia
Copy link
Author

Here are the results of the benchmark on my computer. Running Arch Linux (6.7.1-arch1-1), Intel i7-7700K, Rust toolchain version 1.75.0

before.txt
after.txt

@etemesi254
Copy link
Owner

Hi, sorry for the delay in responding.

The 6-8% perf improvement was measured by using cargo bench

I forgot how much 6% perf is when you are doing 68 ms :) ,

I am not comfortable with making the changes for IDCT that lead to greater divergence with libjpeg-turbo, but I love the change for checking for zeroes and believe that should be merged.

An additional place to be optimized btw may be moving from plaltform specific intrinsics to portable simd, we may gain more speeds in places like wasm, in case you are still up for this btw.

No pressure ( :) )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants