Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD: Filter SSE4 & AVX2 #8301

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

Conversation

homm
Copy link
Member

@homm homm commented Aug 11, 2024

Depends on #8209.

This is porting of Filter acceleration from Pillow-SIMD.

This PR includes required changes from #8209.

homm and others added 12 commits August 11, 2024 20:55
SIMD Filter. 5x5 implementation

SIMD Filter. fast 3x3 filter

SIMD Filter. a bit faster 5x5 filter

SIMD Filter. improve locality in 5x5 filter

SIMD Filter. rearrange 3x3 filter to match 5x5

SIMD Filter. use macros

SIMD Filter. use macros in 3x3

SIMD Filter. 3x3 SSE4 singleband

SIMD Filter. faster 3x3 singleband SSE4

SIMD Filter. reuse loaded values

SIMD Filter. 3x3 SSE4 singleband: 2 lines

SIMD Filter. First AVX try

SIMD Filter. unroll AVX 2 times

SIMD Filter. Macros for AVX

SIMD Filter. unroll AVX (with no profit)

SIMD Filter. consider last pixel in AVX

SIMD Filter. 5x5 single channel SSE4 (tests failed)

SIMD Filter. fix offset

SIMD Filter. move ImagingFilterxxx functions to separate files

SIMD Filter. 3x3i

SIMD Filter. better macros

SIMD Filter. better loading

SIMD Filter. Rearrange  instruction for speedup

SIMD Filter. reduce number of registers

SIMD Filter. rearrange operations

SIMD Filter. avx2 version

SIMD Filter. finish 3x3i_4u8

SIMD Filter. 5x5i_4u8 SSE4

SIMD Filter. advanced 5x5i_4u8 SSE4

SIMD Filter. 5x5i_4u8 AVX2

SIMD Filter. fix memory access for:

3x3f_u8
3x3i_4u8
5x5i_4u8

SIMD Filter. move files

SIMD Filter. Correct offset for 3x3f_u8

# Conflicts:
#	src/libImaging/Filter.c
Comment on lines +30 to +34
/* 5 is number of bits enought to account all kernel coefficients (1<<5 > 25).
8 is number of bits in result.
Any coefficients delta smaller than this precision will have no effect. */
#define PRECISION_BITS (8 + 5)
/* 16 is number of bis required for kernel storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/* 5 is number of bits enought to account all kernel coefficients (1<<5 > 25).
8 is number of bits in result.
Any coefficients delta smaller than this precision will have no effect. */
#define PRECISION_BITS (8 + 5)
/* 16 is number of bis required for kernel storage.
/* 5 is enough bits to account for all kernel coefficients (1<<5 > 25).
8 is the number of bits in the result.
Any coefficients delta smaller than this precision will have no effect. */
#define PRECISION_BITS (8 + 5)
/* 16 is the number of bits required for kernel storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants