You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current algorithm just does two 128-bit algorithms and then merges at the end.
This can be improved by loading a second shift value for the high 128-bits which shifts the signs in to different locations. Then do a add between the two 128-bit halves, then a horizontal add, single FPR->GPR element extract.
Would shave the instructions from 11 to 10, removing a horizontal add and a FPR->GPR transfer.
vmovmskpd is similar in that it is doing two 128-bit algorithms and then extracting. Different algorithm so it would need more noodling.
The text was updated successfully, but these errors were encountered:
Sonicadvance1
changed the title
256-bit vmovmsk can be improved
AVX128: 256-bit vmovmsk can be improved
Jun 30, 2024
Current algorithm just does two 128-bit algorithms and then merges at the end.
This can be improved by loading a second shift value for the high 128-bits which shifts the signs in to different locations. Then do a add between the two 128-bit halves, then a horizontal add, single FPR->GPR element extract.
Would shave the instructions from 11 to 10, removing a horizontal add and a FPR->GPR transfer.
vmovmskpd is similar in that it is doing two 128-bit algorithms and then extracting. Different algorithm so it would need more noodling.
The text was updated successfully, but these errors were encountered: