Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVX128: 256-bit vmovmsk can be improved #3782

Open
Sonicadvance1 opened this issue Jun 30, 2024 · 0 comments
Open

AVX128: 256-bit vmovmsk can be improved #3782

Sonicadvance1 opened this issue Jun 30, 2024 · 0 comments
Labels
Performance Things that will make FEX faster

Comments

@Sonicadvance1
Copy link
Member

    "vmovmskps rax, ymm0": {
      "ExpectedInstructionCount": 11,
      "Comment": [
        "Map 1 0b00 0x50 256-bit"
      ],
      "ExpectedArm64ASM": [
        "ldr q2, [x28, #16]",
        "ushr v3.4s, v16.4s, #31",
        "ldr q4, [x28, #2512]",
        "ushl v3.4s, v3.4s, v4.4s",
        "addv s3, v3.4s",
        "mov w20, v3.s[0]",
        "ushr v2.4s, v2.4s, #31",
        "ushl v2.4s, v2.4s, v4.4s",
        "addv s2, v2.4s",
        "mov w21, v2.s[0]",
        "orr x4, x20, x21, lsl #4"
      ]

Current algorithm just does two 128-bit algorithms and then merges at the end.

This can be improved by loading a second shift value for the high 128-bits which shifts the signs in to different locations. Then do a add between the two 128-bit halves, then a horizontal add, single FPR->GPR element extract.
Would shave the instructions from 11 to 10, removing a horizontal add and a FPR->GPR transfer.

vmovmskpd is similar in that it is doing two 128-bit algorithms and then extracting. Different algorithm so it would need more noodling.

@Sonicadvance1 Sonicadvance1 changed the title 256-bit vmovmsk can be improved AVX128: 256-bit vmovmsk can be improved Jun 30, 2024
@Sonicadvance1 Sonicadvance1 added the Performance Things that will make FEX faster label Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Things that will make FEX faster
Projects
None yet
Development

No branches or pull requests

1 participant