AVX128: 256-bit vmovmsk can be improved #3782

Sonicadvance1 · 2024-06-30T02:43:55Z

    "vmovmskps rax, ymm0": {
      "ExpectedInstructionCount": 11,
      "Comment": [
        "Map 1 0b00 0x50 256-bit"
      ],
      "ExpectedArm64ASM": [
        "ldr q2, [x28, #16]",
        "ushr v3.4s, v16.4s, #31",
        "ldr q4, [x28, #2512]",
        "ushl v3.4s, v3.4s, v4.4s",
        "addv s3, v3.4s",
        "mov w20, v3.s[0]",
        "ushr v2.4s, v2.4s, #31",
        "ushl v2.4s, v2.4s, v4.4s",
        "addv s2, v2.4s",
        "mov w21, v2.s[0]",
        "orr x4, x20, x21, lsl #4"
      ]

Current algorithm just does two 128-bit algorithms and then merges at the end.

This can be improved by loading a second shift value for the high 128-bits which shifts the signs in to different locations. Then do a add between the two 128-bit halves, then a horizontal add, single FPR->GPR element extract.
Would shave the instructions from 11 to 10, removing a horizontal add and a FPR->GPR transfer.

vmovmskpd is similar in that it is doing two 128-bit algorithms and then extracting. Different algorithm so it would need more noodling.

Sonicadvance1 changed the title ~~256-bit vmovmsk can be improved~~ AVX128: 256-bit vmovmsk can be improved Jun 30, 2024

Sonicadvance1 added the Performance Things that will make FEX faster label Jun 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX128: 256-bit vmovmsk can be improved #3782

AVX128: 256-bit vmovmsk can be improved #3782

Sonicadvance1 commented Jun 30, 2024

AVX128: 256-bit vmovmsk can be improved #3782

AVX128: 256-bit vmovmsk can be improved #3782

Comments

Sonicadvance1 commented Jun 30, 2024