Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for software pipelining in pragmas. #126

Open
aidansander opened this issue Jul 17, 2024 · 4 comments
Open

Support for software pipelining in pragmas. #126

aidansander opened this issue Jul 17, 2024 · 4 comments

Comments

@aidansander
Copy link

I'm compiling a simple kernel using peano. Manually software pipelining the attached kernel (dut_pipelined.cc) yields considerable speedup compared to using pipelining pragmas (dut_pragma.cc). Without manual pipelining, the produced assembly does not pipeline and the kernel runs in ~1800 cycles. With manual pipelining, the kernel runs in ~1000 cycles. The clang loop min_iteration_count and max_iteration_count pragmas have no effect on the produced assembly.
dut_pragma.cc
dut_pipelined.cc

@konstantinschwarz
Copy link
Collaborator

Hi @aidansander,
could you also provide the lut_based_ops.h header to be able to reproduce your results?

@aidansander
Copy link
Author

Sure thing. lut_based_ops.h and lut_based_ops.cpp (which holds the LUT values) are both included from here. I'm running the kernel using llvm-lit on some tests. The manually pipelined and loop pragma tests have the steps I used to run and measure the cycle count.

@konstantinschwarz
Copy link
Collaborator

Thanks!
One thing to consider: LoopUnroll runs before MachinePipeliner, i.e. if you completely unroll the loop (through #pragma unroll), there is nothing to do for the pipeliner.

There are a few things we are still lacking though to get this software pipelined:

  • Enable hardware loops by default, to turn an up-counting loop into a down-counting loop. Currently blocked by the next item.
  • Teach the MachinePipeliner to understand our hardware loop construct (we have an open PR for this [AIE2] Add support for ZOL loops for SWP #125)
  • VLDB4X instructions don't carry MachineMemOperands, pessimizing alias analysis results in the MachinePipeliner and rejecting the loop.

With these tweaks, I could get a SWP loop with 23 cycles. It should be possible to further improve on that.
FYI @gbossu @martien-de-jong @andcarminati

@konstantinschwarz
Copy link
Collaborator

Update: 1. & 2. have been resolved.
Last point - MachineMemOperands for VLDB.4x instructions - still needs to be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants