Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support to generate OpenMP parallel construct clauses, at this time for num_threads and proc_bind #2944

Merged

Conversation

AlexandreEichenberger
Copy link
Collaborator

Added support to generate OpenMP parallel construct with num_threads and proc_bind clause.

First I added two optional parameters to the krnl.parallel operation:

      %loop_block, %loop_local = krnl.block %0 32 : (!krnl.loop) -> (!krnl.loop, !krnl.loop)
      krnl.parallel(%loop_block), num_threads(%c8_i32) {proc_bind = "spread"} : !krnl.loop
      krnl.iterate(%loop_block) with (%0 -> %arg1 = 0 to 16384){
        %1 = krnl.get_induction_var_value(%loop_block) : (!krnl.loop) -> index
        %2 = vector.load %reshape[%1] : memref<16384xf32>, vector<32xf32>
        %3 = vector.load %reshape_2[%1] : memref<16384xf32>, vector<32xf32>
        %4 = arith.addf %2, %3 : vector<32xf32>
        vector.store %4, %reshape_4[%1] : memref<16384xf32>, vector<32xf32>
      }

which allows the user to associate parallel loops with an optional num_threads or proc_bind to the create.krnl.parallel builder.

When lowering to affine (or if generating affine or scf parallel operation), we then insert inside the loop a KrnlParallelClauseOp, which takes one mandatory value (the loop index), to identify the parallel loop targeted by the clause, and the optional num_threads (a value) and the proc_bind (a string).

  affine.parallel (%arg1) = (0) to (16384) step (32) {
    %0 = vector.load %reshape[%arg1] : memref<16384xf32>, vector<32xf32>
    %1 = vector.load %reshape_2[%arg1] : memref<16384xf32>, vector<32xf32>
    %2 = arith.addf %0, %1 : vector<32xf32>
    vector.store %2, %reshape_4[%arg1] : memref<16384xf32>, vector<32xf32>
    affine.for %arg2 = 0 to 1 {
    }
    krnl.parallel_clause(%arg1), num_threads(%c8_i32) {proc_bind = "spread"} : index
  }

After the parallel constructs are lowered to OpenMP construct, a simple pass (createProcessKrnlParallelClausePass) identify the KrnlParallelClauseOp, locate its enclosing omp.parallel construct, and migrate the clauses to the OpenMP constructs.

  omp.parallel num_threads(%c8_i32 : i32) proc_bind(spread) {
    omp.wsloop {
      omp.loop_nest (%arg1) : index = (%c0) to (%c16384) step (%c32) {
        memref.alloca_scope  {
          %0 = vector.load %reshape[%arg1] : memref<16384xf32>, vector<32xf32>
          %1 = vector.load %reshape_2[%arg1] : memref<16384xf32>, vector<32xf32>
          %2 = arith.addf %0, %1 : vector<32xf32>
          vector.store %2, %reshape_4[%arg1] : memref<16384xf32>, vector<32xf32>
        }
        omp.yield
      }
      omp.terminator
    }
    omp.terminator
  }

Added 2 mlir lit test files

Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

needParallelClause = false;
// Currently approach: insert after yield and then move before it.
PatternRewriter::InsertionGuard insertGuard(builder);
builder.setInsertionPointAfter(yieldOp);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't setInsertionPoint(yieldOp) work for inserting just before yieldOp?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reasons, if I don't have the "moveBefore", it get's me this error

flt_orig_model.mlir:18:3: error: operand #0 does not dominate this use
  krnl.iterate(%loop_block) with (%0 -> %arg1 = 0 to 16384){
  ^
flt_orig_model.mlir:18:3: note: see current operation: "krnl.parallel_clause"(%arg1, %0) {proc_bind = "spread"} : (index, i32) -> ()
flt_orig_model.mlir:18:3: note: operand defined as a block argument (block #0 in a child region)

Strangely, with the moveBefore(yieldOp), I get the same result with the setInsertionPointAfter or setInsertionPoint.
There is something fragile about the lowering of Krnl to Affine with respect to "movable".

Since it works as is, I prefer to leave it that way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conversion pass traverses the IR by ourselves. We manipulate graph directly. That might be the reason why it is fragile.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then it's fine with the current way.

// Use clause only for the first one (expected the outermost one).
// Ideally, we would generate here a single, multi-dimensional
// AffineParallelOp, and we would not need to reset the flag.
needParallelClause = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this condition used afterwards?

Copy link
Collaborator Author

@AlexandreEichenberger AlexandreEichenberger Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when we need the parallel clause, then only the first loop iteration in the for(Value loopRef : loopRefs) will execute the addition of the KrnlParallelClauseOp

Copy link
Collaborator

@chentong319 chentong319 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@AlexandreEichenberger AlexandreEichenberger merged commit d03eff2 into onnx:main Sep 19, 2024
7 checks passed
@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15665 [push] Added support to generat... started at 09:54

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15662 [push] Added support to generat... started at 08:54

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14692 [push] Added support to generat... started at 10:05

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15662 [push] Added support to generat... passed after 1 hr 6 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15665 [push] Added support to generat... passed after 1 hr 39 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14692 [push] Added support to generat... passed after 2 hr 3 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants