Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization for Linear Quantization #2954

Merged
merged 12 commits into from
Sep 26, 2024

Conversation

AlexandreEichenberger
Copy link
Collaborator

It turns out that while converting from float to int8 in one go is not efficient in terms of generated SIMD code, doing it in two steps (first float32 to int32, and then int32 to int8) generates quite good code.

The code now runs for 64 K data points from 157us (original) to 124 (prior version with 2 loops) to now 96us, so a 1.7x overall speedup.

Interestingly, the z16 float to int asm operation used is vcfeb or vclfeb (signed vs unsigned) has a mode that does the round to nearest even. It cannot be accessed as is (only via inlined asm). If we were able to exploit this, we may further improve the linear quantization step.

Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Signed-off-by: Alexandre Eichenberger <[email protected]>
Copy link
Collaborator

@tungld tungld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.


Type inputElementType = inputType.getElementType();
unsigned inputWidth;
if (isa<Float32Type>(inputElementType))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use

if (inputElementType.isF32() || inputElementType.isF64())
  inputWidth = mlir::cast<FloatType>(inputElementType).getWidth();
else
  llvm_unreachable("unsupported input type");

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, will use.

bool isSigned = quantizedIntType.isSignless() || quantizedIntType.isSigned();
Type quantizedElementTypeInputSized;
if (isSigned) {
// Cannot use getIntegerType(inputWidth, true) as it returns signed ints.
Copy link
Collaborator

@tungld tungld Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you would like to create i32 or i64, I guess we can use getIntegerType(inputWidth) that has a single parameter. Meanwhile getIntegerType(inputWidth, true/false) with two parameters will emit si32/ui32, si64/ui64.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, interface not very clear, like your suggestion, will use.

create.math.cast(quantizedElementTypeInputSized, saturateX);
// Reduce quantized precision.
Value res =
create.math.cast(quantizedElementType, qSaturateXInputSized);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it is beneficial if we do this two-step cast inside a single create.math.cast, saying when the source type is f32 and the target type is i8 we do a two-step cast?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the right way to do it, will add.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrated the functionality into cast for float -> int small (quantization). Also added the dequantization conversion here (small int -> float); that did not result in performance changes on z.

Value res = create.math.cast(quantizedElementType, buffVal);
create.krnl.storeIE(res, flatAlloc, {zero}, {loopInd[0]});
});

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so now we don't need to use the additional loop for conversion.

@AlexandreEichenberger
Copy link
Collaborator Author

@tungld thanks for the very useful suggestions, much appreciated as always.

@AlexandreEichenberger AlexandreEichenberger merged commit f7d5895 into onnx:main Sep 26, 2024
7 checks passed
@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14746 [push] Optimization for Linear ... started at 16:57

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15716 [push] Optimization for Linear ... started at 15:45

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15719 [push] Optimization for Linear ... started at 16:45

@jenkins-droid
Copy link
Collaborator

Jenkins Linux amd64 Build #15716 [push] Optimization for Linear ... passed after 1 hr 7 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux s390x Build #15719 [push] Optimization for Linear ... passed after 1 hr 32 min

@jenkins-droid
Copy link
Collaborator

Jenkins Linux ppc64le Build #14746 [push] Optimization for Linear ... passed after 2 hr 3 min

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants