Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ BLAS ] Implement elementwise operations #2474

Merged
merged 11 commits into from
Feb 23, 2024

Conversation

skykongkong8
Copy link
Member

Implemented elementwise vector operations

  • ele_add (previously ewva)
  • ele_sub
  • ele_mul (previously ewvm)
  • ele_div
    with fp32 / fp16 precisions
    for current BLAS, implement with raw version
    for NEON SIMD, implement with SIMD version

@taos-ci
Copy link
Collaborator

taos-ci commented Feb 19, 2024

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2474. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

@skykongkong8 skykongkong8 changed the title Ele add mul sub div [ BLAS ] Ele add mul sub div Feb 19, 2024
@skykongkong8 skykongkong8 changed the title [ BLAS ] Ele add mul sub div [ BLAS ] Implement elementwise operations Feb 19, 2024
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

Copy link
Contributor

@djeong20 djeong20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me. There are a few things to point out.

  1. we should validate this code by replacing the current implementation of std::transform in the Tensor class with this, especially divide().
  2. after 1, we should check the performance difference for non-neon cases. in other words, check the effect of replacing std::transform() to for loop.

+ please resolve the conflict!

@@ -400,12 +381,64 @@ void scopy_int8_to_float16(const unsigned int N, const uint8_t *X,
copy_int8_to_fp16(N, X, incX, Y, incY);
}

void ewvm(const unsigned int N, const _FP16 *X, const _FP16 *Y, _FP16 *Z) {
ewvm_FP16(N, X, Y, Z);
void ele_mul(const unsigned int N, const _FP16 *X, const _FP16 *Y, _FP16 *Z,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void ele_mul(const unsigned int N, const _FP16 *X, const _FP16 *Y, _FP16 *Z,
void elementwise_multiply(const unsigned int N, const _FP16 *X, const _FP16 *Y, _FP16 *Z,

one suggestion. how about renaming it into something more clear?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked for many contributors' opinions offline, but shorter function name is quite preferred.
ele_* sounds clear enough for me.

Comment on lines +390 to +394
if (std::abs(beta) > __FLT_MIN__)
Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i] +
static_cast<_FP16>(beta) * Z[i];
else
Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to differentiate beta == 0 case?

Suggested change
if (std::abs(beta) > __FLT_MIN__)
Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i] +
static_cast<_FP16>(beta) * Z[i];
else
Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i];
Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i] + static_cast<_FP16>(beta) * Z[i];

Copy link
Member Author

@skykongkong8 skykongkong8 Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if Z[i] != Z[i] ? (NaN in Z[i], or uninitialized Z)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regardless of beta, wouldn't it cause an error anyway?

Copy link
Contributor

@djeong20 djeong20 Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if beta is zero, Z = X * Y + beta * Z would be Z = X * Y.
if not, Z = X * Y + beta * Z.

for the case where NaN is in Z[i] or uninitialized Z, it would cause an error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't NaN * 0 = NaN ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, what I mean is having beta != 0 condition to avoid NaN or uninitialized error seems offbeat.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could come up with a different way to handle these cases (e.g., check if tensor is initialized when using beta).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds more reasonable.
will try to fix it during resolving conflicts

* @param[in] X _FP16 * for Vector X
* @param[in] Y _FP16 * for Vector Y
* @param[in] Z _FP16 * for Vector Z
* @param[in] alpha scalar multiplier for input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick question. it seems like scalars are added only for Y and Z. wouldn't there be cases where X also needs a scalar?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, why is scalar added in the first place?

Copy link
Member Author

@skykongkong8 skykongkong8 Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Good point. Ususally, strict BLAS functions only get X and Y, not Z. So it would go like : $X = X * \alpha (op) Y * \beta$. But current function usage in the NNTrainer requires all $X, Y, Z$ . I thought of adding new scalar multiplier $\gamma$, but there is no such case for now, maybe we can add it whenever we need it.
  2. from tensor operation, we differentiate whether to use std::transform or ele_* by stride and alpha / beta. For more broader use of this function, scalar multiplier is necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for clarification :)

@@ -1076,7 +1076,7 @@ Tensor &Tensor::add(Tensor const &m, Tensor &output, float const alpha) const {
auto f = [&](const BroadcastInfo &e, const float *buf, const float *m_buf,
float *out_buf) {
if (e.strides[3] == 1 && strides[3] == 1 && strides[3] == 1 &&
alpha == 0) {
alpha == 1.f) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain why this condition has changed?

Copy link
Member Author

@skykongkong8 skykongkong8 Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have been using this code wrong.
alpha is an input scalar multiplier, so considering our total formula,

$$Z = X (op) Y * \alpha + Z * \beta$$

alpha == 0 is a nonsense. isn't it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. then so far, ele_add (prev ewva) hasn't been used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I am concerned, that is correct

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@skykongkong8
Copy link
Member Author

skykongkong8 commented Feb 21, 2024

Overall, looks good to me. There are a few things to point out.

  1. we should validate this code by replacing the current implementation of std::transform in the Tensor class with this, especially divide().
  2. after 1, we should check the performance difference for non-neon cases. in other words, check the effect of replacing std::transform() to for loop.
  • please resolve the conflict!

I already conducted this with tensor unittests, and discovered that:

  1. Although we use SIMD, the computational efficiency does not increase linearly (maybe because of loop-unrolling ? -> currently trying to clarify the reason)
  2. latency in std::transform() is almost similar to for-loop. Maybe we can use it only for non-contiguous case only in later fixes.
    And yes, I had some discussions with @jijoongmoon offline, considering when to apply for Tensor operations.
  • merge conflicts will be resolved after all discussions to avoid unnecessary conflicts with TensorV2

…fp32

- rename ewvm, ewva to ele_mul and ele_add
- implement for fp32 case as well
- add scalar multiplier alpha, beta parameter

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- It is quite common to use scalar multiplier in elementwise addition and multiplication.
- However, in case of multiplier beta, if the output vector Z is set to NaN, it might produce invalid values.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- This commit introduces a basic structure of elementwise subtraction and division function structure
- Function implementation will be added in a later commit

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- Latest elementwise functions get alpha, beta as scalar multipliers.
- With those parameters, formula in function brief can be discribed in a more precise way

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
    - Implement elementwise subtraction and division functions based on function structures proposed from the previous commit

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- This commit introduces a basic structure
 of elementwise subtraction and division funct
ion structure in blas_interface
- Function implementation will be added in
 a later commit

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- Implement elementwise subtraction and division functions based on function structures proposed from the previous commit
- with NEON, we can use SIMD-acclerated function from blas_neon

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
…lar multiplier

- Default value of scalar multiplier alpha should be 1, not 0

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- With default output scalar multiplier value beta, output Z might contain NaN values.

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- According to discussions made from nnstreamer#2473, we found a better way of comparing float scalar multiplier using __FLT_MIN__

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
- Like commit#7363546, alpha option in ewva should be set to 1, not 0.
- Change function name : ew* -> ele_*

**Self evaluation:**
1. Build test:     [X]Passed [ ]Failed [ ]Skipped
2. Run test:     [X]Passed [ ]Failed [ ]Skipped

Signed-off-by: skykongkong8 <[email protected]>
Copy link
Collaborator

@taos-ci taos-ci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

Copy link
Collaborator

@jijoongmoon jijoongmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jijoongmoon jijoongmoon merged commit e66a786 into nnstreamer:main Feb 23, 2024
29 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants