[ BLAS ] Implement elementwise operations #2474

skykongkong8 · 2024-02-19T00:43:09Z

Implemented elementwise vector operations

ele_add (previously ewva)
ele_sub
ele_mul (previously ewvm)
ele_div
with fp32 / fp16 precisions
for current BLAS, implement with raw version
for NEON SIMD, implement with SIMD version

taos-ci · 2024-02-19T00:43:13Z

📝 TAOS-CI Version: 1.5.20200925. Thank you for submitting PR #2474. Please a submit 1commit/1PR (one commit per one PR) policy to get comments quickly from reviewers. Your PR must pass all verificiation processes of cibot before starting a review process from reviewers. If you are new member to join this project, please read manuals in documentation folder and wiki page. In order to monitor a progress status of your PR in more detail, visit http://ci.nnstreamer.ai/.

taos-ci

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

taos-ci

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

djeong20

Overall, looks good to me. There are a few things to point out.

we should validate this code by replacing the current implementation of std::transform in the Tensor class with this, especially divide().
after 1, we should check the performance difference for non-neon cases. in other words, check the effect of replacing std::transform() to for loop.

+ please resolve the conflict!

djeong20 · 2024-02-21T04:29:19Z

nntrainer/tensor/blas_interface.cpp

@@ -400,12 +381,64 @@ void scopy_int8_to_float16(const unsigned int N, const uint8_t *X,
  copy_int8_to_fp16(N, X, incX, Y, incY);
 }

-void ewvm(const unsigned int N, const _FP16 *X, const _FP16 *Y, _FP16 *Z) {
-  ewvm_FP16(N, X, Y, Z);
+void ele_mul(const unsigned int N, const _FP16 *X, const _FP16 *Y, _FP16 *Z,


Suggested change

void ele_mul(const unsigned int N, const _FP16 *X, const _FP16 *Y, _FP16 *Z,

void elementwise_multiply(const unsigned int N, const _FP16 *X, const _FP16 *Y, _FP16 *Z,

one suggestion. how about renaming it into something more clear?

I asked for many contributors' opinions offline, but shorter function name is quite preferred.
ele_* sounds clear enough for me.

djeong20 · 2024-02-21T04:32:30Z

nntrainer/tensor/blas_interface.cpp

+    if (std::abs(beta) > __FLT_MIN__)
+      Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i] +
+             static_cast<_FP16>(beta) * Z[i];
+    else
+      Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i];


any reason to differentiate beta == 0 case?

Suggested change

if (std::abs(beta) > __FLT_MIN__)

Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i] +

static_cast<_FP16>(beta) * Z[i];

else

Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i];

Z[i] = static_cast<_FP16>(alpha) * X[i] * Y[i] + static_cast<_FP16>(beta) * Z[i];

what if Z[i] != Z[i] ? (NaN in Z[i], or uninitialized Z)

regardless of beta, wouldn't it cause an error anyway?

if beta is zero, Z = X * Y + beta * Z would be Z = X * Y.
if not, Z = X * Y + beta * Z.

for the case where NaN is in Z[i] or uninitialized Z, it would cause an error.

isn't NaN * 0 = NaN ?

yeah, what I mean is having beta != 0 condition to avoid NaN or uninitialized error seems offbeat.

maybe we could come up with a different way to handle these cases (e.g., check if tensor is initialized when using beta).

Sounds more reasonable.
will try to fix it during resolving conflicts

djeong20 · 2024-02-21T04:34:52Z

nntrainer/tensor/blas_interface.h

+ * @param[in] X _FP16 * for Vector X
+ * @param[in] Y _FP16 * for Vector Y
+ * @param[in] Z _FP16 * for Vector Z
+ * @param[in] alpha scalar multiplier for input


quick question. it seems like scalars are added only for Y and Z. wouldn't there be cases where X also needs a scalar?

also, why is scalar added in the first place?

Good point. Ususally, strict BLAS functions only get X and Y, not Z. So it would go like : $X = X * \alpha (op) Y * \beta$. But current function usage in the NNTrainer requires all $X, Y, Z$ . I thought of adding new scalar multiplier $\gamma$, but there is no such case for now, maybe we can add it whenever we need it.

from tensor operation, we differentiate whether to use std::transform or ele_* by stride and alpha / beta. For more broader use of this function, scalar multiplier is necessary.

thanks for clarification :)

djeong20 · 2024-02-21T04:44:16Z

nntrainer/tensor/tensor.cpp

@@ -1076,7 +1076,7 @@ Tensor &Tensor::add(Tensor const &m, Tensor &output, float const alpha) const {
    auto f = [&](const BroadcastInfo &e, const float *buf, const float *m_buf,
                 float *out_buf) {
      if (e.strides[3] == 1 && strides[3] == 1 && strides[3] == 1 &&
-          alpha == 0) {
+          alpha == 1.f) {


could you explain why this condition has changed?

We have been using this code wrong.
alpha is an input scalar multiplier, so considering our total formula,

$$Z = X (op) Y * \alpha + Z * \beta$$

alpha == 0 is a nonsense. isn't it?

makes sense. then so far, ele_add (prev ewva) hasn't been used?

As far as I am concerned, that is correct

Nice catch!

skykongkong8 · 2024-02-21T06:13:21Z

Overall, looks good to me. There are a few things to point out.

we should validate this code by replacing the current implementation of std::transform in the Tensor class with this, especially divide().

after 1, we should check the performance difference for non-neon cases. in other words, check the effect of replacing std::transform() to for loop.

please resolve the conflict!

I already conducted this with tensor unittests, and discovered that:

Although we use SIMD, the computational efficiency does not increase linearly (maybe because of loop-unrolling ? -> currently trying to clarify the reason)
latency in std::transform() is almost similar to for-loop. Maybe we can use it only for non-contiguous case only in later fixes.
And yes, I had some discussions with @jijoongmoon offline, considering when to apply for Tensor operations.

merge conflicts will be resolved after all discussions to avoid unnecessary conflicts with TensorV2

…fp32 - rename ewvm, ewva to ele_mul and ele_add - implement for fp32 case as well - add scalar multiplier alpha, beta parameter **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- It is quite common to use scalar multiplier in elementwise addition and multiplication. - However, in case of multiplier beta, if the output vector Z is set to NaN, it might produce invalid values. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- This commit introduces a basic structure of elementwise subtraction and division function structure - Function implementation will be added in a later commit **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Latest elementwise functions get alpha, beta as scalar multipliers. - With those parameters, formula in function brief can be discribed in a more precise way **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Implement elementwise subtraction and division functions based on function structures proposed from the previous commit **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- This commit introduces a basic structure of elementwise subtraction and division funct ion structure in blas_interface - Function implementation will be added in a later commit **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Implement elementwise subtraction and division functions based on function structures proposed from the previous commit - with NEON, we can use SIMD-acclerated function from blas_neon **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

…lar multiplier - Default value of scalar multiplier alpha should be 1, not 0 **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- With default output scalar multiplier value beta, output Z might contain NaN values. **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- According to discussions made from nnstreamer#2473, we found a better way of comparing float scalar multiplier using __FLT_MIN__ **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

- Like commit#7363546, alpha option in ewva should be set to 1, not 0. - Change function name : ew* -> ele_* **Self evaluation:** 1. Build test: [X]Passed [ ]Failed [ ]Skipped 2. Run test: [X]Passed [ ]Failed [ ]Skipped Signed-off-by: skykongkong8 <[email protected]>

taos-ci

@skykongkong8, 💯 All CI checkers are successfully verified. Thanks.

jijoongmoon

LGTM!

skykongkong8 requested review from myungjoo, jijoongmoon, again4you, jaeyun-jung, leemgs, wooksong, helloahn, kparichay, gichan-jang, anyj0527, zhoonit, lhs8928, songgot, jihochu, DonghakPark, SeoHyungjun, baek2sm, djeong20, EunjuYang and a team as code owners February 19, 2024 00:43

skykongkong8 changed the title ~~Ele add mul sub div~~ [ BLAS ] Ele add mul sub div Feb 19, 2024

skykongkong8 changed the title ~~[ BLAS ] Ele add mul sub div~~ [ BLAS ] Implement elementwise operations Feb 19, 2024

github-actions bot added the Need Review label Feb 19, 2024

taos-ci approved these changes Feb 19, 2024

View reviewed changes

skykongkong8 force-pushed the ele_add_mul_sub_div branch from c03861d to 3cd78b1 Compare February 19, 2024 05:29

taos-ci approved these changes Feb 19, 2024

View reviewed changes

djeong20 approved these changes Feb 21, 2024

View reviewed changes

skykongkong8 added 9 commits February 23, 2024 10:48

skykongkong8 force-pushed the ele_add_mul_sub_div branch from 3cd78b1 to 760fda9 Compare February 23, 2024 01:52

skykongkong8 force-pushed the ele_add_mul_sub_div branch from 760fda9 to 9a3f5e4 Compare February 23, 2024 01:55

taos-ci approved these changes Feb 23, 2024

View reviewed changes

jijoongmoon approved these changes Feb 23, 2024

View reviewed changes

github-actions bot added PR/READY2MERGE and removed Need Review labels Feb 23, 2024

jijoongmoon added the BUILD/CI label Feb 23, 2024

jijoongmoon merged commit e66a786 into nnstreamer:main Feb 23, 2024
29 of 31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ BLAS ] Implement elementwise operations #2474

[ BLAS ] Implement elementwise operations #2474

skykongkong8 commented Feb 19, 2024

taos-ci commented Feb 19, 2024

taos-ci left a comment

taos-ci left a comment

djeong20 left a comment •

edited

Loading

djeong20 Feb 21, 2024

skykongkong8 Feb 21, 2024

djeong20 Feb 21, 2024

skykongkong8 Feb 21, 2024 •

edited

Loading

djeong20 Feb 21, 2024

djeong20 Feb 21, 2024 •

edited

Loading

skykongkong8 Feb 21, 2024

djeong20 Feb 21, 2024

djeong20 Feb 21, 2024

skykongkong8 Feb 21, 2024

djeong20 Feb 21, 2024

djeong20 Feb 21, 2024

skykongkong8 Feb 21, 2024 •

edited

Loading

djeong20 Feb 21, 2024

djeong20 Feb 21, 2024

skykongkong8 Feb 21, 2024 •

edited

Loading

djeong20 Feb 21, 2024

skykongkong8 Feb 21, 2024

jijoongmoon Feb 23, 2024

skykongkong8 commented Feb 21, 2024 •

edited

Loading

taos-ci left a comment

jijoongmoon left a comment

	void ele_mul(const unsigned int N, const _FP16 X, const _FP16 Y, _FP16 *Z,
	void elementwise_multiply(const unsigned int N, const _FP16 X, const _FP16 Y, _FP16 *Z,

[ BLAS ] Implement elementwise operations #2474

[ BLAS ] Implement elementwise operations #2474

Conversation

skykongkong8 commented Feb 19, 2024

taos-ci commented Feb 19, 2024

taos-ci left a comment

Choose a reason for hiding this comment

taos-ci left a comment

Choose a reason for hiding this comment

djeong20 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skykongkong8 Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djeong20 Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skykongkong8 Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skykongkong8 Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skykongkong8 commented Feb 21, 2024 • edited Loading

taos-ci left a comment

Choose a reason for hiding this comment

jijoongmoon left a comment

Choose a reason for hiding this comment

djeong20 left a comment •

edited

Loading

skykongkong8 Feb 21, 2024 •

edited

Loading

djeong20 Feb 21, 2024 •

edited

Loading

skykongkong8 Feb 21, 2024 •

edited

Loading

skykongkong8 Feb 21, 2024 •

edited

Loading

skykongkong8 commented Feb 21, 2024 •

edited

Loading