Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HPC] Proposal: Exclude data movement from timing #507

Open
nvaprodromou opened this issue Feb 7, 2023 · 6 comments
Open

[HPC] Proposal: Exclude data movement from timing #507

nvaprodromou opened this issue Feb 7, 2023 · 6 comments

Comments

@nvaprodromou
Copy link
Contributor

nvaprodromou commented Feb 7, 2023

Introduction:

After collecting feedback from engineers, clients, and press, NVIDIA presented a list of proposals that aim to improve the popularity of the MLPerf HPC benchmark suite. Please see our slide deck for more information on our feedback gathering process and insights.

Proposal: Exclude data movement from timing (start clock after data retrieval, before caching. Same as MLPerf-T).

Slide 14 in proposals slide deck.

This proposal aims to improve the popularity of the MLPerf HPC benchmark suite by improving on the following aspects:

  1. High submission overhead and cost [Affects participation and competition]
  2. Isolates benchmarking of compute from FS [Improves RFP interest]
  3. Simplifies "Throughput" benchmark (renamed from "weak scaling") [Affects participation and competition]

Note: We strongly believe that the filesystem is extremely important part and we always advice potential clients to consider the interplay of all parts of a system (FS + compute + network). However, we received a strong signal from some clients that it makes it harder to use the MLPerf-HPC scores for apples-to-apples comparisons, as FS and compute are sometimes not purchased at the same time.

Discussion

Pros:

  1. By far the most common feedback we received was the unreasonably high submission overhead for MLPerf HPC (overhead, cost, engineering resources, system time)
  2. Submitter no longer needs to optimize data movement

Cons:

  1. Reduces the quality of the benchmark since it no longer considers the system as a whole.
  2. This will make it an “upper bound” because of storage not being timed, but MLPerf-T has the same issue.
@sparticlesteve
Copy link
Contributor

My comments:

  • I admit I would be a little disappointed if we exclude data movement because as a group we decided that this was an important piece of the end-to-end performance
  • However, I recognize that potential users of this benchmark may want to be able to disentangle compute+storage rather than measure them together.
  • I think the case for this is stronger if MLPerf Storage can fill the gap and characterize HPC storage system performance for our workloads.
  • One potential upside of this rule change is that we would allow the use of systems that can hide data movement completely from the user, e.g. a burst buffer that is prefilled with the data before the training job starts.
  • I wonder if we could still enable the (optional?) reporting of the data movement time while excluding it from the final time-to-train metric.
  • At the very least I think there should be documentation and/or scripts from submitters showing how the data was setup.

@memani1
Copy link
Contributor

memani1 commented Feb 28, 2023

I believe it is essential to include the data movement in this benchmark suite to distinguish it from MLPerf training ones since HPC applications typically involve large datasets that stress the I/O. In addition, this can be studied in detail in the storage group in the next year or so.

How about the data movement be included in the time-to-train (strong-scaling mode) and may be excluded in the throughput (weak-scaling mode)?

I agree that we should still report the data movement timing which can be excluded from the final metric.

@nvaprodromou
Copy link
Contributor Author

Differentiating from MLPerf-T should not be sufficient justification for policy development. It is true that this will blur the line between MLPerf-T and MLPerf-HPC even more. But the truth is that the two benchmarks are indeed very similar.

Including data movement (even partly - only for closed division) does not alleviate the high cost of submission, which is the most common feedback we received (by a very wide margin). Please remember that the motivation for these proposals is to increase MLPerf-HPC's popularity and participation.

Having an optional report of data movement timing still adds complexity, this time in parsing the results. Describing data movement optimization strategies in READMEs can be a good middle ground.

The MLPerf-Storage approach might be the best solution and also results in cleanly separated scopes to the various MLCommons suites.

@TheKanter
Copy link
Contributor

Hi @nvaprodromou are potential submitters willing to guarantee that they would submit to a version without storage?

I have been pondering and this is a difficult trade-off - fidelity vs. submission quantity.

Is there a way we can de-risk it?

How would we feel if we drop the storage and data movement and then no additional submitters appear?

@nvaprodromou
Copy link
Contributor Author

I don't think we can get any formal guarantees on this. I did ask this question myself as well and I can dig into it some more. But I doubt we'll get any commitments. Even if we do, these are still likely to be NVIDIA submissions, which only solves part of the problem (participation and competition need to rise).

Furthermore, changing the rules by itself is not going to change things. We'll need to run some sort of campaign to "advertise" that (I'm making numbers up:) submissions are now 100x easier than they used to be, results (i.e., return on investment) have a guaranteed lifespan, and (this is primarily for businesses) results are more useful to entities that seek to purchase an HPC system.

Even though no guarantees can be made, easier submissions, guaranteed returns, and a good campaign can't really hurt the existing participation numbers. On the other hand, if we change rules and no additional submitters appear, I would argue we are at the same place we were before: Even though the quality of the benchmark was reduced compared to v2.0, the primary problem still remains to attract participation and competition. We can have a shiny thing few care about, or a less shiny thing few care about.

@sparticlesteve
Copy link
Contributor

This was accepted and implemented in the rules, so I think it can be closed now. Correct, @nvaprodromou ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants