Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with 3d-unet #1845

Closed
Agalakdak opened this issue Sep 10, 2024 · 19 comments
Closed

Issue with 3d-unet #1845

Agalakdak opened this issue Sep 10, 2024 · 19 comments

Comments

@Agalakdak
Copy link

Hello everyone, I have already submitted a bug report here. But that topic got a lot of messages and I decided to create a new topic.
This time I ran 3D unet using the command below from this site
https://docs.mlcommons.org/inference/benchmarks/medical_imaging/3d-unet/

The command
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=3d-unet-99
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=50

and a brief error report:
0.580 INFO:root: ! cd /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f
0.580 INFO:root: ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh from tmp-run.sh
0.584 /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f
0.585 ******************************************************
0.585 Current directory: /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f
0.585
0.585 Cloning inference from https://github.com/mlcommons/inference
0.585
0.585 git clone -b master https://github.com/mlcommons/inference --depth 5 inference
0.585
0.586 Cloning into 'inference'...
38.68 fatal: the remote end hung up unexpectedly
38.69 fatal: early EOF
38.69 fatal: index-pack failed
38.69 Detected version: 3.8.10
38.69 Detected version: 3.8.10
38.69
38.69 CM error: Portable CM script failed (name = get-git-repo, return code = 256)
38.69
38.69
38.69 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
38.69 Note that it is often a portability issue of a third-party tool or a native script
38.69 wrapped and unified by this CM script (automation recipe). Please re-run
38.69 this script with --repro flag and report this issue with the original
38.69 command line, cm-repro directory and full log here:
38.69
38.69 https://github.com/mlcommons/cm4mlops/issues
38.69
38.69 The CM concept is to collaboratively fix such issues inside portable CM scripts
38.69 to make existing tools and native scripts more portable, interoperable
38.69 and deterministic. Thank you!

Full log with the problem

error_3dunet.log

@arjunsuresh
Copy link
Contributor

That looks like a github connection issue. Can you please retry?

@Agalakdak
Copy link
Author

Hi @arjunsuresh , thanks for the advice. I tried several times (about 4-5) and in 4 out of 5 cases after ~2000-3000 seconds I got the error shown above.

Finally I got some results (for other neural networks). Please answer the questions below.

  1. My results:
    bert-99
    +---------+----------+----------+------------+-----------------+
    | Model | Scenario | Accuracy | Throughput | Latency (in ms) |
    +---------+----------+----------+------------+-----------------+
    | bert-99 | Offline | 90.16951 | X 1764.6 | - |
    +---------+----------+----------+------------+-----------------+

resnet50

+----------+----------+----------+------------+-----------------+
| Model | Scenario | Accuracy | Throughput | Latency (in ms) |
+----------+----------+----------+------------+-----------------+
| resnet50 | Offline | 76.034 | 19709.5 | - |
+----------+----------+----------+------------+-----------------+

How should I interpret them and what should I compare them with? I found some tables here https://mlcommons.org/benchmarks/inference-edge/ . Did I understand correctly that Throughput is an analogue of "Samples". And what should I do with "Accuracy"?

I wanted to run resnet50 in single mode. But I got an error.
The command: cm run script --tags=run-mlperf,inference,_r4.1-dev --model=resnet50 --implementation=nvidia --framework=tensorrt --category=edge --scenario=SingleStream --execution_mode=valid --device=cuda --quiet

I took the command here https://docs.mlcommons.org/inference/benchmarks/language/bert/#__tabbed_59_3

Log with error:
resnet50_error.log

@Agalakdak
Copy link
Author

@arjunsuresh Hi. I ran the command inside the container.
cm run script --tags=run-mlperf,inference,_r4.1-dev --model=3d-unet-99 --implementation=nvidia --framework=tensorrt --category=edge --scenario=Offline --execution_mode=valid --device=cuda --quiet

And I got an error
3dunet_error.log

@arjunsuresh
Copy link
Contributor

@Agalakdak Can you please open a separate issue for each model related query?

For R50 can you try adding this option? --env.SKIP_POLICIES=1

For the 3d-unet failing, can you please add --docker_cache=no to eliminate any issue with old cache?

While running in "closed" division accuracy will be above the threshold or else the submission checker will fail. For this reason, accuracy is not reported in the official results. In other words accuracy value of all the submissions are expected to be very close in the closed division and so only the performance number matter.

For throughput - yes, it is "samples per second" for most benchmarks and "tokens per second" for LLM ones.

@Agalakdak
Copy link
Author

Hi @arjunsuresh, sorry for the late reply. I was busy with other things. I tried your advice and unfortunately got errors again. But this time I can provide the entire log of the first and second steps.

The first step is entering the command to go to the container.
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=3d-unet-99
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=50

The second step is actually entering the command in the container itself.
cm run script --tags=run-mlperf,inference,_r4.1-dev
--model=sdxl
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=valid
--device=cuda
--quiet

3dunet_full_error_second_step.log
3dunet_full_error_first_step.log

@arjunsuresh
Copy link
Contributor

Hi @Agalakdak The second command you shared is for sdxl but the logs are for 3d-unet. Is SDXL working fine? Let me check 3d-unet at my end.

@arjunsuresh
Copy link
Contributor

It's working fine for me.

make preprocess_data BENCHMARKS='3d-unet'
/home/cmuser/CM/repos/local/cache/4ea5dceee2464cb7/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py:37: DeprecationWarning: Please use `zoom` from the `scipy.ndimage` namespace, the `scipy.ndimage.interpolation` namespace is deprecated.
  from scipy.ndimage.interpolation import zoom
Preprocessing /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/data/KiTS19/kits19/data...
Saved /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/preprocessed_data/KiTS19/reference/case_00012.pkl -- shape (1, 256, 320, 320) mean [-1.8] std [1.05]
Saved /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/preprocessed_data/KiTS19/reference/case_00044.pkl -- shape (1, 320, 384, 384) mean [-1.86] std [1.05]
Saved /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/preprocessed_data/KiTS19/reference/case_00024.pkl -- shape (1, 256, 256, 256) mean [-1.66] std [1.17]
...

What I suspect is a failure in the download of the kits19 dataset as the below Nvidia script is skipping a redownload if the file already exist without checking its validity.

https://github.com/mlcommons/inference_results_v4.0/blob/main/closed/NVIDIA/code/3d-unet/tensorrt/download_data.sh#L20

The below command will give you the path to the NVIDIA_SCRATCH to where the data gets downloaded. You manually remove the kits19 data directory from there and then retry the command.

cm run script "get mlperf inference nvidia scratch space _version.4_0" -j

@arjunsuresh
Copy link
Contributor

Meanwhile kits19 download is slow and can take several hours to complete.

@Agalakdak
Copy link
Author

Hello @arjunsuresh. I think I figured out what the problem is. It's a network issue...
Below is the log
case_00299: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 280551/280552 [00:48<00:00, 5728.77KB/s]
Duplicating KITS19 case_00185 as case_00400...
~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA
Done.
Downloading JSON files describing subset used for inference/calibration...
--2024-09-17 02:52:31-- https://raw.githubusercontent.com/mlcommons/inference/486a629ea4d5c5150f452d0b0a196bf71fd2021e
Connecting to my_proxy_ip:8080... connected.
Proxy request sent, awaiting response... 400 Bad Request
2024-09-17 02:52:32 ERROR 400: Bad Request.

--2024-09-17 02:52:32-- http://92dd3d24cf78d07aa31165f90c636d98c4adddcd/vision/medical_imaging/3d-unet-kits19/meta/inference_cases.json
Connecting to my_proxy_ip:8080 connected.
Proxy request sent, awaiting response... 404 No such domain
2024-09-17 02:52:32 ERROR 404: No such domain.

--2024-09-17 02:52:32-- https://raw.githubusercontent.com/mlcommons/inference/486a629ea4d5c5150f452d0b0a196bf71fd2021e
Connecting to my_proxy_ip:8080... connected.
Proxy request sent, awaiting response... 400 Bad Request
2024-09-17 02:52:32 ERROR 400: Bad Request.

--2024-09-17 02:52:32-- http://92dd3d24cf78d07aa31165f90c636d98c4adddcd/vision/medical_imaging/3d-unet-kits19/meta/calibration_cases.json
Connecting to my_proxy_ip:8080... connected.
Proxy request sent, awaiting response... 404 No such domain
2024-09-17 02:52:33 ERROR 404: No such domain.

Done.
Finished downloading all the datasets!
/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py:37: DeprecationWarning: Please use zoom from the scipy.ndimage namespace, the scipy.ndimage.interpolation namespace is deprecated.
from scipy.ndimage.interpolation import zoom
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py", line 858, in
main()
File "/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py", line 842, in main
kits19tool = KITS19Tool(args)
File "/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py", line 117, in init
self.INFER_CASES = json.load(open(self.INFERENCE_CASE_FILE))
File "/usr/lib/python3.8/json/init.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
make: *** [/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/Makefile.data:36: preprocess_data] Error 1

CM error: Portable CM script failed (name = app-mlperf-inference-nvidia, return code = 256)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!
cmuser@85f58939130e

@arjunsuresh
Copy link
Contributor

@Agalakdak Actually that looks like a problem with the download script where it is creating invalid URLs. It probably worked fine for me because some of the downloaded files were already present. We'll fix this issue in the script.

@Agalakdak
Copy link
Author

@arjunsuresh If you need more information about my system, please let me know

@Agalakdak Agalakdak closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2024
@arjunsuresh
Copy link
Contributor

Hi @Agalakdak Can you please do this (inside the container)

cd ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA
git pull
cm rm cache --tags=_download_data -f

And retry the command?

@Agalakdak
Copy link
Author

Agalakdak commented Sep 18, 2024

@arjunsuresh Hi, I tried the advice above. Didn't help. There aren't many logs, so I just duplicated them below.

cmuser@ccaeef79d72e:$ cd ~~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA
cmuser@ccaeef79d72e:/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ git pull
remote: Enumerating objects: 21, done.
remote: Counting objects: 100% (21/21), done.
remote: Compressing objects: 100% (12/12), done.
remote: Total 13 (delta 10), reused 0 (delta 0), pack-reused 0 (from 0)
Unpacking objects: 100% (13/13), 2.32 KiB | 339.00 KiB/s, done.
From https://github.com/GATEOverflow/inference_results_v4.0
c032f835c..7abca22ba main -> origin/main
Updating c032f835c..7abca22ba
Fast-forward
closed/NVIDIA/Makefile.build | 2 --
closed/NVIDIA/code/3d-unet/tensorrt/download_data.sh | 4 ++--
2 files changed, 2 insertions(+), 4 deletions(-)
cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ cm rm cache --tags=_download_data -f

CM error: artifact(s) not found!
cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$

...

cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ source /cm-venv/bin/activate
(cm-venv) cmuser@ccaeef79d72e:
/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ cm rm cache --tags=_download_data -f

CM error: artifact(s) not found!

@arjunsuresh
Copy link
Contributor

Hi @Agalakdak Can you please try cm rm cache --tags=_preprocess_data -f instead?

@Agalakdak
Copy link
Author

@arjunsuresh
Hi, I tried to run the command in this order:

  1. cd ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA
  2. git pull
  3. cm rm cache --tags=_download_data -f
    And on the 3rd step I got the error "cm: command not found"

I tried to run "cm rm cache --tags=_preprocess_data -f" right after entering the container. And the command completed successfully. But it did not give any result.

unet_error.log

@arjunsuresh
Copy link
Contributor

Can you retry the original command? No need to do command number 3.

@Agalakdak
Copy link
Author

@arjunsuresh
Hi, I am constantly busy with all sorts of tasks, so it is not always possible to promptly collect the necessary logs. The log below is

  1. Running the command
    cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
    --model=3d-unet-99
    --implementation=nvidia
    --framework=tensorrt
    --category=edge
    --scenario=Offline
    --execution_mode=test
    --device=cuda
    --docker --quiet
    --test_query_count=50
    (and getting an error)

  2. Running the command
    cd ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA

  3. Running the command
    git pull

  4. Running the command

cm run script --tags=run-mlperf,inference,_r4.1-dev
--model=3d-unet-99
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=valid
--device=cuda
--quiet

(and getting an error)

Full log:
unet_19_09_error.log

@arjunsuresh
Copy link
Contributor

No worries. I have added some extra checks for existing stale files. Can you please do cm pull repo and just repeat the 4th command (both inside the container)?

@Agalakdak
Copy link
Author

Hi @arjunsuresh , I repeated all the commands as I did above. I got the same result.

unet_24_09_error.log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants