Issue with 3d-unet #1845

Agalakdak · 2024-09-10T08:12:31Z

Hello everyone, I have already submitted a bug report here. But that topic got a lot of messages and I decided to create a new topic.
This time I ran 3D unet using the command below from this site
https://docs.mlcommons.org/inference/benchmarks/medical_imaging/3d-unet/

The command
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=3d-unet-99
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=50

and a brief error report:
0.580 INFO:root: ! cd /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f
0.580 INFO:root: ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-git-repo/run.sh from tmp-run.sh
0.584 /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f
0.585 ******************************************************
0.585 Current directory: /home/cmuser/CM/repos/local/cache/5103bc0a39b8472f
0.585
0.585 Cloning inference from https://github.com/mlcommons/inference
0.585
0.585 git clone -b master https://github.com/mlcommons/inference --depth 5 inference
0.585
0.586 Cloning into 'inference'...
38.68 fatal: the remote end hung up unexpectedly
38.69 fatal: early EOF
38.69 fatal: index-pack failed
38.69 Detected version: 3.8.10
38.69 Detected version: 3.8.10
38.69
38.69 CM error: Portable CM script failed (name = get-git-repo, return code = 256)
38.69
38.69
38.69 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
38.69 Note that it is often a portability issue of a third-party tool or a native script
38.69 wrapped and unified by this CM script (automation recipe). Please re-run
38.69 this script with --repro flag and report this issue with the original
38.69 command line, cm-repro directory and full log here:
38.69
38.69 https://github.com/mlcommons/cm4mlops/issues
38.69
38.69 The CM concept is to collaboratively fix such issues inside portable CM scripts
38.69 to make existing tools and native scripts more portable, interoperable
38.69 and deterministic. Thank you!

Full log with the problem

error_3dunet.log

arjunsuresh · 2024-09-10T10:39:34Z

That looks like a github connection issue. Can you please retry?

Agalakdak · 2024-09-11T12:49:44Z

Hi @arjunsuresh , thanks for the advice. I tried several times (about 4-5) and in 4 out of 5 cases after ~2000-3000 seconds I got the error shown above.

Finally I got some results (for other neural networks). Please answer the questions below.

My results:
bert-99
+---------+----------+----------+------------+-----------------+
| Model | Scenario | Accuracy | Throughput | Latency (in ms) |
+---------+----------+----------+------------+-----------------+
| bert-99 | Offline | 90.16951 | X 1764.6 | - |
+---------+----------+----------+------------+-----------------+

resnet50

+----------+----------+----------+------------+-----------------+
| Model | Scenario | Accuracy | Throughput | Latency (in ms) |
+----------+----------+----------+------------+-----------------+
| resnet50 | Offline | 76.034 | 19709.5 | - |
+----------+----------+----------+------------+-----------------+

How should I interpret them and what should I compare them with? I found some tables here https://mlcommons.org/benchmarks/inference-edge/ . Did I understand correctly that Throughput is an analogue of "Samples". And what should I do with "Accuracy"?

I wanted to run resnet50 in single mode. But I got an error.
The command: cm run script --tags=run-mlperf,inference,_r4.1-dev --model=resnet50 --implementation=nvidia --framework=tensorrt --category=edge --scenario=SingleStream --execution_mode=valid --device=cuda --quiet

I took the command here https://docs.mlcommons.org/inference/benchmarks/language/bert/#__tabbed_59_3

Log with error:
resnet50_error.log

Agalakdak · 2024-09-12T07:58:18Z

@arjunsuresh Hi. I ran the command inside the container.
cm run script --tags=run-mlperf,inference,_r4.1-dev --model=3d-unet-99 --implementation=nvidia --framework=tensorrt --category=edge --scenario=Offline --execution_mode=valid --device=cuda --quiet

And I got an error
3dunet_error.log

arjunsuresh · 2024-09-12T13:53:07Z

@Agalakdak Can you please open a separate issue for each model related query?

For R50 can you try adding this option? --env.SKIP_POLICIES=1

For the 3d-unet failing, can you please add --docker_cache=no to eliminate any issue with old cache?

While running in "closed" division accuracy will be above the threshold or else the submission checker will fail. For this reason, accuracy is not reported in the official results. In other words accuracy value of all the submissions are expected to be very close in the closed division and so only the performance number matter.

For throughput - yes, it is "samples per second" for most benchmarks and "tokens per second" for LLM ones.

Agalakdak · 2024-09-16T07:31:13Z

Hi @arjunsuresh, sorry for the late reply. I was busy with other things. I tried your advice and unfortunately got errors again. But this time I can provide the entire log of the first and second steps.

The first step is entering the command to go to the container.
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=3d-unet-99
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=50

The second step is actually entering the command in the container itself.
cm run script --tags=run-mlperf,inference,_r4.1-dev
--model=sdxl
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=valid
--device=cuda
--quiet

3dunet_full_error_second_step.log
3dunet_full_error_first_step.log

arjunsuresh · 2024-09-16T12:48:13Z

Hi @Agalakdak The second command you shared is for sdxl but the logs are for 3d-unet. Is SDXL working fine? Let me check 3d-unet at my end.

arjunsuresh · 2024-09-16T12:58:00Z

It's working fine for me.

make preprocess_data BENCHMARKS='3d-unet'
/home/cmuser/CM/repos/local/cache/4ea5dceee2464cb7/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py:37: DeprecationWarning: Please use `zoom` from the `scipy.ndimage` namespace, the `scipy.ndimage.interpolation` namespace is deprecated.
  from scipy.ndimage.interpolation import zoom
Preprocessing /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/data/KiTS19/kits19/data...
Saved /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/preprocessed_data/KiTS19/reference/case_00012.pkl -- shape (1, 256, 320, 320) mean [-1.8] std [1.05]
Saved /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/preprocessed_data/KiTS19/reference/case_00044.pkl -- shape (1, 320, 384, 384) mean [-1.86] std [1.05]
Saved /home/cmuser/CM/repos/local/cache/b1f8faeaa7384886/preprocessed_data/KiTS19/reference/case_00024.pkl -- shape (1, 256, 256, 256) mean [-1.66] std [1.17]
...

What I suspect is a failure in the download of the kits19 dataset as the below Nvidia script is skipping a redownload if the file already exist without checking its validity.

https://github.com/mlcommons/inference_results_v4.0/blob/main/closed/NVIDIA/code/3d-unet/tensorrt/download_data.sh#L20

The below command will give you the path to the NVIDIA_SCRATCH to where the data gets downloaded. You manually remove the kits19 data directory from there and then retry the command.

cm run script "get mlperf inference nvidia scratch space _version.4_0" -j

arjunsuresh · 2024-09-16T12:58:39Z

Meanwhile kits19 download is slow and can take several hours to complete.

Agalakdak · 2024-09-17T10:18:58Z

Hello @arjunsuresh. I think I figured out what the problem is. It's a network issue...
Below is the log
case_00299: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 280551/280552 [00:48<00:00, 5728.77KB/s]
Duplicating KITS19 case_00185 as case_00400...
~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA
Done.
Downloading JSON files describing subset used for inference/calibration...
--2024-09-17 02:52:31-- https://raw.githubusercontent.com/mlcommons/inference/486a629ea4d5c5150f452d0b0a196bf71fd2021e
Connecting to my_proxy_ip:8080... connected.
Proxy request sent, awaiting response... 400 Bad Request
2024-09-17 02:52:32 ERROR 400: Bad Request.

--2024-09-17 02:52:32-- http://92dd3d24cf78d07aa31165f90c636d98c4adddcd/vision/medical_imaging/3d-unet-kits19/meta/inference_cases.json
Connecting to my_proxy_ip:8080 connected.
Proxy request sent, awaiting response... 404 No such domain
2024-09-17 02:52:32 ERROR 404: No such domain.

--2024-09-17 02:52:32-- https://raw.githubusercontent.com/mlcommons/inference/486a629ea4d5c5150f452d0b0a196bf71fd2021e
Connecting to my_proxy_ip:8080... connected.
Proxy request sent, awaiting response... 400 Bad Request
2024-09-17 02:52:32 ERROR 400: Bad Request.

--2024-09-17 02:52:32-- http://92dd3d24cf78d07aa31165f90c636d98c4adddcd/vision/medical_imaging/3d-unet-kits19/meta/calibration_cases.json
Connecting to my_proxy_ip:8080... connected.
Proxy request sent, awaiting response... 404 No such domain
2024-09-17 02:52:33 ERROR 404: No such domain.

Done.
Finished downloading all the datasets!
/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py:37: DeprecationWarning: Please use zoom from the scipy.ndimage namespace, the scipy.ndimage.interpolation namespace is deprecated.
from scipy.ndimage.interpolation import zoom
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py", line 858, in
main()
File "/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py", line 842, in main
kits19tool = KITS19Tool(args)
File "/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/code/3d-unet/tensorrt/preprocess_data.py", line 117, in init
self.INFER_CASES = json.load(open(self.INFERENCE_CASE_FILE))
File "/usr/lib/python3.8/json/init.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.8/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
make: *** [/home/cmuser/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA/Makefile.data:36: preprocess_data] Error 1

CM error: Portable CM script failed (name = app-mlperf-inference-nvidia, return code = 256)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!
cmuser@85f58939130e

arjunsuresh · 2024-09-17T11:40:57Z

@Agalakdak Actually that looks like a problem with the download script where it is creating invalid URLs. It probably worked fine for me because some of the downloaded files were already present. We'll fix this issue in the script.

Agalakdak · 2024-09-17T12:03:31Z

@arjunsuresh If you need more information about my system, please let me know

arjunsuresh · 2024-09-17T12:17:22Z

Hi @Agalakdak Can you please do this (inside the container)

cd ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA
git pull
cm rm cache --tags=_download_data -f

And retry the command?

Agalakdak · 2024-09-18T08:04:46Z

@arjunsuresh Hi, I tried the advice above. Didn't help. There aren't many logs, so I just duplicated them below.

cmuser@ccaeef79d72e:$ cd ~~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA
cmuser@ccaeef79d72e:/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ git pull
remote: Enumerating objects: 21, done.
remote: Counting objects: 100% (21/21), done.
remote: Compressing objects: 100% (12/12), done.
remote: Total 13 (delta 10), reused 0 (delta 0), pack-reused 0 (from 0)
Unpacking objects: 100% (13/13), 2.32 KiB | 339.00 KiB/s, done.
From https://github.com/GATEOverflow/inference_results_v4.0
c032f835c..7abca22ba main -> origin/main
Updating c032f835c..7abca22ba
Fast-forward
closed/NVIDIA/Makefile.build | 2 --
closed/NVIDIA/code/3d-unet/tensorrt/download_data.sh | 4 ++--
2 files changed, 2 insertions(+), 4 deletions(-)
cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ cm rm cache --tags=_download_data -f

CM error: artifact(s) not found!
cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$

...

cmuser@ccaeef79d72e:~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ source /cm-venv/bin/activate
(cm-venv) cmuser@ccaeef79d72e:/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA$ cm rm cache --tags=_download_data -f

CM error: artifact(s) not found!

arjunsuresh · 2024-09-18T12:06:01Z

Hi @Agalakdak Can you please try cm rm cache --tags=_preprocess_data -f instead?

Agalakdak · 2024-09-18T14:05:33Z

@arjunsuresh
Hi, I tried to run the command in this order:

cd ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA
git pull
cm rm cache --tags=_download_data -f
And on the 3rd step I got the error "cm: command not found"

I tried to run "cm rm cache --tags=_preprocess_data -f" right after entering the container. And the command completed successfully. But it did not give any result.

unet_error.log

arjunsuresh · 2024-09-18T14:29:02Z

Can you retry the original command? No need to do command number 3.

Agalakdak · 2024-09-19T14:14:28Z

@arjunsuresh
Hi, I am constantly busy with all sorts of tasks, so it is not always possible to promptly collect the necessary logs. The log below is

Running the command
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=3d-unet-99
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=50
(and getting an error)
Running the command
cd ~/CM/repos/local/cache/28c214f878cb4afe/repo/closed/NVIDIA
Running the command
git pull
Running the command

cm run script --tags=run-mlperf,inference,_r4.1-dev
--model=3d-unet-99
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=valid
--device=cuda
--quiet

(and getting an error)

Full log:
unet_19_09_error.log

arjunsuresh · 2024-09-19T17:28:39Z

No worries. I have added some extra checks for existing stale files. Can you please do cm pull repo and just repeat the 4th command (both inside the container)?

Agalakdak · 2024-09-24T09:42:55Z

Hi @arjunsuresh , I repeated all the commands as I did above. I got the same result.

unet_24_09_error.log

Agalakdak closed this as not planned Won't fix, can't repro, duplicate, stale Sep 17, 2024

arjunsuresh mentioned this issue Sep 18, 2024

Portable CM script failed (name = build-docker-image, return code = 256) mlcommons/ck#1276

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with 3d-unet #1845

Issue with 3d-unet #1845

Agalakdak commented Sep 10, 2024

arjunsuresh commented Sep 10, 2024

Agalakdak commented Sep 11, 2024

Agalakdak commented Sep 12, 2024

arjunsuresh commented Sep 12, 2024

Agalakdak commented Sep 16, 2024

arjunsuresh commented Sep 16, 2024

arjunsuresh commented Sep 16, 2024

arjunsuresh commented Sep 16, 2024

Agalakdak commented Sep 17, 2024

arjunsuresh commented Sep 17, 2024

Agalakdak commented Sep 17, 2024

arjunsuresh commented Sep 17, 2024

Agalakdak commented Sep 18, 2024 •

edited

Loading

arjunsuresh commented Sep 18, 2024

Agalakdak commented Sep 18, 2024

arjunsuresh commented Sep 18, 2024

Agalakdak commented Sep 19, 2024

arjunsuresh commented Sep 19, 2024

Agalakdak commented Sep 24, 2024

Issue with 3d-unet #1845

Issue with 3d-unet #1845

Comments

Agalakdak commented Sep 10, 2024

arjunsuresh commented Sep 10, 2024

Agalakdak commented Sep 11, 2024

Agalakdak commented Sep 12, 2024

arjunsuresh commented Sep 12, 2024

Agalakdak commented Sep 16, 2024

arjunsuresh commented Sep 16, 2024

arjunsuresh commented Sep 16, 2024

arjunsuresh commented Sep 16, 2024

Agalakdak commented Sep 17, 2024

arjunsuresh commented Sep 17, 2024

Agalakdak commented Sep 17, 2024

arjunsuresh commented Sep 17, 2024

Agalakdak commented Sep 18, 2024 • edited Loading

arjunsuresh commented Sep 18, 2024

Agalakdak commented Sep 18, 2024

arjunsuresh commented Sep 18, 2024

Agalakdak commented Sep 19, 2024

arjunsuresh commented Sep 19, 2024

Agalakdak commented Sep 24, 2024

Agalakdak commented Sep 18, 2024 •

edited

Loading