RAM OOM Problem #16

kjh21212 · 2020-04-14T01:01:57Z

when i run your code
it happened RAM OOM in eval part
i don't know why happened this problem?
my desktop ram size is 128GB and using 4-ways gpu
and it was increase memory every eval batch
also 4-ways gpu batch process speed is slower than single gpu

prajwaljpj · 2020-04-19T12:54:27Z

@kjh21212 I'm facing the same RAM issue, were you able to solve it?

NAM-hj · 2020-04-23T01:20:50Z

I have same issue.
My system is

RAM : 128GB
GPU : GTX 1080ti * 4
OS : ubuntu 18.04
NVIDIA Driver : 440.82
CUDA : 10.1
CUDNN : 7.6.5
python : 3.6.9
tensorflow & tensorflow-gpu : 2.1.0
(And I do not change any param in run_common_voice.py)

When I run the run_common_voice.py code. These are shown.

At the 0th epoch
Eval_step is running with retracing warning and then, I got the OOM error.

Disable evaluation at the 0th epoch.
2-1. When there is retracing warning (slow)
Epoch: 0, Batch: 60, Global Step: 60, Step Time: 26.0310, Loss: 165.6244
2-2. When there is no retracing warning (fast)
Epoch: 0, Batch: 62, Global Step: 62, Step Time: 6.3741, Loss: 164.6387

Then I get the OOM error after this line
Epoch: 0, Batch: 226, Global Step: 226, Step Time: 5.9092, Loss: 142.7257
...

I think some of the tf.function? affect to speed of the training.

Does the retracing warning have a connection with OOM error?
--> If so, how can I solve the retracing warning?
--> If not, how can I solve the OOM error?

nglehuy · 2020-05-10T13:52:45Z

@Nambee Seems like there's something with GradientTape, RNN layers or TFRecords. I implemented DeepSpeech2 with tfrecord dataset in keras and when I trained it using .fit function, no OOM error, but when I trained using GradientTape, the memory kept going up and then OOM. However, when I trained SEGAN (No recurrent network, only Conv) with a generator dataset using GradientTape, it worked fine.

noahchalifour · 2020-05-14T18:03:51Z

Please try again with the latest commit. I have updated it to use Tensorflow 2.2.0 and solved the retracing issue

stefan-falk · 2020-05-26T07:42:50Z

@noahchalifour Just executed the current repository code with one GPU. I am also running into the OOM error also using a GeForce GTX 1080 Ti card.

nglehuy · 2020-06-15T09:36:10Z

I have figured out that if we use tf.data.TFRecordDataset, then wraping whole dataset loop with @tf.function can avoid RAM OOM (and also train faster), like:

@tf.function
def train():
    for batch in train_dataset:
        train_step(batch)

The downside of this trick is we can't use native python functions and unimplemented tf functions in graph mode (like tf.train.Checkpoint.save()). However, we can use tf.py_function or tf.numpy_function to run them, but we have to run tf.distribute.Server if we want to train using multi-gpus, this limitation is mentioned here: https://www.tensorflow.org/api_docs/python/tf/numpy_function?hl=en

stefan-falk · 2020-07-15T10:59:15Z

@usimarit Are you able to train/use the model? I can only afford a very small batch size (4-8 samples) when running on a single GeForce 1080 Ti (~11 GB RAM) and I am not even sure if it's working.

How long did you have train your model?

nglehuy · 2020-07-15T13:05:03Z

@usimarit Are you able to train/use the model? I can only afford a very small batch size (4-8 samples) when running on a single GeForce 1080 Ti (~11 GB RAM) and I am not even sure if it's working.

How long did you have train your model?

I guess small batch size is normal for ASR models. I trained a ctc model on rtx 2080ti 11G on about 300hours dataset and it took 3 days for 12 epochs with batch size 4.
But this issue is about RAM OOM, not GPU VRAM OOM :)) I've tested multiple times using tfrecorddataset and it seems like there is some bugs when iterating it using for loop.

stefan-falk · 2020-07-16T10:22:19Z

@usimarit Oh, I misinterpreted the issue the.

Yeah, that batch size size what I am using too. Didn't expect such a small batch size to work out :)

malixian · 2020-11-14T08:41:55Z

Please try again with the latest commit. I have updated it to use Tensorflow 2.2.0 and solved the retracing issue

@noahchalifour But I'am also facing the problem even with using Tensorflow2.2.0 and the latest commit.

malixian · 2020-11-14T08:45:02Z

I have figured out that if we use tf.data.TFRecordDataset, then wraping whole dataset loop with @tf.function can avoid RAM OOM (and also train faster), like:
@tf.function
def train():
    for batch in train_dataset:
        train_step(batch)
The downside of this trick is we can't use native python functions and unimplemented tf functions in graph mode (like tf.train.Checkpoint.save()). However, we can use tf.py_function or tf.numpy_function to run them, but we have to run tf.distribute.Server if we want to train using multi-gpus, this limitation is mentioned here: https://www.tensorflow.org/api_docs/python/tf/numpy_function?hl=en

@usimarit I have tried it, but it still doesn't work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAM OOM Problem #16

RAM OOM Problem #16

kjh21212 commented Apr 14, 2020 •

edited

Loading

prajwaljpj commented Apr 19, 2020

NAM-hj commented Apr 23, 2020

nglehuy commented May 10, 2020

noahchalifour commented May 14, 2020

stefan-falk commented May 26, 2020

nglehuy commented Jun 15, 2020

stefan-falk commented Jul 15, 2020

nglehuy commented Jul 15, 2020

stefan-falk commented Jul 16, 2020

malixian commented Nov 14, 2020 •

edited

Loading

malixian commented Nov 14, 2020 •

edited

Loading

RAM OOM Problem #16

RAM OOM Problem #16

Comments

kjh21212 commented Apr 14, 2020 • edited Loading

prajwaljpj commented Apr 19, 2020

NAM-hj commented Apr 23, 2020

nglehuy commented May 10, 2020

noahchalifour commented May 14, 2020

stefan-falk commented May 26, 2020

nglehuy commented Jun 15, 2020

stefan-falk commented Jul 15, 2020

nglehuy commented Jul 15, 2020

stefan-falk commented Jul 16, 2020

malixian commented Nov 14, 2020 • edited Loading

malixian commented Nov 14, 2020 • edited Loading

kjh21212 commented Apr 14, 2020 •

edited

Loading

malixian commented Nov 14, 2020 •

edited

Loading

malixian commented Nov 14, 2020 •

edited

Loading