Skip to content
This repository has been archived by the owner on Sep 6, 2022. It is now read-only.

RAM OOM Problem #16

Open
kjh21212 opened this issue Apr 14, 2020 · 11 comments
Open

RAM OOM Problem #16

kjh21212 opened this issue Apr 14, 2020 · 11 comments

Comments

@kjh21212
Copy link

kjh21212 commented Apr 14, 2020

when i run your code
it happened RAM OOM in eval part
i don't know why happened this problem?
my desktop ram size is 128GB and using 4-ways gpu
and it was increase memory every eval batch
also 4-ways gpu batch process speed is slower than single gpu

@prajwaljpj
Copy link

@kjh21212 I'm facing the same RAM issue, were you able to solve it?

@NAM-hj
Copy link

NAM-hj commented Apr 23, 2020

I have same issue.
My system is

RAM : 128GB
GPU : GTX 1080ti * 4
OS : ubuntu 18.04
NVIDIA Driver : 440.82
CUDA : 10.1
CUDNN : 7.6.5
python : 3.6.9
tensorflow & tensorflow-gpu : 2.1.0
(And I do not change any param in run_common_voice.py)

When I run the run_common_voice.py code. These are shown.

  1. At the 0th epoch
    Eval_step is running with retracing warning and then, I got the OOM error.

  2. Disable evaluation at the 0th epoch.
    2-1. When there is retracing warning (slow)
    Epoch: 0, Batch: 60, Global Step: 60, Step Time: 26.0310, Loss: 165.6244
    2-2. When there is no retracing warning (fast)
    Epoch: 0, Batch: 62, Global Step: 62, Step Time: 6.3741, Loss: 164.6387

    Then I get the OOM error after this line
    Epoch: 0, Batch: 226, Global Step: 226, Step Time: 5.9092, Loss: 142.7257
    ...

I think some of the tf.function? affect to speed of the training.

Does the retracing warning have a connection with OOM error?
--> If so, how can I solve the retracing warning?
--> If not, how can I solve the OOM error?

@nglehuy
Copy link

nglehuy commented May 10, 2020

@Nambee Seems like there's something with GradientTape, RNN layers or TFRecords. I implemented DeepSpeech2 with tfrecord dataset in keras and when I trained it using .fit function, no OOM error, but when I trained using GradientTape, the memory kept going up and then OOM. However, when I trained SEGAN (No recurrent network, only Conv) with a generator dataset using GradientTape, it worked fine.

@noahchalifour
Copy link
Owner

Please try again with the latest commit. I have updated it to use Tensorflow 2.2.0 and solved the retracing issue

@stefan-falk
Copy link

@noahchalifour Just executed the current repository code with one GPU. I am also running into the OOM error also using a GeForce GTX 1080 Ti card.

@nglehuy
Copy link

nglehuy commented Jun 15, 2020

I have figured out that if we use tf.data.TFRecordDataset, then wraping whole dataset loop with @tf.function can avoid RAM OOM (and also train faster), like:

@tf.function
def train():
    for batch in train_dataset:
        train_step(batch)

The downside of this trick is we can't use native python functions and unimplemented tf functions in graph mode (like tf.train.Checkpoint.save()). However, we can use tf.py_function or tf.numpy_function to run them, but we have to run tf.distribute.Server if we want to train using multi-gpus, this limitation is mentioned here: https://www.tensorflow.org/api_docs/python/tf/numpy_function?hl=en

@stefan-falk
Copy link

@usimarit Are you able to train/use the model? I can only afford a very small batch size (4-8 samples) when running on a single GeForce 1080 Ti (~11 GB RAM) and I am not even sure if it's working.

How long did you have train your model?

@nglehuy
Copy link

nglehuy commented Jul 15, 2020

@usimarit Are you able to train/use the model? I can only afford a very small batch size (4-8 samples) when running on a single GeForce 1080 Ti (~11 GB RAM) and I am not even sure if it's working.

How long did you have train your model?

I guess small batch size is normal for ASR models. I trained a ctc model on rtx 2080ti 11G on about 300hours dataset and it took 3 days for 12 epochs with batch size 4.
But this issue is about RAM OOM, not GPU VRAM OOM :)) I've tested multiple times using tfrecorddataset and it seems like there is some bugs when iterating it using for loop.

@stefan-falk
Copy link

@usimarit Oh, I misinterpreted the issue the.

Yeah, that batch size size what I am using too. Didn't expect such a small batch size to work out :)

@malixian
Copy link

malixian commented Nov 14, 2020

Please try again with the latest commit. I have updated it to use Tensorflow 2.2.0 and solved the retracing issue

@noahchalifour But I'am also facing the problem even with using Tensorflow2.2.0 and the latest commit.

@malixian
Copy link

malixian commented Nov 14, 2020

I have figured out that if we use tf.data.TFRecordDataset, then wraping whole dataset loop with @tf.function can avoid RAM OOM (and also train faster), like:

@tf.function
def train():
    for batch in train_dataset:
        train_step(batch)

The downside of this trick is we can't use native python functions and unimplemented tf functions in graph mode (like tf.train.Checkpoint.save()). However, we can use tf.py_function or tf.numpy_function to run them, but we have to run tf.distribute.Server if we want to train using multi-gpus, this limitation is mentioned here: https://www.tensorflow.org/api_docs/python/tf/numpy_function?hl=en

@usimarit I have tried it, but it still doesn't work

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants