Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: High VRAM usage during Vae step #3416

Open
2 tasks done
zaxwashere opened this issue Sep 10, 2024 · 15 comments
Open
2 tasks done

[Issue]: High VRAM usage during Vae step #3416

zaxwashere opened this issue Sep 10, 2024 · 15 comments
Labels
cannot reproduce Reported issue cannot be easily reproducible help wanted Extra attention is needed

Comments

@zaxwashere
Copy link

Issue Description

Vram usage during the VAE step is inconsistent and will spike to >12 gb for an sdxl model. This is atypical for my usage, where an SDXL model will stay at 10gb or less during the vae step with my settings all applied:

  • fp16 mode, vae slicing and vae tiling = true, vae upcast = false
    1024x1024 10 steps dpm ++2m, sdxl timestep presets used, cfg = 3, no attention guidance, no loras applied.

Disabling "use cached model config when available" removes the issue, and generation speeds will be 8 -10 seconds.

VRAM usage in the console does not reflect the usage as seen in task manager or in the webui, attached is a screenshot of the vram usage during a run.
sdnext (1).log

image

Version Platform Description

13:30:50-670748 INFO Logger: file="C:\Users\zaxof\OneDrive\Documents\GitHub\nvidia_sdnext\sdnext.log" level=DEBUG
size=65 mode=create
13:30:50-672246 INFO Python version=3.10.6 platform=Windows
bin="C:\Users\zaxof\OneDrive\Documents\GitHub\nvidia_sdnext\venv\Scripts\python.exe"
venv="C:\Users\zaxof\OneDrive\Documents\GitHub\nvidia_sdnext\venv"
13:30:50-859782 INFO Version: app=sd.next updated=2024-09-10 hash=91bdd3b3 branch=dev
url=https://github.com/vladmandic/automatic.git/tree/dev ui=dev
13:30:51-186334 INFO Updating main repository
13:30:52-008006 INFO Upgraded to version: 91bdd3b Tue Sep 10 19:20:49 2024 +0300
13:30:52-015505 INFO Platform: arch=AMD64 cpu=AMD64 Family 25 Model 33 Stepping 2, AuthenticAMD system=Windows
release=Windows-10-10.0.22631-SP0 python=3.10.6
13:30:52-017006 DEBUG Setting environment tuning
13:30:52-018506 INFO HF cache folder: C:\Users\zaxof.cache\huggingface\hub
13:30:52-019506 DEBUG Torch allocator: "garbage_collection_threshold:0.80,max_split_size_mb:512"
13:30:52-026016 DEBUG Torch overrides: cuda=False rocm=False ipex=False diml=False openvino=False
13:30:52-027513 DEBUG Torch allowed: cuda=True rocm=True ipex=True diml=True openvino=True
13:30:52-037517 INFO nVidia CUDA toolkit detected: nvidia-smi present

Extensions : Extensions all: ['a1111-sd-webui-tagcomplete', 'adetailer', 'OneButtonPrompt',
'sd-civitai-browser-plus_fix', 'sd-webui-infinite-image-browsing', 'sd-webui-inpaint-anything',
'sd-webui-prompt-all-in-one']

Windows 11, RTX 3060 12gb, 5700x3d, 64gb ddr4, dev branch SDNEXT, firefox browser on desktop, chrome on android for remote access.

Relevant log output

No response

Backend

Diffusers

UI

Standard

Branch

Dev

Model

StableDiffusion XL

Acknowledgements

  • I have read the above and searched for existing issues
  • I confirm that this is classified correctly and its not an extension issue
@vladmandic
Copy link
Owner

i cannot reproduce. i've added some extra logging, please set env variable SD_VAE_DEBUG=true and run. note that sdnext should be restarted after changing use cached model config.
post logs starting with TRACE for both runs.

@vladmandic vladmandic added question Further information is requested cannot reproduce Reported issue cannot be easily reproducible labels Sep 11, 2024
@zaxwashere
Copy link
Author

re-ran it with the env variable active.

13:51:08-388186 DEBUG    Sampler: sampler="DPM++ 2M" config={'num_train_timesteps': 1000, 'beta_start': 0.00085,
                         'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon',
                         'thresholding': False, 'sample_max_value': 1.0, 'algorithm_type': 'sde-dpmsolver++',
                         'solver_type': 'midpoint', 'lower_order_final': True, 'use_karras_sigmas': False,
                         'final_sigmas_type': 'zero', 'timestep_spacing': 'leading', 'solver_order': 2}
13:51:08-390185 DEBUG    Sampler: steps=10 timesteps=[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]
13:51:08-392183 DEBUG    Torch generator: device=cuda seeds=[1158966623]
13:51:08-393184 DEBUG    Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE batch=1/1x1
                         set={'timesteps': [999, 845, 730, 587, 443, 310, 193, 116, 53, 13], 'prompt_embeds':
                         torch.Size([1, 154, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]),
                         'negative_prompt_embeds': torch.Size([1, 154, 2048]), 'negative_pooled_prompt_embeds':
                         torch.Size([1, 1280]), 'guidance_scale': 3, 'num_inference_steps': 20, 'eta': 1.0,
                         'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1024,
                         'height': 1024, 'parser': 'Full parser'}
Progress  1.49it/s █████████████████████████████████ 100% 10/10 00:06 00:00 Base
13:51:15-342317 DEBUG    GC: utilization={'gpu': 71, 'ram': 3, 'threshold': 80} gc={'collected': 386, 'saved': 0.66}
                         before={'gpu': 8.55, 'ram': 2.16} after={'gpu': 7.89, 'ram': 2.16, 'retries': 0, 'oom': 0}
                         device=cuda fn=full_vae_decode time=0.22
13:51:15-857910 TRACE    VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types',
                         ['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']),
                         ('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
                         'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2),
                         ('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 1024),
                         ('scaling_factor', 0.13025), ('shift_factor', None), ('latents_mean', None), ('latents_std',
                         None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True),
                         ('mid_block_add_attention', True), ('_use_default_values', ['latents_std',
                         'use_post_quant_conv', 'mid_block_add_attention', 'latents_mean', 'shift_factor',
                         'use_quant_conv']), ('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.20.0.dev0'),
                         ('_name_or_path', '../sdxl-vae/')])
13:51:15-861410 TRACE    VAE memory: defaultdict(<class 'int'>, {'retries': 0, 'oom': 0, 'free': 137363456, 'total':
                         12884377600, 'active': 2672, 'active_peak': 9631253504, 'reserved': 11605639168,
                         'reserved_peak': 11725176832, 'used': 12747014144})
13:51:15-863413 TRACE    VAE decode: name=fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors dtype=torch.float16
                         upcast=False images=1 latents=torch.Size([1, 4, 128, 128]) time=0.741
13:51:16-051942 DEBUG    Profile: VAE decode: 0.93
13:51:16-298983 DEBUG    GC: utilization={'gpu': 99, 'ram': 3, 'threshold': 80} gc={'collected': 254, 'saved': 3.97}
                         before={'gpu': 11.87, 'ram': 2.16} after={'gpu': 7.9, 'ram': 2.16, 'retries': 0, 'oom': 0}
                         device=cuda fn=vae_decode time=0.25
13:51:16-343487 INFO     Save: image="outputs\text\06720-novaAnimeXL_ponyV40-Score 9 score 8 up score 7 up.jpg"
                         type=JPEG width=1024 height=1024 size=133251
13:51:16-345488 INFO     Processed: images=1 time=7.97 its=1.25 memory={'ram': {'used': 2.16, 'total': 63.9}, 'gpu':
                         {'used': 7.9, 'total': 12.0}, 'retries': 0, 'oom': 0}
13:51:22-375787 INFO     Base: class=StableDiffusionXLPipeline
13:51:22-377290 DEBUG    Sampler: sampler="DPM++ 2M" config={'num_train_timesteps': 1000, 'beta_start': 0.00085,
                         'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon',
                         'thresholding': False, 'sample_max_value': 1.0, 'algorithm_type': 'sde-dpmsolver++',
                         'solver_type': 'midpoint', 'lower_order_final': True, 'use_karras_sigmas': False,
                         'final_sigmas_type': 'zero', 'timestep_spacing': 'leading', 'solver_order': 2}
13:51:22-379288 DEBUG    Sampler: steps=10 timesteps=[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]
13:51:22-381287 DEBUG    Torch generator: device=cuda seeds=[2858245960]
13:51:22-382287 DEBUG    Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE batch=1/1x1
                         set={'timesteps': [999, 845, 730, 587, 443, 310, 193, 116, 53, 13], 'prompt_embeds':
                         torch.Size([1, 154, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]),
                         'negative_prompt_embeds': torch.Size([1, 154, 2048]), 'negative_pooled_prompt_embeds':
                         torch.Size([1, 1280]), 'guidance_scale': 3, 'num_inference_steps': 20, 'eta': 1.0,
                         'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1024,
                         'height': 1024, 'parser': 'Full parser'}
Progress  1.19it/s █████████████████████████████████ 100% 10/10 00:08 00:00 Base
13:51:31-083449 DEBUG    GC: utilization={'gpu': 71, 'ram': 3, 'threshold': 80} gc={'collected': 385, 'saved': 0.57}
                         before={'gpu': 8.46, 'ram': 2.16} after={'gpu': 7.89, 'ram': 2.16, 'retries': 0, 'oom': 0}
                         device=cuda fn=full_vae_decode time=0.22
13:51:36-590122 TRACE    VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types',
                         ['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']),
                         ('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
                         'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2),
                         ('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 1024),
                         ('scaling_factor', 0.13025), ('shift_factor', None), ('latents_mean', None), ('latents_std',
                         None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True),
                         ('mid_block_add_attention', True), ('_use_default_values', ['latents_std',
                         'use_post_quant_conv', 'mid_block_add_attention', 'latents_mean', 'shift_factor',
                         'use_quant_conv']), ('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.20.0.dev0'),
                         ('_name_or_path', '../sdxl-vae/')])
13:51:36-593621 TRACE    VAE memory: defaultdict(<class 'int'>, {'retries': 0, 'oom': 0, 'free': 3857711104, 'total':
                         12884377600, 'active': 2672, 'active_peak': 9631253504, 'reserved': 7885291520,
                         'reserved_peak': 11366563840, 'used': 9026666496})
13:51:36-595622 TRACE    VAE decode: name=fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors dtype=torch.float16
                         upcast=False images=1 latents=torch.Size([1, 4, 128, 128]) time=5.727
13:51:36-606121 DEBUG    Profile: VAE decode: 5.74
13:51:36-646633 INFO     Save: image="outputs\text\06721-novaAnimeXL_ponyV40-Score 9 score 8 up score 7 up.jpg"
                         type=JPEG width=1024 height=1024 size=150607
13:51:36-648635 INFO     Processed: images=1 time=14.29 its=0.70 memory={'ram': {'used': 2.16, 'total': 63.9}, 'gpu':
                         {'used': 8.41, 'total': 12.0}, 'retries': 0, 'oom': 0}```
                         
It is inconsistent sometimes the vae is fast, other times it takes almost as long as the whole generation.


here is after a restart with "cached config" unchecked.


```14:01:59-877365 INFO     Base: class=StableDiffusionXLPipeline
14:01:59-879361 DEBUG    Sampler: sampler="DPM++ 2M" config={'num_train_timesteps': 1000, 'beta_start': 0.00085,
                         'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon',
                         'thresholding': False, 'sample_max_value': 1.0, 'algorithm_type': 'sde-dpmsolver++',
                         'solver_type': 'midpoint', 'lower_order_final': True, 'use_karras_sigmas': False,
                         'final_sigmas_type': 'zero', 'timestep_spacing': 'leading', 'solver_order': 2}
14:01:59-880862 DEBUG    Sampler: steps=10 timesteps=[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]
14:01:59-882861 DEBUG    Torch generator: device=cuda seeds=[1523340005]
14:01:59-883862 DEBUG    Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE batch=1/1x1
                         set={'timesteps': [999, 845, 730, 587, 443, 310, 193, 116, 53, 13], 'prompt_embeds':
                         torch.Size([1, 154, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]),
                         'negative_prompt_embeds': torch.Size([1, 154, 2048]), 'negative_pooled_prompt_embeds':
                         torch.Size([1, 1280]), 'guidance_scale': 3, 'num_inference_steps': 10, 'eta': 1.0,
                         'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1024,
                         'height': 1024, 'parser': 'Full parser'}
Progress ?it/s                                              0% 0/10 00:00 ? Base14:02:00-413434 DEBUG    Server: alive=True jobs=0 requests=352 uptime=313 memory=1.91/63.9 backend=Backend.DIFFUSERS
                         state=idle
Progress  1.19it/s █████████████████████████████████ 100% 10/10 00:08 00:00 Base
14:02:08-625569 DEBUG    GC: utilization={'gpu': 66, 'ram': 3, 'threshold': 80} gc={'collected': 393, 'saved': 0.0}
                         before={'gpu': 7.9, 'ram': 1.91} after={'gpu': 7.9, 'ram': 1.91, 'retries': 0, 'oom': 0}
                         device=cuda fn=full_vae_decode time=0.22
14:02:12-525005 TRACE    VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types',
                         ['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']),
                         ('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
                         'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2),
                         ('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 1024),
                         ('scaling_factor', 0.13025), ('shift_factor', None), ('latents_mean', None), ('latents_std',
                         None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True),
                         ('mid_block_add_attention', True), ('_use_default_values', ['use_quant_conv', 'latents_mean',
                         'mid_block_add_attention', 'use_post_quant_conv', 'latents_std', 'shift_factor']),
                         ('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.21.0.dev0'), ('_name_or_path',
                         '/home/patrick/.cache/huggingface/hub/models--lykon-models--dreamshaper-8/snapshots/7e855e3f481
                         832419503d1fa18d4a4379597f04b/vae')])
14:02:12-528508 TRACE    VAE memory: defaultdict(<class 'int'>, {'retries': 0, 'oom': 0, 'free': 4344250368, 'total':
                         12884377600, 'active': 2673, 'active_peak': 8134968320, 'reserved': 7390363648,
                         'reserved_peak': 8545894400, 'used': 8540127232})
14:02:12-530504 TRACE    VAE decode: name=fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors dtype=torch.float16
                         upcast=False images=1 latents=torch.Size([1, 4, 128, 128]) time=4.116
14:02:12-538504 DEBUG    Profile: VAE decode: 4.13
14:02:12-582516 INFO     Save: image="outputs\text\06727-novaAnimeXL_ponyV40-Score 9 score 8 up score 7 up.jpg"
                         type=JPEG width=1024 height=1024 size=166048
14:02:12-584518 INFO     Processed: images=1 time=12.72 its=0.79 memory={'ram': {'used': 1.92, 'total': 63.9}, 'gpu':
                         {'used': 7.95, 'total': 12.0}, 'retries': 0, 'oom': 0}
14:03:31-516232 INFO     Base: class=StableDiffusionXLPipeline
14:03:31-518232 DEBUG    Sampler: sampler="DPM++ 2M" config={'num_train_timesteps': 1000, 'beta_start': 0.00085,
                         'beta_end': 0.012, 'beta_schedule': 'scaled_linear', 'prediction_type': 'epsilon',
                         'thresholding': False, 'sample_max_value': 1.0, 'algorithm_type': 'sde-dpmsolver++',
                         'solver_type': 'midpoint', 'lower_order_final': True, 'use_karras_sigmas': False,
                         'final_sigmas_type': 'zero', 'timestep_spacing': 'leading', 'solver_order': 2}
14:03:31-520232 DEBUG    Sampler: steps=10 timesteps=[999, 845, 730, 587, 443, 310, 193, 116, 53, 13]
14:03:31-522235 DEBUG    Torch generator: device=cuda seeds=[434621457]
14:03:31-523232 DEBUG    Diffuser pipeline: StableDiffusionXLPipeline task=DiffusersTaskType.TEXT_2_IMAGE batch=1/1x1
                         set={'timesteps': [999, 845, 730, 587, 443, 310, 193, 116, 53, 13], 'prompt_embeds':
                         torch.Size([1, 154, 2048]), 'pooled_prompt_embeds': torch.Size([1, 1280]),
                         'negative_prompt_embeds': torch.Size([1, 154, 2048]), 'negative_pooled_prompt_embeds':
                         torch.Size([1, 1280]), 'guidance_scale': 3, 'num_inference_steps': 10, 'eta': 1.0,
                         'guidance_rescale': 0.7, 'denoising_end': None, 'output_type': 'latent', 'width': 1024,
                         'height': 1024, 'parser': 'Full parser'}
Progress  1.19it/s █████████████████████████████████ 100% 10/10 00:08 00:00 Base
14:03:40-225062 DEBUG    GC: utilization={'gpu': 66, 'ram': 3, 'threshold': 80} gc={'collected': 399, 'saved': 0.0}
                         before={'gpu': 7.9, 'ram': 1.9} after={'gpu': 7.9, 'ram': 1.9, 'retries': 0, 'oom': 0}
                         device=cuda fn=full_vae_decode time=0.22
14:03:44-130741 TRACE    VAE config: FrozenDict([('in_channels', 3), ('out_channels', 3), ('down_block_types',
                         ['DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D', 'DownEncoderBlock2D']),
                         ('up_block_types', ['UpDecoderBlock2D', 'UpDecoderBlock2D', 'UpDecoderBlock2D',
                         'UpDecoderBlock2D']), ('block_out_channels', [128, 256, 512, 512]), ('layers_per_block', 2),
                         ('act_fn', 'silu'), ('latent_channels', 4), ('norm_num_groups', 32), ('sample_size', 1024),
                         ('scaling_factor', 0.13025), ('shift_factor', None), ('latents_mean', None), ('latents_std',
                         None), ('force_upcast', True), ('use_quant_conv', True), ('use_post_quant_conv', True),
                         ('mid_block_add_attention', True), ('_use_default_values', ['use_quant_conv', 'latents_mean',
                         'mid_block_add_attention', 'use_post_quant_conv', 'latents_std', 'shift_factor']),
                         ('_class_name', 'AutoencoderKL'), ('_diffusers_version', '0.21.0.dev0'), ('_name_or_path',
                         '/home/patrick/.cache/huggingface/hub/models--lykon-models--dreamshaper-8/snapshots/7e855e3f481
                         832419503d1fa18d4a4379597f04b/vae')])
14:03:44-134241 TRACE    VAE memory: defaultdict(<class 'int'>, {'retries': 0, 'oom': 0, 'free': 4344250368, 'total':
                         12884377600, 'active': 2673, 'active_peak': 8134968320, 'reserved': 7390363648,
                         'reserved_peak': 8545894400, 'used': 8540127232})
14:03:44-136242 TRACE    VAE decode: name=fixFP16ErrorsSDXLLowerMemoryUse_v10.safetensors dtype=torch.float16
                         upcast=False images=1 latents=torch.Size([1, 4, 128, 128]) time=4.121
14:03:44-144242 DEBUG    Profile: VAE decode: 4.13
14:03:44-186753 INFO     Save: image="outputs\text\06728-novaAnimeXL_ponyV40-Score 9 score 8 up score 7 up.jpg"
                         type=JPEG width=1024 height=1024 size=139766
14:03:44-189254 INFO     Processed: images=1 time=12.69 its=0.79 memory={'ram': {'used': 1.92, 'total': 63.9}, 'gpu':
                         {'used': 7.95, 'total': 12.0}, 'retries': 0, 'oom': 0}```


   

@vladmandic
Copy link
Owner

i can see some the difference with vs without config: 8.1gb vs 9.6gb
but i also see absolutely zero differences in the config itself.
and there is no proof of vram spike above 12gb as originally reported.

also, no matter what i do, i cannot reproduce this.
if someone has an idea or is able to reproduce separately, i'm really curious.

@vladmandic vladmandic added help wanted Extra attention is needed and removed question Further information is requested labels Sep 11, 2024
@zaxwashere
Copy link
Author

I did a fresh installation and the issue persisted. I realized that the configs are cached in users/myusername/.cache/huggingface. I just deleted all of that, but are there any other shared locations for cached data to hide that might be contributing to my problem?

@vladmandic
Copy link
Owner

vladmandic commented Sep 12, 2024

downloaded config is in users/myusername/.cache/huggingface
if you use "cached config" option, its exactly so this download is not required and it will use config in configs/ (for sdxl, it would be configs/sdxl)
also, you say that issue persists - but none of the logs you've uploaded with SD_VAE_DEBUG enabled show the spike above 10gb.

@zaxwashere
Copy link
Author

(had to delete my prior comment, formatting got jumbled)

My vram usage spikes above 10 gb per task manager and the webui under the preview image (labeled as GPU active). Vram usage is a bit inconsistent overall, there's probably some GC tweaking that I need to do.

My hunch is that vae tiling isn't being applied, but that's based only on the pattern I see. Vram usage is identical with it on or off when using the cached configuration. Let me know if there's anything else I can try.

SDNext Dev Branch        
RTX 3060 12gb   Driver 555.99  
Windows 11 Pro 23H2   Torch 2.4.1+cu124  
64 gb DDR4 3600mhz cl 18   RainponyXL  
Ryzen 5700x3d   sdxl fp16 fixed vae  
         
3 Run averages, 1024x Resolution
  Cached Config Off Cached config On
Vae Tiling On Off On Off
Vae decode (secs) 3.08 3.67 3.89 3.91
Active 8103 10983 11001 10935
reserved 7368 7357 7476 7410
used 8452 8444 8560 8494
free 3836 3844 3728 3794

@vladmandic
Copy link
Owner

ah, i may have found it. seems like vae was not typecast to fp16 if config was specified. so even if upcast is disabled, its pointless since its loaded as fp32.

update and try to reproduce. if issue persists, update here and i'll reopen.
and upload full log for both runs with and without config.
before running test, set env variable SD_VAE_DEBUG=true

@zaxwashere
Copy link
Author

cached config OFF.log
cached config ON.log

Issue still persists. I've attached screenshots of the webui generation info + screenshots of task manager during each run. Cached config uses significantly more vram and starts using shared memory.

cached config on
cached config on webui info

I used a fresh instance of sdnext dev without extensions. I ran 2 generations and attached the logs with --debug and sd_vae_debug=true env variable.

cached config OFF
cached config OFF webui info

@vladmandic vladmandic reopened this Oct 1, 2024
@vladmandic
Copy link
Owner

vladmandic commented Oct 1, 2024

i've reopened if someone wants to take a shot at it.
i consider this very low priority since its not reproducible AND workaround is well known.

@tampadesignr
Copy link

my system spiked twice and crashed my system, just saying hes not the only one.
when i tested same model in invoke system stayed stable.

@vladmandic
Copy link
Owner

my system spiked twice and crashed my system, just saying hes not the only one.
when i tested same model in invoke system stayed stable.

general statements without logs or any info on platform or settings are not helpful.

@tampadesignr
Copy link

#3471
couldnt find any info to any of those questions and i feel like the answer to those holds some info related to this
answer those questions in detail and well come back to this

@vladmandic
Copy link
Owner

#3471 couldnt find any info to any of those questions and i feel like the answer to those holds some info related to this answer those questions in detail and well come back to this

that item not related at all.

@tampadesignr
Copy link

there is an issue with how your system is handling diffusers.

@vladmandic
Copy link
Owner

there is an issue with how your system is handling diffusers.

maybe there is. create an issue and document it. do not post random comments on completely unrelated issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cannot reproduce Reported issue cannot be easily reproducible help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants