Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup contexts upon calculation completion, failure #354

Merged
merged 13 commits into from
Apr 22, 2023

Conversation

dotsdl
Copy link
Member

@dotsdl dotsdl commented Apr 20, 2023

This adds calls to ContextCache.empty at the end of repex execution, or in the case of execution failure, to avoid leaving behind stale contexts. This can be important for long-running processes executing multiple ProtocolDAGs in sequence.

This adds calls to `ContextCache.empty` at the end of repex execution,
or in the case of execution failure, to avoid leaving behind stale
contexts. This can be important for long-running processes executing
multiple `ProtocolDAG`s in sequence.
@dotsdl
Copy link
Member Author

dotsdl commented Apr 20, 2023

Worked with @jchodera earlier today on this; he suggested this change due to behavior we were seeing in alchemiscale compute services on Lilac. This problem may be mitigated by relatively short (24hr) jobs on Lilac, but would become a problem for services running indefinitely (such as on PRP).

I'm seeing behavior on my local test box that suggests contexts are not being cleaned up currently; I have yet to test if this PR addresses this.

@dotsdl
Copy link
Member Author

dotsdl commented Apr 20, 2023

Can confirm that we observe GPU memory exhaustion when running many RBFE ProtocolDAGs in sequence:

[2023-04-20 06:16:48,756] [vulkan-cc574698-44f9-4d9f-adbd-a9055d83d534] [INFO] '<ProtocolDAG-8c622c4dca6e677be62fdb5199cbb69e>' : '<ProtocolDAGResult-a060faa7540c0d29a7bceff2e1a5f9c9>' : FAILURE : 'ProtocolUnitFailure(lig_ejm_31 lig_ejm_42 repeat 0 generation 0)' : ('OpenMMException', ('Error creating array posDelta: CUDA_ERROR_OUT_OF_MEMORY (2)',))

Also end up getting many of these:

[2023-04-20 06:19:33,720] [vulkan-1611dd3a-cd5e-4c9f-9661-0930074ec2a1] [INFO] '<ProtocolDAG-623250040607a0675771ae4b3ce4405e>' : '<ProtocolDAGResult-b91a3c75dd6eb5bb56f351f886e44f04>' : FAILURE : 'ProtocolUnitFailure(lig_ejm_42 lig_ejm_43 repeat 0 generation 0)' : ('OpenMMException', ('No compatible CUDA device is available',))

@dotsdl
Copy link
Member Author

dotsdl commented Apr 20, 2023

Running a test now on my local box to see if the changes in this PR give stable memory use over time. Will report back on observations.

@dotsdl
Copy link
Member Author

dotsdl commented Apr 20, 2023

This may not yet be solving the issue, but can't conclude yet; will let the test box run overnight and see what state it's in in the morning.

@codecov
Copy link

codecov bot commented Apr 20, 2023

Codecov Report

Patch coverage: 86.36% and project coverage change: -0.07 ⚠️

Comparison is base (f88a29f) 90.66% compared to head (b73dfc6) 90.59%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #354      +/-   ##
==========================================
- Coverage   90.66%   90.59%   -0.07%     
==========================================
  Files          92       92              
  Lines        5718     5741      +23     
==========================================
+ Hits         5184     5201      +17     
- Misses        534      540       +6     
Impacted Files Coverage Δ
openfe/protocols/openmm_rfe/equil_rfe_methods.py 89.76% <81.25%> (-2.28%) ⬇️
...enfe/protocols/openmm_rfe/_rfe_utils/multistate.py 61.68% <100.00%> (+1.10%) ⬆️

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Member

@IAlibay IAlibay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the try except down to only encapsulate the problematic section?

@IAlibay
Copy link
Member

IAlibay commented Apr 20, 2023

This is rather odd, I've ran multi-day long DAGs before and never saw that behaviour.

Comment on lines 688 to 689
energy_context_cache.empty()
sampler_context_cache.empty()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like it should get (eventually) upstreamed as these context objects are used using a context manager pattern

Copy link
Member

@IAlibay IAlibay Apr 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the short term solution, it might be good to at least make sure that the context is cleanly removed in multistatesampler.__del__

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, upstreaming here: choderalab/openmmtools#690

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@richardjgowers
Copy link
Contributor

@IAlibay re: your multiple long DAGs, were these in different Python evocations? I think this might be surfacing as dotson has a single long running process issuing these (which shouldn't be bad), whereas if you were quickrunning in succession you'd have been tearing down the memory footprint between each

@IAlibay
Copy link
Member

IAlibay commented Apr 20, 2023

@IAlibay re: your multiple long DAGs, were these in different Python evocations? I think this might be surfacing as dotson has a single long running process issuing these (which shouldn't be bad), whereas if you were quickrunning in succession you'd have been tearing down the memory footprint between each

No, recently it was the one big Python loop, so should all be the same process. I'd say the most I tried is 12 tyk2 complex calculations in a loop, so if it's a large scale thing it might explain why we didn't see it.

@dotsdl
Copy link
Member Author

dotsdl commented Apr 20, 2023

Running another testing round on local box with latest changes from power hour discussion. Will report back behavior.

@dotsdl
Copy link
Member Author

dotsdl commented Apr 20, 2023

These changes still appear insufficient to avoid growing system and GPU memory use. As we saw from discussion we observe the problem after ~40 successful runs.

@dotsdl
Copy link
Member Author

dotsdl commented Apr 20, 2023

Addressing your comments now @IAlibay.

@dotsdl
Copy link
Member Author

dotsdl commented Apr 20, 2023

Running another round of local tests with latest changes, including what we propose to upstream in choderalab/openmmtools#690.

@IAlibay
Copy link
Member

IAlibay commented Apr 20, 2023

@dotsdl - latest changes seem to be preventing memory leaks on my test loop, if you can test it on your end that'd be great.

del cache.global_context_cache._lru._data[context]

del sampler_context_cache, energy_context_cache
if not dry:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't delete the integrator on dry, one would hope that alchemiscale style services wouldn't be running in dry mode anyways?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't see a scenario where we would use dry on alchemiscale, so think we're good here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems more of a user-host thing for debug.

@IAlibay IAlibay dismissed their stale review April 20, 2023 21:56

now equally guilty of shaking the code until something happens

@dotsdl
Copy link
Member Author

dotsdl commented Apr 20, 2023

@IAlibay haha sometimes it's a good strategy. 😁

@IAlibay
Copy link
Member

IAlibay commented Apr 21, 2023

@IAlibay haha sometimes it's a good strategy. 😁

😅 you mean to say there are other strategies?

In any case, I'll happily merge this & cut a release once we know it's stable.

@dotsdl
Copy link
Member Author

dotsdl commented Apr 21, 2023

I think that was it @IAlibay! Great work! I am no longer seeing rising GPU memory usage, but still observe rising system memory usage. Do you observe the same?

If you don't also observe rising system memory, this may be a symptom of the SynchronousComputeService, not this protocol.

@IAlibay
Copy link
Member

IAlibay commented Apr 21, 2023

I think that was it @IAlibay! Great work! I am no longer seeing rising GPU memory usage, but still observe rising system memory usage. Do you observe the same?

If you don't also observe rising system memory, this may be a symptom of the SynchronousComputeService, not this protocol.

Awesome that's one thing gotten. Yeah the host memory leak tracks with what I'm seeing too :/

Off the top of my head it's possibly HTF or the generator objects.

@IAlibay
Copy link
Member

IAlibay commented Apr 21, 2023

@dotsdl - so with these latest changes I do see host memory creep up, but it plateaus ~ 9 GB in total system memory for three concurrent dag execution loops on my workstation (with a gc.collect() call at the end of each loop). It's hard to know if this is an actual memory leak or just garbage collection being quite lazy when there's lots of RAM left (I still have another 23 GB of RAM free..). For now I'm going to leave it as-is since it seems stable and honestly if we manage to get 12 iterations of a complex calculation in 24h that would be rather amazing performance.

Given it's so late in the day I'm going to just hold off on merging this whilst I quickly do a couple of actual simulations overnight. Just want to check we didn't drastically break stuff somewhere here. If everything looks fine I'll merge / release tomorrow in the US morning.

@@ -34,7 +34,7 @@ def __init__(self, *args, hybrid_factory=None, **kwargs):
self._hybrid_factory = hybrid_factory
super(HybridCompatibilityMixin, self).__init__(*args, **kwargs)

def setup(self, reporter, platform, lambda_protocol,
def setup(self, reporter, lambda_protocol,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

creating a context here was absolutely unecessary, so it's been removed

# we won't take any steps, so use a simple integrator
integrator = openmm.VerletIntegrator(1.0)
platform = openmm.Platform.getPlatformByName('CPU')
dummy_cache = cache.DummyContextCache(platform=platform)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a dummy context cache for the convenience of being able to pass a thermodynamic state object rather than doubling up efforts and doing the 3-4 extra lines here ourselves

)
sampler_state.update_from_context(context)
finally:
del context, integrator, dummy_cache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Being a bit overly cautious but it'll do I think.

minimize(compound_thermostate_copy, sampler_state,
max_iterations=minimization_steps)
sampler_state_list.append(copy.deepcopy(sampler_state))

del compound_thermostate, sampler_state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that should be fine.

@dotsdl
Copy link
Member Author

dotsdl commented Apr 22, 2023

All good @IAlibay! Will await your conclusions and openfe 0.7.3 on the alchemiscale end. Let me know if there's anything you'd like me to hit as well.

@dotsdl
Copy link
Member Author

dotsdl commented Apr 22, 2023

Since we're now using the CPU platform for minimization, I need to set OPENMM_CPU_THREADS=1 in typical usage with stacking alchemiscale SynchronousComputeServices on a single host. This is fine, but is minimization generally faster to do on 1 CPU core or on the GPU, accounting for amortization of moving things on and off the GPU? Not critical to address now, but wanted to ask.

@dotsdl
Copy link
Member Author

dotsdl commented Apr 22, 2023

Running another burn test on my host here; will report back observations as well.

@jchodera
Copy link

Since we're now using the CPU platform for minimization, I need to set OPENMM_CPU_THREADS=1 in typical usage with stacking alchemiscale SynchronousComputeServices on a single host. This is fine, but is minimization generally faster to do on 1 CPU core or on the GPU, accounting for amortization of moving things on and off the GPU? Not critical to address now, but wanted to ask.

Minimization is extremely slow on the CPU. I highly recommend doing this on the GPU.

Some considerations for minimization of replicas:

  • If you create/destroy Context objects for each minimization, it might incur overhead of a second or so per replica.
  • If you use a ContextCache to manage minimization of replicas, the Contexts may not be reused since the ContextCache requires both System and Integrator to match to re-use the Context. That will just tie up a bunch or resources you don't use again if the Integrator doesn't match (e.g. minimization is done with VerletIntegrator but propagation uses LangevinMiddleIntegrator).
  • The best thing to do here might be to create a ContextCache just for minimization of all replicas, minimize all replicas, and then clean up that cache, being careful that all Contexts are deleted.

@dotsdl
Copy link
Member Author

dotsdl commented Apr 22, 2023

Yeah I do observe an idle GPU on my test host for some time while minimizations are happening, which may be a bit of a bottleneck. However for production-length runs, this may not end up mattering much, since the minimization may still be very short compared to production runtime. The tests I'm running now are very short, so the minimization period is quite noticeable.

If it's possible to take @jchodera's suggestions above and run the minimizations on the GPU, that may still be preferable though.

The good news is that so far running 6 workers with a mix of complex and solvent transformation ProtocolDAGs going from tyk2, I see each worker topping out at about 2.45GB, or 14.7GB total memory usage. This suggests that the current state of this protocol may now be memory stable!

@IAlibay
Copy link
Member

IAlibay commented Apr 22, 2023

Minimization is extremely slow on the CPU. I highly recommend doing this on the GPU.

@jchodera - For context here, this is only a 100 step pre-minimization to get out of potentially really bad high energy states. The reason we're doing this here is that we were seeing cases of silent overflows when minimizing on GPUs.

Admittedly I think this might have all been tied into the FIRE minimizer issues we recently reported, however our take was that being a bit extra cautious for now might only cost us an extra couple of minutes? Unless you think this might be unreasonably slow for larger systems (TYK2 seems to be fine, but it's not exactly the largest system)?

Yeah I do observe an idle GPU on my test host for some time while minimizations are happening, which may be a bit of a bottleneck.

I note that the "proper" minimization (i.e. https://github.com/OpenFreeEnergy/openfe/blob/main/openfe/protocols/openmm_rfe/equil_rfe_methods.py#L652) on should be assigned the GPU platform @dotsdl (or at least whatever platform was chosen by default in settings).

The best thing to do here might be to create a ContextCache just for minimization of all replicas, minimize all replicas, and then clean up that cache, being careful that all Contexts are deleted.

That makes sense thanks! For this PR (just so we can quickly deploy on alchemiscale), I'm going to say let's take the extra couple of seconds performance hit, and then we can follow-up with a fix that does exactly as @jchodera suggests? If anything the performance loss of building / creating contexts is miniscule given we're using a wildly inneficient integrator (see #332 ).

@dotsdl
Copy link
Member Author

dotsdl commented Apr 22, 2023

Observing a slow climb in host memory usage, but I still think this looks tolerable if it scales with task count, not task length, given how artificially short the test tasks are. I ran 240 tasks, 120 each of complex and solvent Transformations. Of the 6 processes, here is the host memory usage with only a handful of tasks remaining:

2.79G
2.66G
2.53G
2.45G
2.43G
2.29G

This after running for nearly 4 hours, and an almost unnoticeable climb in memory usage from that observed over 2 hours ago. I believe this is consistent with what you observed as well @IAlibay.

@dotsdl
Copy link
Member Author

dotsdl commented Apr 22, 2023

That makes sense thanks! For this PR (just so we can quickly deploy on alchemiscale), I'm going to say let's take the extra couple of seconds performance hit, and then we can follow-up with a fix that does exactly as @jchodera suggests? If anything the performance loss of building / creating contexts is miniscule given we're using a wildly inneficient integrator (see #332 ).

Yes, I think we're clear to proceed with this as-is. Thank you for all the solid sleuthing on this @IAlibay! I think in practice we'll get plenty of throughput out of the implementation as it stands, and we can work on greater efficiency when we're not under the current time pressure.

@IAlibay
Copy link
Member

IAlibay commented Apr 22, 2023

Aha! I didn't know OpenMM's localenergyminimizer was this smart: https://github.com/openmm/openmm/blob/master/openmmapi/src/LocalEnergyMinimizer.cpp#L54

Seems like it'll always do a CPU fallback if there's a force clipping issue. In that case let's keep the old behaviour here and just stick to whatever the default platform is for this initial minimization. Once choderalab/openmmtools#672 is merged, we can probably just get rid of this pre-minimization completely.

del reporter

# clean up the analyzer
if not dry:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a bit of a cleanup, but I'll do it in the upcoming refactor.

@IAlibay
Copy link
Member

IAlibay commented Apr 22, 2023

Quick test update - we're getting reproducible results for TYK2's ejm_31_ejm_46 edge - within 0.1 kcal/mol. Everything seems fine.

@IAlibay IAlibay merged commit 793b817 into OpenFreeEnergy:main Apr 22, 2023
@IAlibay
Copy link
Member

IAlibay commented Apr 22, 2023

Alright, thanks for all the work here @dotsdl - let's move towards a release 🎉

@jchodera
Copy link

@jchodera - For context here, this is only a 100 step pre-minimization to get out of potentially really bad high energy states. The reason we're doing this here is that we were seeing cases of silent overflows when minimizing on GPUs.
Admittedly I think this might have all been tied into the FIRE minimizer issues we recently reported, however our take was that being a bit extra cautious for now might only cost us an extra couple of minutes? Unless you think this might be unreasonably slow for larger systems (TYK2 seems to be fine, but it's not exactly the largest system)?

@IAlibay : There are definitely deficiencies with the FIRE minimizer, and we're hoping to be able to address those at some point in the near future. But PME systems will get very slow on the CPU, even for 100 steps.

Can you elaborate on the kinds of "silent overflows" you ran into? This was with mixed precision? I wonder if a simpler gradient descent minimizer that rejects steps that go uphill would be sufficient for your purposes but could still run on the GPU.

@jchodera
Copy link

That makes sense thanks! For this PR (just so we can quickly deploy on alchemiscale), I'm going to say let's take the extra couple of seconds performance hit, and then we can follow-up with a fix that does exactly as @jchodera suggests? If anything the performance loss of building / creating contexts is miniscule given we're using a wildly inneficient integrator (see #332 ).

This is a reasonable solution! The main concern here was ensuring there were no device resource leaks, not performance.

@IAlibay
Copy link
Member

IAlibay commented Apr 22, 2023

There are definitely deficiencies with the FIRE minimizer, and we're hoping to be able to address those at some point in the near future. But PME systems will get very slow on the CPU, even for 100 steps.

I got muddled a bit here and I had completely forgotten for this PR - it turns out that localenergyminizer will do a default fallback to CPU if the forces overflow, so I ended up sticking with the old behaviour and just passing whatever the user defined platform is.

Can you elaborate on the kinds of "silent overflows" you ran into? This was with mixed precision? I wonder if a simpler gradient descent minimizer that rejects steps that go uphill would be sufficient for your purposes but could still run on the GPU.

@jchodera: If I remember correctly the issue here was that the minimizer would essentially attempt to minimize but fail to do so. However rather than throwing an exception or directly giving a NaN energy, it would just take the system to an unreasonable energy state, from that point any further attempts to minimize would lead to a system that yields NaNs. I didn't manage to dig into this too much at the time, but avoiding the default MultiStateSampler.minimize FIRE -> local minimizer scheme by doing the pre-minimization seperately with LocalEnergyMinimizer first worked fine.

I suspect the issue with the FIRE minimizer is that, similar to LocalEnergyMinimizer on non-CPU platforms, the forces overflow but not the energy accumulator leading to this weird behaviour.

(I wish I had taken much better notes when I encountered it, I'll try to reproduce this again once things a bit less hectic).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants