Cleanup contexts upon calculation completion, failure #354

dotsdl · 2023-04-20T05:24:25Z

This adds calls to ContextCache.empty at the end of repex execution, or in the case of execution failure, to avoid leaving behind stale contexts. This can be important for long-running processes executing multiple ProtocolDAGs in sequence.

This adds calls to `ContextCache.empty` at the end of repex execution, or in the case of execution failure, to avoid leaving behind stale contexts. This can be important for long-running processes executing multiple `ProtocolDAG`s in sequence.

dotsdl · 2023-04-20T05:28:08Z

Worked with @jchodera earlier today on this; he suggested this change due to behavior we were seeing in alchemiscale compute services on Lilac. This problem may be mitigated by relatively short (24hr) jobs on Lilac, but would become a problem for services running indefinitely (such as on PRP).

I'm seeing behavior on my local test box that suggests contexts are not being cleaned up currently; I have yet to test if this PR addresses this.

dotsdl · 2023-04-20T06:20:12Z

Can confirm that we observe GPU memory exhaustion when running many RBFE ProtocolDAGs in sequence:

[2023-04-20 06:16:48,756] [vulkan-cc574698-44f9-4d9f-adbd-a9055d83d534] [INFO] '<ProtocolDAG-8c622c4dca6e677be62fdb5199cbb69e>' : '<ProtocolDAGResult-a060faa7540c0d29a7bceff2e1a5f9c9>' : FAILURE : 'ProtocolUnitFailure(lig_ejm_31 lig_ejm_42 repeat 0 generation 0)' : ('OpenMMException', ('Error creating array posDelta: CUDA_ERROR_OUT_OF_MEMORY (2)',))

Also end up getting many of these:

[2023-04-20 06:19:33,720] [vulkan-1611dd3a-cd5e-4c9f-9661-0930074ec2a1] [INFO] '<ProtocolDAG-623250040607a0675771ae4b3ce4405e>' : '<ProtocolDAGResult-b91a3c75dd6eb5bb56f351f886e44f04>' : FAILURE : 'ProtocolUnitFailure(lig_ejm_42 lig_ejm_43 repeat 0 generation 0)' : ('OpenMMException', ('No compatible CUDA device is available',))

dotsdl · 2023-04-20T06:53:46Z

Running a test now on my local box to see if the changes in this PR give stable memory use over time. Will report back on observations.

dotsdl · 2023-04-20T07:24:54Z

This may not yet be solving the issue, but can't conclude yet; will let the test box run overnight and see what state it's in in the morning.

openfe/protocols/openmm_rfe/equil_rfe_methods.py

codecov · 2023-04-20T07:34:21Z

Codecov Report

Patch coverage: 86.36% and project coverage change: -0.07 ⚠️

Comparison is base (f88a29f) 90.66% compared to head (b73dfc6) 90.59%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #354      +/-   ##
==========================================
- Coverage   90.66%   90.59%   -0.07%     
==========================================
  Files          92       92              
  Lines        5718     5741      +23     
==========================================
+ Hits         5184     5201      +17     
- Misses        534      540       +6

Impacted Files	Coverage Δ
openfe/protocols/openmm_rfe/equil_rfe_methods.py	`89.76% <81.25%> (-2.28%)`	⬇️
...enfe/protocols/openmm_rfe/_rfe_utils/multistate.py	`61.68% <100.00%> (+1.10%)`	⬆️

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

IAlibay

Can you move the try except down to only encapsulate the problematic section?

openfe/protocols/openmm_rfe/equil_rfe_methods.py

IAlibay · 2023-04-20T08:12:54Z

This is rather odd, I've ran multi-day long DAGs before and never saw that behaviour.

richardjgowers · 2023-04-20T08:41:56Z

openfe/protocols/openmm_rfe/equil_rfe_methods.py

+            energy_context_cache.empty()
+            sampler_context_cache.empty()


this feels like it should get (eventually) upstreamed as these context objects are used using a context manager pattern

I think in the short term solution, it might be good to at least make sure that the context is cleanly removed in multistatesampler.__del__

Indeed, upstreaming here: choderalab/openmmtools#690

Also cc choderalab/openmmtools#688

richardjgowers · 2023-04-20T08:48:41Z

@IAlibay re: your multiple long DAGs, were these in different Python evocations? I think this might be surfacing as dotson has a single long running process issuing these (which shouldn't be bad), whereas if you were quickrunning in succession you'd have been tearing down the memory footprint between each

IAlibay · 2023-04-20T09:36:17Z

@IAlibay re: your multiple long DAGs, were these in different Python evocations? I think this might be surfacing as dotson has a single long running process issuing these (which shouldn't be bad), whereas if you were quickrunning in succession you'd have been tearing down the memory footprint between each

No, recently it was the one big Python loop, so should all be the same process. I'd say the most I tried is 12 tyk2 complex calculations in a loop, so if it's a large scale thing it might explain why we didn't see it.

dotsdl · 2023-04-20T16:32:41Z

Running another testing round on local box with latest changes from power hour discussion. Will report back behavior.

dotsdl · 2023-04-20T17:24:15Z

These changes still appear insufficient to avoid growing system and GPU memory use. As we saw from discussion we observe the problem after ~40 successful runs.

dotsdl · 2023-04-20T17:27:06Z

Addressing your comments now @IAlibay.

dotsdl · 2023-04-20T18:22:26Z

Running another round of local tests with latest changes, including what we propose to upstream in choderalab/openmmtools#690.

openfe/protocols/openmm_rfe/equil_rfe_methods.py

IAlibay · 2023-04-20T21:42:10Z

@dotsdl - latest changes seem to be preventing memory leaks on my test loop, if you can test it on your end that'd be great.

IAlibay · 2023-04-20T21:43:17Z

openfe/protocols/openmm_rfe/equil_rfe_methods.py

+                del cache.global_context_cache._lru._data[context]
+
+            del sampler_context_cache, energy_context_cache
+            if not dry:


won't delete the integrator on dry, one would hope that alchemiscale style services wouldn't be running in dry mode anyways?

Yeah, I don't see a scenario where we would use dry on alchemiscale, so think we're good here.

Seems more of a user-host thing for debug.

now equally guilty of shaking the code until something happens

dotsdl · 2023-04-20T23:58:25Z

@IAlibay haha sometimes it's a good strategy. 😁

IAlibay · 2023-04-21T00:11:44Z

@IAlibay haha sometimes it's a good strategy. 😁

😅 you mean to say there are other strategies?

In any case, I'll happily merge this & cut a release once we know it's stable.

dotsdl · 2023-04-21T00:12:26Z

I think that was it @IAlibay! Great work! I am no longer seeing rising GPU memory usage, but still observe rising system memory usage. Do you observe the same?

If you don't also observe rising system memory, this may be a symptom of the SynchronousComputeService, not this protocol.

IAlibay · 2023-04-21T00:25:16Z

I think that was it @IAlibay! Great work! I am no longer seeing rising GPU memory usage, but still observe rising system memory usage. Do you observe the same?

If you don't also observe rising system memory, this may be a symptom of the SynchronousComputeService, not this protocol.

Awesome that's one thing gotten. Yeah the host memory leak tracks with what I'm seeing too :/

Off the top of my head it's possibly HTF or the generator objects.

…ization

IAlibay · 2023-04-21T23:13:41Z

@dotsdl - so with these latest changes I do see host memory creep up, but it plateaus ~ 9 GB in total system memory for three concurrent dag execution loops on my workstation (with a gc.collect() call at the end of each loop). It's hard to know if this is an actual memory leak or just garbage collection being quite lazy when there's lots of RAM left (I still have another 23 GB of RAM free..). For now I'm going to leave it as-is since it seems stable and honestly if we manage to get 12 iterations of a complex calculation in 24h that would be rather amazing performance.

Given it's so late in the day I'm going to just hold off on merging this whilst I quickly do a couple of actual simulations overnight. Just want to check we didn't drastically break stuff somewhere here. If everything looks fine I'll merge / release tomorrow in the US morning.

IAlibay · 2023-04-21T23:14:19Z

openfe/protocols/openmm_rfe/_rfe_utils/multistate.py

@@ -34,7 +34,7 @@ def __init__(self, *args, hybrid_factory=None, **kwargs):
        self._hybrid_factory = hybrid_factory
        super(HybridCompatibilityMixin, self).__init__(*args, **kwargs)

-    def setup(self, reporter, platform, lambda_protocol,
+    def setup(self, reporter, lambda_protocol,


creating a context here was absolutely unecessary, so it's been removed

IAlibay · 2023-04-21T23:15:11Z

openfe/protocols/openmm_rfe/_rfe_utils/multistate.py

+    # we won't take any steps, so use a simple integrator
+    integrator = openmm.VerletIntegrator(1.0)
+    platform = openmm.Platform.getPlatformByName('CPU')
+    dummy_cache = cache.DummyContextCache(platform=platform)


using a dummy context cache for the convenience of being able to pass a thermodynamic state object rather than doubling up efforts and doing the 3-4 extra lines here ourselves

IAlibay · 2023-04-21T23:15:33Z

openfe/protocols/openmm_rfe/_rfe_utils/multistate.py

+        )
+        sampler_state.update_from_context(context)
+    finally:
+        del context, integrator, dummy_cache


Being a bit overly cautious but it'll do I think.

IAlibay · 2023-04-21T23:15:47Z

openfe/protocols/openmm_rfe/_rfe_utils/multistate.py

            minimize(compound_thermostate_copy, sampler_state,
                     max_iterations=minimization_steps)
            sampler_state_list.append(copy.deepcopy(sampler_state))

+        del compound_thermostate, sampler_state


I think that should be fine.

dotsdl · 2023-04-22T01:34:42Z

All good @IAlibay! Will await your conclusions and openfe 0.7.3 on the alchemiscale end. Let me know if there's anything you'd like me to hit as well.

dotsdl · 2023-04-22T01:48:34Z

Since we're now using the CPU platform for minimization, I need to set OPENMM_CPU_THREADS=1 in typical usage with stacking alchemiscale SynchronousComputeServices on a single host. This is fine, but is minimization generally faster to do on 1 CPU core or on the GPU, accounting for amortization of moving things on and off the GPU? Not critical to address now, but wanted to ask.

dotsdl · 2023-04-22T01:48:54Z

Running another burn test on my host here; will report back observations as well.

jchodera · 2023-04-22T02:13:35Z

Since we're now using the CPU platform for minimization, I need to set OPENMM_CPU_THREADS=1 in typical usage with stacking alchemiscale SynchronousComputeServices on a single host. This is fine, but is minimization generally faster to do on 1 CPU core or on the GPU, accounting for amortization of moving things on and off the GPU? Not critical to address now, but wanted to ask.

Minimization is extremely slow on the CPU. I highly recommend doing this on the GPU.

Some considerations for minimization of replicas:

If you create/destroy Context objects for each minimization, it might incur overhead of a second or so per replica.
If you use a ContextCache to manage minimization of replicas, the Contexts may not be reused since the ContextCache requires both System and Integrator to match to re-use the Context. That will just tie up a bunch or resources you don't use again if the Integrator doesn't match (e.g. minimization is done with VerletIntegrator but propagation uses LangevinMiddleIntegrator).
The best thing to do here might be to create a ContextCache just for minimization of all replicas, minimize all replicas, and then clean up that cache, being careful that all Contexts are deleted.

dotsdl · 2023-04-22T03:16:15Z

Yeah I do observe an idle GPU on my test host for some time while minimizations are happening, which may be a bit of a bottleneck. However for production-length runs, this may not end up mattering much, since the minimization may still be very short compared to production runtime. The tests I'm running now are very short, so the minimization period is quite noticeable.

If it's possible to take @jchodera's suggestions above and run the minimizations on the GPU, that may still be preferable though.

The good news is that so far running 6 workers with a mix of complex and solvent transformation ProtocolDAGs going from tyk2, I see each worker topping out at about 2.45GB, or 14.7GB total memory usage. This suggests that the current state of this protocol may now be memory stable!

IAlibay · 2023-04-22T06:16:06Z

Minimization is extremely slow on the CPU. I highly recommend doing this on the GPU.

@jchodera - For context here, this is only a 100 step pre-minimization to get out of potentially really bad high energy states. The reason we're doing this here is that we were seeing cases of silent overflows when minimizing on GPUs.

Admittedly I think this might have all been tied into the FIRE minimizer issues we recently reported, however our take was that being a bit extra cautious for now might only cost us an extra couple of minutes? Unless you think this might be unreasonably slow for larger systems (TYK2 seems to be fine, but it's not exactly the largest system)?

Yeah I do observe an idle GPU on my test host for some time while minimizations are happening, which may be a bit of a bottleneck.

I note that the "proper" minimization (i.e. https://github.com/OpenFreeEnergy/openfe/blob/main/openfe/protocols/openmm_rfe/equil_rfe_methods.py#L652) on should be assigned the GPU platform @dotsdl (or at least whatever platform was chosen by default in settings).

The best thing to do here might be to create a ContextCache just for minimization of all replicas, minimize all replicas, and then clean up that cache, being careful that all Contexts are deleted.

That makes sense thanks! For this PR (just so we can quickly deploy on alchemiscale), I'm going to say let's take the extra couple of seconds performance hit, and then we can follow-up with a fix that does exactly as @jchodera suggests? If anything the performance loss of building / creating contexts is miniscule given we're using a wildly inneficient integrator (see #332 ).

dotsdl · 2023-04-22T06:24:36Z

Observing a slow climb in host memory usage, but I still think this looks tolerable if it scales with task count, not task length, given how artificially short the test tasks are. I ran 240 tasks, 120 each of complex and solvent Transformations. Of the 6 processes, here is the host memory usage with only a handful of tasks remaining:

2.79G
2.66G
2.53G
2.45G
2.43G
2.29G

This after running for nearly 4 hours, and an almost unnoticeable climb in memory usage from that observed over 2 hours ago. I believe this is consistent with what you observed as well @IAlibay.

dotsdl · 2023-04-22T06:49:09Z

That makes sense thanks! For this PR (just so we can quickly deploy on alchemiscale), I'm going to say let's take the extra couple of seconds performance hit, and then we can follow-up with a fix that does exactly as @jchodera suggests? If anything the performance loss of building / creating contexts is miniscule given we're using a wildly inneficient integrator (see #332 ).

Yes, I think we're clear to proceed with this as-is. Thank you for all the solid sleuthing on this @IAlibay! I think in practice we'll get plenty of throughput out of the implementation as it stands, and we can work on greater efficiency when we're not under the current time pressure.

IAlibay · 2023-04-22T06:53:44Z

Aha! I didn't know OpenMM's localenergyminimizer was this smart: https://github.com/openmm/openmm/blob/master/openmmapi/src/LocalEnergyMinimizer.cpp#L54

Seems like it'll always do a CPU fallback if there's a force clipping issue. In that case let's keep the old behaviour here and just stick to whatever the default platform is for this initial minimization. Once choderalab/openmmtools#672 is merged, we can probably just get rid of this pre-minimization completely.

IAlibay · 2023-04-22T07:13:19Z

openfe/protocols/openmm_rfe/equil_rfe_methods.py

+            del reporter
+
+            # clean up the analyzer
+            if not dry:


This needs a bit of a cleanup, but I'll do it in the upcoming refactor.

IAlibay · 2023-04-22T07:16:17Z

Quick test update - we're getting reproducible results for TYK2's ejm_31_ejm_46 edge - within 0.1 kcal/mol. Everything seems fine.

IAlibay · 2023-04-22T07:34:13Z

Alright, thanks for all the work here @dotsdl - let's move towards a release 🎉

jchodera · 2023-04-22T12:34:41Z

@jchodera - For context here, this is only a 100 step pre-minimization to get out of potentially really bad high energy states. The reason we're doing this here is that we were seeing cases of silent overflows when minimizing on GPUs.
Admittedly I think this might have all been tied into the FIRE minimizer issues we recently reported, however our take was that being a bit extra cautious for now might only cost us an extra couple of minutes? Unless you think this might be unreasonably slow for larger systems (TYK2 seems to be fine, but it's not exactly the largest system)?

@IAlibay : There are definitely deficiencies with the FIRE minimizer, and we're hoping to be able to address those at some point in the near future. But PME systems will get very slow on the CPU, even for 100 steps.

Can you elaborate on the kinds of "silent overflows" you ran into? This was with mixed precision? I wonder if a simpler gradient descent minimizer that rejects steps that go uphill would be sufficient for your purposes but could still run on the GPU.

jchodera · 2023-04-22T12:35:54Z

That makes sense thanks! For this PR (just so we can quickly deploy on alchemiscale), I'm going to say let's take the extra couple of seconds performance hit, and then we can follow-up with a fix that does exactly as @jchodera suggests? If anything the performance loss of building / creating contexts is miniscule given we're using a wildly inneficient integrator (see #332 ).

This is a reasonable solution! The main concern here was ensuring there were no device resource leaks, not performance.

IAlibay · 2023-04-22T20:00:46Z

There are definitely deficiencies with the FIRE minimizer, and we're hoping to be able to address those at some point in the near future. But PME systems will get very slow on the CPU, even for 100 steps.

I got muddled a bit here and I had completely forgotten for this PR - it turns out that localenergyminizer will do a default fallback to CPU if the forces overflow, so I ended up sticking with the old behaviour and just passing whatever the user defined platform is.

Can you elaborate on the kinds of "silent overflows" you ran into? This was with mixed precision? I wonder if a simpler gradient descent minimizer that rejects steps that go uphill would be sufficient for your purposes but could still run on the GPU.

@jchodera: If I remember correctly the issue here was that the minimizer would essentially attempt to minimize but fail to do so. However rather than throwing an exception or directly giving a NaN energy, it would just take the system to an unreasonable energy state, from that point any further attempts to minimize would lead to a system that yields NaNs. I didn't manage to dig into this too much at the time, but avoiding the default MultiStateSampler.minimize FIRE -> local minimizer scheme by doing the pre-minimization seperately with LocalEnergyMinimizer first worked fine.

I suspect the issue with the FIRE minimizer is that, similar to LocalEnergyMinimizer on non-CPU platforms, the forces overflow but not the energy accumulator leading to this weird behaviour.

(I wish I had taken much better notes when I encountered it, I'll try to reproduce this again once things a bit less hectic).

dotsdl added 2 commits April 19, 2023 22:22

Merge branch 'main' into repex-context-cleanup

bef1fb2

richardjgowers requested a review from IAlibay April 20, 2023 06:35

mikemhenry reviewed Apr 20, 2023

View reviewed changes

openfe/protocols/openmm_rfe/equil_rfe_methods.py Show resolved Hide resolved

IAlibay previously requested changes Apr 20, 2023

View reviewed changes

openfe/protocols/openmm_rfe/equil_rfe_methods.py Show resolved Hide resolved

openfe/protocols/openmm_rfe/equil_rfe_methods.py Show resolved Hide resolved

IAlibay reviewed Apr 20, 2023

View reviewed changes

openfe/protocols/openmm_rfe/equil_rfe_methods.py Show resolved Hide resolved

richardjgowers reviewed Apr 20, 2023

View reviewed changes

dotsdl added 2 commits April 20, 2023 08:40

Changes from live power hour discussion

8e8df34

Remove changes to ProtocolResult

abd3ae5

Addressing @IAlibay review

d557843

dotsdl added 4 commits April 20, 2023 11:23

Merge branch 'main' into repex-context-cleanup

ea3feb5

Needed to go through one other layer for context deletion

f65a1f2

Missed some :/

65d2c91

Not enough sleep today...

7fe0d2c

IAlibay reviewed Apr 20, 2023

View reviewed changes

openfe/protocols/openmm_rfe/equil_rfe_methods.py Outdated Show resolved Hide resolved

clear up global context cache + also remove cache bits in multistate

73acb1f

IAlibay reviewed Apr 20, 2023

View reviewed changes

oops forgot to change this

6810208

some host memory clearning and enforce CPU platform for initial minim…

9abe3c9

…ization

IAlibay approved these changes Apr 21, 2023

View reviewed changes

pass platform through to pre-minimization

b73dfc6

IAlibay reviewed Apr 22, 2023

View reviewed changes

IAlibay merged commit 793b817 into OpenFreeEnergy:main Apr 22, 2023

ijpulidos mentioned this pull request May 25, 2023

Backport openfe fixes for context deletion choderalab/perses#1196

Closed

Cleanup contexts upon calculation completion, failure #354

Cleanup contexts upon calculation completion, failure #354

Conversation

dotsdl commented Apr 20, 2023

dotsdl commented Apr 20, 2023

dotsdl commented Apr 20, 2023

dotsdl commented Apr 20, 2023

dotsdl commented Apr 20, 2023

codecov bot commented Apr 20, 2023 • edited Loading

Codecov Report

IAlibay left a comment

Choose a reason for hiding this comment

IAlibay commented Apr 20, 2023

Choose a reason for hiding this comment

IAlibay Apr 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardjgowers commented Apr 20, 2023

IAlibay commented Apr 20, 2023

dotsdl commented Apr 20, 2023

dotsdl commented Apr 20, 2023

dotsdl commented Apr 20, 2023

dotsdl commented Apr 20, 2023

IAlibay commented Apr 20, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotsdl commented Apr 20, 2023

IAlibay commented Apr 21, 2023

dotsdl commented Apr 21, 2023

IAlibay commented Apr 21, 2023

IAlibay commented Apr 21, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotsdl commented Apr 22, 2023

dotsdl commented Apr 22, 2023

dotsdl commented Apr 22, 2023

jchodera commented Apr 22, 2023

dotsdl commented Apr 22, 2023 • edited Loading

IAlibay commented Apr 22, 2023

dotsdl commented Apr 22, 2023 • edited Loading

dotsdl commented Apr 22, 2023

IAlibay commented Apr 22, 2023

Choose a reason for hiding this comment

IAlibay commented Apr 22, 2023

IAlibay commented Apr 22, 2023

jchodera commented Apr 22, 2023

jchodera commented Apr 22, 2023

IAlibay commented Apr 22, 2023

codecov bot commented Apr 20, 2023 •

edited

Loading

IAlibay Apr 20, 2023 •

edited

Loading

dotsdl commented Apr 22, 2023 •

edited

Loading

dotsdl commented Apr 22, 2023 •

edited

Loading