-
Notifications
You must be signed in to change notification settings - Fork 447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory leak when using OpenSSL and threads #1942
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Norman Ashley <[email protected]>
Signed-off-by: Norman Ashley <[email protected]>
Signed-off-by: Norman Ashley <[email protected]>
This is present in the 0.11.0 release, I guess? Is it critical to fix? Do we have a CI job that tests this configuration? |
For what it's worth, I'm not able to reproduce the leak myself—@ashman-p could you let me know what Linux environment you're running so I can test various configs out in Docker images? |
@SWilson4 , Thanks for checking on this issue. liboqs Build cmake -DBUILD_SHARED_LIBS=ON -DOQS_USE_OPENSSL=ON -DCMAKE_BUILD_TYPE=Debug -GNinja .. |
@SWilson4, the problem occurs when use of OQS_USE_SHA3_OPENSSL=ON and threading is active. The only significance of the use of threads is that the problem went away when threading was disabled. |
OK, I've managed to reproduce the leak, but only when building against OpenSSL >= 3.3.2. In particular, the leak does not occur when building against OpenSSL 3.3.1 with the same configuration. It seems to me that this might actually be an OpenSSL bug rather than a liboqs bug, especially as the fix proposed here should (?) be unnecessary: per the OpenSSL docs:
Looping in @baentsch @romen @beldmit @levitte: any knowledge of a regression going from 3.3.1 to 3.3.2? I will try to isolate the exact commit which introduces the leak. |
At first glance nothing suspicious |
It seems that this leak was introduced in openssl/openssl@83efcfd. When building against the parent of that commit (openssl/openssl@a13df68) the leak does not occur. |
^ fyi @beldmit |
@nhorman could you please take a look? |
FWIW, i don't see anything expressly wrong with the changes, its fine to call OPENSSL_init_crypto in your application rather than having the library do it as needed, but I'm having a hard time understanding:
I'll try reproduce here, but since you already have it set up, can you re-run valgrind with --leak-check=full to show the stack trace of where the leaked memory is being allocated? |
Sure, here's the valgrind output:
OpenSSL is built with cd ~/openssl
git checkout 83efcfdfa1
./config --debug --prefix=`pwd`/../.localopenssl_83efcfdfa1_debug && make -j 4 && make install_sw install_ssldirs liboqs is built with cd ~/liboqs
git checkout 7f4c89b2
mkdir build && cd build
cmake -GNinja -DOPENSSL_ROOT_DIR=../.localopenssl_83efcfdfa1_debug -DOQS_MINIMAL_BUILD="SIG_ml_dsa_65" -DOQS_USE_SHA3_OPENSSL=ON -DCMAKE_BUILD_TYPE=Debug ..
ninja The triggering command is valgrind --leak-check=full ~/liboqs/build/tests/test_sig ML-DSA-65 |
Thank you. Thats interesting, I just tried to recreate with the head of the master branch, and didn't encounter the leak. Given that you reported the issue was introduced with commit 83efcfd (which is a backport of the original commit to the 3.3 branch, that makes me think we fixed something in master that we didn't backport. Let me see if I can find it |
Weird—I'm also getting the leak on the latest master (openssl/openssl@c262cc0). |
hmm, what version of valgrind are you using? I'm on valgrind-3.23.0 |
also, what does your openssl.conf look like? Its possible I'm not loading any modules, and as such not triggering the allocation thats leaking |
In fact, I'm sure its my config, I just ran it through gdb and never hit a breakpoint in ossl_rcu_read_lock |
There we go, if I load the oqs provider (duh, should have thought of that), I see the leak. Now to figure out why |
@nhorman, now getting some time to revisit this today. I missed the significance of your statement here. Anyhow, I will apply your patch and report back. |
@ashman-p I'm not sure, but I'm sufficiently unfamiliar with oqs to fully understand how you would use both together without making use of the liboqs provider. It's certainly possible that there are two issues here, but the fact that the errors.are identical makes me biased towards thinking the problems are simmilar if not identical |
Thanks again for taking the time to look into this @nhorman . The
|
ok, thank you for that. thats interesting, so liboqs is one of those interesting libraries for which applications may use the provider interface to access oqs, and liboqs in turn loads symbols from openssl to do its work. The implication here is that we may have two dependencies on openssl (listed via DT_NEEDED entries in the app and the liboqs provider), and those instances may point to different versions of libcrypto. That gets tricky. to make this a bit more concrete, and pull it back to the issue at hand, what config file does the liboqs instance of libcrypto load? Can you post it here? Dependent on how its built and initalized, it may be that the liboqs instance of libcrypto is loading a "vestigual" provider in this test environment, which is never used, but still triggers the allocation that we are failing to free (which would still point to the thread_stop solution we have above as the 'correct' solution to the problem. |
That absolutely is a problem (known to anyone integrating the whole stack and possibly utilizing distros -- which is not many people :) But it is not at play here as the issue occurs in "standalone" The config (build) and config file question for a reproducer I cannot answer: @ashman-p @SWilson4 ? |
@baentsch thats not entirely true, and was the point I was getting at. Even if the top level application doesn't use openssl, and the oqsprovider to access the liboqs library, there is still a path to load the provider. If:
Then, when all of those conditions are true, libcrypto will still load the oqs provider, even though the library doesn't use it, which will send you down the path here, and you'll get the leak So in short, if you have a config file that references the oqs provider (or any external provider, like fips), then this issue will be the result, even if the application being run doesn't actually make use of it @ashman-p can you identify the openssl.cnf file that you are using for the libcrypto instance that liboqs uses and post it here? That would clarify things greatly. |
Fully agree, @nhorman . It was my strong assumption that such circular setup has been excluded in testing as both @ashman-p and @SWilson4 know about the dependencies (and #1942 (comment) seems to show a standalone build -- and #1942 (comment) absolutely guarantees one). But you're right in checking this explicitly by asking for the config -- particularly considering the problem never materialized in CI. Also helpful may simply be the output of |
In my setup it was the vanilla openssl.cnf with the default provider. |
That would do it. the default provider gets loaded via: which then allocates the lock in question thats leaking I avoided that initially as I didn't have the default provider activated in my config. Once I loaded the oqs provider (which could have been any provider, it was just the one I chose, as I was working with oqs), the problem began to appear |
Here's the configuration in my environment. (Added the .txt extension for GitHub upload only.) |
yeah, that looks to be the same situation |
Ok, thank you @nhorman. So, my next question is ... base my tests with "OPENSSL_thread_stop()". Question: We have 2 work-around options... OPENSSL_thread_stop() and OPENSSL_init_crypto(). |
OK, I made an interesting observation: a similar leak also occurs when liboqs is built with So, to address Norm's question above:
It looks like the leak does indeed occur for SHA2 as well. It's just triggered when the hashing code is actually used. (Note that ML-DSA and basically all of the PQ algorithms make extensive use of SHA3.) |
I suppose this should read "...did NOT work.", right @ashman-p ? |
It's easier indeed -- but it'd really be good to understand why it is working. Without this understanding this may haunt us in the future again, say if some side effect in The "thread_stop" solution is clearly understandable and documented, would require 3 code changes in the test programs and one documentation change pointing to OpenSSLs guidance: Would that be too undesirable, @ashman-p ? |
+1 to @baentsch comment. I think the argument to go with the thread_stop approach is because it works, and seems to be fairly well understood at this point. I still don't really see how the OPENSSL_init_crypto approach solves the issue (despite the fact that it does seem to do so). Thats not to say you can't go with that issue, but if I were making the decision, I would at least want to better understand how it solves the problem before electing to go with it. |
And does it disappear likewise if |
That is correct. It did NOT work. Sorry for the omission. |
Yeah, i hear you. I am fine with us doing that. I wondered is @nhorman had any idea why the init calls seem to affect things? The other concern i have is for pre-existing applications. Seems possible that there would be some fallout? |
I'm actually particularly mystified about that. There should be no reason explicitly calling init from the application does anything special. The only thing I can think is that, in doing so, the dynamic loader subtly changes the resolution order of the libraries, so you get the version linked to the application, rather than the one linked to liboqs. If their built with different configs (one with no-atexit, one with atexit, or something like that) you might see this behavior. you might be able to observe that behavior using LD_DEBUG if that were the case |
My preference would be to go with the "thread_stop" solution, for the reasons above. Also, if the leak only occurs because of a flaw in our test programs and not because of a flaw in the library itself, then that's great news, and I would endorse patching it only there. I do wonder, though, if we ought to protect the user a little more—it seems wrong somehow that somebody who wants to use liboqs in a multithreaded application is required to
The only way I see to avoid this pain for the user is to add an analogous well-documented
Yes, it does. |
Sounds (more) sensible (than pointing to openssl) -- but doesn't alleviate users of the need to call this at the right time.
Thanks for checking/confirming, @SWilson4 !
I don't see which: The problem also is "pre-existing" -- so either it didn't bother folks (so a fix with a new API they call (or don't :) doesn't change anything) or they already properly cleaned up "behind themselves" as per OpenSSL guidance and only the OQS tests didn't do so far (again, not making a difference for other pre-existing apps). |
@SWilson4, I can complete the code changes if you haven't already done them. But can you work out the documentation? |
Signed-off-by: Norman Ashley <[email protected]>
Signed-off-by: Norman Ashley <[email protected]>
…king. Signed-off-by: Norman Ashley <[email protected]>
'run_tests' memory leak tests fails when build options enables OQS_USE_PTHREADS and OQS_USE_OPENSSL.
Fixes #1941.