Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A long delay in CAddrMan::Select() causes p2p messages to be blocked #423

Open
dimxy opened this issue May 6, 2021 · 8 comments
Open

A long delay in CAddrMan::Select() causes p2p messages to be blocked #423

dimxy opened this issue May 6, 2021 · 8 comments

Comments

@dimxy
Copy link
Collaborator

dimxy commented May 6, 2021

There is a problem with addrman locks in komodod.

A delay exists in addrman.Select() function (

const int kRetrySleepInterval = 100; // milliseconds
), which locks the addrman object for a long time (up to 2-3 min). So when the addrman is locked by Select() and if getaddr p2p cmd tries to lock addrman then p2p processing stops for a long time.

This is most noticeable for nspv clients which might timeout waiting for response from a nspv server.

The reason for this delay is explained in this zcash fix zcash#929.
However I checked the current bitcoin code, they do not seem to have such delays in addrman. Suggest removing it.
The addman.Select() is called quite regularly when the node is trying to maintain connections to peers.

@dimxy dimxy changed the title A long delay in CAddrMan::Select() causes p2p messages blocked A long delay in CAddrMan::Select() causes p2p messages to be blocked May 6, 2021
@jmjatlanta
Copy link

jmjatlanta commented May 14, 2021

A possible solution is to move the max retries loop outside the lock. This will slow down the P2P stuff, but not stop it. Note that link is totally untested code.

A note here is that it appears a mock RNG is necessary for testing. A possible option is to make this class a template, and apply the appropriate RNG at compile time. This would move the code for testing into the test and out of the implementation.

@dimxy
Copy link
Collaborator Author

dimxy commented May 15, 2021

I thought we could simply use the Select_() from bitcoin source, it is actually the same but has no any delays

@dimxy
Copy link
Collaborator Author

dimxy commented Jan 11, 2022

more findings on this issue:
basically it may affect connectivity: because of the long delay in the CAddrMan::Select() the komodod makes reconnect tries also with long delays and this may cause network partitioning.
However, the backward effect is when the delay removed (like in other bitcoin forks) there may be extra cpu load if there are unreachable seeds (like unused kmd seeds in assets chain). That is why this fix should be accompanied with disabling kmd seeds for assets chains

@dimxy
Copy link
Collaborator Author

dimxy commented Jan 19, 2022

yet more info about connection management:
actually trying connections occurs until outbound connections number becomes 16, after that OpenConnectionsThread starts waiting on lock (and does not waste resources) until the connection number reduces.
So some extra cpu spending on connection open can be only on small networks (<16 conns) and zcash tried to avoid it by the delay in Select, but I still believe we should not lock the address manager for so long and should remove the delay.
Also for assets chain we should disable kmd chain's dns seeds and fixed seeds, to avoid unneeded connection tries

@dimxy
Copy link
Collaborator Author

dimxy commented Feb 3, 2022

The issue is even worse though:
if a net is small so the number of connections < 16 the address manager object is regularly locked up to several minutes (because of the delay in Select()). When this happens an incoming "getaddr" cmd will be waiting for this lock and the whole received message processing loop stops until it is unlocked!

@DeckerSU
Copy link

DeckerSU commented Feb 3, 2022

Just some info from where it comes:

CAddrInfo CAddrMan::Select_(bool newOnly)
{
    if (size() == 0)
        return CAddrInfo();

    if (newOnly && nNew == 0)
        return CAddrInfo();

    // Use a 50% chance for choosing between tried and new table entries.
    if (!newOnly &&
       (nTried > 0 && (nNew == 0 || RandomInt(2) == 0))) { 
        // use a tried node
        double fChanceFactor = 1.0;
        while (1) {
            int nKBucket = RandomInt(ADDRMAN_TRIED_BUCKET_COUNT);
            int nKBucketPos = RandomInt(ADDRMAN_BUCKET_SIZE);
            while (vvTried[nKBucket][nKBucketPos] == -1) {
                nKBucket = (nKBucket + insecure_rand()) % ADDRMAN_TRIED_BUCKET_COUNT;
                nKBucketPos = (nKBucketPos + insecure_rand()) % ADDRMAN_BUCKET_SIZE;
            }
            int nId = vvTried[nKBucket][nKBucketPos];
            assert(mapInfo.count(nId) == 1);
            CAddrInfo& info = mapInfo[nId];
            if (RandomInt(1 << 30) < fChanceFactor * info.GetChance() * (1 << 30))
                return info;
            fChanceFactor *= 1.2;
        }
    } else {
        // use a new node
        double fChanceFactor = 1.0;
        while (1) {
            int nUBucket = RandomInt(ADDRMAN_NEW_BUCKET_COUNT);
            int nUBucketPos = RandomInt(ADDRMAN_BUCKET_SIZE);
            while (vvNew[nUBucket][nUBucketPos] == -1) {
                nUBucket = (nUBucket + insecure_rand()) % ADDRMAN_NEW_BUCKET_COUNT;
                nUBucketPos = (nUBucketPos + insecure_rand()) % ADDRMAN_BUCKET_SIZE;
            }
            int nId = vvNew[nUBucket][nUBucketPos];
            assert(mapInfo.count(nId) == 1);
            CAddrInfo& info = mapInfo[nId];
            if (RandomInt(1 << 30) < fChanceFactor * info.GetChance() * (1 << 30))
                return info;
            fChanceFactor *= 1.2;
        }
    }
}

I guess we should revert ZCash jl777#929 changes and try to reproduce 100% CPU load when offline and investigate what Bitcoin had to do with that issue.

@dimxy
Copy link
Collaborator Author

dimxy commented Feb 3, 2022

A higher cpu load which described by zcash, I believe, occurs only on small chains. On big chains the connection number quickly reaches the limit and the node stops trying connecting, so I think for bitcoin (or kmd) this is not an issue
I ran this fix on a small asset chain and saw some extra load but I think it happens because for assets chains we do not remove the default seeds (so nodes try and try them without success) - so I suggest removing the seeds for assets chains too

@dimxy
Copy link
Collaborator Author

dimxy commented Apr 9, 2023

I return back to this issue as it is a bit annoying on test chains, preventing nodes from connection.

As it was suggested above I tried to check if the same issue is addressed somehow in bitcoin (namely CConnman::ThreadOpenConnections):
I would say they did not bother about it: bitcoin has a similar loop, with more complicated code for selection of the conn type, but generally the code looks the same as it is in zcash/komodo, with the same 500 ms sleep before starting a new address selection, but with no delay in CAddrMan::Select_().

I also investigated what happens on a small chain with the delay in Select_() removed: I would not say that any extra heady load was seen. BTW I saw an extra load and frequent connections on test asset chains, in past, when we had invalid seeds left in our code. Obviously, we should not any unneeded seeds for komodo or asset chain, to avoid extra conn tries.

I still think we should remove this delay in Select_ anyway, because I think it is in fact dangerous code when p2p pump is stopped for minutes, which probably may be exploited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants