Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embedded swarm DNS does not fail over to secondary properly on RHEL7 #2663

Open
erikanderson opened this issue Jun 24, 2022 · 5 comments
Open

Comments

@erikanderson
Copy link

erikanderson commented Jun 24, 2022

OS: RHEL7
Docker Version: 20.10.17

Problem: When primary DNS server is down, embedded DNS server returns timeout even though secondary is available

Reproduction (on RHEL7 host- I got trial sub to get RHEL 7 https://access.redhat.com/downloads/content/69/ver=/rhel---7/7.9/x86_64/packages):

  1. Ensure iptables is in use: https://tecadmin.net/install-and-use-iptables-on-centos-rhel-7/
docker swarm init
docker network create -d overlay dns-test-network --attachable
docker run --network dns-test-network -it openjdk:11 /bin/bash

cat > DNSLookup.java <<'EOF'
import java.net.InetAddress;
import java.net.UnknownHostException;
 
public class DNSLookup
{
    public static void main(String args[])
    {
      System.out.println("DNS Lookup Test");
      try {
        System.out.println(InetAddress.getByName("example.com"));
      }  catch (UnknownHostException e) {
         System.err.println(e);
      }
    }
}
EOF

javac DNSLookup.java
java DNSLookup

This works as expected. However, if we simulate failure of primary DNS with iptables the results are not as we would expect

Drop traffic to primary DNS (eg 10.10.10.10)

iptables -I DOCKER-USER -p udp -d 10.10.10.10 --dport 53 -j DROP

Re-run java DNSLookup in container and we intermittently but the majority of the time get

Temporary failure in name resolution

The debug logs show that we get an io timeout to the primary (replaced with 10.10.10.10), it tries and succeeds to get a result from secondary (replaced with 10.10.10.20) but then continues to try both the primary and the secondary with search domain appended, which means that the successful request was never returned to the underlying container

level=debug msg="Name To resolve: example.com."
level=debug msg="[resolver] query example.com. (A) from 172.18.0.3:39558, forwarding to udp:10.10.10.10"
level=debug msg="Name To resolve: example.com."
level=debug msg="[resolver] query example.com. (A) from 172.18.0.3:58022, forwarding to udp:10.10.10.10"
level=debug msg="Name To resolve: example.com.search.com."
level=debug msg="[resolver] query example.com.search.com. (A) from 172.18.0.3:56944, forwarding to udp:10.10.10.10"
level=debug msg="[resolver] read from DNS server failed, read udp 172.18.0.3:39558->10.10.10.10:53: i/o timeout"
level=debug msg="[resolver] query example.com. (A) from 172.18.0.3:60164, forwarding to udp:10.10.10.20"
level=debug msg="[resolver] received A record \"10.1.1.1\" for \"example.com\" from udp:10.10.10.20"
level=debug msg="Name To resolve: example.com.search.com."
level=debug msg="[resolver] query example.com.search.com. (A) from 172.18.0.3:51365, forwarding to udp:10.10.10.10"
level=debug msg="[resolver] read from DNS server failed, read udp 172.18.0.3:58022->10.10.10.10:53: i/o timeout"
level=debug msg="[resolver] query example.com. (A) from 172.18.0.3:37294, forwarding to udp:10.10.10.20"
level=debug msg="[resolver] received A record \"10.1.1.1\" for \"example.com\" from udp:10.10.10.20"
level=debug msg="[resolver] read from DNS server failed, read udp 172.18.0.3:56944->10.10.10.10:53: i/o timeout"
level=debug msg="[resolver] query example.com.search.com. (A) from 172.18.0.3:50534, forwarding to udp:10.10.10.20"
level=debug msg="[resolver] external DNS udp:10.10.10.20 responded with NXDOMAIN for \"example.com.search.com.\""
level=debug msg="[resolver] external DNS udp:10.10.10.20 did not return any A records for \"example.com.search.com.\""
level=debug msg="[resolver] read from DNS server failed, read udp 172.18.0.3:51365->10.10.10.10:53: i/o timeout"
level=debug msg="[resolver] query example.com.search.com. (A) from 172.18.0.3:32985, forwarding to udp:10.10.10.20"
level=debug msg="[resolver] external DNS udp:10.10.10.20 responded with NXDOMAIN for \"example.com.search.com.\""
level=debug msg="[resolver] external DNS udp:10.10.10.20 did not return any A records for \"example.com.search.com.\""

So when it got a valid return from secondary DNS (lines 8 and 9), it should have stopped and things would have worked

level=debug msg="[resolver] query example.com. (A) from 172.18.0.3:60164, forwarding to udp:10.10.10.20"
level=debug msg="[resolver] received A record \"10.1.1.1\" for \"example.com\" from udp:10.10.10.20"

We know that replacing 127.0.0.11 (docker embdedded dns) with the nameservers from host /etc/resolv.conf works but ideally we would like to find a way forward that allows us to still use docker embdedded dns

Edit: It does work from time to time, this is result of working scenario:

level=debug msg="Name To resolve: example.com."
level=debug msg="[resolver] query example.com. (A) from 172.18.0.3:53936, forwarding to udp:10.10.10.10"
level=debug msg="Name To resolve: example.com."
level=debug msg="[resolver] query example.com. (A) from 172.18.0.3:37429, forwarding to udp:10.10.10.10"
level=debug msg="[resolver] read from DNS server failed, read udp 172.18.0.3:53936->10.10.10.10:53: i/o timeout"
level=debug msg="[resolver] query example.com. (A) from 172.18.0.3:46871, forwarding to udp:10.10.10.20"
level=debug msg="[resolver] received A record \"10.1.1.1\" for \"example.com.\" from udp:10.10.10.20"
@erikanderson
Copy link
Author

erikanderson commented Jun 28, 2022

We did some more digging and it appears that the rotate option the container inherits from the host is causing this.

In container this /etc/resolv.conf doesn't work:

search example.com
nameserver 127.0.0.11
options rotate timeout:2 ndots:0

This /etc/resolv.conf works:

search example.com
nameserver 127.0.0.11
options timeout=2 ndots:0

Looks like this isn't the first time RHEL has had issues with rotate option: https://bugzilla.redhat.com/show_bug.cgi?id=841787 so it looks like maybe there is a bug in RHEL7 in Docker when rotate is set while using swarm mode

Edit: The reason the second one was working is the syntax for /etc/resolv.conf was wrong for timeout (timeout=2, should be timeout:2) so it was reverting to default timeout of 5

@erikanderson erikanderson changed the title Embedded swarm DNS does not fail over to secondary properly Embedded swarm DNS does not fail over to secondary properly on RHEL7 Jun 28, 2022
@erikanderson
Copy link
Author

Reproduced using a clean image of RHEL7 and the key between things working with primary dropping traffic was timeout.
Doesn't work:

timeout:2

Works:

timeout:3

So not sure what kind of weird race condition is happening

@erikanderson
Copy link
Author

erikanderson commented Jun 29, 2022

Based on being able to reproduce in vanilla RHEL7 going to reopen
Broken (timeout:2):
rhel7_vanilla_broken
Working (timeout:3):
rhel7_vanilla_working

@erikanderson erikanderson reopened this Jun 29, 2022
@olljanat
Copy link
Contributor

Code from here is mostly moved to moby/moby (look #2665 ) and that would be probably better place to report this as well.

However what is default timeout value on RHEL 7?

@erikanderson
Copy link
Author

Thank you @olljanat , will crosspost this issue there.

default timeout is set to 5 seconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants