-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP / DONT MERGE: Switch from userspace to kernelspace BPF filtering #8
base: master
Are you sure you want to change the base?
Conversation
Useful if compiling against a library from a custom directory, for instance a modified lipcap. Signed-off-by: Linus Lüssing <[email protected]>
This allows to reuse a previously defined filter. For instance like: MDNS;udp port 5353 MDNS4;ip && ${MDNS} MDNS6;ip6 && ${MDNS} Signed-off-by: Linus Lüssing <[email protected]>
Previously if no packets were received on an interface bpfcountd would hang: It would neither respond to a Ctrl+c (SIGINT) nor to a unix socket client because it would keep blocking on pcap_dispatch(). Avoid this by introducing epoll for the pcap handler and unix socket. epoll_wait() will return when there is a new (batch of) packets from the pcap handler, if there is a connection request from the unix socket or if there is an interrupt from a signal handler. This should also make the unix socket a lot more responsive. epoll infrastructure will also help us to move the pcap filtering from userspace to kernelspace later, to improve performance. That is epoll allows us to easily have and handle multiple, per filter rule pcap handlers later. Signed-off-by: Linus Lüssing <[email protected]>
To increase performance, use pcap_setfilter() instead of pcap_offline_filter(): With the latter all packets are copied to userspace, to bpfcountd, and the filtering is then done within bpfcountd. The former avoids this and applies filtering in kernelspace already. Which then, together with a snaplen of zero, allows to avoid copying the actual payload packet. Furthermore, any traffic which does not match any filter rule of bpfcountd should now be affected less performance wise as it does not create any event to bpfcountd at all. +++ TODO/WIP: +++ The latter actually does not seem to be the case in all situations. This only seems to be true when using very few filter rules. In my tests on an Intel i7 laptop with a veth interface pair and iperf3 my throughput results were affected by kernelspace filtering bpfcountd with n 'arp' filter rules in the following way: 1 rule: -2.8%, 5 rules: -4.7%, 50 rules: -37.9%, 500 rules: -92.7% While for userspace filtering I had about a constant 43% less throughput until bpfcountd reached 100% CPU usage (with each added rule CPU usage of bpfcountd increases). Then throughput actually got faster again, probably due to packets dropped in libpcap. For kernelspace filtering, bpfcountd slumbered at 0% CPU usage, as expected.
pcap_close() is rather slow, it needs about 100 millseconds. When using many rules and calling pcap_close() for each of those then this accumulates quickly to a noticeable delay on shutdown. Parallelising the pcap_close() calls via pthreads helps.
An alternative idea, if we can't figure out why the number of pcap handles has such an impact on unrelated traffic would be to add a new option -F "BPF expression" to bpfcountd. So that we would install a single BPF rule with pcap_setfilter() in kernelspace which matches all the broadcast and mesh protocol traffic. But not the unicast traffic. And keep doing the more specific filtering and counting within bpfcountd via pcap_offline_filter(). That should have a minimal impact on the unicast traffic then. |
I suppose, the additional overhead is due to the necessary additional syscalls. At least that what I had in mind, when I was developing this daemon. However at that point I didn't had in mind that a lot of the traffic might not be matching any of the rules. Fixing this would definitely be a very nice addition. Thanks for demonstrating the break even point. While I like your idea of having a global, preselecting bpf in the kernel, I'm wondering if it would be better to use eBPF and do the counting in the kernel space as well. This would bring us to zero syscalls per packet. I found this an example, where someone did something similar: https://www.collabora.com/news-and-blog/blog/2019/04/05/an-ebpf-overview-part-1-introduction/ . They counted TCP, UDP and ICMP packages in their first example. Without having read it thoroughly, their code looks not very complicated. However, I am not sure, how easy it will be to combine BPF and eBPF. What do you think? |
Hm, thinking about it again, I see your point. None of the unicast packages should be delivered to the userspace in B) and therefore no additional syscalls should be necessary. I am curious about the response from the tcpdump team. |
Hi,
I did some tests with using kernelspace filtering (pcap_setfilter()) instead of the current userspace filtering within bpfcountd (pcap_offline_filter()).
I was hoping to reduce the impact of running bpfcountd on unrelated, unmatched traffic. So that maybe we could run bpfcountd on Freifunk gateways or Gluon mesh nodes to measure the broadcast and mesh routing protocol overhead without affecting the unicast performance / user experience.
However the results I'm getting on my laptop are quite mixed. There kernel space filtering has less impact on the unrelated, unmatched traffic compared to the current approach - but only until about 50 rules. After that the kernel space filtering is actually having a worse effect on the unicast traffic than the current userspace filtering approach (with a single core CPU probably the break-even point is probably at a few more rules).
The userspace filtering has roughly a constant 43% reduction in unicast performance. No matter how many rules I use. Only after about 500 arp rules bpfcountd uses one CPU core at about 100%. And then the unicast throughput actually gets a bit faster, probably because libpcap starts to drop packets.
Nevertheless, the kernelspace filtering is far from having no impact on the unicast traffic. Which I did not quite expect. I probably ask on the tcpdump mailing list for explanations.
Test protocol
Tests run on a laptop with a veth interface pair:
TX: iperf3 -t 15 -P 16 -c fe80::1%veth-tx (on veth-tx, fe80::2)
RX: iperf3 -s (on veth-rx, fe80::1)
LD_LIBRARY_PATH=/home/linus/dev-priv/libpcap/ ./bpfcountd -i veth-rx -f ./filter-test
Hardware:
Software:
5a89e579 Add support for B.A.T.M.A.N. Advanced
bcd6c3fe Mention DLT_LINUX_SLL2 in INSTALL.md [skip ci]
=> CommitDate: Mon Nov 2 09:54:24 2020 +0000
...
A) userspace filtering / pcap_offline_filter()
53faac0 Fix hangs on quiet interface with epoll for pcap handler + unix socket
...
=> https://github.com/T-X/bpfcountd/tree/test-userspace-filtering
B) kernelspace filtering / pcap_setfilter()
4776054 Speedup shutdown via pthreads
80b455e WIP / DONT MERGE: Switch from userspace to kernelspace BPF filtering
...
=> https://github.com/T-X/bpfcountd/tree/test-kernelspace-filtering
Note 1: iperf3's -P 16 does not seem to work with a veth pair, still all threads of the
iperf3 client were running on the same CPU core.
Note 2: Small variations in throughput because it's a laptop and thermal CPU throttling
kicks in quickly. However short 15 seconds iperf runs seems to be fine, the hard impact
on throughput due to CPU throttling kicks in at about 30 seconds of iperf3 runtime.
Note 3: Number of pcap handles seems to be limited to 510 on a Linux 5.7.19 kernel.
Therefore test B / kernelspace results are not available for more than 500 rules.
0x arp / no bpfcountd:
73.8 Gbits/sec (-0.0%), CPU: 100% iperf3 server, 100% iperf3 client, no bpfcountd
1x arp:
A) 43.4 Gbits/sec (-41.2%), CPU: 100% iperf3 server, 100% iperf3 client, 8% bpfcountd
B) 71.7 Gbits/sec (-2.8%), CPU: 100% iperf3 server, 100% iperf3 client, 0% bpfcountd
5x arp:
A) 43.3 Gbits/sec (-41.3%), CPU: 100% iperf3 server, 100% iperf3 client, 8.5% bpfcountd
B) 70.3 Gbits/sec (-4.7%), CPU: 100% iperf3 server, 100% iperf3 client, 0% bpfcountd
10x arp:
A) 42.6 Gbits/sec (-42.3%), CPU: 100% iperf3 server, 100% iperf3 client, 9.5% bpfcountd
B) 65.7 Gbits/sec (-11.0%), CPU: 100% iperf3 server, 100% iperf3 client, 0% bpfcountd
50x arp:
A) 42.0 Gbits/sec (-43.1%), CPU: 100% iperf3 server, 100% iperf3 client, 20% bpfcountd
B) 45.8 Gbits/sec (-37.9%), CPU: 100% iperf3 server, 100% iperf3 client, 0% bpfcountd
100x arp:
A) 42.6 Gbits/sec (-42.3%), CPU: 100% iperf3 server, 100% iperf3 client, 30% bpfcountd
B) 31.0 Gbits/sec (-58.0%), CPU: 100% iperf3 server, 38% iperf3 client, 0% bpfcountd
250x arp:
A) 41.8 Gbits/sec (-43.4%), CPU: 100% iperf3 server, 100% iperf3 client, 70% bpfcountd
B) 11.1 Gbits/sec (-85.0%), CPU: 100% iperf3 server, 18.5% iperf3 client, 0% bpfcountd
500x arp:
A) 45.7 Gbits/sec (-38.1%), CPU: 100% iperf3 server, 100% iperf3 client, 100% bpfcountd
B) 5.39 Gbits/sec (-92.7%), CPU: 100% iperf3 server, 10% iperf3 client, 0% bpfcountd
750x arp:
A) 51.1 Gbits/sec (-30.8%), CPU: 100% iperf3 server, 100% iperf3 client, 100% bpfcountd
B) -
1000x arp:
A) 56.0 Gbits/sec (-24.1%), CPU: 100% iperf3 server, 100% iperf3 client, 100% bpfcountd
B) -
1500x arp:
A) 55.3 Gbits/sec (-25.1%), CPU: 100% iperf3 server, 100% iperf3 client, 100% bpfcountd
B) -
3000x arp:
A) 59.2 Gbits/sec (-19.8%), CPU: 100% iperf3 server, 100% iperf3 client, 100% bpfcountd
B) -