Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP / DONT MERGE: Switch from userspace to kernelspace BPF filtering #8

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

T-X
Copy link
Contributor

@T-X T-X commented Dec 22, 2020

Hi,

I did some tests with using kernelspace filtering (pcap_setfilter()) instead of the current userspace filtering within bpfcountd (pcap_offline_filter()).

I was hoping to reduce the impact of running bpfcountd on unrelated, unmatched traffic. So that maybe we could run bpfcountd on Freifunk gateways or Gluon mesh nodes to measure the broadcast and mesh routing protocol overhead without affecting the unicast performance / user experience.

However the results I'm getting on my laptop are quite mixed. There kernel space filtering has less impact on the unrelated, unmatched traffic compared to the current approach - but only until about 50 rules. After that the kernel space filtering is actually having a worse effect on the unicast traffic than the current userspace filtering approach (with a single core CPU probably the break-even point is probably at a few more rules).

The userspace filtering has roughly a constant 43% reduction in unicast performance. No matter how many rules I use. Only after about 500 arp rules bpfcountd uses one CPU core at about 100%. And then the unicast throughput actually gets a bit faster, probably because libpcap starts to drop packets.

Nevertheless, the kernelspace filtering is far from having no impact on the unicast traffic. Which I did not quite expect. I probably ask on the tcpdump mailing list for explanations.


Test protocol

Tests run on a laptop with a veth interface pair:

TX: iperf3 -t 15 -P 16 -c fe80::1%veth-tx (on veth-tx, fe80::2)
RX: iperf3 -s (on veth-rx, fe80::1)
LD_LIBRARY_PATH=/home/linus/dev-priv/libpcap/ ./bpfcountd -i veth-rx -f ./filter-test

Hardware:

  • Thinkpad T480s
  • Intel(R) Core(TM) i7-8550U CPU, 8 cores

Software:

Note 1: iperf3's -P 16 does not seem to work with a veth pair, still all threads of the
iperf3 client were running on the same CPU core.

Note 2: Small variations in throughput because it's a laptop and thermal CPU throttling
kicks in quickly. However short 15 seconds iperf runs seems to be fine, the hard impact
on throughput due to CPU throttling kicks in at about 30 seconds of iperf3 runtime.

Note 3: Number of pcap handles seems to be limited to 510 on a Linux 5.7.19 kernel.
Therefore test B / kernelspace results are not available for more than 500 rules.

0x arp / no bpfcountd:

73.8 Gbits/sec (-0.0%), CPU: 100% iperf3 server, 100% iperf3 client, no bpfcountd

1x arp:

A) 43.4 Gbits/sec (-41.2%), CPU: 100% iperf3 server, 100% iperf3 client, 8% bpfcountd
B) 71.7 Gbits/sec (-2.8%), CPU: 100% iperf3 server, 100% iperf3 client, 0% bpfcountd

5x arp:

A) 43.3 Gbits/sec (-41.3%), CPU: 100% iperf3 server, 100% iperf3 client, 8.5% bpfcountd
B) 70.3 Gbits/sec (-4.7%), CPU: 100% iperf3 server, 100% iperf3 client, 0% bpfcountd

10x arp:

A) 42.6 Gbits/sec (-42.3%), CPU: 100% iperf3 server, 100% iperf3 client, 9.5% bpfcountd
B) 65.7 Gbits/sec (-11.0%), CPU: 100% iperf3 server, 100% iperf3 client, 0% bpfcountd

50x arp:

A) 42.0 Gbits/sec (-43.1%), CPU: 100% iperf3 server, 100% iperf3 client, 20% bpfcountd
B) 45.8 Gbits/sec (-37.9%), CPU: 100% iperf3 server, 100% iperf3 client, 0% bpfcountd

100x arp:

A) 42.6 Gbits/sec (-42.3%), CPU: 100% iperf3 server, 100% iperf3 client, 30% bpfcountd
B) 31.0 Gbits/sec (-58.0%), CPU: 100% iperf3 server, 38% iperf3 client, 0% bpfcountd

250x arp:

A) 41.8 Gbits/sec (-43.4%), CPU: 100% iperf3 server, 100% iperf3 client, 70% bpfcountd
B) 11.1 Gbits/sec (-85.0%), CPU: 100% iperf3 server, 18.5% iperf3 client, 0% bpfcountd

500x arp:

A) 45.7 Gbits/sec (-38.1%), CPU: 100% iperf3 server, 100% iperf3 client, 100% bpfcountd
B) 5.39 Gbits/sec (-92.7%), CPU: 100% iperf3 server, 10% iperf3 client, 0% bpfcountd

750x arp:

A) 51.1 Gbits/sec (-30.8%), CPU: 100% iperf3 server, 100% iperf3 client, 100% bpfcountd
B) -

1000x arp:

A) 56.0 Gbits/sec (-24.1%), CPU: 100% iperf3 server, 100% iperf3 client, 100% bpfcountd
B) -

1500x arp:

A) 55.3 Gbits/sec (-25.1%), CPU: 100% iperf3 server, 100% iperf3 client, 100% bpfcountd
B) -

3000x arp:

A) 59.2 Gbits/sec (-19.8%), CPU: 100% iperf3 server, 100% iperf3 client, 100% bpfcountd
B) -

Useful if compiling against a library from a custom directory,
for instance a modified lipcap.

Signed-off-by: Linus Lüssing <[email protected]>
This allows to reuse a previously defined filter. For instance like:

MDNS;udp port 5353
MDNS4;ip && ${MDNS}
MDNS6;ip6 && ${MDNS}

Signed-off-by: Linus Lüssing <[email protected]>
Previously if no packets were received on an interface bpfcountd would
hang: It would neither respond to a Ctrl+c (SIGINT) nor to a unix socket
client because it would keep blocking on pcap_dispatch().

Avoid this by introducing epoll for the pcap handler and unix socket.
epoll_wait() will return when there is a new (batch of) packets from the
pcap handler, if there is a connection request from the unix socket or
if there is an interrupt from a signal handler.

This should also make the unix socket a lot more responsive.

epoll infrastructure will also help us to move the pcap filtering from
userspace to kernelspace later, to improve performance. That is epoll
allows us to easily have and handle multiple, per filter rule pcap
handlers later.

Signed-off-by: Linus Lüssing <[email protected]>
To increase performance, use pcap_setfilter() instead of
pcap_offline_filter(): With the latter all packets are copied to
userspace, to bpfcountd, and the filtering is then done within
bpfcountd. The former avoids this and applies filtering in
kernelspace already. Which then, together with a snaplen of zero,
allows to avoid copying the actual payload packet.

Furthermore, any traffic which does not match any filter rule of
bpfcountd should now be affected less performance wise as it does not
create any event to bpfcountd at all.

+++ TODO/WIP: +++

The latter actually does not seem to be the case in all situations. This
only seems to be true when using very few filter rules. In my tests on
an Intel i7 laptop with a veth interface pair and iperf3 my throughput
results were affected by kernelspace filtering bpfcountd with n 'arp'
filter rules in the following way:

1 rule: -2.8%, 5 rules: -4.7%, 50 rules: -37.9%, 500 rules: -92.7%

While for userspace filtering I had about a constant 43% less
throughput until bpfcountd reached 100% CPU usage (with each added rule
CPU usage of bpfcountd increases). Then throughput actually got faster
again, probably due to packets dropped in libpcap.

For kernelspace filtering, bpfcountd slumbered at 0% CPU usage, as
expected.
pcap_close() is rather slow, it needs about 100 millseconds. When using
many rules and calling pcap_close() for each of those then this accumulates
quickly to a noticeable delay on shutdown.

Parallelising the pcap_close() calls via pthreads helps.
@T-X
Copy link
Contributor Author

T-X commented Dec 22, 2020

An alternative idea, if we can't figure out why the number of pcap handles has such an impact on unrelated traffic would be to add a new option -F "BPF expression" to bpfcountd.

So that we would install a single BPF rule with pcap_setfilter() in kernelspace which matches all the broadcast and mesh protocol traffic. But not the unicast traffic. And keep doing the more specific filtering and counting within bpfcountd via pcap_offline_filter(). That should have a minimal impact on the unicast traffic then.

@lemoer
Copy link
Owner

lemoer commented Dec 23, 2020

I suppose, the additional overhead is due to the necessary additional syscalls. At least that what I had in mind, when I was developing this daemon. However at that point I didn't had in mind that a lot of the traffic might not be matching any of the rules. Fixing this would definitely be a very nice addition. Thanks for demonstrating the break even point.

While I like your idea of having a global, preselecting bpf in the kernel, I'm wondering if it would be better to use eBPF and do the counting in the kernel space as well. This would bring us to zero syscalls per packet. I found this an example, where someone did something similar: https://www.collabora.com/news-and-blog/blog/2019/04/05/an-ebpf-overview-part-1-introduction/ . They counted TCP, UDP and ICMP packages in their first example. Without having read it thoroughly, their code looks not very complicated. However, I am not sure, how easy it will be to combine BPF and eBPF.

What do you think?

@lemoer
Copy link
Owner

lemoer commented Dec 23, 2020

Hm, thinking about it again, I see your point. None of the unicast packages should be delivered to the userspace in B) and therefore no additional syscalls should be necessary. I am curious about the response from the tcpdump team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants