1 00:00:00,099 --> 00:00:15,690 *34c3 intro* 2 00:00:15,690 --> 00:00:20,270 Herald: All right, now it's my great pleasure to introduce Paul Emmerich who is 3 00:00:20,270 --> 00:00:26,520 going to talk about "Demystifying Network Cards". Paul is a PhD student at the 4 00:00:26,520 --> 00:00:33,660 Technical University in Munich. He's doing all kinds of network related stuff and 5 00:00:33,660 --> 00:00:37,950 hopefully today he's gonna help us make network cards a bit less of a black box. 6 00:00:37,950 --> 00:00:48,530 So, please give a warm welcome to Paul *applause* 7 00:00:48,530 --> 00:00:50,559 Paul: Thank you and as the introduction 8 00:00:50,559 --> 00:00:54,649 already said I'm a PhD student and I'm researching performance of software packet 9 00:00:54,649 --> 00:00:58,319 processing and forwarding systems. That means I spend a lot of time doing 10 00:00:58,319 --> 00:01:02,559 low-level optimizations and looking into what makes a system fast, what makes it 11 00:01:02,559 --> 00:01:05,980 slow, what can be done to improve it and I'm mostly working on my packet 12 00:01:05,980 --> 00:01:09,770 generator MoonGen I have some cross promotion of a lightning 13 00:01:09,770 --> 00:01:13,490 talk about this on Saturday but here I have this long slot 14 00:01:13,490 --> 00:01:17,550 and I brought a lot of content here so I have to talk really fast so sorry for the 15 00:01:17,550 --> 00:01:20,560 translators and I hope you can mainly follow along 16 00:01:20,560 --> 00:01:24,920 So: this is about Network cards meaning network cards you all have seen. This is a 17 00:01:24,920 --> 00:01:30,369 usual 10G network card with the SFP+ port and this is a faster network card with a 18 00:01:30,369 --> 00:01:35,359 QSFP+ port. This is 20, 40, or 100G and now you bought this fancy network 19 00:01:35,359 --> 00:01:38,229 card, you plug it into your server or your macbook or whatever, 20 00:01:38,229 --> 00:01:41,520 and you start your web server that serves cat pictures and cat videos. 21 00:01:41,520 --> 00:01:45,739 You all know that there's a whole stack of protocols that your cat picture has to go 22 00:01:45,739 --> 00:01:48,089 through until it arrives at a network card at the bottom 23 00:01:48,089 --> 00:01:52,120 and the only thing that I care about are the lower layers. I don't care about TCP, 24 00:01:52,120 --> 00:01:55,520 I have no idea how TCP works. Well I have some idea how it works, but 25 00:01:55,520 --> 00:01:57,701 this is not my research, I don't care about it. 26 00:01:57,701 --> 00:02:01,280 I just want to look at individual packets and the highest thing I look at it's maybe 27 00:02:01,280 --> 00:02:07,729 an IP address or maybe a part of the protocol to identify flows or anything. 28 00:02:07,729 --> 00:02:11,050 Now you might wonder: Is there anything even interesting in these lower layers? 29 00:02:11,050 --> 00:02:15,080 Because people nowadays think that everything runs on top of HTTP, 30 00:02:15,080 --> 00:02:19,160 but you might be surprised that not all applications run on top of HTTP. 31 00:02:19,160 --> 00:02:23,380 There is a lot of software that needs to run at these lower levels and in the 32 00:02:23,380 --> 00:02:26,150 recent years there is a trend of moving network 33 00:02:26,150 --> 00:02:30,810 infrastructure stuff from specialized hardware black boxes to open software 34 00:02:30,810 --> 00:02:33,220 boxes and examples for such software that was 35 00:02:33,220 --> 00:02:37,780 hardware in the past are: routers, switches, firewalls, middle boxes and so on. 36 00:02:37,780 --> 00:02:40,420 If you want to look up the relevant buzzwords: It's Network Function 37 00:02:40,420 --> 00:02:45,850 Virtualization what it's called and this is a recent trend of the recent years. 38 00:02:45,850 --> 00:02:50,610 Now let's say we want to build our own fancy application on that low-level thing. 39 00:02:50,610 --> 00:02:55,120 We want to build our firewall router packet forward modifier thing that does 40 00:02:55,120 --> 00:02:59,410 whatever useful on that lower layer for network infrastructure 41 00:02:59,410 --> 00:03:03,760 and I will use this application as a demo application for this talk as everything 42 00:03:03,760 --> 00:03:08,310 will be about this hypothetical router fireball packet forward modifier thing. 43 00:03:08,310 --> 00:03:11,800 What it does: It receives packets on one or multiple network interfaces, it does 44 00:03:11,800 --> 00:03:16,270 stuff with the packets - filter them, modify them, route them 45 00:03:16,270 --> 00:03:19,980 and sent them out to some other port or maybe the same port or maybe multiple 46 00:03:19,980 --> 00:03:23,140 ports - whatever these low-level applications do. 47 00:03:23,140 --> 00:03:27,540 And this means the application operates on individual packets, not a stream of TCP 48 00:03:27,540 --> 00:03:31,300 packets, not a stream of UDP packets, they have to cope with small packets. 49 00:03:31,300 --> 00:03:34,200 Because that's just the worst case: You get a lot of small packets. 50 00:03:34,200 --> 00:03:37,760 Now you want to build the application. You go to the Internet and you look up: How to 51 00:03:37,760 --> 00:03:41,290 build a packet forwarding application? The internet tells you: There is the 52 00:03:41,290 --> 00:03:46,040 socket API, the socket API is great and it allows you to get packets to your program. 53 00:03:46,040 --> 00:03:50,080 So you build your application on top of the socket API. Once in userspace, you use 54 00:03:50,080 --> 00:03:52,930 your socket, the socket talks to the operating system, 55 00:03:52,930 --> 00:03:56,030 the operating system talks to the driver and the driver talks to the network cards, 56 00:03:56,030 --> 00:03:59,340 and everything is fine except for that it isn't 57 00:03:59,340 --> 00:04:02,080 because what it really looks like if you build this application: 58 00:04:02,080 --> 00:04:07,460 There is this huge scary big gap between user space and kernel space and you 59 00:04:07,460 --> 00:04:13,170 somehow need your packets to go across that without being eaten. 60 00:04:13,170 --> 00:04:16,359 You might wonder why I said this is a big deal and a huge deal that you have this 61 00:04:16,359 --> 00:04:19,399 gap in there and because I think: "Well, my web server 62 00:04:19,399 --> 00:04:23,120 serving cat pictures is doing just fine on a fast connection." 63 00:04:23,120 --> 00:04:28,890 Well, it is because it is serving large packets or even large chunks of files that 64 00:04:28,890 --> 00:04:33,930 it sends at one to the server like you can send a... can take your whole 65 00:04:33,930 --> 00:04:36,510 cat video, give it to the kernel and the kernel will handle everything 66 00:04:36,510 --> 00:04:42,800 from doing... from packetizing it to TCP. But what we want to build is a application 67 00:04:42,800 --> 00:04:47,640 that needs to cope with the worst case of lots of small packets coming in, 68 00:04:47,640 --> 00:04:53,600 and then the overhead that you get here from this gap is mostly on a packet basis 69 00:04:53,600 --> 00:04:57,421 not on a pair-byte basis. So, lots of small packets are a problem 70 00:04:57,421 --> 00:05:00,690 for this interface. When I say "problem" I'm always talking 71 00:05:00,690 --> 00:05:03,240 about performance because I'm mostly about performance. 72 00:05:03,240 --> 00:05:09,390 So if you look at performance... a few figures to get started is... 73 00:05:09,390 --> 00:05:13,250 well how many packets can you fit over your usual 10G link? That's around fifteen 74 00:05:13,250 --> 00:05:17,810 million. But 10G that's last year's news, this year 75 00:05:17,810 --> 00:05:21,370 you have multiple hundred G connections even to this location here. 76 00:05:21,370 --> 00:05:28,280 So 100G link can handle up to 150 million packets per second, and, well, how long 77 00:05:28,280 --> 00:05:32,819 does that give us if we have a CPU? And say we have a three gigahertz CPU in 78 00:05:32,819 --> 00:05:37,260 our Macbook running the router and that means we have around 200 cycles per packet 79 00:05:37,260 --> 00:05:40,400 if we want to handle one 10G link with one CPU core. 80 00:05:40,400 --> 00:05:46,000 Okay we don't want to handle... we have of course multiple cores. But you also have 81 00:05:46,000 --> 00:05:50,430 multiple links, and faster links than 10G. So the typical performance target that you 82 00:05:50,430 --> 00:05:54,510 would aim for when building such an application is five to ten million packets 83 00:05:54,510 --> 00:05:56,880 per second per CPU core per thread that you start. 84 00:05:56,880 --> 00:06:00,550 Thats like a usual target. And that is just for forwarding, just to receive the 85 00:06:00,550 --> 00:06:05,630 packet and to send it back out. All the stuff, that is: all the remaining cycles 86 00:06:05,630 --> 00:06:09,110 can be used for your application. So we don't want any big overhead just for 87 00:06:09,110 --> 00:06:11,700 receiving and sending them without doing any useful work. 88 00:06:11,700 --> 00:06:20,370 So these these figures translate to around 300 to 600 cycles per packet, on a 89 00:06:20,370 --> 00:06:24,380 three gigahertz CPU core. Now, how long does it take to cross that userspace 90 00:06:24,380 --> 00:06:30,860 boundary? Well, very very very long for an individual packet. So in some performance 91 00:06:30,860 --> 00:06:34,620 measurements, if you do single core packet forwarding, with a raw socket socket you 92 00:06:34,620 --> 00:06:38,920 can maybe achieve 300,000 packets per second, if you use libpcap, you can 93 00:06:38,920 --> 00:06:42,740 achieve a million packets per second. These figures can be tuned. You can maybe 94 00:06:42,740 --> 00:06:46,080 get factor two out of that by some tuning, but there are more problems, like 95 00:06:46,080 --> 00:06:50,340 multicore scaling is unnecessarily hard and so on, so this doesn't really seem to 96 00:06:50,340 --> 00:06:54,800 work. So the boundary is the problem, so let's get rid of the boundary by just 97 00:06:54,800 --> 00:06:59,310 moving the application into the kernel. We rewrite our application as a kernel module 98 00:06:59,310 --> 00:07:04,330 and use it directly. You might think "what an incredibly stupid idea, to write kernel 99 00:07:04,330 --> 00:07:08,580 code for something that clearly should be user space". Well, it's not that 100 00:07:08,580 --> 00:07:11,949 unreasonable, there are lots of examples of applications doing this, like a certain 101 00:07:11,949 --> 00:07:16,850 web server by Microsoft runs as a kernel module, the latest Linux kernel has TLS 102 00:07:16,850 --> 00:07:20,850 offloading, to speed that up. Another interesting use case is Open vSwitch, that 103 00:07:20,850 --> 00:07:24,170 has a fast internal chache, that just caches stuff and does complex processing 104 00:07:24,170 --> 00:07:27,419 in a userspace thing, so it's not completely unreasonable. 105 00:07:27,419 --> 00:07:30,890 But it comes with a lot of drawbacks, like it's very cumbersome to develop, most your 106 00:07:30,890 --> 00:07:34,930 usual tools don't work or don't work as expected, you have to follow the usual 107 00:07:34,930 --> 00:07:38,000 kernel restrictions, like you have to use C as a programming language, what you 108 00:07:38,000 --> 00:07:42,260 maybe don't want to, and your application can and will crash the kernel, which can 109 00:07:42,260 --> 00:07:46,750 be quite bad. But lets not care about the restrictions, we wanted to fix 110 00:07:46,750 --> 00:07:50,530 performance, so same figures again: We have 300 to 600 cycles to receive and sent 111 00:07:50,530 --> 00:07:54,660 a packet. What I did: I tested this, I profiled the Linux kernel to see how long 112 00:07:54,660 --> 00:07:58,840 does it take to receive a packet until I can do some useful work on it. This is an 113 00:07:58,840 --> 00:08:03,550 average cost of a longer profiling run. So on average it takes 500 cycles just to 114 00:08:03,550 --> 00:08:08,010 receive the packet. Well, that's bad but sending it out is slightly faster and 115 00:08:08,010 --> 00:08:11,490 again, we are now over our budget. Now you might think "what else do I need to do 116 00:08:11,490 --> 00:08:15,639 besides receiving and sending the packet?" There is some more overhead, there's you 117 00:08:15,639 --> 00:08:20,710 need some time to the sk_buff, the data structure used in the kernel for all 118 00:08:20,710 --> 00:08:24,910 packet buffers, and this is quite bloated, old, big data structure that is growing 119 00:08:24,910 --> 00:08:29,760 bigger and bigger with each release and this takes another 400 cycles. So if you 120 00:08:29,760 --> 00:08:32,999 measure a real world application, single core packet forwarding with Open vSwitch 121 00:08:32,999 --> 00:08:36,429 with the minimum processing possible: One open flow rule that matches on physical 122 00:08:36,429 --> 00:08:40,529 ports and the processing, I profiled this at around 200 cycles per packet. 123 00:08:40,529 --> 00:08:44,790 And while the overhead of the kernel is another thousand something cycles, so in 124 00:08:44,790 --> 00:08:49,360 the end you achieve two million packets per second - and this is faster than our 125 00:08:49,360 --> 00:08:55,320 user space stuff but still kind of slow, well, we want to be faster, because yeah. 126 00:08:55,320 --> 00:08:59,220 And the currently hottest topic, which I'm not talking about in the Linux kernel is 127 00:08:59,220 --> 00:09:03,040 XDP. This fixes some of these problems but comes with new restrictions. I cut that 128 00:09:03,040 --> 00:09:10,079 for my talk for time reasons and so let's just talk about not XDP. So the problem 129 00:09:10,079 --> 00:09:14,439 was that our application - and we wanted to move the application to the kernel 130 00:09:14,439 --> 00:09:17,680 space - and it didn't work, so can we instead move stuff from the kernel to the 131 00:09:17,680 --> 00:09:22,160 user space? Well, yes we can. There a libraries called "user space packet 132 00:09:22,160 --> 00:09:25,660 processing frameworks". They come in two parts: One is a library, you link your 133 00:09:25,660 --> 00:09:29,209 program against, in the user space and one is a kernel module. These two parts 134 00:09:29,209 --> 00:09:34,199 communicate and they setup shared, mapped memory and this shared mapped memory is 135 00:09:34,199 --> 00:09:37,770 used to directly communicate from your application to the driver. You directly 136 00:09:37,770 --> 00:09:41,209 fill the packet buffers that the driver then sends out and this is way faster. 137 00:09:41,209 --> 00:09:44,379 And you might have noticed that the operating system box here is not connected 138 00:09:44,379 --> 00:09:47,349 to anything. That means your operating system doesn't even know that the network 139 00:09:47,349 --> 00:09:51,589 card is there in most cases, this can be quite annoying. But there are quite a few 140 00:09:51,589 --> 00:09:58,000 such frameworks, the biggest examples are netmap PF_RING and pfq and they come with 141 00:09:58,000 --> 00:10:02,170 restrictions, like there is a non-standard API, you can't port between one framework 142 00:10:02,170 --> 00:10:06,180 and the other or one framework in the kernel or sockets, there's a custom kernel 143 00:10:06,180 --> 00:10:10,650 module required, most of these frameworks require some small patches to the drivers, 144 00:10:10,650 --> 00:10:15,699 it's just a mess to maintain and of course they need exclusive access to the network 145 00:10:15,699 --> 00:10:18,970 card, because this one network card is direc- this one application is talking 146 00:10:18,970 --> 00:10:23,540 directly to the network card. Ok, and the next thing is you lose the 147 00:10:23,540 --> 00:10:27,759 access to the usual kernel features, which can be quite annoying and then there's 148 00:10:27,759 --> 00:10:30,970 often poor support for hardware offloading features of the network cards, because 149 00:10:30,970 --> 00:10:33,970 they often found on different parts of the kernel that we no longer have reasonable 150 00:10:33,970 --> 00:10:37,679 access to. And of course these frameworks, we talk directly to a network card, 151 00:10:37,679 --> 00:10:41,529 meaning we need support for each network card individually. Usually they just 152 00:10:41,529 --> 00:10:46,000 support one to two or maybe three NIC families, which can be quite restricting, 153 00:10:46,000 --> 00:10:50,579 if you don't have that specific NIC that is restricted. But can we do an even more 154 00:10:50,579 --> 00:10:54,790 radical approach, because we have all these problems with kernel dependencies 155 00:10:54,790 --> 00:10:59,189 and so on? Well, turns out we can get rid of the kernel entirely and move everything 156 00:10:59,189 --> 00:11:03,650 into one application. This means we take our driver put it in the application, the 157 00:11:03,650 --> 00:11:08,050 driver directly accesses the network card and the sets up DMA memory in the user 158 00:11:08,050 --> 00:11:11,579 space, because the network card doesn't care, where it copies the packets from. We 159 00:11:11,579 --> 00:11:14,739 just have to set up the pointers in the right way and we can build this framework 160 00:11:14,739 --> 00:11:17,410 like this, that everything runs in the application. 161 00:11:17,410 --> 00:11:23,459 We remove the driver from the kernel, no kernel driver running and this is super 162 00:11:23,459 --> 00:11:27,649 fast and we can also use this to implement crazy and obscure hardware features and 163 00:11:27,649 --> 00:11:31,420 network cards that are not supported by the standard driver. Now I'm not the first 164 00:11:31,420 --> 00:11:36,200 one to do this, there are two big frameworks that that do that: One is DPDK, 165 00:11:36,200 --> 00:11:41,060 which is quite quite big. This is a Linux Foundation project and it has basically 166 00:11:41,060 --> 00:11:44,709 support by all NIC vendors, meaning everyone who builds a high-speed NIC 167 00:11:44,709 --> 00:11:49,209 writes a driver that works for DPDK and the second such framework is Snabb, which 168 00:11:49,209 --> 00:11:54,139 I think is quite interesting, because it doesn't write the drivers in C but is 169 00:11:54,139 --> 00:11:58,290 entirely written in Lua, in the scripting language, so this is kind of nice to see a 170 00:11:58,290 --> 00:12:02,999 driver that's written in a scripting language. Okay, what problems did we solve 171 00:12:02,999 --> 00:12:06,679 and what problems did we now gain? One problem is we still have the non-standard 172 00:12:06,679 --> 00:12:11,329 API, we still need exclusive access to the network card from one application, because 173 00:12:11,329 --> 00:12:15,189 the driver runs in that thing, so there's some hardware tricks to solve that, but 174 00:12:15,189 --> 00:12:18,329 mainly it's one application that is running. 175 00:12:18,329 --> 00:12:22,459 Then the framework needs explicit support for all the unique models out there. It's 176 00:12:22,459 --> 00:12:26,369 not that big a problem with DPDK, because it's such a big project that virtually 177 00:12:26,369 --> 00:12:31,319 everyone has a driver for DPDK NIC. And yes, limited support for interrupts but 178 00:12:31,319 --> 00:12:34,170 it turns out interrupts are not something that is useful, when you are building 179 00:12:34,170 --> 00:12:37,999 something that processes more than a few hundred thousand packets per second, 180 00:12:37,999 --> 00:12:41,379 because the overhead of the interrupt is just too large, it's just mainly a power 181 00:12:41,379 --> 00:12:44,839 saving thing, if you ever run into low load. But I don't care about the low load 182 00:12:44,839 --> 00:12:50,410 scenario and power saving, so for me it's polling all the way and all the CPU. And 183 00:12:50,410 --> 00:12:55,260 you of course lose all the access to the usual kernel features. And, well, time to 184 00:12:55,260 --> 00:12:59,880 ask "what has the kernel ever done for us?" Well, the kernel has lots of mature 185 00:12:59,880 --> 00:13:03,139 drivers. Okay, what has the kernel ever done for us, except for all these nice 186 00:13:03,139 --> 00:13:07,639 mature drivers? There are very nice protocol implementations that actually 187 00:13:07,639 --> 00:13:10,220 work, like the kernel TCP stack is a work of art. 188 00:13:10,220 --> 00:13:14,319 It actually works in real world scenarios, unlike all these other TCP stacks that 189 00:13:14,319 --> 00:13:18,410 fail under some things or don't support the features we want, so there is quite 190 00:13:18,410 --> 00:13:22,509 some nice stuff. But what has the kernel ever done for us, except for these mature 191 00:13:22,509 --> 00:13:26,799 drivers and these nice protocol stack implementations? Okay, quite a few things 192 00:13:26,799 --> 00:13:32,870 and we are all throwing them out. And one thing to notice: We mostly don't care 193 00:13:32,870 --> 00:13:37,610 about these features, when building our packet forward modify router firewall 194 00:13:37,610 --> 00:13:44,349 thing, because these are mostly high-level features mostly I think. But it's still a 195 00:13:44,349 --> 00:13:49,199 lot of features that we are losing, like building a TCP stack on top of these 196 00:13:49,199 --> 00:13:52,999 frameworks is kind of an unsolved problem. There are TCP stacks but they all suck in 197 00:13:52,999 --> 00:13:58,409 different ways. Ok, we lost features but we didn't care about the features in the 198 00:13:58,409 --> 00:14:02,640 first place, we wanted performance. Back to our performance figure we want 300 199 00:14:02,640 --> 00:14:06,490 to 600 cycles per packet that we have available, how long does it take in, for 200 00:14:06,490 --> 00:14:10,899 example, DPDK to receive and send a packet? That is around a hundred cycles to 201 00:14:10,899 --> 00:14:15,239 get a packet through the whole stack, from like like receiving a packet, processing 202 00:14:15,239 --> 00:14:19,660 it, well, not processing it but getting it to the application and back to the driver 203 00:14:19,660 --> 00:14:23,080 to send it out. A hundred cycles and the other frameworks typically play in the 204 00:14:23,080 --> 00:14:27,709 same league. DPDK is slightly faster than the other ones, because it's full of magic 205 00:14:27,709 --> 00:14:33,000 SSE and AVX intrinsics and the driver is kind of black magic but it's super fast. 206 00:14:33,000 --> 00:14:37,480 Now in kind of real world scenario, Open vSwitch, as I've mentioned as an example 207 00:14:37,480 --> 00:14:41,689 earlier, that was 2 million packets was the kernel version and Open vSwitch can be 208 00:14:41,689 --> 00:14:45,220 compiled with an optional DPDK backend, so you set some magic flags when compiling, 209 00:14:45,220 --> 00:14:49,729 then it links against DPDK and uses the network card directly, runs completely in 210 00:14:49,729 --> 00:14:54,709 userspace and now it's a factor of around 6 or 7 faster and we can achieve 13 211 00:14:54,709 --> 00:14:58,429 million packets per second with the same, around the same processing step on a 212 00:14:58,429 --> 00:15:03,119 single CPU core. So, great, where does do the performance gains come from? Well, 213 00:15:03,119 --> 00:15:08,129 there are two things: Mainly it's compared to the kernel, not compared to sockets. 214 00:15:08,129 --> 00:15:13,290 What people often say is that this is, zero copy which is a stupid term because 215 00:15:13,290 --> 00:15:18,279 the kernel doesn't copy packets either, so it's not copying packets that was slow, it 216 00:15:18,279 --> 00:15:22,299 was other things. Mainly it's batching, meaning it's very efficient to process a 217 00:15:22,299 --> 00:15:28,619 relatively large number of packets at once and that really helps and the thing has 218 00:15:28,619 --> 00:15:32,509 reduced memory overhead, the SK_Buff data structure is really big and if you cut 219 00:15:32,509 --> 00:15:37,319 that down you save a lot of cycles. These DPDK figures, because DPDK has, unlike 220 00:15:37,319 --> 00:15:42,679 some other frameworks, has memory management, and this is already included 221 00:15:42,679 --> 00:15:46,549 in these 50 cycles. Okay, now we know that these frameworks 222 00:15:46,549 --> 00:15:52,009 exist and everything, and the next obvious question is: "Can we build our own 223 00:15:52,009 --> 00:15:57,689 driver?" Well, but why? First for fun, obviously, and then to understand how that 224 00:15:57,689 --> 00:16:01,159 stuff works; how these drivers work, how these packet processing frameworks 225 00:16:01,159 --> 00:16:04,679 work. I've seen in my work in academia; I've 226 00:16:04,679 --> 00:16:07,840 seen a lot of people using these frameworks. It's nice, because they are 227 00:16:07,840 --> 00:16:12,260 fast and they enable a few things, that just weren't possible before. But people 228 00:16:12,260 --> 00:16:16,170 often treat these as magic black boxes you put in your packet and then it magically 229 00:16:16,170 --> 00:16:20,429 is faster and sometimes I don't blame them. If you look at DPDK source code, 230 00:16:20,429 --> 00:16:24,269 there are more than 20,000 lines of code for each driver. And just for example, 231 00:16:24,269 --> 00:16:28,809 looking at the receive and transmit functions of the IXGBE driver and DPDK, 232 00:16:28,809 --> 00:16:33,769 this is one file with around 3,000 lines of code and they do a lot of magic, just 233 00:16:33,769 --> 00:16:37,950 to receive and send packets. No one wants to read through that, so the question is: 234 00:16:37,950 --> 00:16:40,960 "How hard can it be to write your own driver?" 235 00:16:40,960 --> 00:16:44,850 Turns out: It's quite easy! This was like a weekend project. I have written the 236 00:16:44,850 --> 00:16:48,369 driver called XC. It's less than a thousand lines of C code. That is the full 237 00:16:48,369 --> 00:16:53,559 driver for 10 G network cards and the full framework to get some applications and 2 238 00:16:53,559 --> 00:16:58,099 simple example applications. Took me like less than two days to write it completely, 239 00:16:58,099 --> 00:17:00,897 then two more days to debug it and fix performance. 240 00:17:02,385 --> 00:17:08,209 So I've been building this driver on the Intel IXGBE family. This is a family of 241 00:17:08,209 --> 00:17:13,041 network cards that you know of, if you ever had a server to test this. Because 242 00:17:13,041 --> 00:17:17,639 almost all servers, that have 10 G connections, have these Intel cards. And 243 00:17:17,639 --> 00:17:22,829 they are also embedded in some Xeon CPUs. They are also onboard chips on many 244 00:17:22,829 --> 00:17:29,480 mainboards and the nice thing about them is, they have a publicly available data 245 00:17:29,480 --> 00:17:33,620 sheet. Meaning Intel publishes this 1,000 pages of PDF, that describes everything, 246 00:17:33,620 --> 00:17:37,140 you ever wanted to know, when writing a driver for these. And the next nice thing 247 00:17:37,140 --> 00:17:41,324 is, that there is almost no logic hidden behind the black box magic firmware. Many 248 00:17:41,324 --> 00:17:46,210 newer network cards -especially Mellanox, the newer ones- hide a lot of 249 00:17:46,210 --> 00:17:50,120 functionality behind a firmware and the driver. Mostly just exchanges messages 250 00:17:50,120 --> 00:17:54,169 with the firmware, which is kind of boring, and with this family, it is not 251 00:17:54,169 --> 00:17:58,340 the case, which i think is very nice. So how can we build a driver for this in four 252 00:17:58,340 --> 00:18:02,884 very simple steps? One: We remove the driver that is currently loaded, because 253 00:18:02,884 --> 00:18:07,600 we don't want it to interfere with our stuff. Okay, easy so far. Second, we 254 00:18:07,600 --> 00:18:12,590 memory-map the PCIO memory-mapped I/O address space. This allows us to access 255 00:18:12,590 --> 00:18:16,430 the PCI Express device. Number three: We figure out the physical addresses of our 256 00:18:16,430 --> 00:18:22,750 DMA; of our process per address region and then we use them for DMA. And step four is 257 00:18:22,750 --> 00:18:26,779 slightly more complicated, than the first three steps, as we write the driver. Now, 258 00:18:26,779 --> 00:18:31,849 first thing to do, we figure out, where our network card -let's say we have a 259 00:18:31,849 --> 00:18:35,444 server and be plugged in our network card- then it gets assigned an address and the 260 00:18:35,444 --> 00:18:39,611 PCI bus. We can figure that out with lspci, this is the address. We need it in 261 00:18:39,611 --> 00:18:43,429 a slightly different version with the fully qualified ID, and then we can remove 262 00:18:43,429 --> 00:18:47,775 the kernel driver by telling the currently bound driver to remove that specific ID. 263 00:18:47,775 --> 00:18:52,100 Now the operating system doesn't know, that this is a network card; doesn't know 264 00:18:52,100 --> 00:18:55,870 anything, just notes that some PCI device has no driver. Then we write our 265 00:18:55,870 --> 00:18:59,209 application. This is written in C and we just opened 266 00:18:59,209 --> 00:19:04,207 this magic file in sysfs and this magic file; we just mmap it. Ain't no magic, 267 00:19:04,207 --> 00:19:08,183 just a normal mmap there. But what we get back is a kind of special memory region. 268 00:19:08,183 --> 00:19:12,160 This is the memory mapped I/O memory region of the PCI address configuration 269 00:19:12,160 --> 00:19:17,620 space and this is where all the registers are available. Meaning, I will show you 270 00:19:17,620 --> 00:19:20,960 what that means in just a second. If we if go through the datasheet, there are 271 00:19:20,960 --> 00:19:25,532 hundreds of pages of tables like this and these tables tell us the registers, that 272 00:19:25,532 --> 00:19:29,974 exist on that network card, the offset they have and a link to more detailed 273 00:19:29,974 --> 00:19:34,589 descriptions. And in code that looks like this: For example the LED control register 274 00:19:34,589 --> 00:19:38,090 is at this offset and then the LED control register. 275 00:19:38,090 --> 00:19:42,522 On this register, there are 32 bits, there are some bits offset. Bit 7 is called 276 00:19:42,522 --> 00:19:48,590 LED0_BLINK and if we set that bit in that register, then one of the LEDs will start 277 00:19:48,590 --> 00:19:53,669 to blink. And we can just do that via our magic memory region, because all the reads 278 00:19:53,669 --> 00:19:57,682 and writes, that we do to that memory region, go directly over the PCI Express 279 00:19:57,682 --> 00:20:01,568 bus to the network card and the network card does whatever it wants to do with 280 00:20:01,568 --> 00:20:03,128 them. It doesn't have to be a register, 281 00:20:03,128 --> 00:20:08,690 basically it's just a command, to send to a network card and it's just a nice and 282 00:20:08,690 --> 00:20:11,669 convenient interface to map that into memory. This is a very common technique, 283 00:20:11,669 --> 00:20:15,098 that you will also find when you do some microprocessor programming or something. 284 00:20:16,260 --> 00:20:20,110 So, and one thing to note is, since this is not memory: That also means, it can't 285 00:20:20,110 --> 00:20:24,111 be cached. There's no cache in between. Each of these accesses will trigger a PCI 286 00:20:24,111 --> 00:20:29,210 Express transaction and it will take quite some time. Speaking of lots of lots of 287 00:20:29,210 --> 00:20:32,919 cycles, where lots means like hundreds of cycles or hundred cycles which is a lot 288 00:20:32,919 --> 00:20:37,206 for me. So how do we now handle packets? We now 289 00:20:37,206 --> 00:20:42,400 can, we have access to this registers we can read the datasheet and we can write 290 00:20:42,400 --> 00:20:47,250 the driver but we some need some way to get packets through that. Of course it 291 00:20:47,250 --> 00:20:51,470 would be possible to write a network card that does that via this memory-mapped I/O 292 00:20:51,470 --> 00:20:56,800 region but it's kind of annoying. The second way a PCI Express device 293 00:20:56,800 --> 00:21:01,429 communicates with your server or macbook is via DMA ,direct memory access, and a 294 00:21:01,429 --> 00:21:07,536 DMA transfer, unlike the memory-mapped I/O stuff is initiated by the network card and 295 00:21:07,536 --> 00:21:14,046 this means the network card can just write to arbitrary addresses in in main memory. 296 00:21:14,050 --> 00:21:20,200 And this the network card offers so called rings which are queue interfaces and like 297 00:21:20,200 --> 00:21:22,946 for receiving packets and for sending packets, and they are multiple of these 298 00:21:22,946 --> 00:21:26,584 interfaces, because this is how you do multi-core scaling. If you want to 299 00:21:26,584 --> 00:21:30,649 transmit from multiple cores, you allocate multiple queues. Each core sends to one 300 00:21:30,649 --> 00:21:34,269 queue and the network card just merges these queues in hardware onto the link, 301 00:21:34,269 --> 00:21:38,789 and on receiving the network card can either hash on the incoming incoming 302 00:21:38,789 --> 00:21:42,821 packet like hash over protocol headers or you can set explicit filters. 303 00:21:42,821 --> 00:21:46,630 This is not specific to a network card most PCI Express devices work like this 304 00:21:46,630 --> 00:21:52,000 like GPUs have queues, a command queues and so on, a NVME PCI Express disks have 305 00:21:52,000 --> 00:21:56,660 queues and... So let's look at queues on example of the 306 00:21:56,660 --> 00:22:01,480 ixgbe family but you will find that most NICs work in a very similar way. There are 307 00:22:01,480 --> 00:22:04,110 sometimes small differences but mainly they work like this. 308 00:22:04,344 --> 00:22:08,902 And these rings are just circular buffers filled with so-called DMA descriptors. A 309 00:22:08,902 --> 00:22:14,180 DMA descriptor is a 16-byte struct and that is eight bytes of a physical pointer 310 00:22:14,180 --> 00:22:18,960 pointing to some location where more stuff is and eight byte of metadata like "I 311 00:22:18,960 --> 00:22:24,389 fetch the stuff" or "this packet needs VLAN tag offloading" or "this packet had a 312 00:22:24,389 --> 00:22:27,124 VLAN tag that I removed", information like that is stored in there. 313 00:22:27,124 --> 00:22:31,200 And what we then need to do is we translate virtual addresses from our 314 00:22:31,200 --> 00:22:34,509 address space to physical addresses because the PCI Express device of course 315 00:22:34,509 --> 00:22:39,198 needs physical addresses. And we can use this, do that using procfs: 316 00:22:39,198 --> 00:22:45,590 In the /proc/self/pagemap we can do that. And the next thing is we now have this 317 00:22:45,590 --> 00:22:51,610 this queue of DMA descriptors in memory and this queue itself is also accessed via 318 00:22:51,610 --> 00:22:57,101 DMA and it's controlled like it works like you expect a circular ring to work. It has 319 00:22:57,101 --> 00:23:00,970 a head and a tail, and the head and tail pointer are available via registers in 320 00:23:00,970 --> 00:23:05,680 memory-mapped I/O address space, meaning in a image it looks kind of like this: We 321 00:23:05,680 --> 00:23:09,650 have this descriptor ring in our physical memory to the left full of pointers and 322 00:23:09,650 --> 00:23:16,000 then we have somewhere else these packets in some memory pool. And one thing to note 323 00:23:16,000 --> 00:23:20,269 when allocating this kind of memory: There is a small trick you have to do because 324 00:23:20,269 --> 00:23:25,059 the descriptor ring needs to be in contiguous memory in your physical memory 325 00:23:25,059 --> 00:23:29,139 and if you use if, you just assume everything that's contiguous in your 326 00:23:29,139 --> 00:23:34,399 process is also in hardware physically: No it isn't, and if you have a bug in there 327 00:23:34,399 --> 00:23:37,919 and then it writes to somewhere else then your filesystem dies as I figured out, 328 00:23:37,919 --> 00:23:43,179 which was not a good thing. So ... we, what I'm doing is I'm using 329 00:23:43,179 --> 00:23:46,789 huge pages, two megabyte pages, that's enough of contiguous memory and that's 330 00:23:46,789 --> 00:23:53,990 guaranteed to not have weird gaps. So, um ... now we see packets we need to 331 00:23:53,990 --> 00:23:58,600 set up the ring so we tell the network car via memory mapped I/O the location and 332 00:23:58,600 --> 00:24:03,070 the size of the ring, then we fill up the ring with pointers to freshly allocated 333 00:24:03,070 --> 00:24:09,820 memory that are just empty and now we set the head and tail pointer to tell the head 334 00:24:09,820 --> 00:24:13,100 and tail pointer that the queue is full, because the queue is at the moment full, 335 00:24:13,100 --> 00:24:16,956 it's full of packets. These packets are just not yet filled with anything. And now 336 00:24:16,956 --> 00:24:20,629 what the NIC does, it fetches one of the DNA descriptors and as soon as it receives 337 00:24:20,629 --> 00:24:25,539 a packet it writes the packet via DMA to the location specified in the register and 338 00:24:25,539 --> 00:24:30,299 increments the head pointer of the queue and it also sets a status flag in the DMA 339 00:24:30,299 --> 00:24:33,590 descriptor once it's done like in the packet to memory and this step is 340 00:24:33,590 --> 00:24:39,610 important because reading back the head pointer via MM I/O would be way too slow. 341 00:24:39,610 --> 00:24:43,330 So instead we check the status flag because the status flag gets optimized by 342 00:24:43,330 --> 00:24:47,302 the ... by the cache and is already in cache so we can check that really fast. 343 00:24:48,794 --> 00:24:52,121 Next step is we periodically poll the status flag. This is the point where 344 00:24:52,121 --> 00:24:56,009 interrupts might come in useful. There's some misconception: people 345 00:24:56,009 --> 00:24:59,419 sometimes believe that if you receive a packet then you get an interrupt and the 346 00:24:59,419 --> 00:25:02,420 interrupt somehow magically contains the packet. No it doesn't. The interrupt just 347 00:25:02,420 --> 00:25:05,600 contains the information that there is a new packet. After the interrupt you would 348 00:25:05,600 --> 00:25:12,450 have to poll the status flag anyways. So we now have the packet, we process the 349 00:25:12,450 --> 00:25:16,170 packet or do whatever, then we reset the DMA descriptor, we can either recycle the 350 00:25:16,170 --> 00:25:21,653 old packet or allocate a new one and we set the ready flag on the status register 351 00:25:21,653 --> 00:25:25,529 and we adjust the tail pointer register to tell the network card that we are done 352 00:25:25,529 --> 00:25:28,389 with this and we don't have to do that for any time because we don't have to keep the 353 00:25:28,389 --> 00:25:33,220 queue 100% utilized. We can only update the tail pointer like every hundred 354 00:25:33,220 --> 00:25:37,559 packets or so and then that's not a performance problem. What now, we have a 355 00:25:37,559 --> 00:25:42,020 driver that can receive packets. Next steps, well transmit packets, it basically 356 00:25:42,020 --> 00:25:46,373 works the same. I won't bore you with the details. Then there's of course a lot of 357 00:25:46,373 --> 00:25:50,600 boring boring initialization code and it's just following the datasheet, they are 358 00:25:50,600 --> 00:25:54,070 like: set this register, set that register, do that and I just coded it down 359 00:25:54,070 --> 00:25:58,870 from the datasheet and it works, so big surprise. Then now you know how to write a 360 00:25:58,870 --> 00:26:03,799 driver like this and a few ideas of what ... what I want to do, what maybe you want 361 00:26:03,799 --> 00:26:06,820 to do with a driver like this. One of course want to look at performance to look 362 00:26:06,820 --> 00:26:09,929 at what makes this faster than the kernel, then I want some obscure 363 00:26:09,929 --> 00:26:12,529 hardware/offloading features. In the past I've looked at IPSec 364 00:26:12,529 --> 00:26:15,840 offloading, just quite interesting, because the Intel network cards have 365 00:26:15,840 --> 00:26:19,870 hardware support for IPSec offloading, but none of the Intel drivers had it and it 366 00:26:19,870 --> 00:26:24,200 seems to work just fine. So not sure what's going on there. Then security is 367 00:26:24,200 --> 00:26:29,440 interesting. There is the ... there's obvious some security implications of 368 00:26:29,440 --> 00:26:33,399 having the whole driver in a user space process and ... and I'm wondering about 369 00:26:33,399 --> 00:26:37,120 how we can use the IOMMU, because it turns out, once we have set up the memory 370 00:26:37,120 --> 00:26:40,130 mapping we can drop all the privileges, we don't need them. 371 00:26:40,130 --> 00:26:43,659 And if we set up the IOMMU before to restrict the network card to certain 372 00:26:43,659 --> 00:26:48,750 things then we could have a safe driver in userspace that can't do anything wrong, 373 00:26:48,750 --> 00:26:52,264 because has no privileges and the network card has no access because goes through 374 00:26:52,264 --> 00:26:56,046 the IOMMU and there are performance implications of the IOMMU and so on. Of 375 00:26:56,046 --> 00:26:59,889 course, support for other NICs. I want to support virtIO, virtual NICs and other 376 00:26:59,889 --> 00:27:03,564 programming languages for the driver would also be interesting. It's just written in 377 00:27:03,564 --> 00:27:06,686 C because C is the lowest common denominator of programming languages. 378 00:27:06,991 --> 00:27:12,700 To conclude, check out ixy. It's BSD license on github and the main thing to 379 00:27:12,700 --> 00:27:16,094 take with you is that drivers are really simple. Don't be afraid of drivers. Don't 380 00:27:16,094 --> 00:27:20,059 be afraid of writing your drivers. You can do it in any language and you don't even 381 00:27:20,059 --> 00:27:23,139 need to add kernel code. Just map the stuff to your process, write the driver 382 00:27:23,139 --> 00:27:27,019 and do whatever you want. Okay, thanks for your attention. 383 00:27:27,019 --> 00:27:33,340 *Applause* 384 00:27:33,340 --> 00:27:36,079 Herald: You have very few minutes left for 385 00:27:36,079 --> 00:27:40,529 questions. So if you have a question in the room please go quickly to one of the 8 386 00:27:40,529 --> 00:27:46,899 microphones in the room. Does the signal angel already have a question ready? I 387 00:27:46,899 --> 00:27:52,998 don't see anything. Anybody lining up at any microphones? 388 00:28:07,182 --> 00:28:08,950 Alright, number 6 please. 389 00:28:09,926 --> 00:28:15,140 Mic 6: As you're not actually using any of the Linux drivers, is there an advantage 390 00:28:15,140 --> 00:28:19,470 to using Linux here or could you use any open source operating system? 391 00:28:19,470 --> 00:28:24,200 Paul: I don't know about other operating systems but the only thing I'm using of 392 00:28:24,200 --> 00:28:28,649 Linux here is the ability to easily map that. For some other operating systems we 393 00:28:28,649 --> 00:28:32,779 might need a small stub driver that maps the stuff in there. You can check out the 394 00:28:32,779 --> 00:28:36,820 DPDK FreeBSD port which has a small stub driver that just handles the memory 395 00:28:36,820 --> 00:28:41,379 mapping. Herald: Here, at number 2. 396 00:28:41,379 --> 00:28:45,340 Mic 2: Hi, erm, slightly disconnected to the talk, but I just like to hear your 397 00:28:45,340 --> 00:28:50,880 opinion on smart NICs where they're considering putting CPUs on the NIC 398 00:28:50,880 --> 00:28:55,279 itself. So you could imagine running Open vSwitch on the CPU on the NIC. 399 00:28:55,279 --> 00:28:59,530 Paul: Yeah, I have some smart NIC somewhere on some lap and have also done 400 00:28:59,530 --> 00:29:05,639 work with the net FPGA. I think that it's very interesting, but it ... it's a 401 00:29:05,639 --> 00:29:09,820 complicated trade-off, because these smart NICs come with new restrictions and they 402 00:29:09,820 --> 00:29:13,820 are not dramatically super fast. So it's ... it's interesting from a performance 403 00:29:13,820 --> 00:29:17,610 perspective to see when it's worth it, when it's not worth it and what I 404 00:29:17,610 --> 00:29:22,100 personally think it's probably better to do everything with raw CPU power. 405 00:29:22,100 --> 00:29:25,200 Mic 2: Thanks. Herald: Alright, before we take the next 406 00:29:25,200 --> 00:29:29,730 question, just for the people who don't want to stick around for the Q&A. If you 407 00:29:29,730 --> 00:29:33,720 really do have to leave the room early, please do so quietly, so we can continue 408 00:29:33,720 --> 00:29:39,440 the Q&A. Number 6, please. Mic 6: So how does the performance of the 409 00:29:39,440 --> 00:29:42,809 userspace driver is compared to the XDP solution? 410 00:29:42,809 --> 00:29:51,190 Paul: Um, it's slightly faster. But one important thing about XDP is, if you look 411 00:29:51,190 --> 00:29:54,910 at this, this is still new work and there is ... there are few important 412 00:29:54,910 --> 00:29:58,340 restrictions like you can write your userspace thing in whatever programming 413 00:29:58,340 --> 00:30:01,522 language you want. Like I mentioned, snap has a driver entirely written in Lua. With 414 00:30:01,522 --> 00:30:06,985 XDP you are restricted to eBPF, meaning usually a restricted subset of C and then 415 00:30:06,985 --> 00:30:09,670 there's bytecode verifier but you can disable the bytecode verifier if you want 416 00:30:09,670 --> 00:30:13,990 to disable it, and meaning, you again have weird restrictions that you maybe don't 417 00:30:13,990 --> 00:30:18,960 want and also XDP requires patched driv ... not patched drivers but requires a new 418 00:30:18,960 --> 00:30:23,550 memory model for the drivers. So at moment DPDK supports more drivers than XDP in the 419 00:30:23,550 --> 00:30:26,740 kernel, which is kind of weird, and they're still lacking many features like 420 00:30:26,740 --> 00:30:31,187 sending back to a different NIC. One very very good use case for XDP is 421 00:30:31,187 --> 00:30:35,340 firewalling for applications on the same host because you can pass on a packet to 422 00:30:35,340 --> 00:30:40,309 the TCP stack and this is a very good use case for XDP. But overall, I think that 423 00:30:40,309 --> 00:30:46,761 ... that both things are very very different and XDP is slightly slower but 424 00:30:46,761 --> 00:30:51,077 it's not slower in such a way that it would be relevant. So it's fast, to 425 00:30:51,077 --> 00:30:54,960 answer the question. Herald: All right, unfortunately we are 426 00:30:54,960 --> 00:30:59,172 out of time. So that was the last question. Thanks again, Paul. 427 00:30:59,172 --> 00:31:07,957 *Applause* 428 00:31:07,957 --> 00:31:12,769 *34c3 outro* 429 00:31:12,769 --> 00:31:30,000 subtitles created by c3subtitles.de in the year 2018. Join, and help us!