1 00:00:00,000 --> 00:00:17,860 *35C3 Intro music* 2 00:00:17,860 --> 00:00:23,065 Herald Angel: OK. So this talk is called "A deep dive into the world of DOS 3 00:00:23,065 --> 00:00:33,500 viruses" and if you happened to be at the 8C3, that is 27 years ago, you would have 4 00:00:33,500 --> 00:00:38,599 seen a very young and awkward, even more awkward than I am of the moment, version 5 00:00:38,599 --> 00:00:46,120 of myself, speaking on basically the same subject. The stage of course was a lot 6 00:00:46,120 --> 00:00:50,491 smaller than this, this would have really intimidated me back then, but I was 7 00:00:50,491 --> 00:00:55,160 talking about a university project that we had run for about 3 years at that point, 8 00:00:55,160 --> 00:01:05,500 and our possibilities were very limited. Meanwhile, 27 years later, our speaker, in 9 00:01:05,500 --> 00:01:13,040 between fighting battleships over the public BGP network and trying to encode 10 00:01:13,040 --> 00:01:18,690 data in dubstep music, was able to actually do all of the stuff that we were 11 00:01:18,690 --> 00:01:25,650 trying to do, with a lot of effort, basically, and I guess 4 hours of CPU time 12 00:01:25,650 --> 00:01:32,610 or something like that. Please help me in welcoming Ben to our stage, to talk about 13 00:01:32,610 --> 00:01:35,820 a bygone era. *Applause* 14 00:01:35,820 --> 00:01:40,920 *Applause* 15 00:01:40,920 --> 00:01:48,340 Ben: Thank you. Hi, I'm Ben Cartwright- Cox, as the slide suggests. So I have an 16 00:01:48,340 --> 00:01:53,100 admission to make: So this is a thing to be aware of. 17 00:01:53,100 --> 00:01:56,970 *Laughter* Ben: And you know, things also to be aware 18 00:01:56,970 --> 00:02:07,110 of. Anyway. So what is DOS? To get straight into it. You can do it in a 19 00:02:07,110 --> 00:02:10,947 bullet points way. You know, DOS is an upgrade from CP/M, another very old legacy 20 00:02:10,947 --> 00:02:14,819 system, but another thing to be aware of is that DOS covers a wide range of 21 00:02:14,819 --> 00:02:19,950 vendors. Might not just be like those old IBM PCs. Some of the DOSes had 22 00:02:19,950 --> 00:02:23,950 compatibility with each other, meaning that some of the DOSes had shared malware 23 00:02:23,950 --> 00:02:31,390 with each other. But to be honest, most people know DOS as these lovely old beige 24 00:02:31,390 --> 00:02:37,709 boxes; the same era gave us our loved Model M keyboard. Hated by some, loved by 25 00:02:37,709 --> 00:02:42,840 others, for the sound. But, you know, most people's knowledge of DOS came from 26 00:02:42,840 --> 00:02:59,599 computers, a user interface that looked like this. Pretty basic. Okay so this is 27 00:02:59,599 --> 00:03:04,340 Wordstar, some of you may not know that Game of Thrones was written on Wordstar. 28 00:03:04,340 --> 00:03:09,281 George R. R. Martin is apparently not a big fan of modern word processing. he 29 00:03:09,281 --> 00:03:16,340 admitted he had some issue with disliking how spell checking worked. So just uses, 30 00:03:16,340 --> 00:03:18,700 and I also guess it's a good security quality, you know, you can't get hacked, 31 00:03:18,700 --> 00:03:24,680 if it literally has no Internet access. So, also though, for a lot of people this 32 00:03:24,680 --> 00:03:28,310 is also their first experience into programming. For the some of the older 33 00:03:28,310 --> 00:03:36,500 crowd. This is also the invention of QBasic, which, you know, gave a very basic 34 00:03:36,500 --> 00:03:40,940 language to program creatively in DOS. For some people this was the gateway drug into 35 00:03:40,940 --> 00:03:47,160 programming and perhaps the gateway drug into what they started as a career. For 36 00:03:47,160 --> 00:03:52,800 other people the experience of DOS was not so great. For example, you know, let's 37 00:03:52,800 --> 00:03:57,640 just say you were doing some work in an infinite loop and at some point stuff like 38 00:03:57,640 --> 00:04:04,001 this happens. Unfortunately I don't have sound for this one, but you can just, in 39 00:04:04,001 --> 00:04:09,200 your head, imagine like our PC speakers playing some small techno music, on like, 40 00:04:09,200 --> 00:04:14,310 you know, but only one frequency at a time. This might get especially incredibly 41 00:04:14,310 --> 00:04:18,589 embarrassing, if you are in an office environment, just slowly beeping away. You 42 00:04:18,589 --> 00:04:22,770 can't exit this. It has to finish fully and if you touch the keyboard it reminds you 43 00:04:22,770 --> 00:04:30,069 not to touch the keyboard, and continues playing this music. So, you know, this would be 44 00:04:30,069 --> 00:04:34,319 fun, but this wouldn't be fun, especially in an office environment. But, you know, 45 00:04:34,319 --> 00:04:40,339 ultimately it's not malicious. And that trend continues. This is another good 46 00:04:40,339 --> 00:04:45,240 example of a DOS virus. This is ambulance, for when you run it, an ambulance just 47 00:04:45,240 --> 00:04:50,589 drives past and then your normal program just continues running. I think this is 48 00:04:50,589 --> 00:04:56,729 amazing, it's an interesting era of viruses. It was all, the history of it was 49 00:04:56,729 --> 00:05:01,270 collected very well by a website called VX heavens, which sort of still lives, but 50 00:05:01,270 --> 00:05:06,629 unfortunately, at one point was raided by the Ukrainian police, for what is the 51 00:05:06,629 --> 00:05:11,469 fantastic wording they used. Basically, someone told them they were distributing 52 00:05:11,469 --> 00:05:16,770 Malware. Unfortunately not malware that operates in this century. But I guess 53 00:05:16,770 --> 00:05:21,710 that's good enough for a raid. But luckily for the archivists there are archivists of 54 00:05:21,710 --> 00:05:28,809 archivists, and so we have a saved capture of VX heavens. This is actually an old 55 00:05:28,809 --> 00:05:32,770 snapshot, there are way more modern snapshots, but thankfully the MS DOS virus 56 00:05:32,770 --> 00:05:38,189 era doesn't move very quickly. So, but the interesting thing here is, like, there's 57 00:05:38,189 --> 00:05:44,349 66000 items in this tarball and it's 6.6 gigabytes of code. And these viruses are 58 00:05:44,349 --> 00:05:48,580 like super dense. There's not much to them, like they are just blobs of machine 59 00:05:48,580 --> 00:05:51,520 code. They are not like your electron app these days that ships an entire Chrome 60 00:05:51,520 --> 00:05:57,219 browser, and normally an out of date Chrome browser, you know, this is just 61 00:05:57,219 --> 00:06:00,429 basic, like, you know, how to draw an ambulance and, you know, some infection 62 00:06:00,429 --> 00:06:06,629 routines. The normal distribution also changes with it as well. For example, the 63 00:06:06,629 --> 00:06:11,059 normal lifecycle of an MS DOS virus is, you know, you download, or for some other 64 00:06:11,059 --> 00:06:17,560 reason run an infected program that presumably does nothing; to you it looks 65 00:06:17,560 --> 00:06:22,129 like it does nothing, so, you know, remains roughly undetected. Then you go 66 00:06:22,129 --> 00:06:27,830 and run more files, the DOS virus infects more files and at some point you're 67 00:06:27,830 --> 00:06:31,069 probably going to give one of those excutables to some other computer, or some 68 00:06:31,069 --> 00:06:35,409 other person, whether it was by giving someone or copying a floppy disk of some 69 00:06:35,409 --> 00:06:38,880 software, maybe some expensive software, so they didn't have to pay for it, or 70 00:06:38,880 --> 00:06:44,900 uploading it to a BBS, where it could be downloaded by many people. So the 71 00:06:44,900 --> 00:06:49,689 distribution mechanism is a far cry from the eternal blues of this era, where, you 72 00:06:49,689 --> 00:06:54,449 know, we can have a strain of malware spread across the world very brutally, 73 00:06:54,449 --> 00:07:01,709 very quickly. So most DOS viruses are pretty simple: They start, they say "have 74 00:07:01,709 --> 00:07:06,839 my payload conditions been met?" If not, then they'll go on display, if they are 75 00:07:06,839 --> 00:07:11,799 met they'll go and display the payload. And the payloads are definitely more, 76 00:07:11,799 --> 00:07:16,949 I don't know, nice. You know, you have stuff like this, which is pretty and it uses VGA 77 00:07:16,949 --> 00:07:20,580 colors and all sorts of pretty nice stuff. You get also some very demoscene vibes 78 00:07:20,580 --> 00:07:26,270 from this. Another good example is this like VGA, like super trippy thing, which 79 00:07:26,270 --> 00:07:29,909 is really impressive, 'cause this is really small. This is less than 1 kilobyte 80 00:07:29,909 --> 00:07:34,870 of code. It's in fact way less than 1 kilobyte, it's like 64k. Or you just get 81 00:07:34,870 --> 00:07:38,591 like interesting screen effects as well. For example, it's quick, but like, you can 82 00:07:38,591 --> 00:07:43,580 just watch the entire computer just dissolve away, which also might be quite 83 00:07:43,580 --> 00:07:47,929 worrying, if you weren't expecting that. Alternatively, if the payload conditions 84 00:07:47,929 --> 00:07:52,860 are not met, then, you know, you hook syscalls and you, or alternatively, if you 85 00:07:52,860 --> 00:07:56,870 want to be way more aggressive, as a malware offer, you scan for files on the 86 00:07:56,870 --> 00:08:02,649 system to infect proactively. And the way you infect DOS programs is pretty simple: 87 00:08:02,649 --> 00:08:07,219 Imagining you have like one giant tape of all the code you have for the target 88 00:08:07,219 --> 00:08:11,499 program. Most of them work like this: They replace the first 3 bytes of the program 89 00:08:11,499 --> 00:08:16,909 with a x86 jump. They append their malware onto the end of the executable, and so the 90 00:08:16,909 --> 00:08:19,779 first thing that you do, when you run the executable, is it jumps to the end of the 91 00:08:19,779 --> 00:08:25,489 file, effectively, runs the malware chunk, and then it optionally will return control 92 00:08:25,489 --> 00:08:33,800 back to the original program. But there's also the thing about hooking syscalls, right? 93 00:08:33,800 --> 00:08:39,219 So, you know, MS-DOS is an operating system, it does have syscalls, 94 00:08:39,219 --> 00:08:43,779 programs can reach out to MS-DOS, to do things like file access and stuff, so as 95 00:08:43,779 --> 00:08:48,990 you expect, you run a software interrupt to get there. Thankfully though, MS-DOS 96 00:08:48,990 --> 00:08:55,829 does also allow you to extend MS-DOS by adding handlers itself, or even 97 00:08:55,829 --> 00:08:59,029 overwriting existing handlers, which is very convenient, if you are trying to 98 00:08:59,029 --> 00:09:02,160 write drivers, but it's also incredibly convenient, if you're trying to write 99 00:09:02,160 --> 00:09:09,410 malware. For some of the examples of the syscalls, most of them relevant towards 100 00:09:09,410 --> 00:09:15,530 DOS virus making. Here's a decent example of the things that DOS will provide you. A lot 101 00:09:15,530 --> 00:09:21,180 of them are just very useful in general for producing functional executables the 102 00:09:21,180 --> 00:09:25,660 end users want to use. This is what an average program looks like. This is almost 103 00:09:25,660 --> 00:09:29,269 the shortest hello world you can make, minus the actual hello world string. In 104 00:09:29,269 --> 00:09:34,870 fact, the hello world string might be the largest part of this binary. It's a pretty 105 00:09:34,870 --> 00:09:40,480 simple binary. Here we we're moving a pointer to the message we just set. We 106 00:09:40,480 --> 00:09:50,410 then set the AH register to 9, or hex 9. That's the syscall for printing a string, 107 00:09:50,410 --> 00:09:58,300 and then we run a software interrupt, 21h, which is short for 21 hex, and we continue on. 108 00:09:58,300 --> 00:10:06,589 We then set AH again, to 4C, which is exit with a return code, and the program 109 00:10:06,589 --> 00:10:12,439 will return. So, in the meantime, this is roughly the loop that just happened. 110 00:10:12,439 --> 00:10:18,470 You have your program code, that calls an interrupt and that gets passed over to the 111 00:10:18,470 --> 00:10:22,189 interrupt handler. In the process of doing this, the CPU has quickly looked at the 112 00:10:22,189 --> 00:10:28,430 first 100 bytes of memory in the interrupt vector table, IVT, as it's abbreviated, 113 00:10:28,430 --> 00:10:32,300 and then it's effectively a router. If anyone has written like a small piece of 114 00:10:32,300 --> 00:10:36,149 code to route HTTP requests, or anything, it's basically like that, but in the 80s, 115 00:10:36,149 --> 00:10:41,029 with syscalls. So it's just basically saying "Compare this, compare that, jump 116 00:10:41,029 --> 00:10:46,240 there, jump there." Then the thing gets passed to the call handler, it goes and 117 00:10:46,240 --> 00:10:49,740 does the syscall, the thing that was required. Normally it will leave some 118 00:10:49,740 --> 00:10:55,130 registers behind, a state, or results of actions it has performed, and it returns 119 00:10:55,130 --> 00:10:59,519 control back to the program. So, theoretically speaking, if we wanted to go 120 00:10:59,519 --> 00:11:04,199 and look at what a program actually does we need to set a break point here, because 121 00:11:04,199 --> 00:11:11,030 this is the only place that we can be sure the location exists, because this is way 122 00:11:11,030 --> 00:11:15,760 before the era of ASLR, address space randomisation, and this is way, way before 123 00:11:15,760 --> 00:11:19,819 the era of kernel space randomisation, in fact, MS DOS has almost no memory 124 00:11:19,819 --> 00:11:24,610 protection whatsoever. Once you run a program you are basically putting the full 125 00:11:24,610 --> 00:11:29,430 control of the system to that program, which means you can happily also boot 126 00:11:29,430 --> 00:11:33,870 things like Linux directly from a COM file, which is handy if you want to 127 00:11:33,870 --> 00:11:43,860 upgrade. So, if we look at certain files we can go and see what they do. So in this 128 00:11:43,860 --> 00:11:50,110 case, here is one example. This is a goat file. A goat file is like a sacrificial 129 00:11:50,110 --> 00:11:54,699 goat. It is a file that is purely designed to be infected. So what you do is you 130 00:11:54,699 --> 00:11:59,790 bring a virus into into memory in the system and then you run a goat file, in 131 00:11:59,790 --> 00:12:03,879 the vague hope that the virus will infect it, and then you have a nice clean sample 132 00:12:03,879 --> 00:12:08,450 of just that virus and not another program inside the virus, which makes it way 133 00:12:08,450 --> 00:12:12,079 easier to test and reverse engineer. So, we can see things are happening here. For 134 00:12:12,079 --> 00:12:16,600 example, we can see it opening a file, moving like where it's looking into the 135 00:12:16,600 --> 00:12:19,770 file, reading some data from the file, just 2 bytes, though, and it closes a 136 00:12:19,770 --> 00:12:23,839 file. We see the same sort of thing repeat itself, except at one point it reads a 137 00:12:23,839 --> 00:12:27,529 large amount of data, moves the file pointer, writes another large amount of 138 00:12:27,529 --> 00:12:32,769 data, does some more stuff, and yeah, we pass some filenames, we display a string, 139 00:12:32,769 --> 00:12:39,230 which is almost definitely the goat file message and yeah, we pretty much exit 140 00:12:39,230 --> 00:12:42,860 after that. So, there were a few syscalls here that we would really like to know 141 00:12:42,860 --> 00:12:48,790 more about. So, for that, it's the open files, we'd really like to know what files 142 00:12:48,790 --> 00:12:52,870 were being opened. We would also want to know what, we'd like to know, what data 143 00:12:52,870 --> 00:12:55,950 was being written to the file, rather than having to fish it out of the virtual 144 00:12:55,950 --> 00:13:00,550 machine later, and we'd also, just out of curiosity, really want to know what 145 00:13:00,550 --> 00:13:05,420 filenames it was asking MS-DOS to parse. Display string is also a nice test to 146 00:13:05,420 --> 00:13:08,519 know, whether your code is working. So to do this you're gonna have to look a little 147 00:13:08,519 --> 00:13:14,529 bit deeper into how the MS-DOS runtime and, by proxy, how x86 in 16-bit mode 148 00:13:14,529 --> 00:13:20,250 works, or legacy mode, I guess. This is basically all the registers you have in 149 00:13:20,250 --> 00:13:26,120 16-bit mode, and some nice computations at the bottom, to make it easier to read. 150 00:13:26,120 --> 00:13:33,550 So, as we mentioned, AH is the one that you use to specify, which syscall you want, 151 00:13:33,550 --> 00:13:40,339 and you'll notice it's not there. AH is actually the upper half of AX. AH is a 152 00:13:40,339 --> 00:13:46,320 8-bit register, because sometimes people really just wanted only 8 bits. It's very 153 00:13:46,320 --> 00:13:53,579 obscure that we were saving that much space. And so, this is what a, this is the 154 00:13:53,579 --> 00:13:57,660 definition of the syscall of a print string. So you have AH needs to be set to 155 00:13:57,660 --> 00:14:02,839 9, this is once you, in order to call the syscall for printing string, you set AH to 156 00:14:02,839 --> 00:14:09,070 9, and then you need to set DS and DX to a pointer to a string that ends in a dollar. 157 00:14:09,070 --> 00:14:11,890 And that doesn't make a lot of sense, or it didn't make a lot of sense to me, when 158 00:14:11,890 --> 00:14:15,579 I first read that and so, to do this, we need to learn a little bit more about 159 00:14:15,579 --> 00:14:19,730 how memory works, on these old CPUs, or the CPUs that are probably in your 160 00:14:19,730 --> 00:14:25,720 laptops, but running in an older mode. So this is effectively what it looks like. 161 00:14:25,720 --> 00:14:31,839 They have a 16-bit CPU, 2 to the 16 is 64 kilobytes, and we have a 20-bit memory 162 00:14:31,839 --> 00:14:36,350 addressing space. 2 to 20 is 1 megabyte, so if you ever see an MS-DOS machine like 163 00:14:36,350 --> 00:14:39,519 limiting at 1 megabyte, or some old operating system, saying like the maximum 164 00:14:39,519 --> 00:14:43,980 memory you can have is 1 megabyte, it's because it's running in 16 bit mode. And 165 00:14:43,980 --> 00:14:50,249 the maximum it can physically see is 20 bits. So the question is: How do we 166 00:14:50,249 --> 00:14:58,580 address anything above 64K? If the CPU can only fundamentally see 16 bits. So, this 167 00:14:58,580 --> 00:15:02,399 is where segment registers come in. We have 4 segment registers, actually we 168 00:15:02,399 --> 00:15:05,899 might have more, but they're the ones who need to care about. There's the code 169 00:15:05,899 --> 00:15:10,819 segment, the data segment, the stack segment and the extra segment, in case you 170 00:15:10,819 --> 00:15:15,420 need just another one. So anyway, with that in mind, let's have a quick crash 171 00:15:15,420 --> 00:15:21,419 course on segment registers. So, imagine if you have a very long piece of memory, 172 00:15:21,419 --> 00:15:30,430 and we can only see 16 bits at a time. So, however, we can move the sliding window 173 00:15:30,430 --> 00:15:36,180 around in the memory, to go and see, like, to move our view of where it is. So, we 174 00:15:36,180 --> 00:15:42,410 can do this and put data around the system, and we can use the final pointer 175 00:15:42,410 --> 00:15:48,589 to specify, how far in to the memory segment we should go. So the DS and DX 176 00:15:48,589 --> 00:15:55,360 really just means a multiplier. So, where the data segment is 100, you need to just 177 00:15:55,360 --> 00:16:01,350 move 100 times 16 to get to the correct place in memory, and then DX is the 178 00:16:01,350 --> 00:16:09,170 offset. This continues on, so, where we have a 16 bit cpu, we have a bunch of 179 00:16:09,170 --> 00:16:13,220 general use registers or general purpose registers. They're quite useful for 180 00:16:13,220 --> 00:16:17,379 ensuring, you don't need to touch RAM too often. x86 actually has a fairly small 181 00:16:17,379 --> 00:16:25,240 amount of general purpose registers. Some architectures have way more. I think more 182 00:16:25,240 --> 00:16:32,139 modern chips like GPUs have hundreds, well hundreds, maybe thousands. However, this 183 00:16:32,139 --> 00:16:34,699 doesn't really change over time in x86 because we have to force backwards 184 00:16:34,699 --> 00:16:38,139 compatibility. So, really what actually ends up happening, when we move up the 185 00:16:38,139 --> 00:16:42,709 bittage, is that the same registers just get wider, and we add some more ones for 186 00:16:42,709 --> 00:16:45,499 the programmers, that want them, and the exact same thing happened to 64 bit: The 187 00:16:45,499 --> 00:16:52,970 registers just got wider. So thinking about it, we have a lot of malware now, 188 00:16:52,970 --> 00:16:58,319 what if we want to know everything that's happened in this entire archive. So we 189 00:16:58,319 --> 00:17:01,420 kind of want to trace all of these automatically, but we might not know what 190 00:17:01,420 --> 00:17:04,480 we're looking for, so let's go through the checklist of what we need to do, to trace 191 00:17:04,480 --> 00:17:09,335 all of this malware. We need to break point on the syscall handler. When we get 192 00:17:09,335 --> 00:17:13,260 that breakpoint, we need to save all the registers, so we know which syscall was 193 00:17:13,260 --> 00:17:19,880 run and potentially what data is being given to the syscall. Ideally, we're going 194 00:17:19,880 --> 00:17:25,130 to save one hundred bytes from that data pointer, not especially because we need 195 00:17:25,130 --> 00:17:28,149 it, but it's quite handy in a lot of registers in a lot of syscalls. It's for 196 00:17:28,149 --> 00:17:34,429 example what you use to get the open file path, when you're opening files. We should 197 00:17:34,429 --> 00:17:37,649 also, probably, record the screen for quick analysis, rather than just staring 198 00:17:37,649 --> 00:17:43,870 at HTML tables, and so we can do that, we burn a lot of CPU time and probably cause 199 00:17:43,870 --> 00:17:51,120 some minor amounts of environmental damage. And we get nothing. We just run a 200 00:17:51,120 --> 00:17:55,080 bunch of stuff and most of them don't return anything. At best they return a 201 00:17:55,080 --> 00:18:02,770 goat file string. They just do nothing. So, if we look deeper into the reason why, 202 00:18:02,770 --> 00:18:05,490 it's sort of a smoking gun here, so we can see the syscalls that run on this file 203 00:18:05,490 --> 00:18:09,840 that does nothing, and the smoking gun here is the date. So it's asking for the 204 00:18:09,840 --> 00:18:15,190 date from the system, and this sort of flags out the first issue, is that a lot 205 00:18:15,190 --> 00:18:18,750 of MS-DOS viruses don't really have a lot to go on, because they have no internet 206 00:18:18,750 --> 00:18:24,180 connection, and there's not really any other state they can decide to activate on. 207 00:18:24,180 --> 00:18:28,600 So the date syscall is pretty simple. The get date and get time just return all 208 00:18:28,600 --> 00:18:34,360 of their values as registers. And, you know, some using the 8-bit halves, to save 209 00:18:34,360 --> 00:18:44,970 space. So, a naive way of doing this, is what we do, is we would run the sample, 210 00:18:44,970 --> 00:18:50,030 we'd wait for the syscall for date or time, we would just fiddle the values, 211 00:18:50,030 --> 00:18:53,240 'cause in this case we're using a debugger, so we can automatically change, what the 212 00:18:53,240 --> 00:18:56,760 state registers are, and we can then observe to see, if any of the syscalls 213 00:18:56,760 --> 00:18:59,580 that the program ran changed, which is a pretty good indication that you've hit 214 00:18:59,580 --> 00:19:04,330 some behavior that is different. And then, you know, we can say "Hooray, we found a 215 00:19:04,330 --> 00:19:08,330 new test case!" The downside is: running every one of these samples takes 15 216 00:19:08,330 --> 00:19:13,940 seconds of CPU-time because MS-DOS, well, 15 seconds of wall-time, which, 217 00:19:13,940 --> 00:19:18,080 when you are emulating MS-DOS is 15 seconds of CPU-time because of the fact 218 00:19:18,080 --> 00:19:20,610 that MS-DOS doesn't have power saving mode, so when it's not doing anything, it 219 00:19:20,610 --> 00:19:27,120 just goes into a busy loop which makes it very hard to optimize. Or we could take a 220 00:19:27,120 --> 00:19:33,350 cleverer look. So when we think about it, we are in the interrupt handler where all 221 00:19:33,350 --> 00:19:36,830 we ever see is the insides of the interrupt handler because we don't know 222 00:19:36,830 --> 00:19:40,990 where the program code is. The interrupt handler is the only place that we know is 223 00:19:40,990 --> 00:19:45,450 consistent because MS-DOS could potentially load the code for the malware 224 00:19:45,450 --> 00:19:50,610 or the program anywhere. But we want to know where the code is. It would be really 225 00:19:50,610 --> 00:19:54,250 handy to know what the code is that we'd be about to run. So for this we need to 226 00:19:54,250 --> 00:19:59,190 look towards the stack. Just like the DSN DX registers the stacks are located on a 227 00:19:59,190 --> 00:20:02,970 stack segment, on a stack pointer. Luckily, the first two values is the 228 00:20:02,970 --> 00:20:07,130 interrupt, the interrupt pointer in the stack segment so we can use that to grab 229 00:20:07,130 --> 00:20:10,779 exactly where, what the code will be run afterwards. So we just need to add a few 230 00:20:10,779 --> 00:20:14,440 things to our checklist. We need to grab 4 bytes from the stack pointer and then 231 00:20:14,440 --> 00:20:18,370 using that, we can calculate the destination that the syscall will return 232 00:20:18,370 --> 00:20:22,549 to. And if we look at some of them - we can look at an example here - well, this 233 00:20:22,549 --> 00:20:27,243 is what a piece of what one of the calls returns to us. So we see we running a compare 234 00:20:27,243 --> 00:20:36,640 on DL against the HEX of 0x1E. And then if that comparison is equal it will 235 00:20:36,640 --> 00:20:43,171 jump to 1 memory address. And if not it will jump to another. So if we look back 236 00:20:43,171 --> 00:20:52,560 at the definition of those syscalls we can see that DL is the day. So with this we 237 00:20:52,560 --> 00:21:01,150 can conclude that D if 0x1e is 30 and DL is the day this malware effectively is 238 00:21:01,150 --> 00:21:07,120 saying if the day of month is 30 we need to go down a different path. If we run 239 00:21:07,120 --> 00:21:11,950 these all over time across the whole dataset what we see is roughly this as a 240 00:21:11,950 --> 00:21:21,740 polydome bar chart. We see out of the 17.500 samples we have around 4.700 of them 241 00:21:21,740 --> 00:21:24,330 checked for the date and time and these are the ones that are really tricky 242 00:21:24,330 --> 00:21:27,590 because they're really hard to activate. They're also the most interesting though, because 243 00:21:27,590 --> 00:21:33,900 those are the ones trying to hide. So, with that in mind, we need to, we have the code 244 00:21:33,900 --> 00:21:38,100 segment that we're about to run, when we return and we can't really brute force 245 00:21:38,100 --> 00:21:43,730 because it takes a little CPU-time and we can't brute force it inside a 'real' or 246 00:21:43,730 --> 00:21:47,419 emulated machine but we can brute force it in a significantly more interesting way. 247 00:21:47,419 --> 00:21:53,960 We need to build something: we need to build the world's worst x86 emulator so 248 00:21:53,960 --> 00:22:02,019 dubbed BenX86, it's 16-bit only. Any attempt to access memory effectively ends 249 00:22:02,019 --> 00:22:06,029 the simulation. It's got a fake stack if you try and push something onto the stack 250 00:22:06,029 --> 00:22:09,640 it says sure, fine if you try and pop it it's like oh actually I never held any of 251 00:22:09,640 --> 00:22:13,690 that data anyway so we are ending the simulation. 80 opcodes, most of them are 252 00:22:13,690 --> 00:22:18,900 jumps. Because that's the primary purposes, comparing and jumps. The 253 00:22:18,900 --> 00:22:23,630 difference is it logs every opcode every address that it went trough and it can be 254 00:22:23,630 --> 00:22:29,210 run with just a small x86 code segment and a register snapshot. This means that we 255 00:22:29,210 --> 00:22:34,909 can test old age from 1980 to 2005 and are roughly about 100 milliseconds and most 256 00:22:34,909 --> 00:22:40,860 programs ended up having just 3 different code paths on average so that yields us 257 00:22:40,860 --> 00:22:48,019 with 17.000 virus samples and about 10.000 of samples that had date variations as in: 258 00:22:48,019 --> 00:22:53,539 Once you exploit the complexity. So I'm going to now use my final remaining time 259 00:22:53,539 --> 00:22:59,769 to go through some of my favorites. So this is an example of a virus that just 260 00:22:59,769 --> 00:23:04,440 doesn't do anything on the 1st of 1980. However if you'd happen to be running this 261 00:23:04,440 --> 00:23:08,477 on New Year's Day you would get this. *Laughter* 262 00:23:08,477 --> 00:23:10,610 No matter what you do, every program you can't 263 00:23:10,610 --> 00:23:14,940 exit out of this, your machine is hung. This might be great, right? You might be like: 264 00:23:14,940 --> 00:23:19,040 'Oh cool, I don't need to do work anymore because my computer will literally not let me' 265 00:23:19,040 --> 00:23:21,049 This also might be terrible, because you might need to do some work on New 266 00:23:21,049 --> 00:23:28,100 Year's day. Here's another example. This does nothing as well just another innocent 267 00:23:28,100 --> 00:23:33,600 .com file. Of course reminding these pieces of malware will be wrapped around 268 00:23:33,600 --> 00:23:37,620 something else. Almost anything could be infected in here. In this case though 269 00:23:37,620 --> 00:23:46,880 these binary is a nice and shaped down. However instead we get this, which I think 270 00:23:46,880 --> 00:23:53,564 is super interesting and is basically the author is aware - they're telling you they 271 00:23:53,564 --> 00:23:57,110 are actually like self disclosing in saying the previous year I've infected 272 00:23:57,110 --> 00:24:04,800 your computer. And for some reason it's being nice. They're just saying. Actually 273 00:24:04,800 --> 00:24:11,580 you have been infected. And as a - I guess a pity - I'm just going to remove myself now. 274 00:24:11,580 --> 00:24:17,120 I don't really. For some reason it's also encouraging you to buy McAfee. This is 275 00:24:17,120 --> 00:24:26,179 back in the day when John McAfee himself actually wrote McAfee. Interesting times. 276 00:24:26,179 --> 00:24:33,059 Definitely interesting times. Here is another example. This one I found 277 00:24:33,059 --> 00:24:41,450 particularly obscure. On the 8th of November 1980 or any year I think actually 278 00:24:41,450 --> 00:24:51,110 it turns all zeroes on the system into tiny little glyphs that say "hate" if 279 00:24:51,110 --> 00:24:54,760 anyone understands this I'd really like to know like I've been thinking about this a 280 00:24:54,760 --> 00:25:01,950 lot. What does it mean? Is it an artistic statement? Is it. I wish I knew. 281 00:25:01,950 --> 00:25:05,669 Someone in the audience: it says MATE Ben: There could be a CCC variant says 282 00:25:05,669 --> 00:25:12,630 MATE. Another good one in that it's the last thing I ever want to see any program 283 00:25:12,630 --> 00:25:19,669 tell me is this one here where you run it and it says "error eating drive C:". I 284 00:25:19,669 --> 00:25:25,070 never ever want an error in any program unexpectedly just says 'Sorry almost I 285 00:25:25,070 --> 00:25:30,159 failed to remove you root file system, don't know why, could you like change your 286 00:25:30,159 --> 00:25:35,940 settings so I can remove it?' Cheers. And finally this is one of my absolute 287 00:25:35,940 --> 00:25:41,420 favorites in that it's just brilliant in that it also stops you from running the 288 00:25:41,420 --> 00:25:46,490 program you want to run it exits prematurely. This is the virus version of 289 00:25:46,490 --> 00:25:50,607 the Navy SEAL copy pasta. Says "I am an assassin. I want to and I shall kill you." 290 00:25:50,607 --> 00:25:59,809 "I also hate Aladdin and I also will kill it. I will eliminate you with ...". You know where 291 00:25:59,809 --> 00:26:04,880 this is going. It says fear the virus that is more powerful than God. 292 00:26:04,880 --> 00:26:10,830 It only activates on one day though, so it's fine. Thank you for your time. I know 293 00:26:10,830 --> 00:26:15,480 it's late and I will happily take any questions or corrections if you know this 294 00:26:15,480 --> 00:26:27,029 topic better than me. *applause* 295 00:26:27,029 --> 00:26:33,410 Herald: This totally brings tears to my eyes with nostalgia. So if there is any 296 00:26:33,410 --> 00:26:37,970 questions, we have microphones distributed around the room, there is like 1,2, 3, 4 and 297 00:26:37,970 --> 00:26:42,630 one in the back. We also have questions perhaps from the internet if you want to 298 00:26:42,630 --> 00:26:47,980 ask a question come up to the microphone ask the question just as a reminder a 299 00:26:47,980 --> 00:26:53,789 question is one or two sentences with a question mark behind it and not a life 300 00:26:53,789 --> 00:27:00,840 story attached. So let's see what we have. I'm going to start with microphone number 301 00:27:00,840 --> 00:27:04,470 1 just because I can see it easiest, let's go for it. 302 00:27:04,470 --> 00:27:09,559 Microphone 1: Hi Ben, thanks for the talk. Really interesting. My question would be 303 00:27:09,559 --> 00:27:16,297 did you do any analysis on what ratio of the viruses was more artistic 304 00:27:16,297 --> 00:27:20,690 and which one actually did damage. Ben: So most of them surprisingly don't do 305 00:27:20,690 --> 00:27:26,450 damage. I actually really struggled to find a date varying sample that 306 00:27:26,450 --> 00:27:30,140 specifically activated on a certain day and decided to delete every file. There 307 00:27:30,140 --> 00:27:35,259 are some very good ones in some of them are like virus scanning utilities that just 308 00:27:35,259 --> 00:27:37,990 don't do anything on certain dates and in one day like while they're telling you all 309 00:27:37,990 --> 00:27:41,120 the files they are scanning is actually telling you all the files they're 310 00:27:41,120 --> 00:27:46,120 deleting. So that's particularly cruel but it's actually surprisingly hard to find a 311 00:27:46,120 --> 00:27:50,480 virus sample that actually was brutally malicious. There was some, that would just, 312 00:27:50,480 --> 00:27:53,910 you know, infect binaries is but it's very hard to find one that I think was brutally 313 00:27:53,910 --> 00:27:58,100 malicious, which is a far cry from the days well from the days that we live in right 314 00:27:58,100 --> 00:28:03,549 now, where we're taking down hospitals with windows bugs. 315 00:28:03,549 --> 00:28:09,210 Herald: as everybody is leaving the room. Please do it quietly. I see a question at 316 00:28:09,210 --> 00:28:12,200 (microphone) 3, on that side. Microphone 3: Yes. Since a lot of 317 00:28:12,200 --> 00:28:19,970 industrial control systems still run DOS. What's the threat from DOS malware that 318 00:28:19,970 --> 00:28:27,150 might be written today. Ben: It's probably unlikely than an 319 00:28:27,150 --> 00:28:31,009 Industrial Control System that's running DOS, would come into contact with DOS-malware. 320 00:28:31,009 --> 00:28:36,010 The only way I can think is if one vendor was like or a factory or supply or 321 00:28:36,010 --> 00:28:41,049 whatever it was basically downloading all basically wares onto industrial control 322 00:28:41,049 --> 00:28:47,419 boxes. I wouldn't be surprised but it would be pretty irresponsible. But it 323 00:28:47,419 --> 00:28:52,510 would be quite surprising to find MS-DOS malware today on industrial controllers 324 00:28:52,510 --> 00:28:57,110 that was installed recently and not just a lingering infection from the last 20 325 00:28:57,110 --> 00:29:00,029 years. Herald: Microphone 2 326 00:29:00,029 --> 00:29:05,000 Microphone 2: Did you find any conditions that weren't date based. Some of them do 327 00:29:05,000 --> 00:29:09,610 attempt to some of them try and circumvent the date recognition. Unfortunately it's 328 00:29:09,610 --> 00:29:12,809 very hard to brute force those. Some of them install themselves as what's called 329 00:29:12,809 --> 00:29:19,710 TSR or Terminate and Stay Resident which basically means that they will exit out, 330 00:29:19,710 --> 00:29:23,750 run in the background and continuously ask the actual system time what time it is. 331 00:29:23,750 --> 00:29:27,639 It's a bit of a more risky strategy because the system timer might not exist 332 00:29:27,639 --> 00:29:31,650 which would be unfortunate for the virus. So definitely there are viruses that have 333 00:29:31,650 --> 00:29:38,340 way more complicated execution conditions. I observed one sample that only activated 334 00:29:38,340 --> 00:29:43,850 after I believe it was something silly like 100 keypresses which is very hard to 335 00:29:43,850 --> 00:29:49,770 automatically test. Those sort of viruses require static analysis and statically 336 00:29:49,770 --> 00:29:54,480 analyzing 17.000 samples is a time consuming task. 337 00:29:54,480 --> 00:30:02,009 Herald: So we have a question from the Internet. Signal Angel: Do you have the source? What 338 00:30:02,009 --> 00:30:07,990 is the source of the malware that you analyzed here, is it published somewhere? 339 00:30:07,990 --> 00:30:13,400 Ben:You can still find dump's of VX heavens, and more modern dumps of VX 340 00:30:13,400 --> 00:30:17,990 heavens on popular torrent websites. But I'm sure there are also copies 341 00:30:17,990 --> 00:30:21,399 floating about on non-popular torrent websites. 342 00:30:21,399 --> 00:30:24,810 *Laughter* Herald: Over to microphone 1. 343 00:30:24,810 --> 00:30:32,240 Microphone 1: Hi Ben. I'm Jope. Thank you for your talk. I was wondering: did you 344 00:30:32,240 --> 00:30:36,639 learn anything from your studies of these viruses that should be taught in modern 345 00:30:36,639 --> 00:30:42,820 day computer science classes like more efficient sorting algorithm or some hidden 346 00:30:42,820 --> 00:30:47,080 gem that actually should be part of computing these days. 347 00:30:47,080 --> 00:30:53,570 Ben: My primary takeaway was x86 was a mistake. 348 00:30:53,570 --> 00:31:01,320 *Laughter & applause* Herald: So I'm not seeing any more 349 00:31:01,320 --> 00:31:04,480 questions. Oh no there is. OK one more question from the internet. 350 00:31:04,480 --> 00:31:11,389 Signal angel: Have you found malware samples that did like try to detect dummy 351 00:31:11,389 --> 00:31:14,617 binaries or whatever, to avoid easy analysis? 352 00:31:14,617 --> 00:31:20,007 Ben: Oh actually, that's a really good question. So it is it's complicated: 353 00:31:20,007 --> 00:31:24,580 So some viruses would so, maybe let's be 354 00:31:25,027 --> 00:31:29,770 dangerous let's try and go backwards on my home written presentation software. So 355 00:31:29,770 --> 00:31:41,160 *humming* Too many slides. I have regrets. Yes. OK. Here we are. This slide. 356 00:31:41,160 --> 00:31:45,450 OK. So you know here I'm saying that the malware infection goes to the end. Well 357 00:31:45,450 --> 00:31:49,850 some samples are really cool. They don't change the size of the file. They just 358 00:31:49,850 --> 00:31:54,590 find areas in the files that are full of null bites and just say this is probably 359 00:31:54,590 --> 00:32:00,230 fine. I'm just going to put myself here which may have unintended consequences. It 360 00:32:00,230 --> 00:32:04,960 may mean if a program is like a statically typed, statically defined byte array of 361 00:32:04,960 --> 00:32:10,039 like a certain size and the program is relying on it being zeros when it accesses 362 00:32:10,039 --> 00:32:14,440 it for the first time it may get very surprised to find some malware code in 363 00:32:14,440 --> 00:32:20,159 there. But generally speaking as far as I'm aware, this deployment 364 00:32:20,159 --> 00:32:26,220 procedure works pretty well and actually is very good at avoiding antivirus of the 365 00:32:26,220 --> 00:32:30,390 era which would just be checking like common system files and its size. And you 366 00:32:30,390 --> 00:32:35,059 know the size increases of COMMAND.COM then that's clearly bad news. 367 00:32:35,059 --> 00:32:38,450 Herald: We have a question on microphone 1. 368 00:32:38,450 --> 00:32:45,620 Microphone 1: Are there any viruses that try to eliminate or manipulate virus 369 00:32:45,620 --> 00:32:48,970 scanners of the day. Oh yeah. So a lot of the samples will 370 00:32:48,970 --> 00:32:52,960 actively go and look for files of other anti-viruses. 371 00:32:52,960 --> 00:32:57,159 But I am generally under the impression that it's kind of hard to find them. They 372 00:32:57,159 --> 00:33:01,750 weren't actually that many antivirus products back in the day. 373 00:33:01,750 --> 00:33:06,410 I feel like, it was a bit of a niche thing to be running. Microsoft did for a while ship 374 00:33:06,410 --> 00:33:14,330 their own antivirus with MS-DOS. So I guess you know what's new is old. So there 375 00:33:14,330 --> 00:33:17,860 were antiviruses out there. I don't think many of them were very effective. 376 00:33:17,860 --> 00:33:27,260 Herald: Any more questions? There, where? Oh right. Another one from the Internet. 377 00:33:27,260 --> 00:33:32,049 It's interesting that the internet is querying MS-DOS all the time. Go ahead. 378 00:33:32,049 --> 00:33:38,000 Signal angel: Did you do the diagrams by hand or do you have a tool? 379 00:33:38,000 --> 00:33:42,559 Ben: So many hours. No. So there's a couple of good tools to do it. 380 00:33:42,559 --> 00:33:46,429 asciiflow.org. I think is a fantastic tool. I would highly recommend it. I think 381 00:33:46,429 --> 00:33:52,779 it's not maintained very well, though. Herald: microphone 1. 382 00:33:52,779 --> 00:33:55,519 Microphone 1: Are you publishing the tools you wrote? 383 00:33:55,519 --> 00:34:02,429 Ben: I will be publishing the tools at some point when they are less... when they 384 00:34:02,429 --> 00:34:08,320 are less ugly. I will be publishing all of the automatic malware runs and the gifs 385 00:34:08,320 --> 00:34:12,929 generated by them so that people can easily search google for the virus names 386 00:34:12,929 --> 00:34:16,890 and get like actual real time versions. The hardest thing that I've found is when 387 00:34:16,890 --> 00:34:21,710 looking at virus names was literally just finding any information about them and one 388 00:34:21,710 --> 00:34:25,220 of the things I really wish existed at the time of writing this talk, was being able 389 00:34:25,220 --> 00:34:29,580 to just query a name and be like oh yeah this virus it looks like it does this. 390 00:34:29,580 --> 00:34:33,420 Herald: since I saw microphone 1 first let's go with that. 391 00:34:33,420 --> 00:34:40,260 Microphone 1: Did you find any viruses that had signage in them not signage of 392 00:34:40,260 --> 00:34:43,520 today but the name of the author. Like he was very proud of what he wrote. 393 00:34:43,520 --> 00:34:47,450 Ben: Yeah, there are some notable examples. Quite a few of them will try and 394 00:34:47,450 --> 00:34:52,870 name - so DOS-viruses do like have [incomprehensible] sample names in the same way 395 00:34:52,870 --> 00:34:57,470 that we'd still today give viruses names. A lot of the time you will just encode a 396 00:34:57,470 --> 00:35:01,131 string that you want the virus to be named, you know, somewhere in the file 397 00:35:01,131 --> 00:35:04,472 just a random string doing nothing. It's like oh, ok, they clearly wanted the virus 398 00:35:04,472 --> 00:35:11,430 to be called Tempest. So that does happen. One of the favorite examples is the brain 399 00:35:11,430 --> 00:35:16,750 malware which literally encodes an address and phone number of the author. I believe 400 00:35:16,750 --> 00:35:22,720 in Pakistan and there's a fantastic mini documentary by F-Secure where they go and 401 00:35:22,720 --> 00:35:25,850 visit the people who wrote it. It's a super interesting watch and I would really 402 00:35:25,850 --> 00:35:29,990 recommend it. Herald: Indeed it is. Microphone 2? 403 00:35:29,990 --> 00:35:36,260 Microphone 2: Did you have any chance to look at any kind of viruses that did not 404 00:35:36,260 --> 00:35:42,330 modify the files themselves. For example one of the largest virus infections at the time was a 405 00:35:42,330 --> 00:35:46,080 virus called [incomprehensible] which modified the master boot record 406 00:35:46,080 --> 00:35:51,060 Ben: Yes, Master boot record, I did consider. It was more of a time problem 407 00:35:51,060 --> 00:35:55,320 that I had in getting to the point where you could brute force time and date 408 00:35:55,320 --> 00:36:01,020 combinations and looking for master boot record changes. It was really hard. I am 409 00:36:01,020 --> 00:36:06,610 super interested in reviewing a fact to be the root kits of the era. But yes that's 410 00:36:06,610 --> 00:36:10,220 definitely something I will look into in the future. 411 00:36:10,220 --> 00:36:14,410 Herald: And we have yet another question from the Internet. 412 00:36:14,410 --> 00:36:17,400 Signal angel: And it's even from the same guy. 413 00:36:17,400 --> 00:36:22,830 Ben: Oh damn. Signal angel: is the BenX86 software open- 414 00:36:22,830 --> 00:36:25,530 source or can be found on the web somewhere. 415 00:36:25,530 --> 00:36:29,870 Ben: It probably will be. I wouldn't expect it to work in, well, in any use-case 416 00:36:29,870 --> 00:36:36,360 though. It's effectively designed to like not work correctly, right? Like what 417 00:36:36,360 --> 00:36:40,880 was the spec? It basically like fails at every single thing awkward. I just went 418 00:36:40,880 --> 00:36:46,660 like oh that's fine. We're probably far enough down there anyway. Are we? Be aware 419 00:36:46,660 --> 00:36:50,740 this is the feature list. Herald: So is that a follow up question 420 00:36:50,740 --> 00:36:57,010 from the internet? Signal angel: No it's a new one. I don't 421 00:36:57,010 --> 00:37:02,660 know how serious it is but would it be possible or a good idea to use machine 422 00:37:02,660 --> 00:37:09,500 learning to create new DOS malware from the existing samples. 423 00:37:09,500 --> 00:37:17,021 *Laughter & applause* Ben: It would not be a good idea. But I 424 00:37:17,021 --> 00:37:24,230 like how you think. Herald: Actually I saw somebody trying to 425 00:37:24,230 --> 00:37:27,640 use NLP to generate viruses but ok that's enough for now. 426 00:37:27,640 --> 00:37:32,400 Ben: you could probably do Markov Chains with x86 to be honest. Please don't do 427 00:37:32,400 --> 00:37:34,530 that, please! Herald: Don't try this at home. 428 00:37:34,530 --> 00:37:37,480 Ben: I have seen things I've seen. Just please don't do that. 429 00:37:37,480 --> 00:37:43,461 Herald: So I think we've run out of questions. Going once, going twice. Let's 430 00:37:43,461 --> 00:37:49,520 thank Ben for this marvelous retrospective talk. *Big applause* 431 00:37:49,520 --> 00:37:58,785 *36C3 postroll music* 432 00:37:58,785 --> 00:38:12,000 subtitles created by c3subtitles.de in the year 2020. Join, and help us!