0 00:00:00,000 --> 00:00:30,000 Dear viewer, these subtitles were generated by a machine via the service Trint and therefore are (very) buggy. If you are capable, please help us to create good quality subtitles: https://c3subtitles.de/talk/579 Thanks! 1 00:00:09,360 --> 00:00:10,949 Hi, everyone, thanks for coming. 2 00:00:10,950 --> 00:00:13,439 And it's so good to see familiar faces 3 00:00:13,440 --> 00:00:15,569 again at this time of the year. 4 00:00:15,570 --> 00:00:17,699 And today I'm going I'm 5 00:00:17,700 --> 00:00:19,829 going to do our annual tradition 6 00:00:19,830 --> 00:00:20,999 on stylometry talk. 7 00:00:21,000 --> 00:00:23,279 And I'm from Princeton University. 8 00:00:23,280 --> 00:00:25,829 My name is Eileen as he introduced 9 00:00:25,830 --> 00:00:28,649 me. And I'm currently a post-doctoral 10 00:00:28,650 --> 00:00:29,910 research associate. 11 00:00:31,200 --> 00:00:33,359 And we have been presenting 12 00:00:33,360 --> 00:00:35,519 at CCC for a few years now. 13 00:00:35,520 --> 00:00:36,809 And they have been mostly about 14 00:00:36,810 --> 00:00:39,539 stylometry. But the original 15 00:00:39,540 --> 00:00:41,879 on the first day she gave the alternative 16 00:00:41,880 --> 00:00:44,489 keynote, then it was Nunc telemetric 17 00:00:44,490 --> 00:00:46,529 this time. And I'm going to keep the 18 00:00:46,530 --> 00:00:48,839 tradition alive and talk about stylometry 19 00:00:48,840 --> 00:00:49,959 and machine learning today. 20 00:00:51,240 --> 00:00:53,249 So what happened since last year? 21 00:00:53,250 --> 00:00:55,199 Last year I talked about the anonymizing 22 00:00:55,200 --> 00:00:58,079 program, which for about like 15 minutes 23 00:00:58,080 --> 00:01:00,599 this year, Giana, the anonymization 24 00:01:00,600 --> 00:01:01,829 just became easier. 25 00:01:01,830 --> 00:01:04,199 And that's kind of equivalent to the fact 26 00:01:04,200 --> 00:01:06,269 that there are more privacy 27 00:01:06,270 --> 00:01:08,669 concerns, concerns for programmers 28 00:01:08,670 --> 00:01:10,739 now and also open source software 29 00:01:10,740 --> 00:01:11,740 developers. 30 00:01:12,990 --> 00:01:15,089 And today we're 31 00:01:15,090 --> 00:01:16,559 going to talk about stylometry and 32 00:01:16,560 --> 00:01:18,299 machine learning. But at the same time, 33 00:01:18,300 --> 00:01:20,489 we released our most current paper on 34 00:01:20,490 --> 00:01:22,679 this. It's on archiving on my 35 00:01:22,680 --> 00:01:24,959 website. And if you want 36 00:01:24,960 --> 00:01:26,849 to read a summary of this talk or the 37 00:01:26,850 --> 00:01:29,159 paper, you can also check our blog, 38 00:01:29,160 --> 00:01:30,239 Freedom to Think. 39 00:01:31,960 --> 00:01:33,729 Let's start talking about stylistic 40 00:01:33,730 --> 00:01:34,639 fingerprints. 41 00:01:34,640 --> 00:01:36,159 Why not, if you're not familiar with 42 00:01:36,160 --> 00:01:37,719 stylometry, I'll give you a brief 43 00:01:37,720 --> 00:01:38,720 introduction. 44 00:01:39,580 --> 00:01:41,769 Stylometry is the study of individual 45 00:01:41,770 --> 00:01:44,049 style. Most of the time it has been 46 00:01:44,050 --> 00:01:46,119 researched in writing style, 47 00:01:46,120 --> 00:01:48,459 but we can see stylometry in fine 48 00:01:48,460 --> 00:01:50,589 arts. For example, artists can be 49 00:01:50,590 --> 00:01:53,289 identified by their brushstrokes 50 00:01:53,290 --> 00:01:55,629 and in music, musicians 51 00:01:55,630 --> 00:01:57,789 can be identified by 52 00:01:57,790 --> 00:01:59,979 the by the tones or rhythms that they're 53 00:01:59,980 --> 00:02:02,259 using. And three years 54 00:02:02,260 --> 00:02:04,389 ago, we presented that stylometry 55 00:02:04,390 --> 00:02:07,089 is also present in unconventional 56 00:02:07,090 --> 00:02:09,219 text and by unconventional 57 00:02:09,220 --> 00:02:11,619 text I mean underground forums 58 00:02:11,620 --> 00:02:14,079 where sometimes cyber criminals or 59 00:02:14,080 --> 00:02:16,299 a variety of people engage 60 00:02:16,300 --> 00:02:18,399 in. We can identify 61 00:02:18,400 --> 00:02:19,719 them as well. 62 00:02:19,720 --> 00:02:21,849 And we have looked at translated text 63 00:02:21,850 --> 00:02:24,309 to see if translations can 64 00:02:24,310 --> 00:02:25,659 anonymize you. 65 00:02:25,660 --> 00:02:27,759 And we saw that even when you take 66 00:02:27,760 --> 00:02:29,949 your English writing translated 67 00:02:29,950 --> 00:02:32,019 to German, then to Japanese, then 68 00:02:32,020 --> 00:02:34,329 back to English, we can still identify 69 00:02:34,330 --> 00:02:36,339 you. And that sounds like a serious 70 00:02:36,340 --> 00:02:38,169 concern for someone who would like to 71 00:02:38,170 --> 00:02:39,789 remain anonymous. 72 00:02:39,790 --> 00:02:41,919 And now we started investigating 73 00:02:41,920 --> 00:02:44,829 Source Scott, because if you are going to 74 00:02:44,830 --> 00:02:47,109 investigate style in language, 75 00:02:47,110 --> 00:02:49,059 we can think of source code as another 76 00:02:49,060 --> 00:02:49,969 type of language. 77 00:02:49,970 --> 00:02:52,119 It's a programing language. 78 00:02:52,120 --> 00:02:54,069 And today I will give you the 79 00:02:54,070 --> 00:02:55,659 improvements in our source code, 80 00:02:55,660 --> 00:02:58,089 authorship, attribution method. 81 00:02:58,090 --> 00:02:59,769 And at the end of this talk, we will see 82 00:02:59,770 --> 00:03:01,989 that style that's expressed in code can 83 00:03:01,990 --> 00:03:04,119 be quantified and characterized. 84 00:03:04,120 --> 00:03:05,829 And that's kind of the answer to our 85 00:03:05,830 --> 00:03:06,849 research question. 86 00:03:08,430 --> 00:03:10,439 So what happens with supervised 87 00:03:10,440 --> 00:03:12,779 stylometry, I say supervised stylometry 88 00:03:12,780 --> 00:03:14,459 because I'm going to talk about machine 89 00:03:14,460 --> 00:03:17,099 learning today, we can identify 90 00:03:17,100 --> 00:03:19,229 style in some type of 91 00:03:19,230 --> 00:03:21,749 personal data or writing 92 00:03:21,750 --> 00:03:23,909 by using machine learning methods. 93 00:03:23,910 --> 00:03:25,739 And I will give you a very common 94 00:03:25,740 --> 00:03:27,049 setting. 95 00:03:27,050 --> 00:03:28,769 Let's say that you have a set of 96 00:03:28,770 --> 00:03:31,229 documents with unknown authors 97 00:03:31,230 --> 00:03:33,509 and some with no notice, and 98 00:03:33,510 --> 00:03:35,249 you would like to find out who these 99 00:03:35,250 --> 00:03:36,899 anonymous documents belong to. 100 00:03:36,900 --> 00:03:38,789 So what you do is you take a machine 101 00:03:38,790 --> 00:03:40,919 learning classifier and then 102 00:03:40,920 --> 00:03:43,619 you train it based on the documents 103 00:03:43,620 --> 00:03:45,479 whose authorship is known. 104 00:03:45,480 --> 00:03:47,789 And then you create a model for everyone, 105 00:03:47,790 --> 00:03:50,459 for the people with 106 00:03:50,460 --> 00:03:52,209 known authorship documents. 107 00:03:52,210 --> 00:03:54,209 And after that, you can use your machine 108 00:03:54,210 --> 00:03:56,339 learning classifier to test and see 109 00:03:56,340 --> 00:03:58,379 who this anonymous document was written 110 00:03:58,380 --> 00:04:00,659 by. And let's think about the common 111 00:04:00,660 --> 00:04:01,919 scenario for this. 112 00:04:01,920 --> 00:04:03,959 There's Alisdair Anonymous blogger and 113 00:04:03,960 --> 00:04:05,309 Bob, the abusive employer. 114 00:04:05,310 --> 00:04:08,009 So Alice is blogging about abuses 115 00:04:08,010 --> 00:04:09,509 in Bob's company. 116 00:04:09,510 --> 00:04:11,729 And Bob, as he is abusive, he's 117 00:04:11,730 --> 00:04:14,369 going to go ahead and collect everyone 118 00:04:14,370 --> 00:04:16,528 in his company's writing and 119 00:04:16,529 --> 00:04:18,509 then he's going to train a classifier so 120 00:04:18,510 --> 00:04:20,999 that he can identify who this anonymous 121 00:04:21,000 --> 00:04:23,099 blogger, Alexis and Bob, 122 00:04:23,100 --> 00:04:24,989 can do this by using stylometry and 123 00:04:24,990 --> 00:04:25,990 machine learning. 124 00:04:27,870 --> 00:04:30,299 I will give some other motivating 125 00:04:30,300 --> 00:04:32,489 or scary 126 00:04:32,490 --> 00:04:34,709 examples. For example, there was this 127 00:04:34,710 --> 00:04:37,019 case with a person called 128 00:04:37,020 --> 00:04:39,359 the Koner or hurt her or his 129 00:04:39,360 --> 00:04:41,789 user name was the Konner on Twitter. 130 00:04:41,790 --> 00:04:43,919 And she tweeted that Siska just 131 00:04:43,920 --> 00:04:45,929 offered me a job, not I have to weigh the 132 00:04:45,930 --> 00:04:48,209 utility of a further paycheck against 133 00:04:48,210 --> 00:04:50,249 the daily commute to San Jose and heating 134 00:04:50,250 --> 00:04:52,319 the work. And then to Mileva, 135 00:04:52,320 --> 00:04:54,179 who is the channel partner advocate for 136 00:04:54,180 --> 00:04:56,879 Cisco alert. He saw this and 137 00:04:56,880 --> 00:04:58,949 that wasn't very good because you 138 00:04:58,950 --> 00:05:01,199 can then identify who the Connery's 139 00:05:01,200 --> 00:05:03,869 and her job offer 140 00:05:03,870 --> 00:05:05,819 might be in danger at that moment. 141 00:05:05,820 --> 00:05:07,979 So what if Cisco takes all 142 00:05:07,980 --> 00:05:09,959 the cover letters that were submitted to 143 00:05:09,960 --> 00:05:12,719 Cisco and after that train a classifier 144 00:05:12,720 --> 00:05:14,909 and try to find who the corner is by 145 00:05:14,910 --> 00:05:16,439 looking at her tweets? 146 00:05:16,440 --> 00:05:18,629 Because you can also identify people from 147 00:05:18,630 --> 00:05:19,630 their tweets? 148 00:05:20,860 --> 00:05:23,109 But that wasn't necessary in this case 149 00:05:23,110 --> 00:05:25,419 because since you can find some 150 00:05:25,420 --> 00:05:27,639 cached information online, she 151 00:05:27,640 --> 00:05:29,949 was identified as 152 00:05:29,950 --> 00:05:33,459 Connor Riley and unfortunately, 153 00:05:33,460 --> 00:05:36,459 she lost the job offer after this. 154 00:05:36,460 --> 00:05:38,529 So this is one example where this 155 00:05:38,530 --> 00:05:39,459 might have been applied. 156 00:05:39,460 --> 00:05:41,379 So you need to understand, like when we 157 00:05:41,380 --> 00:05:43,209 are talking about machine learning 158 00:05:43,210 --> 00:05:45,369 methods that make us make 159 00:05:45,370 --> 00:05:47,289 it possible to anonymize people, there 160 00:05:47,290 --> 00:05:49,389 might be some dangers associated 161 00:05:49,390 --> 00:05:51,549 with this, or you might want to be 162 00:05:51,550 --> 00:05:53,769 more aware of how you're 163 00:05:53,770 --> 00:05:56,019 sharing your information online, keeping 164 00:05:56,020 --> 00:05:57,939 in mind that you can always be re 165 00:05:57,940 --> 00:05:58,940 identified. 166 00:06:00,220 --> 00:06:02,079 And what happens with sars-cov-2, for 167 00:06:02,080 --> 00:06:04,839 example, there was this recent tweet 168 00:06:04,840 --> 00:06:06,489 and it says, I just heard from an 169 00:06:06,490 --> 00:06:08,319 Internet couple that they disallow her 170 00:06:08,320 --> 00:06:10,269 from contributing to open source on her 171 00:06:10,270 --> 00:06:12,549 own time. That's illegal, right? 172 00:06:12,550 --> 00:06:14,799 It's probably illegal, but Apple 173 00:06:14,800 --> 00:06:17,229 can probably find out if someone 174 00:06:17,230 --> 00:06:18,849 is contributing to open source code 175 00:06:18,850 --> 00:06:20,979 repositories by looking at the code that 176 00:06:20,980 --> 00:06:23,499 they have at Apple and then just compare 177 00:06:23,500 --> 00:06:25,779 any suspicious code to maybe we identify 178 00:06:25,780 --> 00:06:27,850 who this contributor is. 179 00:06:28,960 --> 00:06:30,879 And because of that, we're going to talk 180 00:06:30,880 --> 00:06:32,619 about the anonymizing programmers with 181 00:06:32,620 --> 00:06:33,919 code stylometry today. 182 00:06:33,920 --> 00:06:35,709 This has been joint work with my great 183 00:06:35,710 --> 00:06:37,719 collaborators and some of them are here 184 00:06:37,720 --> 00:06:38,720 with us today. 185 00:06:39,850 --> 00:06:41,709 And why do we want to do source code 186 00:06:41,710 --> 00:06:43,959 stylometry? Like how can we first start 187 00:06:43,960 --> 00:06:46,299 doing this? First of all, we know that 188 00:06:46,300 --> 00:06:48,789 any language source code as a programing 189 00:06:48,790 --> 00:06:50,769 language is learned on an individual 190 00:06:50,770 --> 00:06:53,049 basis and as a result, you 191 00:06:53,050 --> 00:06:55,149 develop a unique coding style and 192 00:06:55,150 --> 00:06:56,469 that can potentially make you 193 00:06:56,470 --> 00:06:57,379 identifiable. 194 00:06:57,380 --> 00:06:59,709 So if you want to investigate, if that's 195 00:06:59,710 --> 00:07:01,749 really possible, do we leave any 196 00:07:01,750 --> 00:07:03,669 fingerprints in source code that might 197 00:07:03,670 --> 00:07:05,589 make us identifiable? 198 00:07:05,590 --> 00:07:07,869 And why else would we do it as like maybe 199 00:07:07,870 --> 00:07:09,699 we want to gain some software engineering 200 00:07:09,700 --> 00:07:11,919 insights. For example, we want to analyze 201 00:07:11,920 --> 00:07:14,259 how coding style changes over the years 202 00:07:14,260 --> 00:07:16,539 or the differences between coding 203 00:07:16,540 --> 00:07:18,609 styles of more advanced 204 00:07:18,610 --> 00:07:20,319 and less advanced programmers. 205 00:07:20,320 --> 00:07:22,299 Or does your coding style change when 206 00:07:22,300 --> 00:07:23,679 you're trying to implement more 207 00:07:23,680 --> 00:07:26,079 sophisticated functionality? 208 00:07:26,080 --> 00:07:28,569 And the main goal, 209 00:07:28,570 --> 00:07:30,669 or like the most motivating goal here, 210 00:07:30,670 --> 00:07:32,439 would be to identify malicious 211 00:07:32,440 --> 00:07:34,599 programmers who are maybe trying to 212 00:07:34,600 --> 00:07:36,699 contribute malicious code or like back 213 00:07:36,700 --> 00:07:39,909 doors to open source software. 214 00:07:39,910 --> 00:07:42,579 And let's think about a common scenario 215 00:07:42,580 --> 00:07:44,259 analysis, analyzing a library. 216 00:07:44,260 --> 00:07:46,269 And there is malicious code in the 217 00:07:46,270 --> 00:07:48,069 library. And Bob has a source code 218 00:07:48,070 --> 00:07:50,199 collection and he knows 219 00:07:50,200 --> 00:07:52,509 the authors. So Bob is going to search 220 00:07:52,510 --> 00:07:54,699 his collection machine learning to 221 00:07:54,700 --> 00:07:57,249 find out who Alice's adversary is. 222 00:07:57,250 --> 00:07:59,409 And a second scenario about plagiarism. 223 00:07:59,410 --> 00:08:02,199 So you're a college student and 224 00:08:02,200 --> 00:08:04,129 you get an extension to your programing 225 00:08:04,130 --> 00:08:06,339 assignment. And Bob, your professor 226 00:08:06,340 --> 00:08:08,379 wants to know if you have plagiarized or 227 00:08:08,380 --> 00:08:10,629 not. So he's going to train a classifier 228 00:08:10,630 --> 00:08:13,179 on all the submissions 229 00:08:13,180 --> 00:08:15,189 by the other students and then he can 230 00:08:15,190 --> 00:08:17,169 check if there are extreme similarities 231 00:08:17,170 --> 00:08:19,269 between coding styles similar 232 00:08:19,270 --> 00:08:21,399 to that doesn't really belong to your 233 00:08:21,400 --> 00:08:23,439 former coding style. 234 00:08:23,440 --> 00:08:25,389 Unfortunately, these two examples for a 235 00:08:25,390 --> 00:08:27,699 kind of security infringing 236 00:08:27,700 --> 00:08:29,889 sorry, security enhancing 237 00:08:29,890 --> 00:08:32,439 examples, but source code stylometry 238 00:08:32,440 --> 00:08:34,808 could also be very privacy 239 00:08:34,809 --> 00:08:36,459 infringing. So you have to be very 240 00:08:36,460 --> 00:08:38,408 careful when you want to use that. 241 00:08:38,409 --> 00:08:40,599 For example, or maybe some 242 00:08:40,600 --> 00:08:42,819 of you remember him from last 243 00:08:42,820 --> 00:08:43,719 year's top. 244 00:08:43,720 --> 00:08:45,819 He was sentenced to that because he was 245 00:08:45,820 --> 00:08:49,029 identified as the website programmer 246 00:08:49,030 --> 00:08:50,589 of a porn site. 247 00:08:50,590 --> 00:08:52,449 And unfortunately, the Iranian government 248 00:08:52,450 --> 00:08:54,579 found about this and he was sentenced to 249 00:08:54,580 --> 00:08:56,259 that. But he managed to get out of this 250 00:08:56,260 --> 00:08:58,089 entire thing because, like, he was also a 251 00:08:58,090 --> 00:09:00,129 resident of another country. 252 00:09:00,130 --> 00:09:02,499 But this is a dangerous case where 253 00:09:02,500 --> 00:09:05,409 in like oppressive regimes, your court 254 00:09:05,410 --> 00:09:08,109 might put you in a dangerous situation. 255 00:09:09,810 --> 00:09:11,819 So I'll start talking about a little more 256 00:09:11,820 --> 00:09:14,219 technical stuff and show 257 00:09:14,220 --> 00:09:16,439 you how our work improves 258 00:09:16,440 --> 00:09:18,599 the state of the art and bring some novel 259 00:09:18,600 --> 00:09:20,849 contributions, first 260 00:09:20,850 --> 00:09:23,129 of all, this is a list of 261 00:09:23,130 --> 00:09:25,169 comparison to related work. 262 00:09:25,170 --> 00:09:26,999 And there is not much related work, as 263 00:09:27,000 --> 00:09:28,139 you can see. 264 00:09:28,140 --> 00:09:30,299 And the only difference in 265 00:09:30,300 --> 00:09:32,789 the previous features used to represent 266 00:09:32,790 --> 00:09:35,489 coding style and ours is syntactic 267 00:09:35,490 --> 00:09:38,099 features. So we use structural features 268 00:09:38,100 --> 00:09:39,629 to represent your coding style. 269 00:09:39,630 --> 00:09:41,789 We don't just use your function. 270 00:09:41,790 --> 00:09:44,309 Names are variable names or spaces 271 00:09:44,310 --> 00:09:45,779 and types that you use. 272 00:09:45,780 --> 00:09:47,819 And as a classifier, we use a random 273 00:09:47,820 --> 00:09:49,919 force and I'm going to show you why these 274 00:09:49,920 --> 00:09:52,679 are making very big differences 275 00:09:52,680 --> 00:09:53,680 in the results. 276 00:09:55,230 --> 00:09:57,389 So in the past, the highest accuracy that 277 00:09:57,390 --> 00:09:59,369 has been reached to the anonymise 278 00:09:59,370 --> 00:10:01,499 programmers was 97 percent. 279 00:10:01,500 --> 00:10:02,429 And this is today. 280 00:10:02,430 --> 00:10:04,979 Anonymise talk to programmers correctly, 281 00:10:04,980 --> 00:10:07,349 but we can anonymize 282 00:10:07,350 --> 00:10:09,929 two hundred and fifty programmers with 98 283 00:10:09,930 --> 00:10:11,609 percent accuracy. 284 00:10:11,610 --> 00:10:13,829 So we are beating the highest accuracy 285 00:10:13,830 --> 00:10:15,419 in the past. And we have a much more 286 00:10:15,420 --> 00:10:17,039 difficult machine learning problem 287 00:10:17,040 --> 00:10:19,199 because we have a much larger data set 288 00:10:19,200 --> 00:10:20,940 of hundred and fifty programmers. 289 00:10:22,830 --> 00:10:24,959 The largest dataset that has been used 290 00:10:24,960 --> 00:10:27,389 in the past is four to six programmers 291 00:10:27,390 --> 00:10:28,949 and they get seventy five percent 292 00:10:28,950 --> 00:10:29,969 accuracy. 293 00:10:29,970 --> 00:10:32,039 But after last year's talk, we were 294 00:10:32,040 --> 00:10:34,109 able to scale our approach to 295 00:10:34,110 --> 00:10:36,299 1600 programmers 296 00:10:36,300 --> 00:10:38,409 and we get 94 percent accuracy 297 00:10:38,410 --> 00:10:41,969 incorrectly identifying 14500 298 00:10:41,970 --> 00:10:45,239 source code samples of these 1600 299 00:10:45,240 --> 00:10:48,059 programmers. So this is large scale 300 00:10:48,060 --> 00:10:49,379 authorship attribution. 301 00:10:49,380 --> 00:10:51,360 This is large scale, the anonymization. 302 00:10:52,750 --> 00:10:54,729 How do we do that? So we have our general 303 00:10:54,730 --> 00:10:56,619 machine learning set up, first of all, we 304 00:10:56,620 --> 00:10:58,539 need data with ground truth. 305 00:10:59,680 --> 00:11:01,119 So we go to Google Kojm. 306 00:11:01,120 --> 00:11:03,129 It's an international annual programing 307 00:11:03,130 --> 00:11:04,299 competition. 308 00:11:04,300 --> 00:11:06,369 And we collected a data set and it was in 309 00:11:06,370 --> 00:11:08,709 C++ because that was the most commonly 310 00:11:08,710 --> 00:11:11,799 used language in this competition. 311 00:11:11,800 --> 00:11:13,899 And we had about 100000 users 312 00:11:13,900 --> 00:11:15,319 from different years. 313 00:11:15,320 --> 00:11:16,719 We have our data set. 314 00:11:16,720 --> 00:11:18,939 So now what we have to do is we 315 00:11:18,940 --> 00:11:21,129 need to find the features 316 00:11:21,130 --> 00:11:22,809 and properties that are going to 317 00:11:22,810 --> 00:11:24,819 represent the coding styles of these 318 00:11:24,820 --> 00:11:25,809 people. 319 00:11:25,810 --> 00:11:28,029 So we preprocess the data, set 320 00:11:28,030 --> 00:11:30,459 the source code, and 321 00:11:30,460 --> 00:11:32,739 we get the abstract syntax 322 00:11:32,740 --> 00:11:34,809 tree of the source code with using 323 00:11:34,810 --> 00:11:36,189 a fuzzy S.T. 324 00:11:36,190 --> 00:11:37,359 parser. 325 00:11:37,360 --> 00:11:39,039 And after that we extract the features 326 00:11:39,040 --> 00:11:41,439 that represent coding style and then 327 00:11:41,440 --> 00:11:43,539 repeat these properties that are going 328 00:11:43,540 --> 00:11:45,729 to represent individual style in 329 00:11:45,730 --> 00:11:47,469 a random first machine learning 330 00:11:47,470 --> 00:11:48,699 classifier. 331 00:11:48,700 --> 00:11:50,379 And then we do our classification with 332 00:11:50,380 --> 00:11:52,959 major to all of these like 300 333 00:11:52,960 --> 00:11:54,429 random forestry's. 334 00:11:55,580 --> 00:11:57,889 Why did we use Foja, so we have data 335 00:11:57,890 --> 00:11:59,929 from 2008 to 2014. 336 00:11:59,930 --> 00:12:02,899 Now we actually have 2015 status. 337 00:12:02,900 --> 00:12:05,239 And the most important 338 00:12:05,240 --> 00:12:07,219 point was that everyone in this 339 00:12:07,220 --> 00:12:09,739 competition is implementing 340 00:12:09,740 --> 00:12:11,689 the solution to the same programing 341 00:12:11,690 --> 00:12:12,589 tasks. 342 00:12:12,590 --> 00:12:14,509 So they're implementing the same 343 00:12:14,510 --> 00:12:15,919 algorithmic functionality. 344 00:12:15,920 --> 00:12:18,529 And the only thing or the most prevalent 345 00:12:18,530 --> 00:12:20,599 thing that can differentiate 346 00:12:20,600 --> 00:12:22,759 the source code samples are the 347 00:12:22,760 --> 00:12:24,829 right or the coding styles of 348 00:12:24,830 --> 00:12:26,299 these programmers. 349 00:12:26,300 --> 00:12:27,529 And at the same time, they have to 350 00:12:27,530 --> 00:12:29,719 implement this functionality 351 00:12:29,720 --> 00:12:31,669 in a very limited time, which means that 352 00:12:31,670 --> 00:12:33,199 they don't have a chance to go back to 353 00:12:33,200 --> 00:12:35,269 the code, improve it, make 354 00:12:35,270 --> 00:12:37,459 it nicer, and then copy paste some 355 00:12:37,460 --> 00:12:39,589 stuff from stack overflow. 356 00:12:39,590 --> 00:12:42,349 And as 357 00:12:42,350 --> 00:12:44,599 contestants are able to complete 358 00:12:44,600 --> 00:12:46,879 RANTZ, the problems gets harder. 359 00:12:46,880 --> 00:12:49,039 So we kind of have a control 360 00:12:49,040 --> 00:12:51,799 of when someone is implementing 361 00:12:51,800 --> 00:12:54,199 a more sophisticated functionality 362 00:12:54,200 --> 00:12:56,449 and as a result we can infer 363 00:12:56,450 --> 00:12:58,249 which programmers are maybe more 364 00:12:58,250 --> 00:12:59,479 advanced. 365 00:12:59,480 --> 00:13:01,669 And as I said, C++ was the most common 366 00:13:01,670 --> 00:13:03,589 language. So we decided to go ahead with 367 00:13:03,590 --> 00:13:06,649 C++ in our experiments. 368 00:13:06,650 --> 00:13:08,359 How do we do this? How to represent 369 00:13:08,360 --> 00:13:09,709 personal coding style? 370 00:13:09,710 --> 00:13:11,089 First of all, we have the source code 371 00:13:11,090 --> 00:13:13,489 sample here. We see this is a just 372 00:13:13,490 --> 00:13:14,629 five lines of code. 373 00:13:14,630 --> 00:13:16,819 And from that we can look at 374 00:13:16,820 --> 00:13:18,919 lexical features, for example, 375 00:13:18,920 --> 00:13:21,079 integer and and is a lexical 376 00:13:21,080 --> 00:13:23,239 feature because you chose it as 377 00:13:23,240 --> 00:13:25,579 NP and then BA and full 378 00:13:25,580 --> 00:13:27,259 function names, you chose those. 379 00:13:27,260 --> 00:13:29,389 So those are lexical features that come 380 00:13:29,390 --> 00:13:30,949 from personal input. 381 00:13:30,950 --> 00:13:33,019 And then we have layout features like 382 00:13:33,020 --> 00:13:35,239 the spaces and types and where you 383 00:13:35,240 --> 00:13:37,159 put the curly brackets and things like 384 00:13:37,160 --> 00:13:39,799 that. But the important thing 385 00:13:39,800 --> 00:13:42,379 that was able to represent 386 00:13:42,380 --> 00:13:44,509 coding style in a very strong 387 00:13:44,510 --> 00:13:46,999 manner in our experiments was that 388 00:13:47,000 --> 00:13:49,129 we use structural features 389 00:13:49,130 --> 00:13:51,019 and we can do that by converting the 390 00:13:51,020 --> 00:13:53,329 source code sample to an abstract 391 00:13:53,330 --> 00:13:55,279 syntax tree. And then you get the grammar 392 00:13:55,280 --> 00:13:57,469 and structure of the source code 393 00:13:57,470 --> 00:13:59,959 and here you can extract 394 00:13:59,960 --> 00:14:02,089 a rich set of features. 395 00:14:02,090 --> 00:14:04,159 And these are also more 396 00:14:04,160 --> 00:14:06,199 difficult to change as opposed to just 397 00:14:06,200 --> 00:14:07,429 changing foo and 398 00:14:08,630 --> 00:14:10,969 because these are kind of embedded. 399 00:14:10,970 --> 00:14:13,039 So we extract features from the 400 00:14:13,040 --> 00:14:15,499 source code. We extract features such as 401 00:14:15,500 --> 00:14:17,689 edges, nodes, term 402 00:14:17,690 --> 00:14:19,969 frequency, inverse document frequency 403 00:14:19,970 --> 00:14:22,069 or the average depth of a 404 00:14:22,070 --> 00:14:24,379 like statement node, for example. 405 00:14:24,380 --> 00:14:26,899 And then we built our feature set 406 00:14:26,900 --> 00:14:29,359 to represent programing style 407 00:14:29,360 --> 00:14:31,219 and write it be user random. 408 00:14:31,220 --> 00:14:33,409 For first of all, random forests by 409 00:14:33,410 --> 00:14:35,899 nature are multiclass 410 00:14:35,900 --> 00:14:38,269 classifiers, for example, as opposed 411 00:14:38,270 --> 00:14:40,459 to common support vector 412 00:14:40,460 --> 00:14:42,529 machine, which is a two 413 00:14:42,530 --> 00:14:43,969 class classifier. 414 00:14:43,970 --> 00:14:46,039 Random Forest is more 415 00:14:46,040 --> 00:14:48,379 successful in classifying 416 00:14:48,380 --> 00:14:50,659 many classes as opposed to support vector 417 00:14:50,660 --> 00:14:52,819 machine, classifying just two 418 00:14:52,820 --> 00:14:54,829 classes, making a binary classification 419 00:14:54,830 --> 00:14:56,389 problem and also random. 420 00:14:56,390 --> 00:14:58,339 For since they use decision trees and 421 00:14:58,340 --> 00:15:00,439 information gain during the 422 00:15:00,440 --> 00:15:03,199 training process, they avoid overfitting. 423 00:15:03,200 --> 00:15:05,479 So we want to make sure that we are not 424 00:15:05,480 --> 00:15:08,179 overfitting to a bias in the dataset 425 00:15:08,180 --> 00:15:10,789 or to someone's very peculiar 426 00:15:10,790 --> 00:15:11,989 property. 427 00:15:11,990 --> 00:15:13,759 And what we do is we get our data set. 428 00:15:13,760 --> 00:15:15,289 We extract all the features that we do 429 00:15:15,290 --> 00:15:17,119 called cross-validation, which means 430 00:15:17,120 --> 00:15:19,189 that, for example, for each programmer 431 00:15:19,190 --> 00:15:21,169 we have nine source code samples. 432 00:15:21,170 --> 00:15:23,359 We train on eight source code samples 433 00:15:23,360 --> 00:15:25,639 from each programmer, and then we try 434 00:15:25,640 --> 00:15:27,769 to test the ninth one from all of 435 00:15:27,770 --> 00:15:30,139 them and see who it was 436 00:15:30,140 --> 00:15:31,219 written by. 437 00:15:31,220 --> 00:15:32,989 And then we validate our method on a 438 00:15:32,990 --> 00:15:35,359 different dataset to see, like if 439 00:15:35,360 --> 00:15:37,339 the features that we obtained are really 440 00:15:37,340 --> 00:15:38,340 making sense. 441 00:15:39,980 --> 00:15:42,259 Let's talk about the general cases, and 442 00:15:42,260 --> 00:15:44,839 here I will talk about 443 00:15:44,840 --> 00:15:46,689 how we were able to improve the method. 444 00:15:46,690 --> 00:15:48,979 For example, there is this general case. 445 00:15:48,980 --> 00:15:50,449 Who is this anonymous programmer? 446 00:15:50,450 --> 00:15:52,639 This is programmer authorship attribution 447 00:15:52,640 --> 00:15:53,959 program or the anonymization. 448 00:15:53,960 --> 00:15:55,939 And maybe this can be applied to Satoshi 449 00:15:55,940 --> 00:15:58,459 Nakamoto, who was the founder of Bitcoin, 450 00:15:58,460 --> 00:16:01,129 and we don't really know who is. 451 00:16:01,130 --> 00:16:04,159 What happens here, so we have 1600 452 00:16:04,160 --> 00:16:06,289 programmers and each have 453 00:16:06,290 --> 00:16:08,359 nine code samples and 454 00:16:08,360 --> 00:16:10,639 we do a ninefold cross-validation and 455 00:16:10,640 --> 00:16:12,569 we extract features from their core 456 00:16:12,570 --> 00:16:13,609 samples. 457 00:16:13,610 --> 00:16:16,009 And once we train our classifier 458 00:16:16,010 --> 00:16:18,199 and test on the fourteen thousand four 459 00:16:18,200 --> 00:16:20,179 hundred samples, we get 94 percent 460 00:16:20,180 --> 00:16:21,199 accuracy. 461 00:16:21,200 --> 00:16:23,329 How can we do this with Satoshi? 462 00:16:23,330 --> 00:16:25,759 So if we had a suspect set for Satoshi, 463 00:16:25,760 --> 00:16:28,109 what we do is we take the suspect 464 00:16:28,110 --> 00:16:30,289 sets previous code 465 00:16:30,290 --> 00:16:31,369 samples. 466 00:16:31,370 --> 00:16:33,499 We train a classifier on that. 467 00:16:33,500 --> 00:16:36,109 And after that, as test data, 468 00:16:36,110 --> 00:16:38,659 we take Bitcoin's initial 469 00:16:38,660 --> 00:16:41,179 get commit the first original 470 00:16:41,180 --> 00:16:43,309 Bitcoin code and then we try 471 00:16:43,310 --> 00:16:45,619 to see who this was written by 472 00:16:45,620 --> 00:16:47,569 in the suspect's set. 473 00:16:47,570 --> 00:16:49,669 And many people ask us, so who is 474 00:16:49,670 --> 00:16:50,779 Satoshi? 475 00:16:50,780 --> 00:16:53,119 And the thing is, we have a suspect set, 476 00:16:53,120 --> 00:16:55,459 but unfortunately, the main suspect 477 00:16:55,460 --> 00:16:57,919 in our set doesn't have any formal 478 00:16:57,920 --> 00:16:59,269 code samples. 479 00:16:59,270 --> 00:17:01,459 So we are just leaving this life like 480 00:17:01,460 --> 00:17:02,460 this. 481 00:17:03,410 --> 00:17:04,729 What happens if someone tries to 482 00:17:04,730 --> 00:17:06,469 obfuscate code, so why do people 483 00:17:06,470 --> 00:17:08,598 obfuscate what they would do that 484 00:17:08,599 --> 00:17:11,449 to make their code become unrecognizable? 485 00:17:11,450 --> 00:17:13,459 Maybe they're plagiarizing, maybe it's 486 00:17:13,460 --> 00:17:15,709 malicious code, or maybe they're trying 487 00:17:15,710 --> 00:17:17,328 to be anonymous. But we are going to show 488 00:17:17,329 --> 00:17:18,889 that this is not going to make them 489 00:17:18,890 --> 00:17:21,139 anonymous. So our authorship 490 00:17:21,140 --> 00:17:23,419 attribution technique is impervious 491 00:17:23,420 --> 00:17:25,279 to off the shelf source code. 492 00:17:25,280 --> 00:17:26,719 Obfuscatory. 493 00:17:26,720 --> 00:17:28,819 And this is one example. 494 00:17:28,820 --> 00:17:30,499 This is a commercial off the shelf 495 00:17:30,500 --> 00:17:32,389 obfuscated called Senex. 496 00:17:32,390 --> 00:17:34,489 And what it does 497 00:17:34,490 --> 00:17:35,629 is like it's available. 498 00:17:35,630 --> 00:17:37,519 You can go online, buy it and like it 499 00:17:37,520 --> 00:17:39,349 works for many languages. 500 00:17:39,350 --> 00:17:41,509 What it does is if you look 501 00:17:41,510 --> 00:17:43,669 here, it will take all the lexical 502 00:17:43,670 --> 00:17:45,109 features, like the function names, 503 00:17:45,110 --> 00:17:47,419 variable names and all the comments and 504 00:17:47,420 --> 00:17:50,039 also all the spaces and refactor 505 00:17:50,040 --> 00:17:52,459 them. But it's not going to make 506 00:17:52,460 --> 00:17:54,739 any difference in 507 00:17:54,740 --> 00:17:56,959 the structure of the program. 508 00:17:56,960 --> 00:17:59,059 So like all the spaces are wrapped, 509 00:17:59,060 --> 00:18:00,650 everything is refactored, 510 00:18:02,060 --> 00:18:04,429 characters are refactored 511 00:18:04,430 --> 00:18:06,859 with hexadecimal ASCII 512 00:18:06,860 --> 00:18:08,059 representations. 513 00:18:08,060 --> 00:18:10,219 The same goes true for the numbers that 514 00:18:10,220 --> 00:18:11,449 are here. 515 00:18:11,450 --> 00:18:13,579 But since the structure of 516 00:18:13,580 --> 00:18:16,489 the program remains unchanged, 517 00:18:16,490 --> 00:18:18,709 we still get the same 518 00:18:18,710 --> 00:18:21,139 accuracy in the anonymizing programmer 519 00:18:21,140 --> 00:18:22,099 programmers. 520 00:18:22,100 --> 00:18:24,439 Once we obfuscate them with such 521 00:18:24,440 --> 00:18:26,809 a of the common off the shelf 522 00:18:26,810 --> 00:18:28,849 obfuscatory, which is not changing the 523 00:18:28,850 --> 00:18:30,409 structure of the program. 524 00:18:30,410 --> 00:18:33,109 And the example here is like we use 20 525 00:18:33,110 --> 00:18:35,239 C++ programmers obfuscated their 526 00:18:35,240 --> 00:18:37,429 code. You are able to anonymize them with 527 00:18:37,430 --> 00:18:39,649 99 percent accuracy with their original 528 00:18:39,650 --> 00:18:41,719 code, and we are able to anonymize them 529 00:18:41,720 --> 00:18:43,819 with 99 percent accuracy from their 530 00:18:43,820 --> 00:18:46,009 obfuscated code because we have 531 00:18:46,010 --> 00:18:48,139 the structural features which are 532 00:18:48,140 --> 00:18:50,359 very powerful, but 533 00:18:50,360 --> 00:18:52,069 happens if we try to use a more 534 00:18:52,070 --> 00:18:54,289 sophisticated, obfuscatory here. 535 00:18:54,290 --> 00:18:55,759 Our example would be Tigress. 536 00:18:55,760 --> 00:18:57,859 It's a visualizer and it 537 00:18:57,860 --> 00:19:00,379 enables you to apply many various 538 00:19:00,380 --> 00:19:02,299 kinds of obfuscation methods. 539 00:19:02,300 --> 00:19:03,619 So we have code. 540 00:19:03,620 --> 00:19:05,959 It's like 14 lines and we obfuscated. 541 00:19:05,960 --> 00:19:07,789 It becomes like 800 lines. 542 00:19:07,790 --> 00:19:09,619 It's completely unreadable. 543 00:19:09,620 --> 00:19:11,689 So I don't think like if you're open 544 00:19:11,690 --> 00:19:13,849 source software developer, I'm 545 00:19:13,850 --> 00:19:16,069 not sure the 546 00:19:16,070 --> 00:19:18,169 people in your project would be happy 547 00:19:18,170 --> 00:19:20,300 if you contribute code like this. 548 00:19:23,460 --> 00:19:26,039 But this works better at anonymizing 549 00:19:26,040 --> 00:19:27,209 your coding style. 550 00:19:27,210 --> 00:19:29,999 So we took C programmers 551 00:19:30,000 --> 00:19:32,099 for this experiment and then again, 552 00:19:32,100 --> 00:19:33,749 we had 20 programmers. 553 00:19:33,750 --> 00:19:35,909 We are able to do anonymize them with 96 554 00:19:35,910 --> 00:19:38,189 percent accuracy. But when we 555 00:19:38,190 --> 00:19:41,039 obfuscated with Tigress, changing the 556 00:19:41,040 --> 00:19:42,869 structure of the program as well, but the 557 00:19:42,870 --> 00:19:44,609 functionality remains the same. 558 00:19:44,610 --> 00:19:46,409 We get 67 percent accuracy. 559 00:19:46,410 --> 00:19:48,749 So there's almost a 30 percent drop 560 00:19:48,750 --> 00:19:51,059 in accuracy. But compared 561 00:19:51,060 --> 00:19:52,709 to the random chance, which is five 562 00:19:52,710 --> 00:19:54,779 percent, that's like the five 563 00:19:54,780 --> 00:19:56,789 percent chance that you can correctly 564 00:19:56,790 --> 00:19:59,399 identify someone just randomly 565 00:19:59,400 --> 00:20:01,319 six to seven percent is a very high 566 00:20:01,320 --> 00:20:03,239 number, which shows that your code is 567 00:20:03,240 --> 00:20:06,179 certainly not completely anonymized 568 00:20:06,180 --> 00:20:08,639 once you apply this obfuscatory. 569 00:20:08,640 --> 00:20:10,619 So it kind of gives the answer that 570 00:20:10,620 --> 00:20:12,719 obfuscation is not the solution to 571 00:20:12,720 --> 00:20:14,430 anonymization in source code, 572 00:20:15,720 --> 00:20:17,369 but advanced coding style throughout 573 00:20:17,370 --> 00:20:18,989 years. So if you want to see coding, 574 00:20:18,990 --> 00:20:21,179 style is consistent because if that's 575 00:20:21,180 --> 00:20:23,249 the case, we can take someone's code 576 00:20:23,250 --> 00:20:25,889 from 10 years ago and then try to test 577 00:20:25,890 --> 00:20:27,959 on code from this year. 578 00:20:27,960 --> 00:20:30,059 And for this, we took 25 letters from 579 00:20:30,060 --> 00:20:32,219 2012 trying to classify 580 00:20:32,220 --> 00:20:34,049 them. And then our test data came from 581 00:20:34,050 --> 00:20:36,269 2014 and we were able 582 00:20:36,270 --> 00:20:38,459 to correctly identified this these 583 00:20:38,460 --> 00:20:40,619 with 96 percent accuracy. 584 00:20:40,620 --> 00:20:42,449 And if we were trying to do this with in 585 00:20:42,450 --> 00:20:44,729 2014, the accuracy was 98 586 00:20:44,730 --> 00:20:46,799 percent. So there's a two percent 587 00:20:46,800 --> 00:20:48,839 change in the accuracy, which shows that 588 00:20:48,840 --> 00:20:51,329 coding style is somehow still persistent 589 00:20:51,330 --> 00:20:52,330 throughout years. 590 00:20:53,610 --> 00:20:55,019 And if you want to generalize our 591 00:20:55,020 --> 00:20:56,579 approach and we wanted to do this very 592 00:20:56,580 --> 00:20:59,069 quickly because we would like to see 593 00:20:59,070 --> 00:21:01,289 how feasible this is and like, is this 594 00:21:01,290 --> 00:21:03,299 a general approach that can be applied to 595 00:21:03,300 --> 00:21:04,559 other programing languages? 596 00:21:04,560 --> 00:21:06,539 And how easy would it be for someone else 597 00:21:06,540 --> 00:21:08,579 that wants to use our code just for a 598 00:21:08,580 --> 00:21:10,109 different programing language? 599 00:21:10,110 --> 00:21:12,329 And for this, we only use structural 600 00:21:12,330 --> 00:21:14,669 features and we use the E 601 00:21:14,670 --> 00:21:16,799 generator that comes with 602 00:21:16,800 --> 00:21:18,869 Python to generate the highest use 603 00:21:18,870 --> 00:21:20,519 of Python source code. 604 00:21:20,520 --> 00:21:23,279 And we were able to do anonymize 229 605 00:21:23,280 --> 00:21:25,799 program, which just from this 606 00:21:25,800 --> 00:21:28,019 abstract, abstract syntax, 607 00:21:28,020 --> 00:21:29,939 three features and structural features 608 00:21:29,940 --> 00:21:32,099 with 54 percent accuracy. 609 00:21:32,100 --> 00:21:33,839 And if you do top five relax 610 00:21:33,840 --> 00:21:35,939 classification, which means that you 611 00:21:35,940 --> 00:21:38,039 do classification, I would classify 612 00:21:38,040 --> 00:21:39,839 you would return your probability saying 613 00:21:39,840 --> 00:21:41,399 that, OK, this is the most probable 614 00:21:41,400 --> 00:21:42,989 person and this is the second most 615 00:21:42,990 --> 00:21:44,279 probable person. 616 00:21:44,280 --> 00:21:45,809 And then if you look at the first five 617 00:21:45,810 --> 00:21:48,149 probabilities, if the program is 618 00:21:48,150 --> 00:21:50,369 correct, programmer is within 619 00:21:50,370 --> 00:21:52,709 that set, then that's considered correct. 620 00:21:52,710 --> 00:21:54,509 And in that case, we are able to 621 00:21:55,740 --> 00:21:58,379 increase the accuracy to 76 percent. 622 00:21:58,380 --> 00:22:00,389 And if you do this for 23 programmers, we 623 00:22:00,390 --> 00:22:02,219 get 88 percent accuracy. 624 00:22:02,220 --> 00:22:04,559 And with top five relax classification, 625 00:22:04,560 --> 00:22:06,959 we get close to 100 percent accuracy. 626 00:22:06,960 --> 00:22:08,639 And whatever you do, relax 627 00:22:08,640 --> 00:22:10,229 classification. Let's say that you have a 628 00:22:10,230 --> 00:22:12,359 huge dataset and maybe you want 629 00:22:12,360 --> 00:22:13,889 to you're willing to do some manual 630 00:22:13,890 --> 00:22:15,629 analysis. But first, you would like to 631 00:22:15,630 --> 00:22:17,819 start with reducing your suspect 632 00:22:17,820 --> 00:22:19,919 size so you can just relax 633 00:22:19,920 --> 00:22:21,749 classification and then maybe look at the 634 00:22:21,750 --> 00:22:24,059 top 10 manually to understand 635 00:22:24,060 --> 00:22:27,329 it better and. 636 00:22:27,330 --> 00:22:29,309 In our results, we see that we are 637 00:22:29,310 --> 00:22:31,379 bringing a new principal method with a 638 00:22:31,380 --> 00:22:33,449 robust syntactic feature 639 00:22:33,450 --> 00:22:35,939 set for the anonymizing programmers, 640 00:22:35,940 --> 00:22:38,069 and this shows that there is 641 00:22:38,070 --> 00:22:40,709 a serious concern for anonymity 642 00:22:40,710 --> 00:22:42,899 when you're trying to be open source 643 00:22:42,900 --> 00:22:45,419 software developer or just a programmer, 644 00:22:45,420 --> 00:22:47,789 because we will soon talk about 645 00:22:47,790 --> 00:22:50,109 executable binaries. 646 00:22:50,110 --> 00:22:52,299 Where there is no source code and 647 00:22:52,300 --> 00:22:54,159 for future work, we are planning to look 648 00:22:54,160 --> 00:22:55,929 at multiple authorship detection, for 649 00:22:55,930 --> 00:22:57,849 example, get repositories, can we find 650 00:22:57,850 --> 00:23:00,099 multiple alters and can 651 00:23:00,100 --> 00:23:02,229 we identify which part was exactly 652 00:23:02,230 --> 00:23:03,939 written by whom? 653 00:23:03,940 --> 00:23:06,099 And we would also like to look into 654 00:23:06,100 --> 00:23:08,019 anonymizing source code because we saw 655 00:23:08,020 --> 00:23:09,969 that obfuscation is not the answer to 656 00:23:09,970 --> 00:23:12,649 that and then but about stylometry 657 00:23:12,650 --> 00:23:13,819 and executable binary. 658 00:23:13,820 --> 00:23:16,179 So executable binaries are compiled 659 00:23:16,180 --> 00:23:18,789 code. When you compile code, 660 00:23:18,790 --> 00:23:20,889 the coding style features still 661 00:23:20,890 --> 00:23:23,649 persist to the compiled version. 662 00:23:23,650 --> 00:23:24,999 So this is what happens. 663 00:23:25,000 --> 00:23:27,429 We have source code. It's like 20 lines. 664 00:23:27,430 --> 00:23:28,989 I don't have all of it here. 665 00:23:28,990 --> 00:23:30,549 And then once you compile it, you get 666 00:23:30,550 --> 00:23:32,619 binary zeros and ones like thousands 667 00:23:32,620 --> 00:23:33,909 of them. 668 00:23:33,910 --> 00:23:36,999 And I don't think I can personally 669 00:23:37,000 --> 00:23:38,559 understand anything from this. 670 00:23:38,560 --> 00:23:40,629 I don't think we can anonymize it by just 671 00:23:40,630 --> 00:23:41,769 looking at it like this. 672 00:23:43,690 --> 00:23:45,879 So now I'm going to talk about the second 673 00:23:45,880 --> 00:23:48,369 part of this talk and 674 00:23:48,370 --> 00:23:50,439 what happens when you compile 675 00:23:50,440 --> 00:23:52,899 code and you try to anonymize programmers 676 00:23:52,900 --> 00:23:54,909 from their executable binaries. 677 00:23:54,910 --> 00:23:57,129 And this is the paper that just went 678 00:23:57,130 --> 00:23:59,289 public today. If you want, you can look 679 00:23:59,290 --> 00:24:01,729 at it on my Web site. 680 00:24:01,730 --> 00:24:02,799 Why would we want to do that? 681 00:24:02,800 --> 00:24:04,809 First of all, the research question, does 682 00:24:04,810 --> 00:24:07,149 coding style exist and binary code? 683 00:24:07,150 --> 00:24:09,279 And is there a 684 00:24:09,280 --> 00:24:11,619 threat to privacy and anonymity 685 00:24:11,620 --> 00:24:13,689 can be the anonymized programmers 686 00:24:13,690 --> 00:24:15,849 from compiled code and maybe at 687 00:24:15,850 --> 00:24:18,039 the end, can we use this for malware 688 00:24:18,040 --> 00:24:19,449 family classification? 689 00:24:21,630 --> 00:24:23,429 This is the approach they had in related 690 00:24:23,430 --> 00:24:25,679 work, so since I've shown you the 691 00:24:25,680 --> 00:24:27,779 machine learning workflow that 692 00:24:27,780 --> 00:24:29,759 we had, I think you you're kind of 693 00:24:29,760 --> 00:24:31,739 getting the idea how machine learning 694 00:24:31,740 --> 00:24:33,959 would work. You have your data set. 695 00:24:33,960 --> 00:24:35,639 You need to extract features that are 696 00:24:35,640 --> 00:24:37,269 going to represent a class. 697 00:24:37,270 --> 00:24:39,329 Once you do that, you feed 698 00:24:39,330 --> 00:24:41,699 it to the classifier and then test 699 00:24:41,700 --> 00:24:43,769 to see which class one 700 00:24:43,770 --> 00:24:45,989 sample belongs to and in related 701 00:24:45,990 --> 00:24:48,599 work took a executable 702 00:24:48,600 --> 00:24:51,029 binaries. And then they disassembled 703 00:24:51,030 --> 00:24:53,249 them with reverse engineering methods 704 00:24:53,250 --> 00:24:55,529 and they obtained the control flow graphs 705 00:24:55,530 --> 00:24:57,749 as well. So they extracted features 706 00:24:57,750 --> 00:24:59,969 from the assembly instructions and also 707 00:24:59,970 --> 00:25:01,139 the control flow graph. 708 00:25:01,140 --> 00:25:03,209 They used some information 709 00:25:03,210 --> 00:25:05,339 gain methods to find the most 710 00:25:05,340 --> 00:25:07,409 prevalent stylistic features and then 711 00:25:07,410 --> 00:25:09,569 they use the support vector machine. 712 00:25:09,570 --> 00:25:11,699 And I would like to remind you that we 713 00:25:11,700 --> 00:25:13,889 don't use support vector machines that 714 00:25:13,890 --> 00:25:16,619 much for multiclass classification 715 00:25:16,620 --> 00:25:18,869 problems. That's why we use reenforce. 716 00:25:18,870 --> 00:25:20,939 And then the idea anonymized programing. 717 00:25:20,940 --> 00:25:23,159 And what we do is we take our dataset 718 00:25:23,160 --> 00:25:25,049 and we use the same dataset with them so 719 00:25:25,050 --> 00:25:27,329 that we can make a real comparison. 720 00:25:27,330 --> 00:25:28,799 We disassemble it with reverse 721 00:25:28,800 --> 00:25:31,259 engineering. We also compile it 722 00:25:31,260 --> 00:25:33,359 and the compilation 723 00:25:33,360 --> 00:25:35,519 we can get the source code 724 00:25:35,520 --> 00:25:37,799 representation of the binary. 725 00:25:37,800 --> 00:25:40,139 And again, we can apply all the source 726 00:25:40,140 --> 00:25:42,239 code feature extraction methods such 727 00:25:42,240 --> 00:25:44,669 as the abstract syntax regeneration. 728 00:25:44,670 --> 00:25:46,739 To this we also get the control 729 00:25:46,740 --> 00:25:49,229 flow graphs and then we run information 730 00:25:49,230 --> 00:25:51,299 gain on these to see what 731 00:25:51,300 --> 00:25:54,089 feature belongs to an alter 732 00:25:54,090 --> 00:25:56,489 instead of just being a random property 733 00:25:56,490 --> 00:25:58,109 of the code. And then we do 734 00:25:58,110 --> 00:26:00,299 classification to down to the 735 00:26:00,300 --> 00:26:01,410 Nanami as a programmer. 736 00:26:02,520 --> 00:26:04,649 And some features for this would be, 737 00:26:04,650 --> 00:26:06,569 for example, we have our common AFSC 738 00:26:06,570 --> 00:26:08,639 features coming from the abstract syntax 739 00:26:08,640 --> 00:26:10,949 tree that's obtained from 740 00:26:10,950 --> 00:26:13,649 generating the structural properties. 741 00:26:13,650 --> 00:26:15,779 And these are things like just 742 00:26:15,780 --> 00:26:18,149 not anagrams in the abstract syntax tree 743 00:26:18,150 --> 00:26:20,429 estie Bagram's or edges. 744 00:26:20,430 --> 00:26:22,769 And we do similar things for the control 745 00:26:22,770 --> 00:26:24,849 flow graphs and we get the unique grammes 746 00:26:24,850 --> 00:26:26,879 Bagram's. But remember that like since 747 00:26:26,880 --> 00:26:28,949 this is decompiled, called the 748 00:26:28,950 --> 00:26:30,809 abstract syntax tree and control flow 749 00:26:30,810 --> 00:26:33,359 graph is like 10, 20 times longer than 750 00:26:33,360 --> 00:26:34,379 the original one. 751 00:26:34,380 --> 00:26:36,359 So we get a lot of features and they look 752 00:26:36,360 --> 00:26:38,609 very similar to each other because they 753 00:26:38,610 --> 00:26:40,139 have been reversed, engineered with the 754 00:26:40,140 --> 00:26:41,249 same tools. 755 00:26:42,660 --> 00:26:44,819 And for example, when we have 100 756 00:26:44,820 --> 00:26:47,189 programmers, we extract their features 757 00:26:47,190 --> 00:26:49,409 from their 900 758 00:26:49,410 --> 00:26:51,629 binary executable samples 759 00:26:51,630 --> 00:26:54,959 and we get about 200000 features. 760 00:26:54,960 --> 00:26:57,299 Once we get all of our features, 761 00:26:57,300 --> 00:26:59,159 once we are on information gain on this, 762 00:26:59,160 --> 00:27:01,349 we see that only four hundred and twenty 763 00:27:01,350 --> 00:27:04,199 six of them in this particular dataset 764 00:27:04,200 --> 00:27:05,909 represent coding style. 765 00:27:05,910 --> 00:27:07,769 So we focus on these four hundred and 766 00:27:07,770 --> 00:27:09,359 twenty six features. 767 00:27:10,610 --> 00:27:12,109 And what happens when we try to do 768 00:27:12,110 --> 00:27:14,899 Anonymise 100 program or so, 769 00:27:14,900 --> 00:27:16,579 we wanted to see how much training data 770 00:27:16,580 --> 00:27:18,199 we need. First of all, like how many 771 00:27:18,200 --> 00:27:21,109 binary samples do I need to accurately 772 00:27:21,110 --> 00:27:22,819 anonymize a programmer? 773 00:27:22,820 --> 00:27:25,309 And with one binary 774 00:27:25,310 --> 00:27:27,529 sample and 100 programmers, 775 00:27:27,530 --> 00:27:29,719 you can still re identify 776 00:27:29,720 --> 00:27:31,819 them with 20 percent accuracy. 777 00:27:31,820 --> 00:27:34,189 And once we use eight training 778 00:27:34,190 --> 00:27:36,259 samples, we got 78 779 00:27:36,260 --> 00:27:38,029 percent accuracy in the anonymized and 780 00:27:38,030 --> 00:27:39,829 100 programmers. 781 00:27:39,830 --> 00:27:42,109 And it seems like if we had more 782 00:27:42,110 --> 00:27:44,359 source, more binary samples, 783 00:27:44,360 --> 00:27:46,159 you would be able to increase our 784 00:27:46,160 --> 00:27:47,089 accuracy further. 785 00:27:47,090 --> 00:27:49,159 But our dataset wasn't really 786 00:27:49,160 --> 00:27:51,409 letting us do this with 100 787 00:27:51,410 --> 00:27:54,229 programmers and 788 00:27:54,230 --> 00:27:56,419 relax classification, which I 789 00:27:56,420 --> 00:27:58,549 mentioned towards the end of the first 790 00:27:58,550 --> 00:28:00,709 part with 100 programmers. 791 00:28:00,710 --> 00:28:03,589 When we relaxed the classification 792 00:28:03,590 --> 00:28:05,689 to a set 793 00:28:05,690 --> 00:28:07,879 of size 10, we get 95 794 00:28:07,880 --> 00:28:10,309 percent accuracy in reducing 795 00:28:10,310 --> 00:28:11,859 our suspects that size. 796 00:28:11,860 --> 00:28:14,269 So let's say that you start with 100 797 00:28:14,270 --> 00:28:16,429 programmers and their binaries and you 798 00:28:16,430 --> 00:28:18,769 want to focus on 10 of them with 95 799 00:28:18,770 --> 00:28:19,729 percent accuracy. 800 00:28:19,730 --> 00:28:21,919 You're the correct programmer would be 801 00:28:21,920 --> 00:28:24,049 within that set 802 00:28:24,050 --> 00:28:25,050 of ten people. 803 00:28:26,390 --> 00:28:28,369 And we also wanted to see what happens 804 00:28:28,370 --> 00:28:29,899 with a smaller dataset size. 805 00:28:29,900 --> 00:28:32,899 So here, when we just 806 00:28:32,900 --> 00:28:34,999 sit and relax classification, 807 00:28:35,000 --> 00:28:37,069 we can get close to 100 percent accuracy 808 00:28:37,070 --> 00:28:39,409 after, like, just relaxing 809 00:28:39,410 --> 00:28:42,319 it to a suspect that size of four. 810 00:28:42,320 --> 00:28:44,629 And we're getting certainly 100 percent 811 00:28:44,630 --> 00:28:46,730 accuracy at the suspect that size eight. 812 00:28:48,180 --> 00:28:50,389 And we also wanted to 813 00:28:50,390 --> 00:28:52,459 see what happens when we just use one 814 00:28:52,460 --> 00:28:54,829 training sample for trying to program. 815 00:28:54,830 --> 00:28:56,959 So we 20 programmers, if you just 816 00:28:56,960 --> 00:28:59,049 have one sample from 20 programmers, you 817 00:28:59,050 --> 00:29:01,219 train on 20 files 818 00:29:01,220 --> 00:29:03,379 to generate 20 classes. 819 00:29:03,380 --> 00:29:05,899 And then once samples are given to you, 820 00:29:05,900 --> 00:29:08,119 you can correctly identify those samples 821 00:29:08,120 --> 00:29:09,799 with 75 percent accuracy. 822 00:29:09,800 --> 00:29:11,299 And that's kind of scary. 823 00:29:11,300 --> 00:29:13,369 So if you have just one binary out there 824 00:29:13,370 --> 00:29:15,439 that belongs to and that is known, 825 00:29:15,440 --> 00:29:17,569 and if you're in a suspect set size 826 00:29:17,570 --> 00:29:19,909 of 20, there is a 75 827 00:29:19,910 --> 00:29:22,159 percent chance that your 828 00:29:22,160 --> 00:29:24,499 anonymous binary would be identified 829 00:29:24,500 --> 00:29:26,599 as it belongs to you and 830 00:29:26,600 --> 00:29:28,459 you want it to scale this up. 831 00:29:28,460 --> 00:29:30,619 So we went from 100 programmers 832 00:29:30,620 --> 00:29:32,419 to 600 programmers. 833 00:29:32,420 --> 00:29:34,399 And in this case, we see that the 834 00:29:34,400 --> 00:29:37,069 accuracy is gradually decreasing. 835 00:29:37,070 --> 00:29:39,589 And with 600 programmers, we get 836 00:29:39,590 --> 00:29:41,749 52 percent accuracy. 837 00:29:41,750 --> 00:29:43,579 And here I would like to mention that, 838 00:29:43,580 --> 00:29:45,409 for example, in the previous part we had 839 00:29:45,410 --> 00:29:47,179 1600 programmers. 840 00:29:47,180 --> 00:29:49,699 But since we are using the same dataset, 841 00:29:49,700 --> 00:29:52,099 we had to compile the source code 842 00:29:52,100 --> 00:29:54,169 so that we can 843 00:29:54,170 --> 00:29:56,209 use it in a controlled setting with the 844 00:29:56,210 --> 00:29:58,849 same compilation 845 00:29:58,850 --> 00:29:59,959 options. 846 00:29:59,960 --> 00:30:02,059 And we couldn't obtain 847 00:30:02,060 --> 00:30:04,399 1600 binaries after 848 00:30:04,400 --> 00:30:06,079 that compilation. Some code just did and 849 00:30:06,080 --> 00:30:09,259 compile and like some programmers, 850 00:30:09,260 --> 00:30:10,909 did not have enough code because they 851 00:30:10,910 --> 00:30:12,889 were missing. So we had to get rid of all 852 00:30:12,890 --> 00:30:13,890 of those programmers. 853 00:30:23,950 --> 00:30:26,139 So there hasn't been much work done 854 00:30:26,140 --> 00:30:27,039 on this area. 855 00:30:27,040 --> 00:30:29,109 There is one major paper that has been 856 00:30:29,110 --> 00:30:30,879 published by Rosenblum and it's a great 857 00:30:30,880 --> 00:30:33,039 paper. It's the previous workflow 858 00:30:33,040 --> 00:30:34,749 that I have shown you in the beginning of 859 00:30:34,750 --> 00:30:36,039 the second section. 860 00:30:36,040 --> 00:30:38,169 And for example, with 861 00:30:38,170 --> 00:30:40,539 20 programmers, they get 77 percent 862 00:30:40,540 --> 00:30:42,669 accuracy and they use more training 863 00:30:42,670 --> 00:30:43,629 samples than us. 864 00:30:43,630 --> 00:30:45,699 It's not very clear, but the least 865 00:30:45,700 --> 00:30:47,349 number of training samples they use is 866 00:30:47,350 --> 00:30:48,939 eight and it goes up to 16. 867 00:30:48,940 --> 00:30:51,039 And we kind of know that when we use 868 00:30:51,040 --> 00:30:53,019 more samples, we are going to get higher 869 00:30:53,020 --> 00:30:55,089 accuracy. But in our case, 870 00:30:55,090 --> 00:30:57,189 with 100 programmers, we 871 00:30:57,190 --> 00:30:59,739 get 78 percent accuracy 872 00:30:59,740 --> 00:31:00,740 and. 873 00:31:01,510 --> 00:31:04,339 For example, when we look at 874 00:31:04,340 --> 00:31:06,519 the year 100 or 875 00:31:06,520 --> 00:31:08,769 data set, we see that they get 61 percent 876 00:31:08,770 --> 00:31:10,989 accuracy and we can get 78 877 00:31:10,990 --> 00:31:12,849 percent accuracy. And again, they are 878 00:31:12,850 --> 00:31:15,999 using more training samples. 879 00:31:16,000 --> 00:31:18,279 And at the end, we are able to scale 880 00:31:18,280 --> 00:31:20,229 our approach to 600 programmers. 881 00:31:20,230 --> 00:31:22,389 But their largest dataset is almost 882 00:31:22,390 --> 00:31:24,459 200 programmers and there 883 00:31:24,460 --> 00:31:25,419 are 200 programmers. 884 00:31:25,420 --> 00:31:27,579 Data that gets the same accuracy with 885 00:31:27,580 --> 00:31:29,739 our 600 programmer dataset, which 886 00:31:29,740 --> 00:31:31,329 was a more difficult machine learning 887 00:31:31,330 --> 00:31:33,459 problem. So this is a great 888 00:31:33,460 --> 00:31:35,099 improvement in accuracy. 889 00:31:36,220 --> 00:31:37,989 What happens if we optimize code? 890 00:31:37,990 --> 00:31:40,359 So is that kind of like the translation 891 00:31:40,360 --> 00:31:42,669 of obfuscation in binaries, 892 00:31:42,670 --> 00:31:44,979 like are they going to anonymized code? 893 00:31:44,980 --> 00:31:47,379 So for that, for the first time 894 00:31:47,380 --> 00:31:50,559 in in the literature, 895 00:31:50,560 --> 00:31:52,839 we wanted to try compiler optimization 896 00:31:52,840 --> 00:31:54,039 and stripping the symbols. 897 00:31:54,040 --> 00:31:56,079 And we saw that with 100 programmers 898 00:31:56,080 --> 00:31:58,179 without any optimizations, simply 899 00:31:58,180 --> 00:32:00,459 compiling code, we can anonymize 900 00:32:00,460 --> 00:32:03,399 100 programmers with 78 percent accuracy. 901 00:32:03,400 --> 00:32:05,589 But after conflation, once we 902 00:32:05,590 --> 00:32:07,749 stripped the symbols from 903 00:32:07,750 --> 00:32:09,819 the winery's, we got 66 percent 904 00:32:09,820 --> 00:32:10,839 accuracy. 905 00:32:10,840 --> 00:32:12,849 And once we start applying more 906 00:32:12,850 --> 00:32:14,919 optimizations like the common level 907 00:32:14,920 --> 00:32:17,139 one optimization level to optimization, 908 00:32:17,140 --> 00:32:19,479 which like which is cumulative, it takes 909 00:32:19,480 --> 00:32:21,609 the previous optimizations in it 910 00:32:21,610 --> 00:32:22,719 as well. 911 00:32:22,720 --> 00:32:25,279 And it makes the program 912 00:32:25,280 --> 00:32:27,549 more efficient and maybe 913 00:32:27,550 --> 00:32:29,979 faster and smaller. 914 00:32:29,980 --> 00:32:32,259 We see that the accuracy is not 915 00:32:32,260 --> 00:32:34,359 decreasing in a tragic 916 00:32:34,360 --> 00:32:36,309 way. With the highest level of 917 00:32:36,310 --> 00:32:38,499 optimizations that we tried Level three, 918 00:32:38,500 --> 00:32:41,239 we get 60 percent accuracy incorrectly, 919 00:32:41,240 --> 00:32:43,419 the anonymizing these 100 programmers. 920 00:32:43,420 --> 00:32:45,489 So compiler optimization 921 00:32:45,490 --> 00:32:47,469 is not our solution to anonymizing 922 00:32:47,470 --> 00:32:48,470 binaries. 923 00:32:51,270 --> 00:32:53,699 We also wanted to see how we can find 924 00:32:53,700 --> 00:32:55,649 out features that are remaining in 925 00:32:55,650 --> 00:32:57,749 binaries that represent your coding 926 00:32:57,750 --> 00:32:59,969 style, because since binaries are so 927 00:32:59,970 --> 00:33:02,399 cryptic, it's difficult to tell 928 00:33:02,400 --> 00:33:03,869 what's going on, what kind of 929 00:33:03,870 --> 00:33:06,119 transformations are happening after 930 00:33:06,120 --> 00:33:07,589 compilation. 931 00:33:07,590 --> 00:33:09,689 So for this, we came up with a 932 00:33:09,690 --> 00:33:11,759 machine learning setting 933 00:33:11,760 --> 00:33:14,849 where we have the same code samples 934 00:33:14,850 --> 00:33:17,249 and we have numeric 935 00:33:17,250 --> 00:33:19,949 representations of these code samples 936 00:33:19,950 --> 00:33:21,899 for the original code and the compiled 937 00:33:21,900 --> 00:33:24,149 code. And what we tried to do was by 938 00:33:24,150 --> 00:33:26,609 taking the decompiled 939 00:33:26,610 --> 00:33:29,279 code, can we predict the features 940 00:33:29,280 --> 00:33:31,439 in the original code that has 941 00:33:31,440 --> 00:33:34,019 not been compiled at all. 942 00:33:34,020 --> 00:33:36,359 And once we did that, we 943 00:33:36,360 --> 00:33:38,819 generated the predictions for the 944 00:33:38,820 --> 00:33:41,249 new features in the original code, 945 00:33:41,250 --> 00:33:43,439 and we wanted to see how 946 00:33:43,440 --> 00:33:44,399 similar this is. 947 00:33:44,400 --> 00:33:46,049 And there is not a very simple way to 948 00:33:46,050 --> 00:33:48,479 make a direct comparison between between 949 00:33:48,480 --> 00:33:49,889 these predictions. 950 00:33:49,890 --> 00:33:51,509 So we looked at cosine distance 951 00:33:51,510 --> 00:33:53,279 similarity and we saw that the new 952 00:33:53,280 --> 00:33:55,589 predictions were 80 953 00:33:55,590 --> 00:33:58,229 percent, 81 percent similar 954 00:33:58,230 --> 00:34:00,509 to the original code features 955 00:34:00,510 --> 00:34:02,099 which are set. 956 00:34:02,100 --> 00:34:04,179 And also we did one more experiment. 957 00:34:04,180 --> 00:34:05,729 So we just took the original code 958 00:34:05,730 --> 00:34:08,158 features and we took the decompiled 959 00:34:08,159 --> 00:34:09,658 code features and we looked at the 960 00:34:09,659 --> 00:34:12,119 similarity between those two and 961 00:34:13,469 --> 00:34:15,749 the cosine distance was zero 962 00:34:15,750 --> 00:34:17,309 point three five. So it was about thirty 963 00:34:17,310 --> 00:34:19,379 five percent similarity, which 964 00:34:19,380 --> 00:34:20,609 is much less. 965 00:34:20,610 --> 00:34:22,468 And this is kind of showing that like 966 00:34:22,469 --> 00:34:24,869 coding style and properties in compiled 967 00:34:24,870 --> 00:34:26,968 code are certainly getting 968 00:34:26,969 --> 00:34:28,109 transformed. 969 00:34:28,110 --> 00:34:30,448 But this transformation is not wiping 970 00:34:30,449 --> 00:34:32,669 away all coding style features. 971 00:34:32,670 --> 00:34:35,428 So somehow they still remain embedded 972 00:34:35,429 --> 00:34:36,779 in the binary. 973 00:34:36,780 --> 00:34:38,908 And this might be a concern for the 974 00:34:38,909 --> 00:34:41,069 anonymization and remaining anonymous. 975 00:34:42,270 --> 00:34:44,428 We also wanted to see if we can 976 00:34:44,429 --> 00:34:45,959 gain any insights. 977 00:34:45,960 --> 00:34:47,969 And again, we looked at 978 00:34:49,139 --> 00:34:51,448 the differences between the binaries of 979 00:34:51,449 --> 00:34:53,459 more advanced programmers that were able 980 00:34:53,460 --> 00:34:55,678 to advance to 981 00:34:55,679 --> 00:34:57,179 more difficult rounds. 982 00:34:57,180 --> 00:34:59,339 And we saw that even in the binaries, 983 00:34:59,340 --> 00:35:01,799 you can see when a programmer is 984 00:35:01,800 --> 00:35:03,869 more advanced as opposed to 985 00:35:03,870 --> 00:35:06,239 other programmers with a smaller 986 00:35:06,240 --> 00:35:07,169 skill set. 987 00:35:07,170 --> 00:35:09,479 And for doing this in the data 988 00:35:09,480 --> 00:35:11,729 set, we generated 989 00:35:11,730 --> 00:35:12,749 two subsets. 990 00:35:12,750 --> 00:35:14,669 The first one was with people that were 991 00:35:14,670 --> 00:35:17,099 only able to compete, 992 00:35:17,100 --> 00:35:19,199 complete the same seven problems. 993 00:35:19,200 --> 00:35:21,269 And the second one was people who 994 00:35:21,270 --> 00:35:23,429 were able to complete 14 problems, 995 00:35:23,430 --> 00:35:26,249 including the seven in the first subset. 996 00:35:26,250 --> 00:35:29,309 And we just used the seven samples 997 00:35:29,310 --> 00:35:32,159 that were same between these subsets 998 00:35:32,160 --> 00:35:34,559 and see how well we can anonymize 999 00:35:34,560 --> 00:35:35,879 these programmers. 1000 00:35:35,880 --> 00:35:37,979 And for the more advanced programmers, 1001 00:35:37,980 --> 00:35:40,109 we got 88 percent accuracy and correct 1002 00:35:40,110 --> 00:35:42,329 with the anonymizing, they 1003 00:35:42,330 --> 00:35:44,519 binaries this to program programmer data 1004 00:35:44,520 --> 00:35:46,799 set. And for the less advanced 1005 00:35:46,800 --> 00:35:48,899 programmers, we got 80 percent accuracy. 1006 00:35:48,900 --> 00:35:51,149 So somehow there is more 1007 00:35:51,150 --> 00:35:53,010 coding style present 1008 00:35:54,180 --> 00:35:56,309 in the source code of the 1009 00:35:56,310 --> 00:35:57,929 more advanced programmers, and it's 1010 00:35:57,930 --> 00:36:00,179 getting transferred better after 1011 00:36:00,180 --> 00:36:01,649 being compiled. 1012 00:36:01,650 --> 00:36:03,789 And to validate this, we tried the same 1013 00:36:03,790 --> 00:36:06,269 setting with the six problems, having 1014 00:36:06,270 --> 00:36:08,519 a subset that was only able to 1015 00:36:08,520 --> 00:36:10,619 complete six problems and a subset that 1016 00:36:10,620 --> 00:36:12,719 was only able to complete that was able 1017 00:36:12,720 --> 00:36:14,849 to complete 12 problems, including 1018 00:36:14,850 --> 00:36:17,069 the six in the first subset. 1019 00:36:17,070 --> 00:36:18,639 And again, we see the same results. 1020 00:36:18,640 --> 00:36:21,479 So the accuracy is lower because 1021 00:36:21,480 --> 00:36:23,999 our training samples are less. 1022 00:36:24,000 --> 00:36:26,219 So our machine learning model might 1023 00:36:26,220 --> 00:36:28,529 be less accurate than as opposed to using 1024 00:36:28,530 --> 00:36:29,549 more samples. 1025 00:36:29,550 --> 00:36:31,739 So we get 87 percent accuracy incorrectly 1026 00:36:31,740 --> 00:36:33,959 identifying those, whereas we get 78 1027 00:36:33,960 --> 00:36:36,209 percent accuracy in the anonymizing 1028 00:36:36,210 --> 00:36:37,679 the less advanced programmers. 1029 00:36:39,770 --> 00:36:41,959 So we have been working on Google 1030 00:36:41,960 --> 00:36:44,029 Kojm, which is a very controlled 1031 00:36:44,030 --> 00:36:45,679 environment for running these 1032 00:36:45,680 --> 00:36:48,079 experiments, for the reasons I explained 1033 00:36:48,080 --> 00:36:50,389 to you, and we get the question 1034 00:36:50,390 --> 00:36:52,609 from many people asking that is 1035 00:36:52,610 --> 00:36:54,829 it is this the anonymization 1036 00:36:54,830 --> 00:36:57,139 being so successful because 1037 00:36:57,140 --> 00:36:59,269 of your Google called data set? 1038 00:36:59,270 --> 00:37:01,399 And we wanted to see what would 1039 00:37:01,400 --> 00:37:03,559 be the difference if we tried 1040 00:37:03,560 --> 00:37:05,719 to do the anonymization in the wild and 1041 00:37:05,720 --> 00:37:07,789 we try to collect the data 1042 00:37:07,790 --> 00:37:08,719 set from GitHub. 1043 00:37:08,720 --> 00:37:11,719 So for this we 1044 00:37:11,720 --> 00:37:13,879 get help and we find single alternate 1045 00:37:13,880 --> 00:37:14,869 repositories. 1046 00:37:14,870 --> 00:37:17,269 And these repositories had 1047 00:37:17,270 --> 00:37:19,519 at least 500 lines of code 1048 00:37:19,520 --> 00:37:21,649 and they had to have like at least 10 1049 00:37:21,650 --> 00:37:23,809 stars. And these people 1050 00:37:23,810 --> 00:37:25,879 should have several 1051 00:37:25,880 --> 00:37:27,739 number of repositories on GitHub. 1052 00:37:27,740 --> 00:37:29,809 So we had some requirements and 1053 00:37:29,810 --> 00:37:31,519 after that we ended up with four to nine 1054 00:37:31,520 --> 00:37:33,289 programmers. After all of our 1055 00:37:33,290 --> 00:37:35,419 restrictions, you can refer to our paper 1056 00:37:35,420 --> 00:37:37,819 for the details of these datasets. 1057 00:37:37,820 --> 00:37:40,249 And we had one and seventeen 1058 00:37:40,250 --> 00:37:42,169 repositories. But unfortunately, again, 1059 00:37:42,170 --> 00:37:44,359 GitHub code is sometimes very difficult 1060 00:37:44,360 --> 00:37:46,759 to compile. So after compiling 1061 00:37:46,760 --> 00:37:49,219 these, we ended up with 12 1062 00:37:49,220 --> 00:37:50,809 and 50 binaries. 1063 00:37:53,630 --> 00:37:55,759 And on this 1064 00:37:55,760 --> 00:37:57,859 dataset, we are able to anonymize the 1065 00:37:57,860 --> 00:38:00,379 programmers with 62 percent accuracy. 1066 00:38:00,380 --> 00:38:02,059 And we tried to generate the exact same 1067 00:38:02,060 --> 00:38:04,429 dataset from Google, could 1068 00:38:04,430 --> 00:38:06,769 use the exact same number of binaries for 1069 00:38:06,770 --> 00:38:08,959 each programmer, and we were getting 68 1070 00:38:08,960 --> 00:38:10,159 percent accuracy. 1071 00:38:10,160 --> 00:38:12,229 So it shows that, like, this is 1072 00:38:12,230 --> 00:38:14,629 a very promising result for running 1073 00:38:14,630 --> 00:38:16,610 program or the anonymization in the wild. 1074 00:38:18,560 --> 00:38:20,689 And for future work, again, 1075 00:38:20,690 --> 00:38:22,459 we would like to look at anonymizing 1076 00:38:22,460 --> 00:38:24,589 executable binaries because here we are 1077 00:38:24,590 --> 00:38:26,839 showing your problem like there is a 1078 00:38:26,840 --> 00:38:28,039 privacy problem here. 1079 00:38:28,040 --> 00:38:30,049 We're able to anonymize programmers with 1080 00:38:30,050 --> 00:38:32,389 very high accuracy in a 1081 00:38:32,390 --> 00:38:35,119 very simple machine learning setting. 1082 00:38:35,120 --> 00:38:37,399 And we can do this on the large scale 1083 00:38:37,400 --> 00:38:39,469 and for the executable binaries which 1084 00:38:39,470 --> 00:38:41,749 show that optimizations are not the 1085 00:38:41,750 --> 00:38:44,899 result, the solution to 1086 00:38:44,900 --> 00:38:46,459 anonymization. 1087 00:38:46,460 --> 00:38:48,329 And we would also like to look at the 1088 00:38:48,330 --> 00:38:50,659 anonymizing collaborative binaries 1089 00:38:50,660 --> 00:38:52,279 that have been written by multiple 1090 00:38:52,280 --> 00:38:54,379 people. And we would really 1091 00:38:54,380 --> 00:38:56,459 like to find out if we can extend this to 1092 00:38:56,460 --> 00:38:58,189 malware family classification. 1093 00:38:58,190 --> 00:39:00,679 But for that problem, 1094 00:39:00,680 --> 00:39:02,809 we need a dataset with some 1095 00:39:02,810 --> 00:39:04,999 ground truth. So if anyone 1096 00:39:05,000 --> 00:39:07,129 in the audience is interested in that 1097 00:39:07,130 --> 00:39:09,349 and has some ground truth data, 1098 00:39:09,350 --> 00:39:11,629 a data set that we can work with at 1099 00:39:11,630 --> 00:39:13,129 the end of the tunnel, please come and 1100 00:39:13,130 --> 00:39:15,349 talk to me. That might be amazing 1101 00:39:15,350 --> 00:39:16,350 for us. 1102 00:39:17,690 --> 00:39:19,579 And we have some available tools on these 1103 00:39:19,580 --> 00:39:21,409 projects, so you can find all of them 1104 00:39:21,410 --> 00:39:23,449 online, you can send us emails like if 1105 00:39:23,450 --> 00:39:24,839 you want to run things on different 1106 00:39:24,840 --> 00:39:26,899 settings. We have announced these 1107 00:39:26,900 --> 00:39:28,639 in our previous talks as well. 1108 00:39:28,640 --> 00:39:31,059 So we start with the main one program. 1109 00:39:32,390 --> 00:39:33,979 So the source code program or the 1110 00:39:33,980 --> 00:39:36,379 anonymization one is on my GitHub 1111 00:39:36,380 --> 00:39:38,149 account. So you can just like Google for 1112 00:39:38,150 --> 00:39:39,979 it and like find it and run it. 1113 00:39:39,980 --> 00:39:43,039 And we also have an 1114 00:39:43,040 --> 00:39:45,349 authorship attribution framework where 1115 00:39:46,580 --> 00:39:49,489 you can have some authors with known 1116 00:39:49,490 --> 00:39:51,559 documents, documents with known 1117 00:39:51,560 --> 00:39:53,659 authors, and then you might try to 1118 00:39:53,660 --> 00:39:56,869 identify an anonymous document 1119 00:39:56,870 --> 00:39:58,819 and it will give you many machine 1120 00:39:58,820 --> 00:40:01,069 learning options to do that and different 1121 00:40:01,070 --> 00:40:03,289 ways to generate different features and 1122 00:40:03,290 --> 00:40:05,899 so on so that you get a hands on 1123 00:40:05,900 --> 00:40:08,119 experience and like running this 1124 00:40:08,120 --> 00:40:10,399 machine, learning setting and see how 1125 00:40:10,400 --> 00:40:11,959 anonymization works. 1126 00:40:11,960 --> 00:40:12,919 And on top of G. 1127 00:40:12,920 --> 00:40:15,139 Stila, we have an animal that's built 1128 00:40:15,140 --> 00:40:17,269 so anonymous users stylize the 1129 00:40:17,270 --> 00:40:19,249 back engine and it's a framework that 1130 00:40:19,250 --> 00:40:21,469 will help you anonymize your writing 1131 00:40:21,470 --> 00:40:23,839 style. So once you give 1132 00:40:23,840 --> 00:40:25,999 it a suspect set and you're 1133 00:40:26,000 --> 00:40:27,889 also in the suspect's set and you want to 1134 00:40:27,890 --> 00:40:29,989 make sure that in that aspect that you're 1135 00:40:29,990 --> 00:40:32,059 anonymized, you can 1136 00:40:32,060 --> 00:40:34,189 use Anonymous so that 1137 00:40:34,190 --> 00:40:36,679 it will identify all the 1138 00:40:36,680 --> 00:40:39,099 all the alters in the suspect 1139 00:40:39,100 --> 00:40:41,209 set, and then it will give you certain 1140 00:40:41,210 --> 00:40:43,279 recommendations and suggestions so 1141 00:40:43,280 --> 00:40:45,409 that you can make your writing 1142 00:40:45,410 --> 00:40:46,579 more anonymous. 1143 00:40:46,580 --> 00:40:49,039 And this might 1144 00:40:49,040 --> 00:40:51,139 be a very helpful tool for 1145 00:40:51,140 --> 00:40:52,909 people, for example, in oppressive 1146 00:40:52,910 --> 00:40:55,069 regimes are like who really want to make 1147 00:40:55,070 --> 00:40:57,499 sure that they would like to write 1148 00:40:57,500 --> 00:40:59,419 anonymously so that they don't want to 1149 00:40:59,420 --> 00:41:00,420 get in trouble. 1150 00:41:01,340 --> 00:41:03,469 Because even, for example, let's say 1151 00:41:03,470 --> 00:41:05,599 that you are as with 1152 00:41:05,600 --> 00:41:08,119 the like Alice and abusive employer 1153 00:41:08,120 --> 00:41:10,639 Bob example, even if you are writing 1154 00:41:10,640 --> 00:41:12,829 or blogging through Tor 1155 00:41:12,830 --> 00:41:14,779 and you think that you're anonymous, your 1156 00:41:14,780 --> 00:41:17,269 writing style would make you 1157 00:41:17,270 --> 00:41:18,409 identifiable. 1158 00:41:20,440 --> 00:41:21,909 And I would like to thank all my 1159 00:41:21,910 --> 00:41:23,979 collaborators for this 1160 00:41:23,980 --> 00:41:25,689 great work, without them, it wouldn't 1161 00:41:25,690 --> 00:41:28,329 have been possible and. 1162 00:41:39,030 --> 00:41:41,339 If you like, I have some backup slides, 1163 00:41:41,340 --> 00:41:42,689 but if you think that you have a lot of 1164 00:41:42,690 --> 00:41:44,819 questions, we can just go to Cairney 1165 00:41:44,820 --> 00:41:45,820 now. 1166 00:41:46,980 --> 00:41:49,129 And thanks for coming 1167 00:41:49,130 --> 00:41:49,439 in. 1168 00:41:49,440 --> 00:41:50,440 Thank you. 1169 00:41:58,210 --> 00:42:00,069 Thank you very much. 1170 00:42:00,070 --> 00:42:02,619 We have about 20 minutes for Q&A, so 1171 00:42:02,620 --> 00:42:04,839 there's plenty of room for questions. 1172 00:42:04,840 --> 00:42:06,879 The first two questions go to the 1173 00:42:06,880 --> 00:42:08,889 Internet and the people who feel that 1174 00:42:08,890 --> 00:42:10,900 they have to leave. Please do so quietly. 1175 00:42:13,660 --> 00:42:15,969 OK, um, the first question is whether 1176 00:42:15,970 --> 00:42:18,309 your technique also works on, uh, 1177 00:42:18,310 --> 00:42:19,869 Shell Command so that I can see who 1178 00:42:19,870 --> 00:42:22,119 actually did something in my 1179 00:42:22,120 --> 00:42:23,120 system. 1180 00:42:23,830 --> 00:42:25,629 Can you repeat the question if it also 1181 00:42:25,630 --> 00:42:27,279 works on what? Can you speak a little 1182 00:42:27,280 --> 00:42:27,879 louder? 1183 00:42:27,880 --> 00:42:28,869 Oh, I'm sorry. 1184 00:42:28,870 --> 00:42:31,209 Does it also work on command? 1185 00:42:32,260 --> 00:42:34,819 So if you type or look at a session 1186 00:42:34,820 --> 00:42:36,909 of someone locked in a computer 1187 00:42:36,910 --> 00:42:38,050 to analyze with it. 1188 00:42:40,510 --> 00:42:42,309 So I'm not sure I understand the 1189 00:42:42,310 --> 00:42:44,469 question, are you asking 1190 00:42:44,470 --> 00:42:45,520 about sanctions 1191 00:42:47,110 --> 00:42:49,179 now if you have a lot 1192 00:42:49,180 --> 00:42:51,459 of unique command lines that you see that 1193 00:42:51,460 --> 00:42:52,929 someone has entered somewhere? 1194 00:42:52,930 --> 00:42:54,759 Can you also try to find who is the 1195 00:42:54,760 --> 00:42:57,130 author of these command lines? 1196 00:42:58,630 --> 00:43:00,639 Oh, just just command lines. 1197 00:43:00,640 --> 00:43:02,709 So as long as you 1198 00:43:02,710 --> 00:43:04,899 find the correct features for that, 1199 00:43:04,900 --> 00:43:06,309 I believe you can do it. 1200 00:43:06,310 --> 00:43:08,559 But this doesn't research this 1201 00:43:08,560 --> 00:43:10,959 current research does not exactly 1202 00:43:10,960 --> 00:43:12,369 apply to that. 1203 00:43:12,370 --> 00:43:14,589 But since we can do this on 1204 00:43:14,590 --> 00:43:17,139 various kinds of textual 1205 00:43:17,140 --> 00:43:19,209 data and we have shown that 1206 00:43:19,210 --> 00:43:21,249 like we can do it on like many other 1207 00:43:21,250 --> 00:43:23,559 things that have other than 1208 00:43:23,560 --> 00:43:25,749 mentioned in this presentation, I believe 1209 00:43:25,750 --> 00:43:27,819 it might be possible 1210 00:43:27,820 --> 00:43:28,820 to do that. 1211 00:43:30,370 --> 00:43:32,679 OK, next question from the Internet, 1212 00:43:32,680 --> 00:43:35,289 um, do you also look at 1213 00:43:35,290 --> 00:43:37,629 comments like, uh, style 1214 00:43:37,630 --> 00:43:39,999 of writing comments or position 1215 00:43:40,000 --> 00:43:42,099 of comments, support 1216 00:43:42,100 --> 00:43:44,199 these datasets to make sure that there 1217 00:43:44,200 --> 00:43:45,969 was no personally identifiable or 1218 00:43:45,970 --> 00:43:48,729 identifiable information left 1219 00:43:48,730 --> 00:43:50,949 that would just bias the classifier we 1220 00:43:50,950 --> 00:43:52,029 removed to comments? 1221 00:43:52,030 --> 00:43:54,639 So we don't have comments in this, 1222 00:43:54,640 --> 00:43:56,379 but we also looked at it with comments 1223 00:43:56,380 --> 00:43:58,389 and it usually makes it just like the 1224 00:43:58,390 --> 00:44:00,519 anonymization accuracy just increases. 1225 00:44:02,400 --> 00:44:05,009 Before moving to questions from the room, 1226 00:44:05,010 --> 00:44:07,139 please do remember that questions are 1227 00:44:07,140 --> 00:44:09,209 so short sentences and we have a question 1228 00:44:09,210 --> 00:44:09,659 mark. 1229 00:44:09,660 --> 00:44:10,660 First question over here. 1230 00:44:12,780 --> 00:44:15,209 What do you think the chances are of 1231 00:44:15,210 --> 00:44:17,039 having something like a compiler switch 1232 00:44:17,040 --> 00:44:19,109 to automatically anonymized code 1233 00:44:19,110 --> 00:44:21,499 by doing structural refactoring switch 1234 00:44:21,500 --> 00:44:23,729 then like your mouth thing that 1235 00:44:23,730 --> 00:44:25,829 you know, that's a very 1236 00:44:25,830 --> 00:44:26,399 good point. 1237 00:44:26,400 --> 00:44:28,259 And exactly. It's kind of a similar thing 1238 00:44:28,260 --> 00:44:30,539 to do with Anonymous because you're just 1239 00:44:30,540 --> 00:44:32,609 trying to, like, convert the 1240 00:44:32,610 --> 00:44:34,979 features that make you an unmissable. 1241 00:44:34,980 --> 00:44:37,439 And that might be a good experiment 1242 00:44:37,440 --> 00:44:38,549 to try in the future. 1243 00:44:38,550 --> 00:44:40,499 So I will keep that in mind when we are 1244 00:44:40,500 --> 00:44:42,630 like starting our anonymization work. 1245 00:44:44,700 --> 00:44:45,269 Next question. 1246 00:44:45,270 --> 00:44:47,849 Over there, yes. 1247 00:44:47,850 --> 00:44:49,979 Thank you for the talk in the 1248 00:44:49,980 --> 00:44:52,319 beginning. Let's say I am 1249 00:44:52,320 --> 00:44:54,000 writing code for 1250 00:44:55,170 --> 00:44:57,359 I'm in Iran, Iraq code for a 1251 00:44:57,360 --> 00:44:58,349 porn website. 1252 00:44:58,350 --> 00:45:00,329 And I want to 1253 00:45:01,440 --> 00:45:03,809 obviously complete the anonymization is, 1254 00:45:03,810 --> 00:45:05,129 uh, not possible. 1255 00:45:05,130 --> 00:45:07,199 Anonymization, uh, can 1256 00:45:07,200 --> 00:45:10,049 I, uh, produce uh, 1257 00:45:10,050 --> 00:45:12,629 can I maximize my chances to, 1258 00:45:12,630 --> 00:45:14,879 uh, to 1259 00:45:14,880 --> 00:45:17,159 not be identified and 1260 00:45:17,160 --> 00:45:18,160 executed? 1261 00:45:19,060 --> 00:45:21,179 Um, so first of all, 1262 00:45:21,180 --> 00:45:23,099 with the site Molik, for example, he was 1263 00:45:23,100 --> 00:45:25,499 identified because his name was 1264 00:45:25,500 --> 00:45:27,659 on the court. So that was like a direct 1265 00:45:27,660 --> 00:45:29,369 identification. 1266 00:45:29,370 --> 00:45:31,529 But as you suggest, if you are very 1267 00:45:31,530 --> 00:45:33,989 careful about being anonymous, maybe 1268 00:45:33,990 --> 00:45:36,209 you can try to follow very 1269 00:45:36,210 --> 00:45:38,549 strict conventions and make sure 1270 00:45:38,550 --> 00:45:40,769 that everyone in that project is also 1271 00:45:40,770 --> 00:45:42,869 following the same conventions so 1272 00:45:42,870 --> 00:45:44,939 that all of you look very similar and you 1273 00:45:44,940 --> 00:45:47,069 cannot be identified, you cannot 1274 00:45:47,070 --> 00:45:49,559 be discriminated 1275 00:45:49,560 --> 00:45:50,739 from each other in that case. 1276 00:45:50,740 --> 00:45:53,099 So that might be one solution so far 1277 00:45:53,100 --> 00:45:54,900 or it might at least help you. 1278 00:45:57,550 --> 00:45:59,559 OK, next question up there. 1279 00:45:59,560 --> 00:46:02,049 My question is, I assume 1280 00:46:02,050 --> 00:46:04,269 all of those numbers that you had 1281 00:46:04,270 --> 00:46:06,609 where you had a set 1282 00:46:06,610 --> 00:46:08,139 of people that you programed into the 1283 00:46:08,140 --> 00:46:09,939 system and you were giving them one 1284 00:46:09,940 --> 00:46:12,039 sample of a person that you knew 1285 00:46:12,040 --> 00:46:13,959 was in that set. 1286 00:46:13,960 --> 00:46:16,149 Have you ever tried giving them 1287 00:46:16,150 --> 00:46:19,479 a completely unrelated code sample? 1288 00:46:19,480 --> 00:46:22,149 Is the software able to 1289 00:46:22,150 --> 00:46:24,219 tell that that's someone else who 1290 00:46:24,220 --> 00:46:26,249 is not part of the reference set? 1291 00:46:26,250 --> 00:46:29,229 And if so, at what probability? 1292 00:46:29,230 --> 00:46:31,389 So you can't look at the 1293 00:46:31,390 --> 00:46:31,779 slide. 1294 00:46:31,780 --> 00:46:33,369 And this is kind of a verification 1295 00:46:33,370 --> 00:46:36,009 problem in machine learning. 1296 00:46:36,010 --> 00:46:38,079 And this is a one class, two class 1297 00:46:38,080 --> 00:46:40,269 kind of classifier that might 1298 00:46:40,270 --> 00:46:42,579 help you verify if the 1299 00:46:42,580 --> 00:46:45,069 anonymous code sample comes 1300 00:46:45,070 --> 00:46:47,229 from someone that is in your 1301 00:46:47,230 --> 00:46:49,389 suspect set or like that's in the set 1302 00:46:49,390 --> 00:46:50,439 of programmers 1303 00:46:51,940 --> 00:46:54,009 that you trained the classifier on. 1304 00:46:54,010 --> 00:46:56,799 And for the setting, what we did was so 1305 00:46:56,800 --> 00:46:58,869 we have Mallery and Mallery claims 1306 00:46:58,870 --> 00:47:00,969 that this code has been written 1307 00:47:00,970 --> 00:47:03,069 by her and we want to find 1308 00:47:03,070 --> 00:47:04,839 out if it was really written by her. 1309 00:47:04,840 --> 00:47:07,119 So we have two classes, 1310 00:47:07,120 --> 00:47:09,189 the one on the samples 1311 00:47:09,190 --> 00:47:11,109 from Mallory and the second one has 1312 00:47:11,110 --> 00:47:13,269 samples from random people, which 1313 00:47:13,270 --> 00:47:15,369 represents the outside world. 1314 00:47:15,370 --> 00:47:18,199 It's anyone but Mallory. 1315 00:47:18,200 --> 00:47:20,319 And in this case, we can take the 1316 00:47:20,320 --> 00:47:22,449 code that Mallory is claiming to have 1317 00:47:22,450 --> 00:47:24,669 written and see if it 1318 00:47:24,670 --> 00:47:26,259 was really written by her. 1319 00:47:26,260 --> 00:47:27,909 Like is it going to be attributed to the 1320 00:47:27,910 --> 00:47:28,899 outside world? 1321 00:47:28,900 --> 00:47:30,699 Who is not Mallory or is it going to be 1322 00:47:30,700 --> 00:47:32,409 really attributed to Mallory? 1323 00:47:32,410 --> 00:47:34,269 And for this case, we get ninety one 1324 00:47:34,270 --> 00:47:35,409 percent accuracy with 1325 00:47:36,460 --> 00:47:38,859 each repetition of such a setting. 1326 00:47:38,860 --> 00:47:40,389 And if you want to find more details 1327 00:47:40,390 --> 00:47:42,219 about the probabilities of how we can 1328 00:47:42,220 --> 00:47:44,769 like threshold the verification, 1329 00:47:44,770 --> 00:47:46,089 look at our first paper. 1330 00:47:46,090 --> 00:47:48,309 We had the details of that or we can 1331 00:47:48,310 --> 00:47:49,779 chat later. 1332 00:47:49,780 --> 00:47:51,969 So. But you haven't tried 1333 00:47:51,970 --> 00:47:54,159 when you don't know who it could 1334 00:47:54,160 --> 00:47:56,289 have been. If you say, OK, we have 1335 00:47:56,290 --> 00:47:58,369 like a suspect list of 50 people, we 1336 00:47:58,370 --> 00:48:00,729 haven't tried seeing if the 1337 00:48:00,730 --> 00:48:02,319 if it would be possible to determine that 1338 00:48:02,320 --> 00:48:03,880 it's not one of those fifty people. 1339 00:48:04,960 --> 00:48:07,179 So with verification like 1340 00:48:07,180 --> 00:48:09,579 so instead of this two class setting, 1341 00:48:09,580 --> 00:48:11,229 let's think about the fifth class setting 1342 00:48:11,230 --> 00:48:12,879 with 50 programmers. 1343 00:48:12,880 --> 00:48:15,399 If a sample is being attributed 1344 00:48:15,400 --> 00:48:17,499 to one programmer, A 1345 00:48:17,500 --> 00:48:20,259 is below a certain probability, 1346 00:48:20,260 --> 00:48:21,849 then you might be able to say that this 1347 00:48:21,850 --> 00:48:23,239 looks like this person, but this is 1348 00:48:23,240 --> 00:48:25,209 sketchy. Maybe this is not this person 1349 00:48:25,210 --> 00:48:27,189 because this is not a very confident 1350 00:48:27,190 --> 00:48:28,239 classification. 1351 00:48:29,590 --> 00:48:31,059 Does that answer? 1352 00:48:31,060 --> 00:48:32,949 Oh, I think I'll just talk to you in 1353 00:48:32,950 --> 00:48:35,679 person. OK, let's go to the next question 1354 00:48:35,680 --> 00:48:36,680 over here. 1355 00:48:37,360 --> 00:48:39,399 What about codings? Diago Most languages 1356 00:48:39,400 --> 00:48:41,109 have a devout codding Steiglitz and you 1357 00:48:41,110 --> 00:48:42,159 have cording formatters. 1358 00:48:42,160 --> 00:48:43,749 You can run automatically. 1359 00:48:43,750 --> 00:48:44,869 Would that help? 1360 00:48:44,870 --> 00:48:46,959 Oh, so, yeah, that's I think the 1361 00:48:46,960 --> 00:48:48,519 previous question was similar to that. 1362 00:48:48,520 --> 00:48:50,649 If you follow strict conventions and if 1363 00:48:50,650 --> 00:48:52,239 you everyone does that, that would 1364 00:48:52,240 --> 00:48:54,339 normalize, that should normalize your 1365 00:48:54,340 --> 00:48:56,409 writing style to some degree 1366 00:48:56,410 --> 00:48:57,609 or coding style. 1367 00:48:57,610 --> 00:48:59,769 And but we saw with 1368 00:48:59,770 --> 00:49:02,199 the compilation case that compilation 1369 00:49:02,200 --> 00:49:04,149 is also kind of a normalization over your 1370 00:49:04,150 --> 00:49:06,399 coding style because it's converting 1371 00:49:06,400 --> 00:49:08,889 everything to a set set 1372 00:49:08,890 --> 00:49:10,989 of rules and it becomes very similar. 1373 00:49:10,990 --> 00:49:12,789 And even in that case, we're able to do 1374 00:49:12,790 --> 00:49:14,799 anonymize programmers, but with lower 1375 00:49:14,800 --> 00:49:17,229 accuracy. So that might help you 1376 00:49:17,230 --> 00:49:19,119 be more anonymous. 1377 00:49:19,120 --> 00:49:21,249 But I'm not sure if it would be 1378 00:49:21,250 --> 00:49:23,229 the exact solution, but no numbers how 1379 00:49:23,230 --> 00:49:25,749 much it would change. 1380 00:49:25,750 --> 00:49:27,939 So the problem with that 1381 00:49:27,940 --> 00:49:30,009 experimental setting was that we don't 1382 00:49:30,010 --> 00:49:32,139 have such a dataset to test this. 1383 00:49:32,140 --> 00:49:33,849 But like if someone from the industry has 1384 00:49:33,850 --> 00:49:35,829 a dataset with very programmers are 1385 00:49:35,830 --> 00:49:37,329 following strict conditions. 1386 00:49:37,330 --> 00:49:39,429 And so we should be able to 1387 00:49:39,430 --> 00:49:41,169 get more answers to this question, 1388 00:49:42,390 --> 00:49:44,169 I think to that may have a few more 1389 00:49:44,170 --> 00:49:45,329 questions. 1390 00:49:45,330 --> 00:49:46,269 Yes. 1391 00:49:46,270 --> 00:49:48,099 Can you use your technique to actually 1392 00:49:48,100 --> 00:49:50,199 forge code to look like it was written 1393 00:49:50,200 --> 00:49:51,280 by the programmer? 1394 00:49:54,210 --> 00:49:56,769 Uh, so, yes, 1395 00:49:56,770 --> 00:49:58,989 this is possible with machine learning 1396 00:49:58,990 --> 00:50:00,649 because like, for example, with 1397 00:50:00,650 --> 00:50:02,709 anonymized case, that's where you 1398 00:50:02,710 --> 00:50:04,809 anonymize your writing style, not coding 1399 00:50:04,810 --> 00:50:05,959 style. 1400 00:50:05,960 --> 00:50:08,289 What we do is like your witness aspects 1401 00:50:08,290 --> 00:50:10,389 that and you try to do anonymize yourself 1402 00:50:10,390 --> 00:50:11,889 in that suspect set in. 1403 00:50:11,890 --> 00:50:14,019 What you do is try to 1404 00:50:14,020 --> 00:50:15,999 like bring in features that do not 1405 00:50:16,000 --> 00:50:17,259 represent your style. 1406 00:50:17,260 --> 00:50:19,329 But if you bring in features that belong 1407 00:50:19,330 --> 00:50:20,859 to someone else. Exactly. 1408 00:50:20,860 --> 00:50:23,049 And you can see what those features are, 1409 00:50:23,050 --> 00:50:24,849 then your style would be more similar to 1410 00:50:24,850 --> 00:50:25,539 that person. 1411 00:50:25,540 --> 00:50:27,849 So once you have this framework 1412 00:50:27,850 --> 00:50:30,039 for source code, 1413 00:50:30,040 --> 00:50:31,329 that should be possible. 1414 00:50:32,610 --> 00:50:33,610 Next question up here. 1415 00:50:34,640 --> 00:50:36,569 Hey, you said there's a difference 1416 00:50:36,570 --> 00:50:38,159 between advanced and less advanced 1417 00:50:38,160 --> 00:50:39,179 programmers. 1418 00:50:39,180 --> 00:50:41,069 So my question is, when you have a given 1419 00:50:41,070 --> 00:50:43,199 set of binaries, can you tell which ones 1420 00:50:43,200 --> 00:50:44,909 were written by advanced programmers? 1421 00:50:46,450 --> 00:50:47,709 That's a very good question. 1422 00:50:47,710 --> 00:50:49,359 We haven't tried that, but I think you 1423 00:50:49,360 --> 00:50:51,159 can come up with an experimental setting 1424 00:50:51,160 --> 00:50:52,270 to test that and 1425 00:50:53,320 --> 00:50:54,590 we haven't done that yet. 1426 00:50:55,810 --> 00:50:56,810 OK, thank you. 1427 00:50:58,020 --> 00:51:00,089 Hi, two questions. 1428 00:51:00,090 --> 00:51:02,369 First question, have you considered using 1429 00:51:02,370 --> 00:51:04,499 SCAP Grams where and 1430 00:51:04,500 --> 00:51:06,719 like Skip and Grams and 1431 00:51:06,720 --> 00:51:08,399 is greater than 30, then you get more 1432 00:51:08,400 --> 00:51:10,649 support? Second question, how 1433 00:51:10,650 --> 00:51:12,819 about unsupervised learning? 1434 00:51:12,820 --> 00:51:14,939 Oh, yes, I'll answer both of 1435 00:51:14,940 --> 00:51:17,159 them for the skip gram or like 1436 00:51:17,160 --> 00:51:18,929 multiton grams. 1437 00:51:18,930 --> 00:51:21,089 You can do that. We tried it. 1438 00:51:21,090 --> 00:51:23,399 It just takes a very long time to extract 1439 00:51:23,400 --> 00:51:25,619 those features and then try 1440 00:51:25,620 --> 00:51:27,389 to classify it like that. 1441 00:51:27,390 --> 00:51:29,489 And we saw that when you get 1442 00:51:29,490 --> 00:51:31,679 more in grams or skip grams or like a 1443 00:51:31,680 --> 00:51:34,619 variety of powerful features, 1444 00:51:34,620 --> 00:51:36,329 it doesn't really help, especially the 1445 00:51:36,330 --> 00:51:37,949 source code, authorship, attribution, 1446 00:51:37,950 --> 00:51:39,839 accuracy, so much because it's already so 1447 00:51:39,840 --> 00:51:42,389 high. We also would like to avoid 1448 00:51:42,390 --> 00:51:44,489 overfitting to the data set by 1449 00:51:44,490 --> 00:51:47,159 generating very detailed features 1450 00:51:47,160 --> 00:51:49,019 that might bias the classifier 1451 00:51:50,400 --> 00:51:53,129 at the same time. Very lucky that we have 1452 00:51:53,130 --> 00:51:55,229 Amazon Research Grant where we can 1453 00:51:55,230 --> 00:51:57,119 run our experiments on Etsy too. 1454 00:51:57,120 --> 00:51:59,819 So it's making things much faster. 1455 00:51:59,820 --> 00:52:02,009 But other than that, I wouldn't always 1456 00:52:02,010 --> 00:52:04,529 want to extract 1457 00:52:04,530 --> 00:52:06,659 like thousands and maybe millions 1458 00:52:06,660 --> 00:52:08,249 of features if you're going to go into 1459 00:52:08,250 --> 00:52:09,509 such details. 1460 00:52:09,510 --> 00:52:11,759 And for the second question, yes, 1461 00:52:11,760 --> 00:52:13,619 you can do unsupervised learning. 1462 00:52:13,620 --> 00:52:16,049 You can just try to cluster these 1463 00:52:16,050 --> 00:52:19,229 based on certain properties found 1464 00:52:19,230 --> 00:52:21,419 in court. And these properties could 1465 00:52:21,420 --> 00:52:23,369 also be coding style properties. 1466 00:52:23,370 --> 00:52:24,359 And that's possible. 1467 00:52:24,360 --> 00:52:26,639 But we haven't gone through 1468 00:52:26,640 --> 00:52:27,809 that setting. 1469 00:52:27,810 --> 00:52:28,810 Thanks. Yeah. 1470 00:52:30,250 --> 00:52:32,619 Hello, so have 1471 00:52:32,620 --> 00:52:34,869 you thought about researching 1472 00:52:34,870 --> 00:52:37,169 stuff of using 1473 00:52:37,170 --> 00:52:39,309 meta metadata of code 1474 00:52:39,310 --> 00:52:40,719 to actually classify it? 1475 00:52:40,720 --> 00:52:42,879 So let me get a get 1476 00:52:42,880 --> 00:52:45,039 canids, commit messages or 1477 00:52:45,040 --> 00:52:47,379 workflow or even set of used 1478 00:52:47,380 --> 00:52:48,969 libraries or something like that. 1479 00:52:48,970 --> 00:52:51,399 And, of course, how to protect 1480 00:52:51,400 --> 00:52:52,400 from that. 1481 00:52:55,310 --> 00:52:57,289 So in the beginning, I started this 1482 00:52:57,290 --> 00:52:59,509 project by using exactly 1483 00:52:59,510 --> 00:53:01,339 a data set, like you describe it, like 1484 00:53:01,340 --> 00:53:03,469 all the get comments messages and it's 1485 00:53:03,470 --> 00:53:05,779 possible, yes, you can 1486 00:53:05,780 --> 00:53:07,819 anonymize those programs and it probably 1487 00:53:07,820 --> 00:53:10,279 makes it more powerful. 1488 00:53:10,280 --> 00:53:12,529 But then with Kitab, 1489 00:53:12,530 --> 00:53:14,789 it kind of becomes a multi altor problem. 1490 00:53:14,790 --> 00:53:16,849 So it relates to our future part where 1491 00:53:16,850 --> 00:53:18,649 we would like to look at multi other 1492 00:53:18,650 --> 00:53:20,899 source code, authorship, attribution 1493 00:53:20,900 --> 00:53:22,609 and the same for binaries. 1494 00:53:22,610 --> 00:53:24,739 And we are working on that currently. 1495 00:53:24,740 --> 00:53:25,740 Thank you. Yep. 1496 00:53:26,720 --> 00:53:28,849 I, uh, using your approach, 1497 00:53:28,850 --> 00:53:31,129 do you think it's possible to abstract 1498 00:53:31,130 --> 00:53:33,139 these models even more so that it becomes 1499 00:53:33,140 --> 00:53:35,299 possible to train on data 1500 00:53:35,300 --> 00:53:37,489 sets that are constructed from 1501 00:53:37,490 --> 00:53:39,799 one programing language and 1502 00:53:39,800 --> 00:53:41,869 then do classification on another 1503 00:53:41,870 --> 00:53:43,400 programing language or with 1504 00:53:45,020 --> 00:53:47,149 text or samples from another programing 1505 00:53:47,150 --> 00:53:48,169 language? 1506 00:53:48,170 --> 00:53:50,149 And did you do experiments in that 1507 00:53:50,150 --> 00:53:51,469 direction? 1508 00:53:51,470 --> 00:53:53,119 Thank you. That's a very good question. 1509 00:53:53,120 --> 00:53:54,589 We haven't done experiments on that. 1510 00:53:54,590 --> 00:53:56,479 Again, like it's difficult to find 1511 00:53:56,480 --> 00:53:58,489 ground. True data, for instance, like 1512 00:53:58,490 --> 00:54:00,349 even with Google Culture, some 1513 00:54:00,350 --> 00:54:03,109 programmers write in multiple languages, 1514 00:54:03,110 --> 00:54:05,269 but that's a very small set. 1515 00:54:05,270 --> 00:54:07,459 So we haven't looked at that 1516 00:54:07,460 --> 00:54:09,679 yet. But like when, for example, compare 1517 00:54:09,680 --> 00:54:12,079 C++, yes, you can do a cross training 1518 00:54:12,080 --> 00:54:14,419 and testing because like the nature 1519 00:54:14,420 --> 00:54:16,249 of the two languages are very similar 1520 00:54:16,250 --> 00:54:18,290 when you you are coding. 1521 00:54:20,570 --> 00:54:22,459 Next question goes to the Internet. 1522 00:54:22,460 --> 00:54:23,959 Yes, yes. 1523 00:54:23,960 --> 00:54:26,719 Would that also work on assembly 1524 00:54:26,720 --> 00:54:28,489 because you do not really have something 1525 00:54:28,490 --> 00:54:29,809 like variable names. 1526 00:54:32,310 --> 00:54:33,629 So with assembly, 1527 00:54:35,490 --> 00:54:38,039 I will go back to the site where we 1528 00:54:38,040 --> 00:54:40,139 oops, I just 1529 00:54:40,140 --> 00:54:41,570 closed that just a second. 1530 00:54:42,750 --> 00:54:44,639 We have a bunch of important features 1531 00:54:44,640 --> 00:54:46,680 that come from assembly and. 1532 00:54:49,460 --> 00:54:51,619 It's good and it's very difficult 1533 00:54:51,620 --> 00:54:53,539 to understand what they actually mean 1534 00:54:53,540 --> 00:54:55,939 because, oh, it's close. 1535 00:54:55,940 --> 00:54:56,940 Oh, I see. 1536 00:55:00,060 --> 00:55:01,060 OK, here we go. 1537 00:55:03,030 --> 00:55:05,519 With assembly futures, here, 1538 00:55:05,520 --> 00:55:07,799 we get assembly futures from 1539 00:55:07,800 --> 00:55:09,599 two different disassembles to make it 1540 00:55:09,600 --> 00:55:11,729 stronger and richer and we don't 1541 00:55:11,730 --> 00:55:13,949 get that many, but we have like close to 1542 00:55:13,950 --> 00:55:15,809 like one hundred and something. 1543 00:55:15,810 --> 00:55:17,879 And it's very difficult to tell, like, 1544 00:55:17,880 --> 00:55:19,649 what those exactly mean, because, like, 1545 00:55:19,650 --> 00:55:21,330 some of them are just anagrams and 1546 00:55:22,440 --> 00:55:25,199 it's very it looks very overfitting. 1547 00:55:25,200 --> 00:55:28,109 But since we show in our reconstruction 1548 00:55:28,110 --> 00:55:30,329 experiments that somehow Preserve's 1549 00:55:30,330 --> 00:55:32,550 coding style in its somewhere is about. 1550 00:55:34,140 --> 00:55:35,309 Does that answer 1551 00:55:36,360 --> 00:55:39,389 the question on my online, online 1552 00:55:39,390 --> 00:55:41,789 over that I, 1553 00:55:41,790 --> 00:55:44,159 um, do you think it would be possible to 1554 00:55:44,160 --> 00:55:46,469 combine your research with, for example, 1555 00:55:46,470 --> 00:55:47,969 a social graph analysis to find 1556 00:55:47,970 --> 00:55:50,639 cooperation between programs or groups? 1557 00:55:50,640 --> 00:55:52,109 Um, I guess you can do that. 1558 00:55:52,110 --> 00:55:53,939 You can just add that as an extreme 1559 00:55:53,940 --> 00:55:56,039 machine learning feature and just like 1560 00:55:56,040 --> 00:55:57,509 improve your classifier. 1561 00:55:57,510 --> 00:55:59,549 But in this research, we are particularly 1562 00:55:59,550 --> 00:56:01,619 interested in finding coding 1563 00:56:01,620 --> 00:56:03,769 style and quantifying coding style. 1564 00:56:03,770 --> 00:56:06,029 So if you wanted to exclude like any 1565 00:56:06,030 --> 00:56:09,269 other irrelevant information, 1566 00:56:09,270 --> 00:56:11,280 which might make them more identifiable. 1567 00:56:14,320 --> 00:56:16,869 Did you do any research in the direction 1568 00:56:16,870 --> 00:56:18,909 of different compilers of the same 1569 00:56:18,910 --> 00:56:21,669 language, so, for example, GCSE 1570 00:56:21,670 --> 00:56:24,519 and is better than 11am, or 1571 00:56:24,520 --> 00:56:26,679 can it help anonymize my code if 1572 00:56:26,680 --> 00:56:29,409 I use different compilers for different 1573 00:56:29,410 --> 00:56:31,149 binaries or projects? 1574 00:56:31,150 --> 00:56:33,249 Uh, and mix and match might help 1575 00:56:33,250 --> 00:56:34,359 a little. 1576 00:56:34,360 --> 00:56:36,819 We haven't investigated that question 1577 00:56:36,820 --> 00:56:38,859 in particular, because when we look at 1578 00:56:38,860 --> 00:56:41,259 related work, we see the compiler 1579 00:56:41,260 --> 00:56:43,869 detection is kind of a small problem. 1580 00:56:43,870 --> 00:56:45,999 So as long as we know 1581 00:56:46,000 --> 00:56:48,249 the compiler because like it it 1582 00:56:48,250 --> 00:56:50,499 can be detected, then we can 1583 00:56:50,500 --> 00:56:52,749 come up with a setting where 1584 00:56:52,750 --> 00:56:54,579 a source code has been compiled in that 1585 00:56:54,580 --> 00:56:56,859 setting or like try to get rid 1586 00:56:56,860 --> 00:56:58,959 of the properties that that compiler 1587 00:56:58,960 --> 00:56:59,949 would bring in. 1588 00:56:59,950 --> 00:57:01,509 But as you said, if you like, start 1589 00:57:01,510 --> 00:57:03,219 mixing and matching, that might help you 1590 00:57:03,220 --> 00:57:04,359 anonymize it a little. 1591 00:57:04,360 --> 00:57:06,609 And that might be a good way to start our 1592 00:57:06,610 --> 00:57:08,739 anonymization experiments. 1593 00:57:08,740 --> 00:57:09,740 Thank you. Thanks. 1594 00:57:11,670 --> 00:57:13,709 Hi, I have a question regarding the real 1595 00:57:13,710 --> 00:57:16,019 world application you talked about, 1596 00:57:16,020 --> 00:57:17,789 what is the statistical statistical 1597 00:57:17,790 --> 00:57:19,679 probability that this is just based on 1598 00:57:19,680 --> 00:57:21,869 pure luck? I mean, you have like 70 to 70 1599 00:57:21,870 --> 00:57:24,569 or 60 or 70 percent of the anonymization 1600 00:57:24,570 --> 00:57:26,249 for 12 programmers. 1601 00:57:26,250 --> 00:57:28,139 How high is the probability that this is 1602 00:57:28,140 --> 00:57:29,140 just random? 1603 00:57:29,940 --> 00:57:31,589 Yes, it's difficult to talk about 1604 00:57:31,590 --> 00:57:33,779 statistical significance in this case 1605 00:57:33,780 --> 00:57:35,939 because like, we have a smaller dataset, 1606 00:57:35,940 --> 00:57:38,099 but at least we can kind of compare it to 1607 00:57:38,100 --> 00:57:40,169 the Google Kojm dataset where we know 1608 00:57:40,170 --> 00:57:41,819 there is statistical significance. 1609 00:57:41,820 --> 00:57:43,379 That's why we think that it might be 1610 00:57:43,380 --> 00:57:45,629 possible. And so in our future, 1611 00:57:45,630 --> 00:57:47,699 work to like apply these to larger 1612 00:57:47,700 --> 00:57:49,949 real world datasets where we can talk 1613 00:57:49,950 --> 00:57:52,199 about statistical significance for sure. 1614 00:57:52,200 --> 00:57:54,449 And that's a very good question that 1615 00:57:54,450 --> 00:57:56,249 we have been thinking about. 1616 00:57:56,250 --> 00:57:57,250 Thank you. 1617 00:57:57,670 --> 00:57:59,529 What may be the last question goes to the 1618 00:57:59,530 --> 00:58:00,530 Internet. 1619 00:58:02,050 --> 00:58:03,139 Oh, well, yes. 1620 00:58:03,140 --> 00:58:06,189 So, um, the question is 1621 00:58:06,190 --> 00:58:08,379 what violence, where the most important 1622 00:58:08,380 --> 00:58:10,419 and the random forest analysis. 1623 00:58:12,900 --> 00:58:14,669 Let me open that slide. 1624 00:58:14,670 --> 00:58:16,889 We should have a slight word, so the most 1625 00:58:16,890 --> 00:58:19,110 important ones are Bagram's, but 1626 00:58:20,130 --> 00:58:22,619 like, what do you mean real variables? 1627 00:58:22,620 --> 00:58:25,889 Or are you talking about features? 1628 00:58:25,890 --> 00:58:26,900 This is not the. 1629 00:58:28,330 --> 00:58:30,819 So for source code, authorship, 1630 00:58:30,820 --> 00:58:32,679 attribution, the most important feature 1631 00:58:32,680 --> 00:58:34,809 is for the random first 1632 00:58:34,810 --> 00:58:36,939 R word anagrams, first of all, 1633 00:58:36,940 --> 00:58:38,529 and Virginny Grams are things like your 1634 00:58:38,530 --> 00:58:40,809 function name or your integer or 1635 00:58:40,810 --> 00:58:42,129 double choices or 1636 00:58:43,330 --> 00:58:45,519 things that come from your 1637 00:58:45,520 --> 00:58:47,049 source code. 1638 00:58:47,050 --> 00:58:49,249 And we have the Bigram 1639 00:58:49,250 --> 00:58:51,849 that come from the abstract syntax tree, 1640 00:58:51,850 --> 00:58:53,799 even though they are percentage in the 1641 00:58:53,800 --> 00:58:55,869 information game dataset was less 1642 00:58:55,870 --> 00:58:58,089 than the Virgini grams. 1643 00:58:58,090 --> 00:59:00,639 Their information gain is almost 1644 00:59:00,640 --> 00:59:03,279 equivalent to this entire 1645 00:59:03,280 --> 00:59:04,719 information gainsaid. 1646 00:59:04,720 --> 00:59:07,359 So in this case, the trigrams 1647 00:59:07,360 --> 00:59:09,529 are the most important. 1648 00:59:09,530 --> 00:59:11,009 OK, thank you very much. 1649 00:59:11,010 --> 00:59:12,760 Time's up.