0 00:00:00,000 --> 00:00:30,000 Dear viewer, these subtitles were generated by a machine via the service Trint and therefore are (very) buggy. If you are capable, please help us to create good quality subtitles: https://c3subtitles.de/talk/50 Thanks! 1 00:00:09,330 --> 00:00:11,789 Now we have Agnes MIDA 2 00:00:11,790 --> 00:00:13,949 here, Mel, she's going 3 00:00:13,950 --> 00:00:15,329 to talk to us a little bit about parallel 4 00:00:15,330 --> 00:00:17,519 programing and because we know 5 00:00:17,520 --> 00:00:19,679 that serial programing is totally last 6 00:00:19,680 --> 00:00:21,869 decade, rock'em parallel 7 00:00:21,870 --> 00:00:22,870 graphics cards. 8 00:00:32,360 --> 00:00:33,769 Thanks for having me, and I would 9 00:00:33,770 --> 00:00:35,839 especially like to welcome our 10 00:00:35,840 --> 00:00:37,909 stream watchers, especially the ones in 11 00:00:37,910 --> 00:00:38,960 China, in Beijing. 12 00:00:40,010 --> 00:00:41,119 But they are probably watching it 13 00:00:41,120 --> 00:00:42,120 tomorrow. 14 00:00:43,640 --> 00:00:45,979 Hey, come on. I don't know if it's 15 00:00:45,980 --> 00:00:47,569 just on the wrong position. 16 00:00:47,570 --> 00:00:48,889 Probably so. 17 00:00:48,890 --> 00:00:50,439 Should I put it on my online 18 00:00:53,000 --> 00:00:54,000 at. 19 00:00:58,810 --> 00:01:00,219 Who better? 20 00:01:06,240 --> 00:01:08,219 So we are going to start again, and now 21 00:01:08,220 --> 00:01:10,409 it's apparently I hear more of myself, 22 00:01:10,410 --> 00:01:11,759 which is perfect. 23 00:01:11,760 --> 00:01:13,919 And again, I would like to greet the 24 00:01:13,920 --> 00:01:15,959 Chinese peace mission in Beijing. 25 00:01:17,820 --> 00:01:20,039 And I'm very happy to see you here. 26 00:01:20,040 --> 00:01:21,989 I'm going to give you a short overview 27 00:01:21,990 --> 00:01:24,269 about what's happened until now 28 00:01:24,270 --> 00:01:26,429 in parallelism and what 29 00:01:26,430 --> 00:01:29,429 is going to happen in the next few years. 30 00:01:29,430 --> 00:01:31,589 And I hope you will get a 31 00:01:31,590 --> 00:01:33,689 firm reality check if it's worth the 32 00:01:33,690 --> 00:01:35,609 work, actually, or if you actually have 33 00:01:35,610 --> 00:01:37,919 the problem for investing some time 34 00:01:37,920 --> 00:01:40,259 and energy into changing your code 35 00:01:40,260 --> 00:01:42,389 to fit to 36 00:01:42,390 --> 00:01:45,159 the needs for our programing. 37 00:01:45,160 --> 00:01:47,229 And because my battery ran flat on my 38 00:01:47,230 --> 00:01:50,009 placenta, I do dancing around 39 00:01:50,010 --> 00:01:52,019 first. We do a little bit of motivation. 40 00:01:52,020 --> 00:01:54,929 Then I will fully explore 41 00:01:54,930 --> 00:01:57,059 the expense, some basic words 42 00:01:57,060 --> 00:01:59,069 about parallelism so that we know we are 43 00:01:59,070 --> 00:02:01,199 on the same page when we talk about 44 00:02:01,200 --> 00:02:03,329 the old standards, old and 45 00:02:03,330 --> 00:02:05,579 means of single multicore 46 00:02:05,580 --> 00:02:06,599 programing. 47 00:02:06,600 --> 00:02:08,939 And then I move over to accelerator cards 48 00:02:08,940 --> 00:02:11,069 and hopefully you will have a new picture 49 00:02:11,070 --> 00:02:12,139 of the whole story. 50 00:02:13,770 --> 00:02:15,569 In the end, there will be ten minutes Q&A 51 00:02:15,570 --> 00:02:17,759 hopefully, and my help will make sure 52 00:02:17,760 --> 00:02:20,729 that I will get to these ten minutes. 53 00:02:20,730 --> 00:02:22,199 So first, motivation. 54 00:02:22,200 --> 00:02:24,779 Why should we bother doing parallel 55 00:02:24,780 --> 00:02:27,029 work besides playing computer 56 00:02:27,030 --> 00:02:28,030 games? 57 00:02:28,620 --> 00:02:30,149 Well, I'm in research. 58 00:02:30,150 --> 00:02:32,429 So what do we tend to do is 59 00:02:32,430 --> 00:02:35,609 probably analyze the so-called big data. 60 00:02:35,610 --> 00:02:37,739 I'm not a huge fan of the big data 61 00:02:37,740 --> 00:02:39,929 because it's basically a buzzword 62 00:02:39,930 --> 00:02:42,239 for pattern recognition in large 63 00:02:42,240 --> 00:02:43,169 datasets. 64 00:02:43,170 --> 00:02:44,789 And some people think it's the new 65 00:02:44,790 --> 00:02:46,769 crystal ball. 66 00:02:46,770 --> 00:02:49,079 Yeah, OK, you could do big data 67 00:02:49,080 --> 00:02:51,509 with huge parallelism. 68 00:02:51,510 --> 00:02:53,039 You could also do something, in my 69 00:02:53,040 --> 00:02:54,029 opinion, more meaningful. 70 00:02:54,030 --> 00:02:55,499 You could do physics. 71 00:02:55,500 --> 00:02:57,389 And this example, this is astrophysics, 72 00:02:57,390 --> 00:02:59,249 which always creates wonderful pictures, 73 00:03:00,930 --> 00:03:03,539 but they do some Higgs boson analyzing 74 00:03:03,540 --> 00:03:05,429 at the sound or so they have huge 75 00:03:05,430 --> 00:03:07,499 machines working on the 76 00:03:07,500 --> 00:03:09,089 large amount of data. 77 00:03:09,090 --> 00:03:11,429 They have kids. 78 00:03:11,430 --> 00:03:13,649 But I'm not talking about you 79 00:03:13,650 --> 00:03:15,359 can also do some weather calculations 80 00:03:15,360 --> 00:03:17,309 with it, but I'm not talking about them 81 00:03:17,310 --> 00:03:18,989 because here at the university Honberg, 82 00:03:18,990 --> 00:03:21,059 they don't like accelerator cards because 83 00:03:21,060 --> 00:03:23,429 they're very energy intensive. 84 00:03:23,430 --> 00:03:25,110 So they are against them. 85 00:03:26,700 --> 00:03:29,129 But what I do in my work is actually 86 00:03:29,130 --> 00:03:30,719 we are trying to simulate protein 87 00:03:30,720 --> 00:03:31,679 flexibility. 88 00:03:31,680 --> 00:03:33,149 I'm applying for competition. 89 00:03:33,150 --> 00:03:34,799 I'm a student at university. 90 00:03:34,800 --> 00:03:37,019 I work and we try to figure out 91 00:03:37,020 --> 00:03:39,419 how molecules move. 92 00:03:39,420 --> 00:03:41,309 This is a molecule which is very relevant 93 00:03:41,310 --> 00:03:42,689 for your heart. 94 00:03:42,690 --> 00:03:45,449 This is very important for people. 95 00:03:45,450 --> 00:03:47,249 Do you, Pollari, block 96 00:03:49,210 --> 00:03:51,239 your membrane so that your heart actually 97 00:03:51,240 --> 00:03:54,239 beats? If this thing is 98 00:03:54,240 --> 00:03:56,579 broken by medicine you take, 99 00:03:56,580 --> 00:03:58,659 you will die by 100 00:03:58,660 --> 00:03:59,939 a heart attack. 101 00:03:59,940 --> 00:04:02,159 So you do not understand 102 00:04:02,160 --> 00:04:04,289 this protein, this channel. 103 00:04:04,290 --> 00:04:05,489 It's very important to know its 104 00:04:05,490 --> 00:04:07,739 structure, difficult 105 00:04:07,740 --> 00:04:10,709 to determine or to 106 00:04:10,710 --> 00:04:12,419 simulate its movement. 107 00:04:12,420 --> 00:04:14,609 Actually, you can to see 108 00:04:14,610 --> 00:04:16,679 a picture from above the 109 00:04:16,680 --> 00:04:18,778 red colors mean that 110 00:04:18,779 --> 00:04:20,849 it's in the close. Said nothing can 111 00:04:20,850 --> 00:04:21,778 pass through. 112 00:04:21,779 --> 00:04:24,149 The greenside means it's in the open side 113 00:04:24,150 --> 00:04:26,609 now the ions can move through and then 114 00:04:26,610 --> 00:04:29,159 replies and jeopardize the membrane. 115 00:04:29,160 --> 00:04:31,649 This is a very essential stage in 116 00:04:31,650 --> 00:04:32,729 your heart. 117 00:04:32,730 --> 00:04:34,199 So you really don't want to create 118 00:04:34,200 --> 00:04:36,389 medicine that works against this one. 119 00:04:36,390 --> 00:04:38,099 And if you could make the prediction 120 00:04:38,100 --> 00:04:40,469 methods better so that 121 00:04:40,470 --> 00:04:42,539 we can sort medicine out that 122 00:04:42,540 --> 00:04:45,329 works against these stuffs, 123 00:04:45,330 --> 00:04:47,519 we could make medicine a bit a little bit 124 00:04:47,520 --> 00:04:48,419 cheaper. 125 00:04:48,420 --> 00:04:50,639 That's in development, which means 126 00:04:50,640 --> 00:04:52,709 also that my university 127 00:04:52,710 --> 00:04:54,779 sends me to summer school so 128 00:04:54,780 --> 00:04:56,819 that I would get a better overview about 129 00:04:56,820 --> 00:04:58,739 the future trends and the current trends, 130 00:04:58,740 --> 00:05:00,239 which means I now have to show you my 131 00:05:00,240 --> 00:05:02,489 sponsor logos, you know, so I'm funded 132 00:05:02,490 --> 00:05:04,979 by the federal ministry and 133 00:05:04,980 --> 00:05:07,379 a third party, which means by a science. 134 00:05:07,380 --> 00:05:09,419 And I'm happy to say that I'm a 135 00:05:09,420 --> 00:05:11,369 university, which also means I'm a member 136 00:05:11,370 --> 00:05:12,779 of the CCC 8H. 137 00:05:15,230 --> 00:05:17,299 So let's get to the 138 00:05:17,300 --> 00:05:19,369 real topic, let's talk about 139 00:05:19,370 --> 00:05:20,370 parallelisms. 140 00:05:22,790 --> 00:05:25,340 So that everybody can follow 141 00:05:26,540 --> 00:05:27,540 the. 142 00:05:30,840 --> 00:05:33,389 I created a wonderful, colorful picture 143 00:05:33,390 --> 00:05:35,729 and we will go through the details, 144 00:05:35,730 --> 00:05:37,949 we will reconstruct this whole picture 145 00:05:37,950 --> 00:05:40,439 and the talk and then we will hopefully 146 00:05:40,440 --> 00:05:43,319 you can see how it fits all together. 147 00:05:43,320 --> 00:05:45,509 But we will start on a very simple 148 00:05:45,510 --> 00:05:46,529 level. 149 00:05:46,530 --> 00:05:47,530 We will just 150 00:05:48,690 --> 00:05:51,389 start with the headline 151 00:05:51,390 --> 00:05:53,609 because we need to discuss what data 152 00:05:53,610 --> 00:05:55,679 base and task based terrorism in my 153 00:05:55,680 --> 00:05:56,789 definition is. 154 00:05:56,790 --> 00:05:59,309 This may not be the best fighting words, 155 00:05:59,310 --> 00:06:01,379 but this may be words that 156 00:06:01,380 --> 00:06:03,809 you can associate with something. 157 00:06:03,810 --> 00:06:05,909 And in my talk, in my mind, 158 00:06:05,910 --> 00:06:08,159 these tend to associate with food because 159 00:06:08,160 --> 00:06:10,499 I'm a very food oriented person. 160 00:06:10,500 --> 00:06:12,419 When we talk about data pearlescent, it 161 00:06:12,420 --> 00:06:13,420 could be 162 00:06:16,050 --> 00:06:18,329 associated with frying a lot 163 00:06:18,330 --> 00:06:19,319 of omelets. 164 00:06:19,320 --> 00:06:21,479 Your data is the respective agony 165 00:06:21,480 --> 00:06:23,639 to fry the omelet and you have to do 166 00:06:23,640 --> 00:06:26,309 a lot of omelets to feed a lot of people. 167 00:06:26,310 --> 00:06:28,529 To ask Petrelis means that you would like 168 00:06:28,530 --> 00:06:30,749 to cook five course 169 00:06:30,750 --> 00:06:32,189 menu for one person. 170 00:06:32,190 --> 00:06:34,889 You have a lot of tasks you need to do, 171 00:06:34,890 --> 00:06:37,199 but you have only to deliver 172 00:06:37,200 --> 00:06:38,759 them in a certain way. 173 00:06:38,760 --> 00:06:40,379 But you have to cook them in a different 174 00:06:40,380 --> 00:06:41,380 arrangement. 175 00:06:42,300 --> 00:06:44,219 You can definitely mix them up, 176 00:06:44,220 --> 00:06:46,739 obviously, in 177 00:06:46,740 --> 00:06:48,509 increasing the number of people for whom 178 00:06:48,510 --> 00:06:49,769 you're cooking. 179 00:06:49,770 --> 00:06:52,589 So database and task based parallelism 180 00:06:52,590 --> 00:06:53,999 are not a contradiction. 181 00:06:54,000 --> 00:06:55,320 They are a combination. 182 00:06:59,300 --> 00:07:01,549 Good, but obviously you tend 183 00:07:01,550 --> 00:07:03,679 to have problems 184 00:07:03,680 --> 00:07:05,749 when you start to cook for 8000 185 00:07:05,750 --> 00:07:07,909 people, of course, menu from one 186 00:07:07,910 --> 00:07:10,299 menu is a monument to peace is 187 00:07:10,300 --> 00:07:11,849 the name of it. 188 00:07:11,850 --> 00:07:13,149 First of all, your kitchen could be just 189 00:07:13,150 --> 00:07:14,569 small. A thousand people. 190 00:07:14,570 --> 00:07:16,849 It's not so fun, although 191 00:07:16,850 --> 00:07:19,039 you may not have chef cooks all the time. 192 00:07:19,040 --> 00:07:20,959 You may have apprentices. 193 00:07:20,960 --> 00:07:23,149 You also have probably not enough pens 194 00:07:23,150 --> 00:07:25,219 and you don't have enough space 195 00:07:25,220 --> 00:07:27,709 to move from into the fridges. 196 00:07:27,710 --> 00:07:29,839 You also have just probably one 197 00:07:29,840 --> 00:07:32,169 recipe book and 198 00:07:32,170 --> 00:07:33,889 I mean, it's crowded when you want to 199 00:07:33,890 --> 00:07:34,969 read from it. 200 00:07:34,970 --> 00:07:37,369 You also need to deliver your stuff 201 00:07:37,370 --> 00:07:39,439 to the kitchen to a specific spot. 202 00:07:39,440 --> 00:07:40,819 You probably have some problems 203 00:07:40,820 --> 00:07:42,139 transporting the eggs. 204 00:07:42,140 --> 00:07:44,479 And obviously in the end, you 205 00:07:44,480 --> 00:07:46,219 somehow need to serve the cause in the 206 00:07:46,220 --> 00:07:47,899 correct order and it should be hot. 207 00:07:49,400 --> 00:07:51,319 But you can translate these problems into 208 00:07:51,320 --> 00:07:54,079 more techie language, which means 209 00:07:54,080 --> 00:07:56,209 you can say that your kitchen is a 210 00:07:56,210 --> 00:07:58,489 has a global capacity limit. 211 00:07:58,490 --> 00:08:01,159 Your apprentices are actually 212 00:08:01,160 --> 00:08:02,929 related to your complexity of the 213 00:08:02,930 --> 00:08:05,029 process, which in our case means a 214 00:08:05,030 --> 00:08:07,639 single or double precision computation. 215 00:08:07,640 --> 00:08:09,319 You could have probably not enough frying 216 00:08:09,320 --> 00:08:11,029 pans, which means your card is just too 217 00:08:11,030 --> 00:08:12,889 small to cook everything. 218 00:08:12,890 --> 00:08:15,019 At the same time, you can also 219 00:08:15,020 --> 00:08:17,089 have some bandwidth limitations. 220 00:08:17,090 --> 00:08:19,219 You have very wide access limitations 221 00:08:19,220 --> 00:08:21,439 you need to think about and then you 222 00:08:21,440 --> 00:08:23,719 end up with probably coalescing 223 00:08:23,720 --> 00:08:25,789 memory access problems for your 224 00:08:25,790 --> 00:08:27,919 memory works better when you access 225 00:08:27,920 --> 00:08:30,199 it at the same time with a large request 226 00:08:30,200 --> 00:08:32,089 instead of accessing all the time with 227 00:08:32,090 --> 00:08:33,439 small requests. 228 00:08:33,440 --> 00:08:35,689 So you should probably assemble the 229 00:08:35,690 --> 00:08:38,359 request for X and transport one big 230 00:08:38,360 --> 00:08:41,119 package, you know, and in the end, 231 00:08:41,120 --> 00:08:42,769 you know, cooking, you need to 232 00:08:42,770 --> 00:08:45,469 synchronize and keep the meal hot, 233 00:08:45,470 --> 00:08:46,470 so. 234 00:08:48,150 --> 00:08:49,229 This problem 235 00:08:50,910 --> 00:08:53,609 where all the time here, 236 00:08:53,610 --> 00:08:56,099 so they devised developed some standards, 237 00:08:56,100 --> 00:08:58,019 some old standards which were used by 238 00:08:58,020 --> 00:08:59,969 single and multicore clusters and 239 00:08:59,970 --> 00:09:02,099 research where products 240 00:09:02,100 --> 00:09:04,229 to use 241 00:09:04,230 --> 00:09:06,299 their computational resources the 242 00:09:06,300 --> 00:09:08,519 best way, the old 243 00:09:08,520 --> 00:09:10,679 standards, which are not so old, but we 244 00:09:10,680 --> 00:09:12,659 are living in a fast developing age. 245 00:09:12,660 --> 00:09:14,849 I open up an open API 246 00:09:14,850 --> 00:09:16,649 and you can see with the different colors 247 00:09:16,650 --> 00:09:19,199 and the different shading, they are 248 00:09:19,200 --> 00:09:20,789 on two sides. 249 00:09:20,790 --> 00:09:23,129 So database base and space. 250 00:09:23,130 --> 00:09:25,229 And the one of 251 00:09:25,230 --> 00:09:27,449 my man, the red 252 00:09:27,450 --> 00:09:29,579 one, is on the implicit, which I 253 00:09:29,580 --> 00:09:31,829 would say it's on the high 254 00:09:31,830 --> 00:09:33,719 level where you will see what I mean when 255 00:09:33,720 --> 00:09:35,850 you see the first line of code. 256 00:09:39,180 --> 00:09:41,309 First of all, we would like to cook our 257 00:09:41,310 --> 00:09:43,439 menu, general, five courses, how do we 258 00:09:43,440 --> 00:09:44,489 do that? 259 00:09:44,490 --> 00:09:46,739 Well, this we use open API because 260 00:09:46,740 --> 00:09:49,289 we need to dispatch a lot of tasks 261 00:09:49,290 --> 00:09:51,179 and open and let us do this. 262 00:09:51,180 --> 00:09:53,639 Just imagine 263 00:09:53,640 --> 00:09:55,709 that you have a classroom of 264 00:09:55,710 --> 00:09:57,779 many single core multiple machines 265 00:09:57,780 --> 00:10:00,239 like two thousand eight, probably 266 00:10:00,240 --> 00:10:01,619 two thousand five. 267 00:10:01,620 --> 00:10:03,359 And you want to dove to would like to 268 00:10:03,360 --> 00:10:05,549 cook a course of five 269 00:10:05,550 --> 00:10:06,989 course menu. 270 00:10:06,990 --> 00:10:09,059 You first of all, you install 271 00:10:09,060 --> 00:10:10,949 it, which is what I, I'm not talking 272 00:10:10,950 --> 00:10:13,379 about. But then you have this 273 00:10:13,380 --> 00:10:15,479 line, the last one that says how 274 00:10:15,480 --> 00:10:17,669 to run the code and this 275 00:10:17,670 --> 00:10:19,859 says with MPI run and then you have 276 00:10:19,860 --> 00:10:21,939 a and 277 00:10:21,940 --> 00:10:24,329 an abbreviation for a number of processes 278 00:10:24,330 --> 00:10:26,459 to launch, you could set it on 279 00:10:26,460 --> 00:10:28,769 one or you prove you have five computers, 280 00:10:28,770 --> 00:10:30,899 five course to use over your 281 00:10:30,900 --> 00:10:32,549 distributed net so you could set it on 282 00:10:32,550 --> 00:10:35,069 five or eight what you can think of. 283 00:10:35,070 --> 00:10:37,379 And then you have a minus host file, 284 00:10:37,380 --> 00:10:39,689 which will read a text 285 00:10:39,690 --> 00:10:41,819 file in which you write down the 286 00:10:41,820 --> 00:10:44,459 names or addresses or whatever of the 287 00:10:44,460 --> 00:10:46,829 computers you would like to use 288 00:10:46,830 --> 00:10:48,629 for your task. 289 00:10:48,630 --> 00:10:51,539 And then open API will 290 00:10:51,540 --> 00:10:53,699 execute the program on the 291 00:10:53,700 --> 00:10:56,539 specified computers 292 00:10:56,540 --> 00:10:59,759 with the specified amount of processes, 293 00:10:59,760 --> 00:11:02,009 and it will loop over the names in this 294 00:11:02,010 --> 00:11:03,419 host file. 295 00:11:03,420 --> 00:11:05,819 So you can probably have more processors 296 00:11:05,820 --> 00:11:07,889 running on one call than on the 297 00:11:07,890 --> 00:11:09,749 others because it has to probably loop 298 00:11:09,750 --> 00:11:10,750 over. 299 00:11:11,820 --> 00:11:14,249 This is different to the POSIX Red 300 00:11:14,250 --> 00:11:16,919 and Blue Thread Library because Open API 301 00:11:16,920 --> 00:11:18,689 gives you an interface for message 302 00:11:18,690 --> 00:11:19,709 passing. 303 00:11:19,710 --> 00:11:21,299 So when you execute this when you 304 00:11:21,300 --> 00:11:23,819 started, it actually starts 305 00:11:23,820 --> 00:11:26,219 to create a message queuing system 306 00:11:26,220 --> 00:11:28,350 so that the respective processes can 307 00:11:29,430 --> 00:11:31,499 send data and exchange 308 00:11:31,500 --> 00:11:32,500 data with each other. 309 00:11:34,320 --> 00:11:37,109 Actually, when you think about it, 310 00:11:37,110 --> 00:11:38,879 the code to do this is actually quite 311 00:11:38,880 --> 00:11:41,039 short because 312 00:11:41,040 --> 00:11:43,589 you just eat it. 313 00:11:43,590 --> 00:11:44,590 De de de de de. 314 00:11:46,490 --> 00:11:47,490 You are. 315 00:11:47,900 --> 00:11:48,900 You 316 00:11:50,090 --> 00:11:51,090 know, 317 00:11:52,220 --> 00:11:54,979 I have provisions so I can be there 318 00:11:54,980 --> 00:11:56,329 to do this in a different way. 319 00:11:56,330 --> 00:11:59,149 Your first in the episode 320 00:11:59,150 --> 00:12:01,609 will start the year, the killing, 321 00:12:01,610 --> 00:12:03,859 the message queuing system and start 322 00:12:03,860 --> 00:12:05,119 with a lot of overhead. 323 00:12:05,120 --> 00:12:07,189 And then you realize the two in 324 00:12:07,190 --> 00:12:09,349 line 10 and 11, how 325 00:12:09,350 --> 00:12:11,569 big your world is and where your position 326 00:12:11,570 --> 00:12:13,459 in this world is. 327 00:12:13,460 --> 00:12:16,009 And then you can start communicating 328 00:12:16,010 --> 00:12:18,409 an example, informing the user where 329 00:12:18,410 --> 00:12:21,019 this threat is actually was launched. 330 00:12:21,020 --> 00:12:23,239 And then you can probably cook your menu. 331 00:12:23,240 --> 00:12:26,059 This is how you do a task based 332 00:12:26,060 --> 00:12:28,579 model. And I don't think there is 333 00:12:28,580 --> 00:12:30,799 yet a new implementation, real one 334 00:12:30,800 --> 00:12:33,199 in C C++ for this. 335 00:12:33,200 --> 00:12:35,419 So this is still summer and high 336 00:12:35,420 --> 00:12:37,849 tech and very often used 337 00:12:37,850 --> 00:12:40,039 the finish, obviously with NBA finals. 338 00:12:40,040 --> 00:12:42,529 And after that your MP program 339 00:12:42,530 --> 00:12:43,909 has switched off its message, 340 00:12:43,910 --> 00:12:45,189 communication, etc. 341 00:12:45,190 --> 00:12:47,059 and will just run on your signal. 342 00:12:47,060 --> 00:12:48,060 Called again. 343 00:12:49,500 --> 00:12:51,249 So this is short and this is sweet. 344 00:12:52,440 --> 00:12:55,469 So let's look now how we fry an omelet. 345 00:12:55,470 --> 00:12:57,419 Let's go over to the old standard. 346 00:12:57,420 --> 00:12:58,709 How do you do data? 347 00:12:58,710 --> 00:13:00,779 Perla's obviously when you fry 348 00:13:00,780 --> 00:13:03,149 an omelet, you need eggs and the pan 349 00:13:03,150 --> 00:13:05,279 and then you need to apply heat to get 350 00:13:05,280 --> 00:13:07,589 an omelet, which is 351 00:13:07,590 --> 00:13:09,900 basically a matrix multiplication. 352 00:13:14,850 --> 00:13:16,949 I mean, I mean, it's obvious, right, you 353 00:13:16,950 --> 00:13:17,950 apply some heat, 354 00:13:19,020 --> 00:13:21,299 so if you're not so familiar 355 00:13:21,300 --> 00:13:23,129 with matrix multiplication because you 356 00:13:23,130 --> 00:13:25,409 don't move, a lot of this is this is 357 00:13:25,410 --> 00:13:26,699 the hour. Help us light. 358 00:13:26,700 --> 00:13:29,189 And I also uploaded these slides into the 359 00:13:29,190 --> 00:13:31,409 system already, so be sure 360 00:13:31,410 --> 00:13:33,749 to download them and look 361 00:13:33,750 --> 00:13:35,099 back if you're lost. 362 00:13:35,100 --> 00:13:37,979 This is really could 363 00:13:37,980 --> 00:13:40,299 be tricky for some people. 364 00:13:40,300 --> 00:13:42,509 So matrix multiplication actually means 365 00:13:42,510 --> 00:13:44,939 that you that you produce 366 00:13:44,940 --> 00:13:46,409 the product. 367 00:13:46,410 --> 00:13:48,719 There's a sum of products of parts 368 00:13:48,720 --> 00:13:51,229 of the Matrix and I call 369 00:13:51,230 --> 00:13:53,489 the respective necessary 370 00:13:53,490 --> 00:13:55,979 rows and columns to compute the element 371 00:13:55,980 --> 00:13:58,499 and see the resulting element in C 372 00:13:58,500 --> 00:14:00,749 and you do that in picking the first 373 00:14:00,750 --> 00:14:03,149 element in A and multiplying with 374 00:14:03,150 --> 00:14:05,189 with the first element in B in the 375 00:14:05,190 --> 00:14:06,659 colored area. 376 00:14:06,660 --> 00:14:08,909 So you move in both mattresses 377 00:14:08,910 --> 00:14:10,979 over K wide 378 00:14:10,980 --> 00:14:13,559 eye and they are fixed. 379 00:14:13,560 --> 00:14:14,609 This is important. 380 00:14:14,610 --> 00:14:16,799 So you realize that actually to 381 00:14:16,800 --> 00:14:19,229 calculate this specific spot. 382 00:14:21,200 --> 00:14:22,200 You are not, 383 00:14:23,690 --> 00:14:25,159 you know, not connected to the 384 00:14:25,160 --> 00:14:28,519 calculations in four other positions, 385 00:14:28,520 --> 00:14:30,319 so you actually it doesn't matter when 386 00:14:30,320 --> 00:14:33,019 you calculate the respective C part, 387 00:14:33,020 --> 00:14:35,269 you just have to do it some time 388 00:14:35,270 --> 00:14:37,279 and you need a specific amount of data. 389 00:14:37,280 --> 00:14:38,689 OK, but it doesn't matter. 390 00:14:38,690 --> 00:14:40,699 You can do this in parallel and actually 391 00:14:40,700 --> 00:14:42,799 in image processing or in 392 00:14:42,800 --> 00:14:44,569 scientific computing, they do that a lot 393 00:14:44,570 --> 00:14:45,570 in parallel. 394 00:14:47,760 --> 00:14:48,760 And you can 395 00:14:51,240 --> 00:14:53,309 obviously code it 396 00:14:53,310 --> 00:14:55,529 in just for reference, I'm showing 397 00:14:55,530 --> 00:14:56,999 you the test of C code, 398 00:14:58,680 --> 00:15:01,070 which is just full like you need to. 399 00:15:02,380 --> 00:15:04,199 Yeah, exactly. So you need to define your 400 00:15:04,200 --> 00:15:06,479 matron's. And there's in line 401 00:15:06,480 --> 00:15:08,759 three. There is a known 402 00:15:08,760 --> 00:15:11,109 prerequisite for multiplying Metrozoo. 403 00:15:11,110 --> 00:15:12,110 So please 404 00:15:13,350 --> 00:15:15,479 know that. And then you just go over 405 00:15:15,480 --> 00:15:17,909 the whoops fix a day and then 406 00:15:17,910 --> 00:15:20,069 go over and produce your 407 00:15:20,070 --> 00:15:21,070 some product. 408 00:15:22,650 --> 00:15:24,809 So we will not transform most 409 00:15:24,810 --> 00:15:26,579 of the time this code into the respective 410 00:15:26,580 --> 00:15:27,599 Thorold versions. 411 00:15:28,860 --> 00:15:31,049 First of all, to run Oakman P 412 00:15:31,050 --> 00:15:33,719 on the multicore processor, 413 00:15:33,720 --> 00:15:35,799 you should probably test if it's working. 414 00:15:35,800 --> 00:15:37,979 So you write another word here 415 00:15:37,980 --> 00:15:38,980 and you can ask 416 00:15:40,110 --> 00:15:42,929 the system if it actually did work. 417 00:15:42,930 --> 00:15:45,089 As you probably can already see, I 418 00:15:45,090 --> 00:15:47,279 included a header there 419 00:15:47,280 --> 00:15:49,439 for opening up and then there is not 420 00:15:49,440 --> 00:15:50,429 so much code. 421 00:15:50,430 --> 00:15:52,739 There's just this weird pragma in there. 422 00:15:52,740 --> 00:15:55,379 This is how you code open and 423 00:15:55,380 --> 00:15:57,689 open. MPLX is translated 424 00:15:57,690 --> 00:16:00,179 by the compiler to actual code 425 00:16:00,180 --> 00:16:02,609 so the compiler decides how to paralyze 426 00:16:02,610 --> 00:16:03,610 it. 427 00:16:04,080 --> 00:16:06,159 So you inform him, hey, now, now come 428 00:16:06,160 --> 00:16:07,679 some parallel stuff, please. 429 00:16:07,680 --> 00:16:10,169 Executed Perello on 430 00:16:10,170 --> 00:16:12,059 the amount of press and processes you 431 00:16:12,060 --> 00:16:14,609 have in your system and the compiler 432 00:16:14,610 --> 00:16:16,889 just will decide how many processes 433 00:16:16,890 --> 00:16:18,809 there are and how to distribute it. 434 00:16:20,510 --> 00:16:22,579 But it's very short and cute and you can 435 00:16:22,580 --> 00:16:25,099 see the code, instead of focusing 436 00:16:25,100 --> 00:16:27,109 on all the overhead stuff, you should 437 00:16:27,110 --> 00:16:29,419 probably you could do and we will see how 438 00:16:29,420 --> 00:16:30,420 it could look like 439 00:16:31,700 --> 00:16:34,459 if you do some real matrix multiplication 440 00:16:34,460 --> 00:16:36,679 with open MPM, I 441 00:16:36,680 --> 00:16:37,680 will. 442 00:16:38,000 --> 00:16:39,889 First, you have to probably fill your 443 00:16:39,890 --> 00:16:42,139 matrix. This is why this less 444 00:16:42,140 --> 00:16:44,210 important slide you have to analyze. 445 00:16:45,740 --> 00:16:47,599 I got these words. 446 00:16:47,600 --> 00:16:49,819 You have to fill your matrix 447 00:16:49,820 --> 00:16:52,099 with Beda. Apparently the importance that 448 00:16:52,100 --> 00:16:54,199 comes now because now you have 449 00:16:54,200 --> 00:16:55,599 to compute. 450 00:16:55,600 --> 00:16:57,409 As you can see, it's basically the same 451 00:16:57,410 --> 00:17:00,049 code as in a matrix multiplication. 452 00:17:00,050 --> 00:17:02,479 I just wrote an own parallel 453 00:17:02,480 --> 00:17:03,480 above it. 454 00:17:05,109 --> 00:17:07,809 Yeah, now 455 00:17:07,810 --> 00:17:08,880 we're seeing the floor. 456 00:17:11,609 --> 00:17:13,739 OK, and 457 00:17:13,740 --> 00:17:15,450 you free it, then 458 00:17:16,859 --> 00:17:18,209 that's the end of it, you can print it 459 00:17:18,210 --> 00:17:19,318 out and free it. 460 00:17:19,319 --> 00:17:21,629 But actually what it does is 461 00:17:21,630 --> 00:17:23,699 give you a print out and it will 462 00:17:23,700 --> 00:17:26,219 tell you, hey, come on, I fall press. 463 00:17:26,220 --> 00:17:27,779 This is a threat to you and I'm 464 00:17:27,780 --> 00:17:28,679 comfortable doing this and that. 465 00:17:28,680 --> 00:17:30,989 Well, you then the next Fed will say, 466 00:17:30,990 --> 00:17:32,439 oh, I'm computing that. 467 00:17:32,440 --> 00:17:33,449 Well, you too. 468 00:17:33,450 --> 00:17:35,699 Oh, I do that too. 469 00:17:35,700 --> 00:17:37,049 This is not what you want. 470 00:17:37,050 --> 00:17:39,509 So just writing the of it will execute 471 00:17:39,510 --> 00:17:41,339 the code on Karua. 472 00:17:41,340 --> 00:17:43,229 Apparently this is a quote quote. 473 00:17:43,230 --> 00:17:44,230 And so. 474 00:17:45,250 --> 00:17:47,789 We should invest some more 475 00:17:47,790 --> 00:17:50,489 letters into our pragma, 476 00:17:50,490 --> 00:17:51,490 and we did that. 477 00:17:52,800 --> 00:17:54,929 We define some the variables 478 00:17:54,930 --> 00:17:57,569 here, the example, the trunk size and the 479 00:17:57,570 --> 00:17:59,669 cake, and then we extended 480 00:17:59,670 --> 00:18:01,019 our pragma. 481 00:18:01,020 --> 00:18:03,149 Now we tell it that, first of all, 482 00:18:03,150 --> 00:18:05,219 this in the next one blockquote is in 483 00:18:05,220 --> 00:18:07,379 parallel and some 484 00:18:07,380 --> 00:18:09,569 of the information, like the mattresses 485 00:18:09,570 --> 00:18:10,919 are shared. 486 00:18:10,920 --> 00:18:13,559 Everybody in this in this 487 00:18:13,560 --> 00:18:15,869 whole compute unit is going to use 488 00:18:15,870 --> 00:18:18,299 them. But Jahnke are private. 489 00:18:19,670 --> 00:18:22,429 And then we inform it that 490 00:18:22,430 --> 00:18:25,369 here there's a second pragma in line 11, 491 00:18:25,370 --> 00:18:27,799 that there is a follow up now coming 492 00:18:27,800 --> 00:18:30,319 and this whole is important because 493 00:18:30,320 --> 00:18:32,479 the eye that defines the spot 494 00:18:32,480 --> 00:18:34,549 that runs through this loop is 495 00:18:34,550 --> 00:18:36,319 defined as shared. 496 00:18:36,320 --> 00:18:38,599 So this means that the 497 00:18:38,600 --> 00:18:40,849 four threats of running and compute 498 00:18:40,850 --> 00:18:43,159 these values will not 499 00:18:43,160 --> 00:18:45,319 calculate they don't have 500 00:18:45,320 --> 00:18:47,329 the same eye. So they will be different. 501 00:18:47,330 --> 00:18:48,829 And this is what we want. 502 00:18:48,830 --> 00:18:50,959 But they will iterate over the day 503 00:18:50,960 --> 00:18:53,170 and they will not change anything here. 504 00:18:54,200 --> 00:18:56,319 And you can also define how open 505 00:18:56,320 --> 00:18:57,920 schedules, all these things. 506 00:18:59,060 --> 00:19:00,109 So this is a way better. 507 00:19:00,110 --> 00:19:01,250 This is what we want. 508 00:19:02,670 --> 00:19:05,460 And one word for scheduling, 509 00:19:07,200 --> 00:19:09,929 it could be dynamic and you can also 510 00:19:09,930 --> 00:19:12,059 dynamic means it checks if 511 00:19:12,060 --> 00:19:14,219 the respective Fed is done and then it 512 00:19:14,220 --> 00:19:16,409 will load another pair of data 513 00:19:16,410 --> 00:19:18,749 on it for the computation and 514 00:19:18,750 --> 00:19:21,029 it will lower the amount of chunk 515 00:19:21,030 --> 00:19:21,929 into it. 516 00:19:21,930 --> 00:19:24,569 So it will compute in our 517 00:19:24,570 --> 00:19:25,829 situation for. 518 00:19:25,830 --> 00:19:27,450 Well, use for is. 519 00:19:29,920 --> 00:19:32,139 Via this loop and then request 520 00:19:32,140 --> 00:19:33,339 new data. 521 00:19:33,340 --> 00:19:35,349 You can change this, you have a little 522 00:19:35,350 --> 00:19:36,879 bit of overhead when you launch new 523 00:19:36,880 --> 00:19:39,309 processes here, so you should choose 524 00:19:39,310 --> 00:19:41,639 the value wisely and especially check 525 00:19:41,640 --> 00:19:44,259 the time. You may be a bit irritated 526 00:19:44,260 --> 00:19:45,999 or different. It could be changing the 527 00:19:46,000 --> 00:19:47,619 tank size. 528 00:19:47,620 --> 00:19:50,349 This also is relevant for Cuda. 529 00:19:50,350 --> 00:19:52,149 Obviously you have to install it first 530 00:19:52,150 --> 00:19:54,189 and you have to use in this case, 531 00:19:54,190 --> 00:19:56,559 chieftaincy to link it. 532 00:19:56,560 --> 00:19:58,749 And these 533 00:19:58,750 --> 00:20:00,519 were the old standards and now we are 534 00:20:00,520 --> 00:20:01,780 going to accelerate accord's. 535 00:20:04,310 --> 00:20:06,469 We are back to the picture and now we are 536 00:20:06,470 --> 00:20:08,499 going over to the explicit models. 537 00:20:08,500 --> 00:20:10,509 They really have to look into your data 538 00:20:10,510 --> 00:20:12,679 and move the data along stuff and 539 00:20:12,680 --> 00:20:14,869 so on. So this really key way 540 00:20:14,870 --> 00:20:17,419 of doing it, and I'm perfectly 541 00:20:17,420 --> 00:20:18,920 on time, which is great. 542 00:20:22,590 --> 00:20:24,719 And most wanted for this, 543 00:20:24,720 --> 00:20:26,819 we have to make sure that we don't 544 00:20:26,820 --> 00:20:28,919 miss the words, we are 545 00:20:28,920 --> 00:20:31,049 always thinking about 546 00:20:31,050 --> 00:20:32,639 a host and a device. 547 00:20:32,640 --> 00:20:34,799 The host is in most cases, the 548 00:20:34,800 --> 00:20:37,529 CPU sending off work to the device, 549 00:20:37,530 --> 00:20:39,959 the accelerator card, most cases, 550 00:20:39,960 --> 00:20:41,279 which are you are probably used. 551 00:20:41,280 --> 00:20:43,259 These are graphic cards. 552 00:20:43,260 --> 00:20:45,359 And then when you send something 553 00:20:45,360 --> 00:20:47,909 to the device, you send over 554 00:20:47,910 --> 00:20:49,859 a kernel, which is a function which is 555 00:20:49,860 --> 00:20:51,269 launched on the device. 556 00:20:51,270 --> 00:20:53,489 So if I am now switching to kernel, 557 00:20:53,490 --> 00:20:55,359 this is just a name for a function. 558 00:20:56,370 --> 00:20:58,049 Yes. And when you are running Linux, you 559 00:20:58,050 --> 00:21:00,209 need to install the proprietary graphics 560 00:21:00,210 --> 00:21:02,429 drivers, which actually worked 561 00:21:02,430 --> 00:21:04,020 out very fine on my system. 562 00:21:05,410 --> 00:21:06,549 It's not on my work system. 563 00:21:08,200 --> 00:21:09,969 First of all, we can do this now in 564 00:21:09,970 --> 00:21:12,309 Kouda, could I speak programing 565 00:21:12,310 --> 00:21:14,559 language developed by Invidia? 566 00:21:14,560 --> 00:21:16,719 They do invest a lot of marketing money 567 00:21:16,720 --> 00:21:18,909 and share it all 568 00:21:18,910 --> 00:21:19,569 with all the word. 569 00:21:19,570 --> 00:21:21,669 They have a lot of great documentation 570 00:21:21,670 --> 00:21:22,670 about it. 571 00:21:23,410 --> 00:21:25,719 Yeah, but you should choose 572 00:21:25,720 --> 00:21:27,219 wisely, you know. 573 00:21:27,220 --> 00:21:29,109 But it's you will see it's it's actually 574 00:21:29,110 --> 00:21:30,579 quite nice, this one. 575 00:21:30,580 --> 00:21:32,079 What you can see here is, first of all, 576 00:21:32,080 --> 00:21:33,279 the, the kernel. 577 00:21:33,280 --> 00:21:35,169 It's going to the device, which means it 578 00:21:35,170 --> 00:21:37,689 has this global keyword 579 00:21:37,690 --> 00:21:39,819 in front of its definition 580 00:21:39,820 --> 00:21:42,249 that double lines there. 581 00:21:42,250 --> 00:21:44,769 And then in line eight and nine, 582 00:21:44,770 --> 00:21:47,079 it somehow tries to calculate 583 00:21:47,080 --> 00:21:48,190 a row and a column 584 00:21:49,780 --> 00:21:52,269 and it does this and requesens 585 00:21:52,270 --> 00:21:55,179 something like a block idea and 586 00:21:55,180 --> 00:21:57,429 block dimension and the threat 587 00:21:57,430 --> 00:21:58,430 idea. 588 00:21:59,350 --> 00:22:01,509 Obviously, we are talking later about 589 00:22:01,510 --> 00:22:03,759 this. You have graphics cards with 590 00:22:03,760 --> 00:22:06,609 a lot of parallel processing units. 591 00:22:06,610 --> 00:22:07,569 They are a lot of them. 592 00:22:07,570 --> 00:22:09,879 So you need to number them somehow. 593 00:22:09,880 --> 00:22:12,189 And there are different ways of doing 594 00:22:12,190 --> 00:22:14,739 that. You can do this in one dimension 595 00:22:14,740 --> 00:22:16,309 or in two or in three dimensions. 596 00:22:16,310 --> 00:22:19,149 And this is how you compute 597 00:22:19,150 --> 00:22:21,669 the global index 598 00:22:21,670 --> 00:22:24,039 of your respective 599 00:22:24,040 --> 00:22:25,479 Crono Lounge. 600 00:22:25,480 --> 00:22:27,609 And then you may need to consider what 601 00:22:27,610 --> 00:22:29,319 you're going to do this depending on 602 00:22:29,320 --> 00:22:31,839 which position you are in this process, 603 00:22:31,840 --> 00:22:33,879 which happens here because it uses this 604 00:22:33,880 --> 00:22:36,069 computed number and call 605 00:22:36,070 --> 00:22:38,169 to define which part of 606 00:22:38,170 --> 00:22:40,299 the Matrix C it is going to 607 00:22:40,300 --> 00:22:41,300 compute. 608 00:22:42,630 --> 00:22:45,449 So now I have to warn you, my workpiece 609 00:22:45,450 --> 00:22:48,089 didn't like my Linko, 610 00:22:48,090 --> 00:22:51,179 so I was able to test this, 611 00:22:51,180 --> 00:22:53,249 but you can find a better example on 612 00:22:53,250 --> 00:22:55,499 Kouda, so I may have switched 613 00:22:55,500 --> 00:22:57,569 the Ines's. I can guarantee. 614 00:22:57,570 --> 00:22:58,589 This is really tricky. 615 00:22:58,590 --> 00:23:01,049 This is the worst part of our programing. 616 00:23:03,120 --> 00:23:05,579 OK, but we also 617 00:23:05,580 --> 00:23:08,309 have to initialize 618 00:23:08,310 --> 00:23:10,379 on our hosting our main function, our 619 00:23:10,380 --> 00:23:12,569 matrix, and then we have to 620 00:23:12,570 --> 00:23:14,789 in line nine to 11, we have to 621 00:23:14,790 --> 00:23:16,949 allocate memory on the device and 622 00:23:16,950 --> 00:23:18,909 then we have to move the knowledge. 623 00:23:18,910 --> 00:23:21,149 The Matrix is A and B to 624 00:23:21,150 --> 00:23:22,799 the device per hand. 625 00:23:22,800 --> 00:23:24,239 This is kudasai. 626 00:23:24,240 --> 00:23:26,459 You know, we are explicit here. 627 00:23:26,460 --> 00:23:29,039 And then you have to define your 628 00:23:29,040 --> 00:23:31,319 in line 17, 18, your 629 00:23:31,320 --> 00:23:34,019 way, how to figure out 630 00:23:34,020 --> 00:23:36,599 what each Conlon's launch has to do. 631 00:23:36,600 --> 00:23:38,759 So you need to say in 632 00:23:38,760 --> 00:23:40,889 which in how many dimensions you 633 00:23:40,890 --> 00:23:43,169 define the indices you're working 634 00:23:43,170 --> 00:23:45,389 with. And this is done with a stimulating 635 00:23:45,390 --> 00:23:47,459 block thing only as 636 00:23:47,460 --> 00:23:49,919 you can see, my dimwit is just 637 00:23:49,920 --> 00:23:51,989 one one one about my Dimbulah is 638 00:23:51,990 --> 00:23:54,179 exactly the size of my 639 00:23:54,180 --> 00:23:55,380 output c. 640 00:23:56,620 --> 00:23:58,809 Just in two dimensions, 641 00:23:58,810 --> 00:24:00,009 not so nice. 642 00:24:00,010 --> 00:24:02,409 We'll talk about that a little bit. 643 00:24:02,410 --> 00:24:04,569 So this is this code is not safe 644 00:24:04,570 --> 00:24:07,839 for work if Shiraki is too big. 645 00:24:07,840 --> 00:24:09,939 OK, but when you have decided 646 00:24:09,940 --> 00:24:12,189 how you would like to a 647 00:24:12,190 --> 00:24:14,169 number of the threats that are running on 648 00:24:14,170 --> 00:24:16,119 your graphics, quite an example. 649 00:24:16,120 --> 00:24:18,369 You have to launch the counter and inform 650 00:24:18,370 --> 00:24:20,529 the council how it's going to 651 00:24:20,530 --> 00:24:23,019 be addressed. You know, you tell it, 652 00:24:23,020 --> 00:24:25,319 how is the numbering of the respective 653 00:24:25,320 --> 00:24:26,439 countries going to be? 654 00:24:26,440 --> 00:24:28,059 And you just put in the input, the 655 00:24:28,060 --> 00:24:30,429 mattresses to be multiplied and then 656 00:24:30,430 --> 00:24:31,959 the kernel starts to compute the 657 00:24:31,960 --> 00:24:33,939 respective index and so on. 658 00:24:33,940 --> 00:24:35,589 I think I have a summary slide of that, 659 00:24:35,590 --> 00:24:37,719 too. OK, after you've done 660 00:24:37,720 --> 00:24:39,759 your computation, you need to synchronize 661 00:24:39,760 --> 00:24:40,989 again. You need to 662 00:24:42,280 --> 00:24:44,349 fetch the memory from the device 663 00:24:44,350 --> 00:24:46,449 to the host and then you do something 664 00:24:46,450 --> 00:24:48,339 with it for your stuff. 665 00:24:48,340 --> 00:24:49,569 Yeah. 666 00:24:49,570 --> 00:24:51,459 So if you summarize the important parts 667 00:24:51,460 --> 00:24:52,899 are here in Hooda. 668 00:24:52,900 --> 00:24:54,789 First of all, you have your come back 669 00:24:54,790 --> 00:24:57,849 there with your indices computation 670 00:24:57,850 --> 00:25:00,189 very relevant and very 671 00:25:00,190 --> 00:25:01,190 problematic. 672 00:25:02,110 --> 00:25:05,079 And then you have in your main function 673 00:25:05,080 --> 00:25:07,059 the question how you define your 674 00:25:07,060 --> 00:25:09,189 dimensions, how many threats you need to 675 00:25:09,190 --> 00:25:11,589 launch and how you address 676 00:25:11,590 --> 00:25:14,049 them, which are so you have to 677 00:25:14,050 --> 00:25:17,109 fit like 19, 20 to line 678 00:25:17,110 --> 00:25:18,110 three and four. 679 00:25:18,990 --> 00:25:21,149 This must fit and this is 680 00:25:21,150 --> 00:25:23,249 a case for a pen and paper, 681 00:25:23,250 --> 00:25:25,139 and then you lost your thread and the 682 00:25:25,140 --> 00:25:26,939 economy and then you synchronize the 683 00:25:26,940 --> 00:25:28,799 threads to make sure that you are done 684 00:25:28,800 --> 00:25:29,969 with the computation. 685 00:25:29,970 --> 00:25:31,830 This is basically a cooler style. 686 00:25:34,640 --> 00:25:37,130 But there is 687 00:25:38,300 --> 00:25:40,339 this is the insulation, I cannot 688 00:25:40,340 --> 00:25:41,959 guarantee you for the Linko comment, it 689 00:25:41,960 --> 00:25:43,729 should work like this, but I couldn't 690 00:25:43,730 --> 00:25:45,949 reach my sysadmin at work, work 691 00:25:45,950 --> 00:25:46,879 I have heard. 692 00:25:46,880 --> 00:25:49,129 And at home I have an open salekhard 693 00:25:49,130 --> 00:25:51,529 and my work configuration 694 00:25:51,530 --> 00:25:53,249 of I think so. 695 00:25:53,250 --> 00:25:55,369 Now comes the other version of programing 696 00:25:55,370 --> 00:25:57,499 accelerator products, which is called 697 00:25:57,500 --> 00:25:58,549 Open Sea. 698 00:25:58,550 --> 00:26:00,679 For this I used the nice code 699 00:26:00,680 --> 00:26:02,839 from the source I linked 700 00:26:02,840 --> 00:26:04,819 here, but he can't come today. 701 00:26:04,820 --> 00:26:06,979 Very sad. I asked him about it. 702 00:26:06,980 --> 00:26:09,079 Um, and it's actually just a 703 00:26:09,080 --> 00:26:11,209 regular edition, but you will not 704 00:26:11,210 --> 00:26:12,619 really miss it. 705 00:26:12,620 --> 00:26:14,689 Open Seattle is a language that is 706 00:26:14,690 --> 00:26:17,329 developed by Nvidia arm, the 707 00:26:17,330 --> 00:26:19,399 arm, etcetera, so 708 00:26:19,400 --> 00:26:21,859 that it will fit to a lot of devices, 709 00:26:21,860 --> 00:26:24,679 basically devices that have a car. 710 00:26:24,680 --> 00:26:27,769 So it is very flexible 711 00:26:27,770 --> 00:26:29,869 about the device it has to run on, 712 00:26:29,870 --> 00:26:32,059 which means you have to tell everything 713 00:26:32,060 --> 00:26:34,369 to the programing language, 714 00:26:34,370 --> 00:26:36,499 to the to the kernel so 715 00:26:36,500 --> 00:26:39,199 that it knows how to how to execute. 716 00:26:39,200 --> 00:26:41,509 First of all, obviously, you have to 717 00:26:41,510 --> 00:26:43,729 locate your memory and put some stuff 718 00:26:43,730 --> 00:26:44,839 in it. 719 00:26:44,840 --> 00:26:47,779 Then you have to interestingly, 720 00:26:47,780 --> 00:26:49,669 you have to read the file that your 721 00:26:49,670 --> 00:26:52,609 kernel is programed 722 00:26:52,610 --> 00:26:54,319 into your system. 723 00:26:55,760 --> 00:26:59,119 And then on line 14, 724 00:26:59,120 --> 00:27:01,519 you have to figure out on which device 725 00:27:01,520 --> 00:27:03,319 you are actually running and. 726 00:27:04,660 --> 00:27:06,759 Yeah, the I.D. 727 00:27:06,760 --> 00:27:09,309 stuff, and then you have this device 728 00:27:09,310 --> 00:27:11,439 I.D. thing, which is actually kind 729 00:27:11,440 --> 00:27:13,569 of interesting in line twenty, I 730 00:27:13,570 --> 00:27:15,519 will repeat this line. 731 00:27:15,520 --> 00:27:17,949 They don't talk to more senses 732 00:27:17,950 --> 00:27:18,789 about us. 733 00:27:18,790 --> 00:27:21,429 And then you have to create an open 734 00:27:21,430 --> 00:27:24,069 context, put everything together, 735 00:27:24,070 --> 00:27:27,069 and then you have to 736 00:27:27,070 --> 00:27:28,419 start a command queue. 737 00:27:28,420 --> 00:27:30,759 You have to then right 738 00:27:30,760 --> 00:27:32,889 over the memory and memory for 739 00:27:32,890 --> 00:27:33,890 it. 740 00:27:34,760 --> 00:27:36,919 Then you have 741 00:27:36,920 --> 00:27:39,439 to create the program, 742 00:27:40,820 --> 00:27:43,009 then you have to I have to read 743 00:27:43,010 --> 00:27:44,599 it up, you have to build the program and 744 00:27:44,600 --> 00:27:47,809 then you have to create the open kernel, 745 00:27:47,810 --> 00:27:50,809 you know, I mean, don't mix it up, 746 00:27:50,810 --> 00:27:53,299 then you have to set the arguments 747 00:27:53,300 --> 00:27:55,369 of the kernel because it doesn't know it, 748 00:27:55,370 --> 00:27:56,479 apparently. 749 00:27:56,480 --> 00:27:58,669 And then you have to inform 750 00:27:58,670 --> 00:28:00,739 the system how big 751 00:28:00,740 --> 00:28:02,899 of chunks probably you 752 00:28:02,900 --> 00:28:05,179 have to calculate 753 00:28:05,180 --> 00:28:07,939 it. So you have to inform it how 754 00:28:07,940 --> 00:28:10,279 big the overall size of your vector 755 00:28:10,280 --> 00:28:12,529 in this case is or if your matrix 756 00:28:12,530 --> 00:28:14,149 and you have to tell it and how many 757 00:28:14,150 --> 00:28:16,249 pieces one thread should execute 758 00:28:16,250 --> 00:28:17,509 when it started. 759 00:28:17,510 --> 00:28:18,510 And then you launch the 760 00:28:19,610 --> 00:28:20,610 great. 761 00:28:23,960 --> 00:28:26,509 Now, we have launched the colonel, 762 00:28:26,510 --> 00:28:28,849 we have to get the memory back, 763 00:28:28,850 --> 00:28:31,669 so you encourage Buffer 764 00:28:31,670 --> 00:28:34,099 and then you need to delete 765 00:28:34,100 --> 00:28:36,109 everything and release everything, which 766 00:28:36,110 --> 00:28:37,269 is also a lot of code. 767 00:28:38,510 --> 00:28:39,510 So. 768 00:28:40,500 --> 00:28:42,540 Yep, as you can see. 769 00:28:44,390 --> 00:28:46,579 Open seat needs a lot of code. 770 00:28:47,860 --> 00:28:49,989 But because you can configure so 771 00:28:49,990 --> 00:28:52,149 much, you can use it 772 00:28:52,150 --> 00:28:53,979 on a lot more platform than just in 773 00:28:53,980 --> 00:28:54,980 videography cards. 774 00:28:55,990 --> 00:28:58,299 The very interesting thing about it is 775 00:28:58,300 --> 00:29:00,459 actually that you in line 776 00:29:00,460 --> 00:29:02,769 seven can switch to this device 777 00:29:02,770 --> 00:29:04,899 type GPU to CPU and 778 00:29:04,900 --> 00:29:06,789 then you can execute your program without 779 00:29:06,790 --> 00:29:09,339 any problems on your multicore system. 780 00:29:09,340 --> 00:29:11,499 And then you can use your normal debugger 781 00:29:11,500 --> 00:29:13,599 for debugging it, which 782 00:29:13,600 --> 00:29:14,949 is quite nice. 783 00:29:14,950 --> 00:29:16,749 And you just have to change this line. 784 00:29:20,970 --> 00:29:23,069 So, so. 785 00:29:25,230 --> 00:29:27,449 You obviously already has to 786 00:29:27,450 --> 00:29:29,759 install the Cronos open, the 787 00:29:29,760 --> 00:29:31,949 Kroner's is the group in which 788 00:29:31,950 --> 00:29:34,469 and the ARM and I'm MDR 789 00:29:34,470 --> 00:29:36,779 and others are developing this open S.L 790 00:29:36,780 --> 00:29:38,849 language and you have 791 00:29:38,850 --> 00:29:40,930 to compile it is already in the TCC. 792 00:29:42,660 --> 00:29:44,430 Yeah, so. 793 00:29:46,030 --> 00:29:47,640 The next slide is going to be. 794 00:29:51,310 --> 00:29:53,379 This one. So in summary, we always 795 00:29:53,380 --> 00:29:55,509 have the same code, but in in 796 00:29:55,510 --> 00:29:57,459 reality it may look a little bit 797 00:29:57,460 --> 00:29:59,889 different. So you have to define your 798 00:29:59,890 --> 00:30:02,259 accelerator device either explicitly 799 00:30:02,260 --> 00:30:04,449 or implicitly, and then you have to 800 00:30:04,450 --> 00:30:06,639 move the memory, launch the kernel 801 00:30:06,640 --> 00:30:08,709 and tell how many launches it should 802 00:30:08,710 --> 00:30:11,829 have and how many data it should process 803 00:30:11,830 --> 00:30:13,959 in one launch. And then you have to clean 804 00:30:13,960 --> 00:30:15,999 up with more or less code. 805 00:30:16,000 --> 00:30:18,429 You can read a lot of open S.L in special 806 00:30:18,430 --> 00:30:20,559 functions so it stays 807 00:30:20,560 --> 00:30:22,359 flexible and looks a little bit more 808 00:30:22,360 --> 00:30:23,349 cleaner. 809 00:30:23,350 --> 00:30:25,120 But when it's about curliness, 810 00:30:26,170 --> 00:30:28,539 we could also look at open 811 00:30:28,540 --> 00:30:30,639 axi and 812 00:30:30,640 --> 00:30:32,709 this is the new standard situated 813 00:30:32,710 --> 00:30:34,179 next to open and up. 814 00:30:34,180 --> 00:30:36,039 So if you can remember, opening up was 815 00:30:36,040 --> 00:30:37,749 about this pragma stuff. 816 00:30:37,750 --> 00:30:39,249 Open I.C.C.. 817 00:30:39,250 --> 00:30:41,359 It's a similar thing in blue just 818 00:30:41,360 --> 00:30:42,819 to our accelerator's. 819 00:30:42,820 --> 00:30:45,189 They are aiming to fuze 820 00:30:45,190 --> 00:30:47,529 it if used with open MPLX. 821 00:30:47,530 --> 00:30:49,629 But it's not so really working 822 00:30:49,630 --> 00:30:52,149 right now that because it's short 823 00:30:52,150 --> 00:30:54,969 and you can then see your actual code, 824 00:30:54,970 --> 00:30:56,919 I will show it to you as well. 825 00:30:56,920 --> 00:30:59,079 This was the matrix multiplication and 826 00:30:59,080 --> 00:31:00,080 open up. 827 00:31:00,960 --> 00:31:03,219 And you can see there is this parallel 828 00:31:03,220 --> 00:31:05,649 shared of this for 829 00:31:05,650 --> 00:31:07,929 loop pragma thing. 830 00:31:07,930 --> 00:31:09,549 And then we take this one 831 00:31:10,900 --> 00:31:13,419 and we convert it to open I.C.C.. 832 00:31:14,890 --> 00:31:17,139 We just, yeah, switch their own 833 00:31:17,140 --> 00:31:18,140 up to I.C.C.. 834 00:31:19,650 --> 00:31:21,749 And we are pretty much 835 00:31:21,750 --> 00:31:23,909 what else here, again, 836 00:31:23,910 --> 00:31:26,729 in this cold piece, 837 00:31:26,730 --> 00:31:28,859 like an open and open I.C.C. 838 00:31:28,860 --> 00:31:30,989 is actually compiled 839 00:31:30,990 --> 00:31:33,059 to open code or depending 840 00:31:33,060 --> 00:31:34,319 on the compiler to put out code 841 00:31:35,340 --> 00:31:36,479 when you compile it. 842 00:31:36,480 --> 00:31:38,669 So this thing will be translated 843 00:31:38,670 --> 00:31:40,739 into an open civil code, hopefully 844 00:31:40,740 --> 00:31:42,869 fitting to your to 845 00:31:42,870 --> 00:31:44,939 your accelerator device you 846 00:31:44,940 --> 00:31:47,219 have to find on your compiler line, 847 00:31:47,220 --> 00:31:49,319 which means you can compile it for 848 00:31:49,320 --> 00:31:51,449 a lot of devices instead 849 00:31:51,450 --> 00:31:53,609 of for the one that 850 00:31:53,610 --> 00:31:55,679 you had in mind, which is in most 851 00:31:55,680 --> 00:31:57,959 cases better because you can be more 852 00:31:57,960 --> 00:31:59,339 flexible in your development. 853 00:32:01,930 --> 00:32:04,089 After open sead, I think this is 854 00:32:04,090 --> 00:32:05,440 kind of a relaxing 855 00:32:07,780 --> 00:32:10,089 and it's quite interesting, 856 00:32:10,090 --> 00:32:12,519 you can get open see 857 00:32:12,520 --> 00:32:14,919 currently only in 858 00:32:14,920 --> 00:32:16,989 comparatives for which you have to buy 859 00:32:16,990 --> 00:32:18,429 some licenses. 860 00:32:18,430 --> 00:32:20,559 But since November, there is an 861 00:32:20,560 --> 00:32:22,029 e-mail exchange on the D.C. 862 00:32:22,030 --> 00:32:24,009 mailing list that apparently open I.C.C. 863 00:32:24,010 --> 00:32:26,709 is moving into into ATCC 864 00:32:26,710 --> 00:32:29,409 and the next 12 to 15 months, 865 00:32:29,410 --> 00:32:31,179 the first batch is already submitted. 866 00:32:31,180 --> 00:32:33,249 So we can hope that it is available 867 00:32:33,250 --> 00:32:34,269 for. 868 00:32:34,270 --> 00:32:36,249 Yeah, people who don't have a lot of 869 00:32:36,250 --> 00:32:38,559 money for buying all these compiler 870 00:32:38,560 --> 00:32:40,299 licenses that we apparently need. 871 00:32:41,820 --> 00:32:43,619 OK, so. 872 00:32:45,900 --> 00:32:47,459 This is open, I see 873 00:32:48,660 --> 00:32:51,119 that the world does not only consist 874 00:32:51,120 --> 00:32:53,669 of stupid matrix multiplication 875 00:32:53,670 --> 00:32:55,829 code like the one I showed you, 876 00:32:55,830 --> 00:32:58,259 it actually is a bit more difficult 877 00:32:58,260 --> 00:33:00,449 because your matrix could be too 878 00:33:00,450 --> 00:33:01,919 big for the device. 879 00:33:01,920 --> 00:33:03,209 So you actually don't load the whole 880 00:33:03,210 --> 00:33:04,299 matrix into it. 881 00:33:04,300 --> 00:33:05,849 You just load tiles into it 882 00:33:07,680 --> 00:33:10,019 and then you have to walk through 883 00:33:10,020 --> 00:33:12,089 your matrix like 884 00:33:12,090 --> 00:33:14,309 and so you move from one place to another 885 00:33:14,310 --> 00:33:16,499 place and 886 00:33:16,500 --> 00:33:18,659 be a bit more brandee about 887 00:33:18,660 --> 00:33:19,660 it. 888 00:33:21,920 --> 00:33:23,139 Exactly. 889 00:33:23,140 --> 00:33:25,249 Um, so you have to be a 890 00:33:25,250 --> 00:33:27,469 bit more specific about your friends 891 00:33:27,470 --> 00:33:28,669 and be careful. 892 00:33:28,670 --> 00:33:30,229 And so and this is also a part where you 893 00:33:30,230 --> 00:33:31,579 need pen and paper when you start to 894 00:33:31,580 --> 00:33:33,319 program it, because you can mess up so 895 00:33:33,320 --> 00:33:34,879 many things. 896 00:33:34,880 --> 00:33:36,949 And there's another 897 00:33:36,950 --> 00:33:37,950 problem with it. 898 00:33:42,170 --> 00:33:44,269 And this is in Coudert terms, it's 899 00:33:44,270 --> 00:33:45,589 called warb divergence, 900 00:33:46,760 --> 00:33:49,099 because actually, if you 901 00:33:49,100 --> 00:33:51,409 at least look at the graphic records by 902 00:33:51,410 --> 00:33:53,840 Andrian and media, you can see that. 903 00:33:57,330 --> 00:33:59,489 You may say that they are 904 00:33:59,490 --> 00:34:00,490 HeartWare. 905 00:34:01,760 --> 00:34:04,409 How do you say that they are threats 906 00:34:04,410 --> 00:34:06,149 combined by hardware? 907 00:34:06,150 --> 00:34:08,488 So you define the amount 908 00:34:08,489 --> 00:34:10,769 of threats that work together that 909 00:34:10,770 --> 00:34:13,319 launch simultaneously by 910 00:34:13,320 --> 00:34:15,569 your head, in your code, in your 911 00:34:15,570 --> 00:34:17,999 in your main program, when you define the 912 00:34:18,000 --> 00:34:20,069 dimensions of the grid and the block. 913 00:34:20,070 --> 00:34:22,169 But actually, an area as well 914 00:34:22,170 --> 00:34:24,388 as Omed also have hardware 915 00:34:24,389 --> 00:34:26,879 encoding of the sizes of these. 916 00:34:26,880 --> 00:34:30,149 And in an area it's 30 to 917 00:34:30,150 --> 00:34:32,729 this means that they will always 918 00:34:32,730 --> 00:34:35,789 be executed in 919 00:34:35,790 --> 00:34:37,529 blocks of 32 threads. 920 00:34:37,530 --> 00:34:39,599 And you should choose a block size or 921 00:34:39,600 --> 00:34:41,369 a grid dimension size of thirty two or 922 00:34:41,370 --> 00:34:43,948 more, you know, multiplies of thirty two. 923 00:34:43,949 --> 00:34:45,988 This also means that probably your tile 924 00:34:45,989 --> 00:34:48,059 doesn't fit and you do a tired 925 00:34:48,060 --> 00:34:50,339 matrix multiplication to your matrix. 926 00:34:50,340 --> 00:34:51,928 So you have an overlap and now you need 927 00:34:51,929 --> 00:34:53,638 to think about what you do when you have 928 00:34:53,639 --> 00:34:55,919 an overlap, because 929 00:34:55,920 --> 00:34:58,079 when you have an if else branch in 930 00:34:58,080 --> 00:35:00,179 a warp, this means 931 00:35:00,180 --> 00:35:02,339 that you first have 932 00:35:02,340 --> 00:35:04,889 every threat in this wall will execute 933 00:35:04,890 --> 00:35:07,089 first the first 934 00:35:07,090 --> 00:35:09,029 punch, the response and then the else 935 00:35:09,030 --> 00:35:10,229 branch. 936 00:35:10,230 --> 00:35:12,299 So the ones that have something to 937 00:35:12,300 --> 00:35:14,249 do in this branch will work, the others 938 00:35:14,250 --> 00:35:16,409 will sleep, and then the others 939 00:35:16,410 --> 00:35:17,969 that will have something to do in the 940 00:35:17,970 --> 00:35:19,799 response they will work on the others 941 00:35:19,800 --> 00:35:22,139 again, will sleep, will do 942 00:35:22,140 --> 00:35:24,269 some efficiency efficiency. 943 00:35:24,270 --> 00:35:25,799 So you should think about I've just 944 00:35:25,800 --> 00:35:28,379 probably in this case and zeros 945 00:35:28,380 --> 00:35:30,779 in the pure 946 00:35:30,780 --> 00:35:32,909 blue parts or 947 00:35:32,910 --> 00:35:35,069 at somehow figuring 948 00:35:35,070 --> 00:35:37,259 out what divergence which means 949 00:35:37,260 --> 00:35:39,329 if sponges are not nice, 950 00:35:39,330 --> 00:35:42,059 it's better to recompute than do if 951 00:35:42,060 --> 00:35:43,279 branches. 952 00:35:43,280 --> 00:35:45,599 This also holds true for not just 953 00:35:45,600 --> 00:35:47,999 warps but for the whole block. 954 00:35:48,000 --> 00:35:50,159 They just can work through one 955 00:35:50,160 --> 00:35:52,229 instruction. There's a lot of data, so 956 00:35:52,230 --> 00:35:54,059 they have to work through each line and 957 00:35:54,060 --> 00:35:55,649 some of them will be idle. 958 00:35:55,650 --> 00:35:58,049 So if you choose the box size, 959 00:35:58,050 --> 00:35:59,999 not very wisely. 960 00:36:00,000 --> 00:36:02,129 Do you have may have a lot of sleeping 961 00:36:02,130 --> 00:36:03,989 threats the whole time and your device 962 00:36:03,990 --> 00:36:05,369 seems to be busy. 963 00:36:05,370 --> 00:36:06,749 You should look out for that. 964 00:36:06,750 --> 00:36:08,489 And there's another thing that which I 965 00:36:08,490 --> 00:36:11,329 didn't cover in my code, actually 966 00:36:11,330 --> 00:36:11,969 an example. 967 00:36:11,970 --> 00:36:14,699 Invidia GP's are have 968 00:36:14,700 --> 00:36:16,979 a global constant and textured 969 00:36:16,980 --> 00:36:18,989 memory, which is accessible for 970 00:36:18,990 --> 00:36:20,069 everybody. 971 00:36:20,070 --> 00:36:21,929 And then they have shared memory, which 972 00:36:21,930 --> 00:36:23,819 is just accessible for the ones which are 973 00:36:23,820 --> 00:36:26,759 organized in one big block. 974 00:36:26,760 --> 00:36:28,199 They are grouped together the wabster 975 00:36:28,200 --> 00:36:30,299 blocks so they can access the same 976 00:36:30,300 --> 00:36:31,259 shared memory. 977 00:36:31,260 --> 00:36:33,209 And obviously it's way faster to access 978 00:36:33,210 --> 00:36:35,159 the shared memory than access every time 979 00:36:35,160 --> 00:36:36,569 the global memory. 980 00:36:36,570 --> 00:36:38,939 So you first load the 981 00:36:38,940 --> 00:36:40,649 piece of the matrix into your global 982 00:36:40,650 --> 00:36:42,629 memory and then you load the specific 983 00:36:42,630 --> 00:36:44,429 title of your matrix into your shared 984 00:36:44,430 --> 00:36:46,559 memory. And this is the speed-up you 985 00:36:46,560 --> 00:36:48,029 should aim for. 986 00:36:48,030 --> 00:36:50,189 But this can be depending on what 987 00:36:50,190 --> 00:36:51,989 you actually do instead of matrix. 988 00:36:51,990 --> 00:36:54,509 Multiplication can be tricky 989 00:36:54,510 --> 00:36:56,639 and it the how 990 00:36:56,640 --> 00:36:58,379 much, how much speed up you can get 991 00:36:58,380 --> 00:37:00,329 changes of every generation of an 992 00:37:00,330 --> 00:37:01,799 example, the Nvidia cards. 993 00:37:01,800 --> 00:37:03,569 So probably the speed up you got with the 994 00:37:03,570 --> 00:37:05,699 last generation card is not going to 995 00:37:05,700 --> 00:37:07,109 work with the next one because they have 996 00:37:07,110 --> 00:37:09,359 better memory access, which 997 00:37:09,360 --> 00:37:11,399 is why I am actually rooting for open 998 00:37:11,400 --> 00:37:12,400 I.C.C.. 999 00:37:14,080 --> 00:37:16,449 So just 1000 00:37:16,450 --> 00:37:18,819 an information for the ones 1001 00:37:18,820 --> 00:37:20,650 that are looking into these programs, 1002 00:37:22,840 --> 00:37:25,299 they didn't do not have a very nice 1003 00:37:25,300 --> 00:37:26,300 name. 1004 00:37:27,190 --> 00:37:29,829 They have decided that in Kouda, a threat 1005 00:37:29,830 --> 00:37:32,139 is a threat and you can combine 1006 00:37:32,140 --> 00:37:33,610 threats into a block. 1007 00:37:34,850 --> 00:37:37,549 Actually, Blocks consists of walks 1008 00:37:37,550 --> 00:37:40,099 based on hardware situation, but, 1009 00:37:40,100 --> 00:37:42,409 you know, threads in a group, 1010 00:37:42,410 --> 00:37:44,509 these are blocks and then you can order 1011 00:37:44,510 --> 00:37:46,819 blocks into a great fine 1012 00:37:46,820 --> 00:37:49,099 openside has decided that 1013 00:37:49,100 --> 00:37:50,269 threats are actually work. 1014 00:37:50,270 --> 00:37:52,429 Items and items can 1015 00:37:52,430 --> 00:37:54,709 be grouped in work groups. 1016 00:37:54,710 --> 00:37:57,199 They don't know a great and open ICSE 1017 00:37:57,200 --> 00:38:00,049 thought. Yeah, well, we need to somehow 1018 00:38:00,050 --> 00:38:02,239 accept every definition so 1019 00:38:02,240 --> 00:38:04,159 we do a vector, which is a threat. 1020 00:38:04,160 --> 00:38:06,289 Fine. And then we somehow say 1021 00:38:06,290 --> 00:38:08,329 there are workers and then there is a 1022 00:38:08,330 --> 00:38:09,799 gang. 1023 00:38:09,800 --> 00:38:11,779 And you should at the moment never use 1024 00:38:11,780 --> 00:38:13,879 all three ways of 1025 00:38:13,880 --> 00:38:15,619 defining how your criminals should launch 1026 00:38:15,620 --> 00:38:17,839 just to please 1027 00:38:17,840 --> 00:38:18,840 so. 1028 00:38:20,010 --> 00:38:22,409 It could get a bit tricky if you read 1029 00:38:22,410 --> 00:38:24,389 if you switch between the various 1030 00:38:24,390 --> 00:38:26,609 documentations, just imagine you have 1031 00:38:26,610 --> 00:38:28,529 to count through the amount of threats 1032 00:38:28,530 --> 00:38:30,299 you are launching and you have to address 1033 00:38:30,300 --> 00:38:31,259 and find them again. 1034 00:38:31,260 --> 00:38:32,820 And so you have to 1035 00:38:34,440 --> 00:38:36,449 use pen and paper and find the fitting 1036 00:38:36,450 --> 00:38:38,210 words on the respective documentation. 1037 00:38:40,380 --> 00:38:41,380 So. 1038 00:38:42,350 --> 00:38:44,569 But we should cover and how 1039 00:38:44,570 --> 00:38:45,679 many minutes do I have left? 1040 00:38:46,850 --> 00:38:47,929 Wonderful. 1041 00:38:47,930 --> 00:38:49,550 Um, this is great 1042 00:38:50,570 --> 00:38:52,189 hardware you have heard a lot about in 1043 00:38:52,190 --> 00:38:53,389 graphic arts. 1044 00:38:53,390 --> 00:38:55,639 And there's also 1045 00:38:55,640 --> 00:38:57,349 the talk about this fire on fire thing. 1046 00:38:58,400 --> 00:39:00,319 And then something a friend of mine 1047 00:39:00,320 --> 00:39:01,609 actually went above and recommended to 1048 00:39:01,610 --> 00:39:03,559 me, which is called Kalila. 1049 00:39:03,560 --> 00:39:06,049 And I will not cover FPGA, 1050 00:39:06,050 --> 00:39:08,389 but everybody who looks into 1051 00:39:08,390 --> 00:39:10,609 Parrello parallel 1052 00:39:10,610 --> 00:39:13,249 computation of stuff of algorithms 1053 00:39:13,250 --> 00:39:14,869 that they can encode into hardware. 1054 00:39:14,870 --> 00:39:17,209 FPGA eyes are so fast, you should 1055 00:39:17,210 --> 00:39:19,459 think about it and they are cheap. 1056 00:39:19,460 --> 00:39:21,949 But if you need a CPU, 1057 00:39:21,950 --> 00:39:24,109 a flexible one, you need to look and 1058 00:39:24,110 --> 00:39:25,819 do the three other ones. 1059 00:39:25,820 --> 00:39:28,039 First of all, you probably all know how 1060 00:39:28,040 --> 00:39:29,809 these graphic cards probably look like. 1061 00:39:31,430 --> 00:39:33,589 This is a very interesting picture 1062 00:39:33,590 --> 00:39:34,849 of a Tesla card. 1063 00:39:34,850 --> 00:39:36,979 Tesla cars could cost around two 1064 00:39:36,980 --> 00:39:38,689 thousand euros, but so if you don't have 1065 00:39:38,690 --> 00:39:40,219 the money, I'm sorry, but you could 1066 00:39:40,220 --> 00:39:42,289 probably use your graphics card, but you 1067 00:39:42,290 --> 00:39:43,909 already have in your computer, you know, 1068 00:39:44,960 --> 00:39:47,149 AMD cards look somehow similar. 1069 00:39:47,150 --> 00:39:49,549 They all have the screen vector units 1070 00:39:49,550 --> 00:39:51,949 they all use, which is 1071 00:39:51,950 --> 00:39:52,950 blue. 1072 00:39:55,290 --> 00:39:57,279 Right, the captious, then you have a 1073 00:39:57,280 --> 00:39:59,189 dispatcher unit and then they have they 1074 00:39:59,190 --> 00:40:01,079 are somehow communicating with the global 1075 00:40:01,080 --> 00:40:03,389 memory and fetching some memory. 1076 00:40:03,390 --> 00:40:05,579 And there I'm not talking about 1077 00:40:05,580 --> 00:40:07,499 the possibilities of reaching that and 1078 00:40:07,500 --> 00:40:09,629 that how many Google flops and how 1079 00:40:09,630 --> 00:40:12,029 many computations per second, because 1080 00:40:12,030 --> 00:40:14,159 reality shows that you actually need to 1081 00:40:14,160 --> 00:40:16,379 have a problem to get to these 1082 00:40:16,380 --> 00:40:17,369 numbers. 1083 00:40:17,370 --> 00:40:19,859 If your problem doesn't fit, you 1084 00:40:19,860 --> 00:40:21,599 just don't need to care about the keeg 1085 00:40:21,600 --> 00:40:23,399 flops that you client could maximally 1086 00:40:23,400 --> 00:40:25,110 reach because you cannot reach it. 1087 00:40:26,160 --> 00:40:28,319 Because in example, my my protein is 1088 00:40:28,320 --> 00:40:29,549 not so big. 1089 00:40:29,550 --> 00:40:31,289 I don't have the say on. 1090 00:40:31,290 --> 00:40:33,479 This is a very, very special, 1091 00:40:33,480 --> 00:40:35,699 different architecture you don't 1092 00:40:35,700 --> 00:40:37,799 need Koula are open to 1093 00:40:37,800 --> 00:40:40,379 and you just need simple C or C++ 1094 00:40:40,380 --> 00:40:42,979 code because it's actually 1095 00:40:42,980 --> 00:40:45,239 a consisting of sixty one or more 1096 00:40:45,240 --> 00:40:46,889 depending what kind of cells 1097 00:40:48,180 --> 00:40:50,429 normal course with connected 1098 00:40:50,430 --> 00:40:51,869 with the cash. 1099 00:40:51,870 --> 00:40:53,459 And then they are all the you can see in 1100 00:40:53,460 --> 00:40:56,519 the above picture they are all connected 1101 00:40:56,520 --> 00:40:57,520 overing 1102 00:40:58,830 --> 00:41:01,199 so and these they access 1103 00:41:01,200 --> 00:41:03,929 this ring. We are Busse which means 1104 00:41:03,930 --> 00:41:06,089 if you do not have enough threats to 1105 00:41:06,090 --> 00:41:08,130 use every core in this loop. 1106 00:41:09,360 --> 00:41:11,519 Your memory bandwidth will 1107 00:41:11,520 --> 00:41:12,520 drop. 1108 00:41:13,770 --> 00:41:15,899 So think about it. 1109 00:41:15,900 --> 00:41:17,219 It's really nice that you can use your 1110 00:41:17,220 --> 00:41:19,469 normal zip code, your normal open and 1111 00:41:19,470 --> 00:41:21,359 open API code on this, so you don't have 1112 00:41:21,360 --> 00:41:22,499 to rewrite a lot. 1113 00:41:22,500 --> 00:41:24,119 But you need to think about if you really 1114 00:41:24,120 --> 00:41:25,769 can reach the memory bandwidth that you 1115 00:41:25,770 --> 00:41:26,699 are aiming for. 1116 00:41:26,700 --> 00:41:28,229 You need enough of threats. 1117 00:41:28,230 --> 00:41:30,299 If you then look into the processor, 1118 00:41:30,300 --> 00:41:32,429 the lower picture, you will probably 1119 00:41:32,430 --> 00:41:34,530 realize that this is quite old 1120 00:41:35,730 --> 00:41:37,049 core design. 1121 00:41:37,050 --> 00:41:38,700 I think it's a Pentium design 1122 00:41:40,080 --> 00:41:41,909 and you can say it can execute for 1123 00:41:41,910 --> 00:41:44,729 threats on the top there in parallel. 1124 00:41:44,730 --> 00:41:46,679 It has two pipelines to actually 1125 00:41:48,120 --> 00:41:50,309 pipeline the whole stuff, but there 1126 00:41:50,310 --> 00:41:52,379 is just one pipeline with a 1127 00:41:52,380 --> 00:41:54,539 connection to your reactor unit and your 1128 00:41:54,540 --> 00:41:56,639 reactor is the one that executes code in 1129 00:41:56,640 --> 00:41:58,019 parallel. 1130 00:41:58,020 --> 00:42:00,449 So, again, you need to put enough work 1131 00:42:00,450 --> 00:42:02,519 into this reactor unit to make up for 1132 00:42:02,520 --> 00:42:03,629 it. 1133 00:42:03,630 --> 00:42:05,879 So it's not perfect, but 1134 00:42:05,880 --> 00:42:08,009 it's great because this card runs Linux. 1135 00:42:08,010 --> 00:42:10,079 You can open a show with it 1136 00:42:10,080 --> 00:42:11,969 and you can run or you can work with it 1137 00:42:11,970 --> 00:42:13,619 like a normal cluster. 1138 00:42:13,620 --> 00:42:15,869 So your migration efforts are 1139 00:42:15,870 --> 00:42:17,939 way smaller than you 1140 00:42:17,940 --> 00:42:18,940 might think. 1141 00:42:19,590 --> 00:42:20,789 And then there's this one. 1142 00:42:20,790 --> 00:42:22,259 Oh, yeah. By the way, Acción Fi, it's 1143 00:42:22,260 --> 00:42:24,449 probably 2000 euro, but I haven't 1144 00:42:24,450 --> 00:42:26,110 met anybody who actually owns it. 1145 00:42:28,530 --> 00:42:30,509 There's this suite called Paronella. 1146 00:42:30,510 --> 00:42:32,459 As you can see, it's ninety nine dollars, 1147 00:42:32,460 --> 00:42:34,260 which is nice 1148 00:42:35,280 --> 00:42:37,589 and it has a combination 1149 00:42:37,590 --> 00:42:39,659 of an accelerator on 1150 00:42:39,660 --> 00:42:42,299 it. The accelerator card is Epiphanny 1151 00:42:42,300 --> 00:42:44,729 here. There and it has 16 1152 00:42:44,730 --> 00:42:47,519 to 64 microprocessors 1153 00:42:47,520 --> 00:42:48,629 on it. 1154 00:42:48,630 --> 00:42:50,789 It's a can understand C 1155 00:42:50,790 --> 00:42:51,819 C++ and open. 1156 00:42:51,820 --> 00:42:54,479 So, so not so bad. 1157 00:42:54,480 --> 00:42:56,819 It also has an FPGA 1158 00:42:56,820 --> 00:42:58,679 and I read if you don't use the HDMI 1159 00:42:58,680 --> 00:43:00,629 output of this small little card because 1160 00:43:00,630 --> 00:43:02,819 actually this is a computer 1161 00:43:02,820 --> 00:43:05,129 for computer, then you have around 1162 00:43:05,130 --> 00:43:07,199 70 percent of this FPGA 1163 00:43:07,200 --> 00:43:09,479 for your own use, which is 1164 00:43:09,480 --> 00:43:11,339 interesting for figuring out. 1165 00:43:11,340 --> 00:43:13,439 And then it also has a process, a dual 1166 00:43:13,440 --> 00:43:15,719 core processor, and it's just ninety 1167 00:43:15,720 --> 00:43:16,769 nine dollars. 1168 00:43:16,770 --> 00:43:18,689 So I would say this is something for I 1169 00:43:18,690 --> 00:43:19,679 will make a community. 1170 00:43:19,680 --> 00:43:22,129 You can do something with it. 1171 00:43:22,130 --> 00:43:24,809 The only thing is at the moment you can 1172 00:43:24,810 --> 00:43:26,819 I think they have already sold some 1173 00:43:26,820 --> 00:43:29,009 pieces, but the preordering 1174 00:43:29,010 --> 00:43:31,499 for the new model is closed 1175 00:43:31,500 --> 00:43:33,869 so probably they will open up in January 1176 00:43:33,870 --> 00:43:36,119 and then you can shop and try 1177 00:43:36,120 --> 00:43:37,979 out your own stuff and have a little 1178 00:43:37,980 --> 00:43:39,179 small cluster at home. 1179 00:43:40,410 --> 00:43:41,410 Would be cool. 1180 00:43:43,700 --> 00:43:44,700 So. 1181 00:43:45,810 --> 00:43:48,149 Now, the picture is again complete, 1182 00:43:48,150 --> 00:43:50,259 I just edited just in the middle of 1183 00:43:50,260 --> 00:43:52,559 the C++ amp 1184 00:43:52,560 --> 00:43:55,349 because I was talking a lot about Invidia 1185 00:43:55,350 --> 00:43:57,449 and aimed, but 1186 00:43:57,450 --> 00:44:00,269 you should never leave out Microsoft 1187 00:44:00,270 --> 00:44:02,399 because they are people who 1188 00:44:02,400 --> 00:44:04,379 need to develop on Windows even for 1189 00:44:04,380 --> 00:44:05,429 parallel applications. 1190 00:44:05,430 --> 00:44:08,159 And Microsoft has developed 1191 00:44:08,160 --> 00:44:09,809 an extension for C++. 1192 00:44:09,810 --> 00:44:11,429 They are shipping it with visitors to 1193 00:44:11,430 --> 00:44:12,629 2012. 1194 00:44:12,630 --> 00:44:14,969 Apparently it is based in somewhere 1195 00:44:14,970 --> 00:44:16,529 situated in the middle of it because it 1196 00:44:16,530 --> 00:44:18,659 means that you can go 1197 00:44:18,660 --> 00:44:20,699 towards explicit programing. 1198 00:44:20,700 --> 00:44:21,869 You can really force it. 1199 00:44:21,870 --> 00:44:24,359 Which threats should execute, 1200 00:44:24,360 --> 00:44:26,699 in which combination, in which size 1201 00:44:26,700 --> 00:44:28,049 on your device. 1202 00:44:28,050 --> 00:44:29,099 But you don't have to. 1203 00:44:30,400 --> 00:44:32,559 So if you are 1204 00:44:32,560 --> 00:44:34,929 going towards Windows, you can use C++ 1205 00:44:34,930 --> 00:44:36,669 and you look into it, I don't have a code 1206 00:44:36,670 --> 00:44:38,619 snippet there, but it's online, the 1207 00:44:38,620 --> 00:44:39,620 documentation. 1208 00:44:42,460 --> 00:44:44,219 Exactly, I would say, 1209 00:44:45,400 --> 00:44:47,979 and then we are coming 1210 00:44:47,980 --> 00:44:50,409 already to the end, I think, yes, 1211 00:44:50,410 --> 00:44:51,609 the end, which is great 1212 00:44:53,290 --> 00:44:55,009 because I did not mention an example, 1213 00:44:55,010 --> 00:44:57,009 posted threats. I did mention them in a 1214 00:44:57,010 --> 00:44:58,719 small sentence. 1215 00:44:58,720 --> 00:45:01,059 I did not mention boost threats. 1216 00:45:01,060 --> 00:45:03,879 You know, these are all more on your 1217 00:45:03,880 --> 00:45:05,829 these are threat libraries for your 1218 00:45:05,830 --> 00:45:08,649 system. They are not aimed for running 1219 00:45:08,650 --> 00:45:10,779 on a huge cluster of a lot of nodes. 1220 00:45:10,780 --> 00:45:12,939 So they're not they don't bring this 1221 00:45:12,940 --> 00:45:15,039 whole message passing interface for easy 1222 00:45:15,040 --> 00:45:16,209 communication with them. 1223 00:45:16,210 --> 00:45:17,319 So I don't need to cover them. 1224 00:45:17,320 --> 00:45:19,629 There's also something like Hossack, 1225 00:45:19,630 --> 00:45:21,989 I think it's by Entel and there is also 1226 00:45:21,990 --> 00:45:22,990 building. 1227 00:45:23,290 --> 00:45:24,339 I have just seen it. 1228 00:45:24,340 --> 00:45:26,619 So probably if you're searching for 1229 00:45:26,620 --> 00:45:29,139 different, different alternatives, 1230 00:45:29,140 --> 00:45:30,159 you should look into that. 1231 00:45:30,160 --> 00:45:32,499 And I didn't talk about 1232 00:45:32,500 --> 00:45:35,079 Reprograms, even though I love them 1233 00:45:35,080 --> 00:45:37,239 because and I only have 1234 00:45:37,240 --> 00:45:38,769 two minutes left so I can talk about 1235 00:45:38,770 --> 00:45:39,770 them, which is great 1236 00:45:42,160 --> 00:45:44,229 task. Skinnerian programs are the 1237 00:45:44,230 --> 00:45:46,419 ones that you would like to have for 1238 00:45:46,420 --> 00:45:48,729 deploying your programs 1239 00:45:48,730 --> 00:45:50,559 on workstations of your users. 1240 00:45:50,560 --> 00:45:52,479 If you do not know the workstation of 1241 00:45:52,480 --> 00:45:54,309 your user, you can 1242 00:45:56,110 --> 00:45:58,239 write a program based on SOPA, 1243 00:45:58,240 --> 00:46:00,279 which is developed by the University of 1244 00:46:00,280 --> 00:46:02,859 Pado and 1245 00:46:02,860 --> 00:46:05,019 stop, you will first benchmark's the 1246 00:46:05,020 --> 00:46:07,209 PC and the device, the graphics 1247 00:46:07,210 --> 00:46:08,949 card, probably what's on it, and then 1248 00:46:08,950 --> 00:46:11,169 decide how many data will send off 1249 00:46:11,170 --> 00:46:13,299 to the device and how many data it would 1250 00:46:13,300 --> 00:46:15,729 be will be computed on the CPU. 1251 00:46:15,730 --> 00:46:18,459 So people your customers, probably 1252 00:46:18,460 --> 00:46:20,709 with a graphics card, with the high 1253 00:46:20,710 --> 00:46:22,869 end graphics card, could load off 1254 00:46:22,870 --> 00:46:24,939 a lot of computation on the 1255 00:46:24,940 --> 00:46:25,839 graphics card. 1256 00:46:25,840 --> 00:46:28,089 People with a very graphic card 1257 00:46:28,090 --> 00:46:30,219 will have the most computation 1258 00:46:30,220 --> 00:46:33,039 on their CPU, which is nice and flexible. 1259 00:46:33,040 --> 00:46:35,229 All you have to do is actually 1260 00:46:35,230 --> 00:46:37,299 provide stuff you or the program that 1261 00:46:37,300 --> 00:46:39,219 you're writing with multiple 1262 00:46:39,220 --> 00:46:41,439 implementations, which will work 1263 00:46:41,440 --> 00:46:43,719 then for the graphic cards 1264 00:46:43,720 --> 00:46:44,859 and for the CPU. 1265 00:46:44,860 --> 00:46:47,499 So probably one of my inspiration 1266 00:46:47,500 --> 00:46:50,589 for your CPU and one Kouda 1267 00:46:50,590 --> 00:46:52,899 inspiration for you and give you 1268 00:46:52,900 --> 00:46:53,900 an example. 1269 00:46:54,950 --> 00:46:57,439 This is actually very interesting because 1270 00:46:57,440 --> 00:46:59,839 as these students forth 1271 00:46:59,840 --> 00:47:01,939 for a company we 1272 00:47:01,940 --> 00:47:04,189 deploy should deploy software for 1273 00:47:04,190 --> 00:47:06,379 a different set of customers, and if our 1274 00:47:06,380 --> 00:47:08,270 software can actually somehow 1275 00:47:10,160 --> 00:47:12,829 meet the needs or the abilities 1276 00:47:12,830 --> 00:47:15,439 of the respective customers PC, 1277 00:47:15,440 --> 00:47:16,870 I think this is quite interesting. 1278 00:47:18,210 --> 00:47:20,969 With this, I'm done 1279 00:47:20,970 --> 00:47:22,709 and I would say it's time for questions 1280 00:47:22,710 --> 00:47:23,850 now. Thank you for listening. 1281 00:47:31,560 --> 00:47:33,749 That was awesome. So for questions, yeah, 1282 00:47:33,750 --> 00:47:36,809 just queue up at the mikes, if you like, 1283 00:47:36,810 --> 00:47:38,370 and we could take some questions. 1284 00:47:41,430 --> 00:47:43,280 All right. We have a question over there. 1285 00:47:44,370 --> 00:47:45,689 I want to know whether 1286 00:47:46,770 --> 00:47:49,019 Mosaic's and Open Mosaic's still exist. 1287 00:47:49,020 --> 00:47:50,669 And do they know how to talk to graphics 1288 00:47:50,670 --> 00:47:51,869 cards today? If they do, 1289 00:47:53,730 --> 00:47:56,099 you are probably talking about this 1290 00:47:56,100 --> 00:47:58,919 this pre-, um, 1291 00:47:58,920 --> 00:48:01,079 these versions that were that were 1292 00:48:01,080 --> 00:48:03,389 before they tried to make this programing 1293 00:48:03,390 --> 00:48:05,789 easier. And I don't know about it because 1294 00:48:05,790 --> 00:48:07,919 this would be so difficult for us to 1295 00:48:07,920 --> 00:48:09,149 use. 1296 00:48:09,150 --> 00:48:11,369 So we moved into this field 1297 00:48:11,370 --> 00:48:12,450 when Coolac 1298 00:48:13,680 --> 00:48:15,839 because, yeah, I think they are still 1299 00:48:15,840 --> 00:48:17,969 alive or somehow something like this are 1300 00:48:17,970 --> 00:48:19,829 still alive, but I think they are losing 1301 00:48:19,830 --> 00:48:22,129 market share, especially with open sale. 1302 00:48:22,130 --> 00:48:23,339 Thank you. 1303 00:48:23,340 --> 00:48:24,869 All right. We have another question at 1304 00:48:24,870 --> 00:48:26,129 this, Mike. 1305 00:48:26,130 --> 00:48:28,169 OK, so it looks to me that we are, 1306 00:48:28,170 --> 00:48:30,569 roughly speaking, of course, open selling 1307 00:48:30,570 --> 00:48:32,639 and cool. For example, compare 1308 00:48:32,640 --> 00:48:34,739 like Java and C a bit because 1309 00:48:34,740 --> 00:48:36,689 both Java and Open Sea are more bloated 1310 00:48:36,690 --> 00:48:38,759 in code. And what you get out of it when 1311 00:48:38,760 --> 00:48:41,099 you compile it is more general code. 1312 00:48:41,100 --> 00:48:43,229 But then we all know that Java may be 1313 00:48:43,230 --> 00:48:44,119 slower than C. 1314 00:48:44,120 --> 00:48:46,019 So I wonder, can you say something about 1315 00:48:46,020 --> 00:48:47,579 how the performance compares, maybe in 1316 00:48:47,580 --> 00:48:49,709 terms of the of some benchmarks, maybe 1317 00:48:49,710 --> 00:48:51,809 also in comparison with 1318 00:48:51,810 --> 00:48:53,979 the vector processing instructions like 1319 00:48:53,980 --> 00:48:56,489 this e that we have on modern CPU 1320 00:48:56,490 --> 00:48:57,490 architectures? 1321 00:48:58,400 --> 00:49:00,899 Um, actually I heard a bit 1322 00:49:00,900 --> 00:49:03,209 about a benchmark because somebody ran 1323 00:49:03,210 --> 00:49:05,189 a very powerful application, the full 1324 00:49:05,190 --> 00:49:07,139 parallel application of Montecarlo 1325 00:49:07,140 --> 00:49:09,499 simulation on ACCE on fire Coolac 1326 00:49:09,500 --> 00:49:11,579 card. And on the side of it there's 1327 00:49:11,580 --> 00:49:13,859 a sixteen core into the processor 1328 00:49:13,860 --> 00:49:15,569 as far as I know, and I think we actually 1329 00:49:15,570 --> 00:49:16,619 have one in our cluster 1330 00:49:17,850 --> 00:49:20,729 and a full parallel version. 1331 00:49:20,730 --> 00:49:23,309 The other GPU would perform best, 1332 00:49:23,310 --> 00:49:25,469 but if they would just be one iteration, 1333 00:49:25,470 --> 00:49:26,470 one 1334 00:49:27,750 --> 00:49:29,849 word would not parallel. 1335 00:49:29,850 --> 00:49:31,949 You know what you 1336 00:49:31,950 --> 00:49:33,250 see where one serial 1337 00:49:34,500 --> 00:49:37,169 iteration in this piece of code, 1338 00:49:37,170 --> 00:49:39,719 the Sandbridge would perform best 1339 00:49:39,720 --> 00:49:42,779 and the GPO 1340 00:49:42,780 --> 00:49:44,799 would be, even though most of the code 1341 00:49:44,800 --> 00:49:47,039 would be perfectly parallel, would not 1342 00:49:47,040 --> 00:49:49,469 perform good enough. 1343 00:49:49,470 --> 00:49:51,659 So instead of investing so much 1344 00:49:51,660 --> 00:49:53,879 time into optimizing for for 1345 00:49:53,880 --> 00:49:55,799 this Coudert card or for it, I think they 1346 00:49:55,800 --> 00:49:56,939 had a Tesla. 1347 00:49:56,940 --> 00:49:58,289 They said now probably I should 1348 00:49:59,370 --> 00:50:01,589 skip it and run by Musante, 1349 00:50:01,590 --> 00:50:03,739 but which could be busy, 1350 00:50:03,740 --> 00:50:06,179 which is also something if you 1351 00:50:06,180 --> 00:50:08,249 optimize your control and see how 1352 00:50:08,250 --> 00:50:11,039 it will not run as good as for Kouda 1353 00:50:11,040 --> 00:50:13,409 and on the other way around. 1354 00:50:13,410 --> 00:50:14,739 So it depends on the problem. 1355 00:50:14,740 --> 00:50:18,029 Basically, this is one thing and 1356 00:50:18,030 --> 00:50:20,069 it depends on you how much time you 1357 00:50:20,070 --> 00:50:23,139 invest on optimizing for a specific card. 1358 00:50:23,140 --> 00:50:24,489 Thank you. 1359 00:50:24,490 --> 00:50:26,129 OK, so we could take another question 1360 00:50:26,130 --> 00:50:27,130 from there. 1361 00:50:28,290 --> 00:50:29,249 All right. 1362 00:50:29,250 --> 00:50:31,439 Did you also look into other programing 1363 00:50:31,440 --> 00:50:33,809 languages that 1364 00:50:33,810 --> 00:50:36,059 make parallel programing possible or 1365 00:50:36,060 --> 00:50:38,309 even easier than with 1366 00:50:38,310 --> 00:50:40,859 C, C++, Khuda, Open SEO and stuff 1367 00:50:40,860 --> 00:50:42,959 like, for example, Haskell 1368 00:50:42,960 --> 00:50:45,659 with the libraries 1369 00:50:45,660 --> 00:50:46,619 that they provide? 1370 00:50:46,620 --> 00:50:48,929 Because I'm found through my project 1371 00:50:48,930 --> 00:50:51,199 to see I don't 1372 00:50:51,200 --> 00:50:53,639 there is another language I 1373 00:50:53,640 --> 00:50:55,049 don't know anything about Haskell. 1374 00:50:55,050 --> 00:50:57,599 I just seen some of the lambdas 1375 00:50:57,600 --> 00:50:59,849 that Google developed, 1376 00:50:59,850 --> 00:51:02,219 SCO, which is 1377 00:51:02,220 --> 00:51:04,319 somehow a replacement for C++, 1378 00:51:04,320 --> 00:51:06,539 which also I think aims to be 1379 00:51:06,540 --> 00:51:08,429 to kind of make parallel programing more 1380 00:51:08,430 --> 00:51:09,329 easier. 1381 00:51:09,330 --> 00:51:11,429 And also all of the standards are 1382 00:51:11,430 --> 00:51:13,229 also developed for Fortum. 1383 00:51:13,230 --> 00:51:15,009 And I would really be happy if this 1384 00:51:15,010 --> 00:51:17,429 Lightwood just stay stable, 1385 00:51:17,430 --> 00:51:18,430 really. 1386 00:51:19,290 --> 00:51:21,359 And so this also 1387 00:51:21,360 --> 00:51:23,249 works for Fortum, which is what most 1388 00:51:23,250 --> 00:51:25,739 physicists and mathematicians use 1389 00:51:25,740 --> 00:51:28,079 that now a lot farther 1390 00:51:28,080 --> 00:51:29,080 than that. 1391 00:51:29,940 --> 00:51:31,199 All right. Went from there. 1392 00:51:31,200 --> 00:51:33,089 Now, again, I. 1393 00:51:33,090 --> 00:51:35,460 I have a question regarding open access. 1394 00:51:37,020 --> 00:51:39,149 Obviously, there is a lot of magic 1395 00:51:39,150 --> 00:51:41,429 implemented regarding data flow 1396 00:51:41,430 --> 00:51:42,389 analysis. 1397 00:51:42,390 --> 00:51:44,579 And I want to know 1398 00:51:44,580 --> 00:51:47,939 whether there's also some magic 1399 00:51:47,940 --> 00:51:50,219 for in case your 1400 00:51:50,220 --> 00:51:52,379 input or output doesn't fit into 1401 00:51:52,380 --> 00:51:54,589 memory like this tiling 1402 00:51:54,590 --> 00:51:56,819 case regarding matrix multiplication 1403 00:51:56,820 --> 00:51:59,579 you talked about or is it 1404 00:51:59,580 --> 00:52:01,199 or do you have to handle it manually? 1405 00:52:02,380 --> 00:52:04,529 Um, I think they 1406 00:52:04,530 --> 00:52:06,689 do tiling and it depends on 1407 00:52:06,690 --> 00:52:08,429 the compiler. The good thing about is 1408 00:52:08,430 --> 00:52:09,599 that Open I.C.C. 1409 00:52:09,600 --> 00:52:11,969 is for the PCI compiler, 1410 00:52:11,970 --> 00:52:14,039 which I which I saw for Invidia 1411 00:52:14,040 --> 00:52:15,449 Cards is compiled into Kouda. 1412 00:52:15,450 --> 00:52:17,429 So you can actually look into it. 1413 00:52:17,430 --> 00:52:19,709 And for Randy Cohen 1414 00:52:19,710 --> 00:52:21,659 cards, it's compiled into open s.L, so 1415 00:52:21,660 --> 00:52:23,669 you can actually read it before you 1416 00:52:23,670 --> 00:52:25,989 launch it. I think they do do an example 1417 00:52:25,990 --> 00:52:27,719 tiling because else you can't come to 1418 00:52:27,720 --> 00:52:29,759 this Matrix Open. 1419 00:52:29,760 --> 00:52:30,769 LGMA says he doesn't. 1420 00:52:30,770 --> 00:52:32,959 Gives you enough pragma 1421 00:52:32,960 --> 00:52:35,419 possibilities to tell it to tyo 1422 00:52:35,420 --> 00:52:37,489 the Matrix, so it needs to do this 1423 00:52:37,490 --> 00:52:39,829 on its own and 1424 00:52:39,830 --> 00:52:41,959 the more trained the 1425 00:52:41,960 --> 00:52:44,149 compiler gets, the the better they 1426 00:52:44,150 --> 00:52:45,589 will do it. 1427 00:52:45,590 --> 00:52:47,659 So it has to grow, will grow with the 1428 00:52:47,660 --> 00:52:48,679 community, I would say. 1429 00:52:51,560 --> 00:52:53,089 All right, cool. 1430 00:52:53,090 --> 00:52:55,309 I think we have one question from there, 1431 00:52:55,310 --> 00:52:56,689 and we have time for one from here, I 1432 00:52:56,690 --> 00:52:59,299 guess. OK, um, 1433 00:52:59,300 --> 00:53:01,789 I'm as advanced 1434 00:53:01,790 --> 00:53:03,289 as you C++ programmer. 1435 00:53:03,290 --> 00:53:05,709 A single example is 1436 00:53:05,710 --> 00:53:07,879 a little confusing because 1437 00:53:07,880 --> 00:53:10,429 you mention you mixing 1438 00:53:10,430 --> 00:53:12,699 integer and appointer. 1439 00:53:12,700 --> 00:53:15,019 And this one question talking 1440 00:53:15,020 --> 00:53:17,209 about is that it's fine 1441 00:53:17,210 --> 00:53:19,559 in honor of, uh, 1442 00:53:19,560 --> 00:53:20,600 for two 1443 00:53:21,680 --> 00:53:23,899 particular architecture. 1444 00:53:23,900 --> 00:53:25,669 But the problem is, if you're switching 1445 00:53:25,670 --> 00:53:28,069 to 64, 1446 00:53:28,070 --> 00:53:30,259 then you get a problem because 1447 00:53:30,260 --> 00:53:32,569 the pointer will double and 1448 00:53:32,570 --> 00:53:35,239 the pointer size is now eight 1449 00:53:35,240 --> 00:53:37,759 and integer in all 1450 00:53:37,760 --> 00:53:39,859 standards. I know in C and 1451 00:53:39,860 --> 00:53:41,989 C++, uh, defined us 1452 00:53:41,990 --> 00:53:44,249 for byte types. 1453 00:53:44,250 --> 00:53:46,579 So the question is, is 1454 00:53:46,580 --> 00:53:48,769 in Kouda or s.L, 1455 00:53:48,770 --> 00:53:51,049 uh, integer either 1456 00:53:51,050 --> 00:53:53,299 defined or is all defined, 1457 00:53:53,300 --> 00:53:56,059 it should be only 32. 1458 00:53:56,060 --> 00:53:57,060 But 1459 00:53:58,600 --> 00:54:00,169 I actually cannot answer your question. 1460 00:54:00,170 --> 00:54:02,539 I just switched to enter into an 1461 00:54:02,540 --> 00:54:04,789 integer so that my written output 1462 00:54:04,790 --> 00:54:06,969 on my answer would be more readable. 1463 00:54:06,970 --> 00:54:09,049 So it's like getting rid 1464 00:54:09,050 --> 00:54:10,219 of the digits 1465 00:54:11,270 --> 00:54:13,849 because the size of 1466 00:54:13,850 --> 00:54:16,069 it. And you say then 1467 00:54:16,070 --> 00:54:18,289 what star? 1468 00:54:18,290 --> 00:54:21,379 And this is a problem supply perspective. 1469 00:54:21,380 --> 00:54:23,869 OK, with the word programmer. 1470 00:54:23,870 --> 00:54:26,129 Yeah, actually I do 1471 00:54:26,130 --> 00:54:28,249 program with float's, so 1472 00:54:28,250 --> 00:54:29,689 it's just for this use case. 1473 00:54:29,690 --> 00:54:31,280 I'm sorry it didn't 1474 00:54:32,420 --> 00:54:35,299 broke my computer until now 1475 00:54:35,300 --> 00:54:37,459 in my examples, so I'm sorry. 1476 00:54:37,460 --> 00:54:38,460 So please don't 1477 00:54:40,490 --> 00:54:42,889 make too big integer stuff and 1478 00:54:42,890 --> 00:54:43,999 be careful. 1479 00:54:44,000 --> 00:54:45,379 All right. I think we have one more from 1480 00:54:45,380 --> 00:54:46,380 here. 1481 00:54:47,210 --> 00:54:49,069 I haven't worked with graphic arts, but I 1482 00:54:49,070 --> 00:54:50,839 worked with parallel execution. 1483 00:54:50,840 --> 00:54:53,089 And there it's most of the time I spent 1484 00:54:53,090 --> 00:54:55,339 defining what our values, what 1485 00:54:55,340 --> 00:54:57,709 are mutable values 1486 00:54:57,710 --> 00:55:00,079 and what needs to be shared and stuff. 1487 00:55:00,080 --> 00:55:02,149 And then the code you showed, I've hardly 1488 00:55:02,150 --> 00:55:04,219 seen any definitions 1489 00:55:04,220 --> 00:55:06,439 of what needs to be shared among which 1490 00:55:06,440 --> 00:55:08,479 and stuff. Can you say something about 1491 00:55:08,480 --> 00:55:10,169 how that works on a on a graphic? 1492 00:55:10,170 --> 00:55:12,769 Harder. I mean, I actually 1493 00:55:12,770 --> 00:55:15,049 checked it out so that it wouldn't be too 1494 00:55:15,050 --> 00:55:16,549 comp complicated. 1495 00:55:16,550 --> 00:55:18,679 And what you do is you 1496 00:55:18,680 --> 00:55:20,869 just use a specific keyboard in 1497 00:55:20,870 --> 00:55:23,989 case of Kouda in front of your 1498 00:55:23,990 --> 00:55:26,239 in your kernel when you need to load 1499 00:55:26,240 --> 00:55:28,069 an example of small tile into your shared 1500 00:55:28,070 --> 00:55:30,199 memory. So use a 1501 00:55:30,200 --> 00:55:33,019 specific keyboard. 1502 00:55:33,020 --> 00:55:35,299 I would say, I think and and 1503 00:55:35,300 --> 00:55:37,399 as encoder, you have to do this 1504 00:55:37,400 --> 00:55:40,069 the same way. You have to do this by hand 1505 00:55:40,070 --> 00:55:41,779 and then openside. 1506 00:55:41,780 --> 00:55:43,879 You could it should figure 1507 00:55:43,880 --> 00:55:45,229 it out by itself. 1508 00:55:45,230 --> 00:55:47,479 But often I see also allows you a little 1509 00:55:47,480 --> 00:55:49,759 bit like, OK, mine to define 1510 00:55:49,760 --> 00:55:51,919 for the system what is shared and what 1511 00:55:51,920 --> 00:55:53,119 is private. 1512 00:55:53,120 --> 00:55:54,289 So. 1513 00:55:54,290 --> 00:55:55,759 All right. That has to be our last 1514 00:55:55,760 --> 00:55:57,919 question. And so, please, on 1515 00:55:57,920 --> 00:55:59,689 your way out, first of all, a round of 1516 00:55:59,690 --> 00:56:00,690 applause.