1 00:00:00,000 --> 00:00:18,229 *35C3 preroll music* 2 00:00:18,229 --> 00:00:24,750 Herald Angel: Welcome to our introduction to deep learning with Teubi. Deep 3 00:00:24,750 --> 00:00:30,247 learning, also often called machine learning is a hype word which we hear in 4 00:00:30,247 --> 00:00:37,152 the media all the time. It's nearly as bad as blockchain. It's a solution for 5 00:00:37,152 --> 00:00:43,249 everything. Today we'll get a sneak peek into the internals of this mystical black 6 00:00:43,249 --> 00:00:48,820 box, they are talking about. And Teubi will show us why people, who know what 7 00:00:48,820 --> 00:00:53,040 machine learning really is about, have to facepalm so often, when they read the 8 00:00:53,040 --> 00:00:58,715 news. So please welcome Teubi with a big round of applause! 9 00:00:58,715 --> 00:01:10,245 *Applause* Teubi: Alright! Good morning and welcome 10 00:01:10,245 --> 00:01:14,470 to Introduction to Deep Learning. The title will already tell you what this talk 11 00:01:14,470 --> 00:01:19,920 is about. I want to give you an introduction onto how deep learning works, 12 00:01:19,920 --> 00:01:27,090 what happens inside this black box. But, first of all, who am I? I'm Teubi. It's a 13 00:01:27,090 --> 00:01:32,280 German nickname, it has nothing to do with toys or bees. You might have heard my 14 00:01:32,280 --> 00:01:36,480 voice before, because I host the Nussschale podcast. There I explain 15 00:01:36,480 --> 00:01:41,560 scientific topics in under 10 minutes. I'll have to use a little more time today, 16 00:01:41,560 --> 00:01:46,850 and you'll also have fancy animations which hopefully will help. In my day job 17 00:01:46,850 --> 00:01:52,540 I'm a research scientist at an institute for computer vision. I analyze microscopy 18 00:01:52,540 --> 00:01:58,240 images of bone marrow blood cells and try to find ways to teach the computer to 19 00:01:58,240 --> 00:02:04,660 understand what it sees. Namely, to differentiate between certain cells or, 20 00:02:04,660 --> 00:02:09,449 first of all, find cells in an image, which is a task that is more complex than 21 00:02:09,449 --> 00:02:17,180 it might sound like. Let me start with the introduction to deep learning. We all know 22 00:02:17,180 --> 00:02:22,769 how to code. We code in a very simple way. We have some input for all computer 23 00:02:22,769 --> 00:02:27,618 algorithm. Then we have an algorithm which says: Do this, do that. If this, then 24 00:02:28,510 --> 00:02:28,906 that. And in that way we generate some output. This is not how machine learning 25 00:02:29,495 --> 00:02:30,754 works. Machine learning assumes you have some input, and you also have some output. 26 00:02:40,810 --> 00:02:46,180 And what you also have is some statistical model. This statistical model is flexible. 27 00:02:46,180 --> 00:02:51,549 It has certain parameters, which it can learn from the distribution of inputs and 28 00:02:51,549 --> 00:02:57,430 outputs you give it for training. So you basically learn the statistical model to 29 00:02:57,430 --> 00:03:03,659 generate the desired output from the given input. Let me give you a really simple 30 00:03:03,659 --> 00:03:09,980 example of how this might work. Let's say we have two animals. Well, we have two 31 00:03:09,980 --> 00:03:15,689 kinds of animals: unicorns and rabbits. And now we want to find an algorithm that 32 00:03:15,689 --> 00:03:24,270 tells us whether this animal we have right now as an input is a rabbit or a unicorn. 33 00:03:24,270 --> 00:03:28,230 We can write a simple algorithm to do that, but we can also do it with machine 34 00:03:28,230 --> 00:03:34,590 learning. The first thing we need is some input. I choose two features that are able 35 00:03:34,590 --> 00:03:42,269 to tell me whether this animal is a rabbit or a unicorn. Namely, speed and size. We 36 00:03:42,269 --> 00:03:46,859 call these features, and they describe something about what we want to classify. 37 00:03:46,859 --> 00:03:52,409 And the class is in this case our animal. First thing I need is some training data, 38 00:03:52,409 --> 00:03:59,170 some input. The input here are just pairs of speed and size. What I also need is 39 00:03:59,170 --> 00:04:04,129 information about the desired output. The desired output, of course, being the 40 00:04:04,129 --> 00:04:12,999 class. So either unicorn or rabbit, here denoted by yellow and red X's. So let's 41 00:04:12,999 --> 00:04:18,298 try to find a statistical model which we can use to separate this feature space 42 00:04:18,298 --> 00:04:24,150 into two halves: One for the rabbits, one for the unicorns. Looking at this, we can 43 00:04:24,150 --> 00:04:28,660 actually find a really simple statistical model, and our statistical model in this 44 00:04:28,660 --> 00:04:34,390 case is just a straight line. And the learning process is then to find where in 45 00:04:34,390 --> 00:04:41,080 this feature space the line should be. Ideally, for example, here. Right in the 46 00:04:41,080 --> 00:04:45,220 middle between the two classes rabbit and unicorn. Of course this is an overly 47 00:04:45,220 --> 00:04:50,370 simplified example. Real-world applications have feature distributions 48 00:04:50,370 --> 00:04:56,080 which look much more like this. So, we have a gradient, we don't have a perfect 49 00:04:56,080 --> 00:05:00,130 separation between those two classes, and those two classes are definitely not 50 00:05:00,130 --> 00:05:05,560 separable by a line. If we look again at some training samples — training samples 51 00:05:05,560 --> 00:05:11,730 are the data points we use for the machine learning process, so, to try to find the 52 00:05:11,730 --> 00:05:17,540 parameters of our statistical model — if we look at the line again, then this will 53 00:05:17,540 --> 00:05:23,000 not be able to separate this training set. Well, we will have a line that has some 54 00:05:23,000 --> 00:05:27,320 errors, some unicorns which will be classified as rabbits, some rabbits which 55 00:05:27,320 --> 00:05:33,070 will be classified as unicorns. This is what we call underfitting. Our model is 56 00:05:33,070 --> 00:05:40,150 just not able to express what we want it to learn. There is the opposite case. The 57 00:05:40,150 --> 00:05:45,510 opposite case being: we just learn all the training samples by heart. This is if we 58 00:05:45,510 --> 00:05:50,020 have a very complex model and just a few training samples to teach the model what 59 00:05:50,020 --> 00:05:55,120 it should learn. In this case we have a perfect separation of unicorns and 60 00:05:55,120 --> 00:06:00,700 rabbits, at least for the few data points we have. If we draw another example from 61 00:06:00,700 --> 00:06:07,300 the real world,some other data points, they will most likely be wrong. And this 62 00:06:07,300 --> 00:06:11,380 is what we call overfitting. The perfect scenario in this case would be something 63 00:06:11,380 --> 00:06:17,340 like this: a classifier which is really close to the distribution we have in the 64 00:06:17,340 --> 00:06:23,350 real world and machine learning is tasked with finding this perfect model and its 65 00:06:23,350 --> 00:06:28,960 parameters. Let me show you a different kind of model, something you probably all 66 00:06:28,960 --> 00:06:35,670 have heard about: Neural networks. Neural networks are inspired by the brain. 67 00:06:35,670 --> 00:06:41,210 Or more precisely, by the neurons in our brain. Neurons are tiny objects, tiny 68 00:06:41,210 --> 00:06:47,250 cells in our brain that take some input and generate some output. Sounds familiar, 69 00:06:47,250 --> 00:06:52,680 right? We have inputs usually in the form of electrical signals. And if they are 70 00:06:52,680 --> 00:06:57,860 strong enough, this neuron will also send out an electrical signal. And this is 71 00:06:57,860 --> 00:07:03,430 something we can model in a computer- engineering way. So, what we do is: We 72 00:07:03,430 --> 00:07:09,240 take a neuron. The neuron is just a simple mapping from input to output. Input here, 73 00:07:09,240 --> 00:07:17,200 just three input nodes. We denote them by i1, i2 and i3 and output denoted by o. And 74 00:07:17,200 --> 00:07:20,840 now you will actually see some mathematical equations. There are not many 75 00:07:20,840 --> 00:07:26,700 of these in this foundation talk, don't worry, and it's really simple. There's one 76 00:07:26,700 --> 00:07:30,250 more thing we need first, though, if we want to map input to output in the way a 77 00:07:30,250 --> 00:07:35,490 neuron does. Namely, the weights. The weights are just some arbitrary numbers 78 00:07:35,490 --> 00:07:43,020 for now. Let's call them w1, w2 and w3. So, we take those weights and we multiply 79 00:07:43,020 --> 00:07:51,360 them with the input. Input1 times weight1, input2 times weight2, and so on. And this, 80 00:07:51,360 --> 00:07:57,550 this sum just will be our output. Well, not quite. We make it a little bit more 81 00:07:57,550 --> 00:08:02,430 complicated. We also use something called an activation function. The activation 82 00:08:02,430 --> 00:08:08,520 function is just a mapping from one scalar value to another scalar value. In this 83 00:08:08,520 --> 00:08:14,280 case from what we got as an output, the sum, to something that more closely fits 84 00:08:14,280 --> 00:08:19,360 what we need. This could for example be something binary, where we have all the 85 00:08:19,360 --> 00:08:23,780 negative numbers being mapped to zero and all the positive numbers being mapped to 86 00:08:23,780 --> 00:08:30,910 one. And then this zero and one can encode something. For example: rabbit or unicorn. 87 00:08:30,910 --> 00:08:35,309 So, let me give you an example of how we can make the previous example with the 88 00:08:35,309 --> 00:08:41,729 rabbits and unicorns work with such a simple neuron. We just use speed, size, 89 00:08:41,729 --> 00:08:49,650 and the arbitrarily chosen number 10 as our inputs and the weights 1, 1, and -1. 90 00:08:49,650 --> 00:08:54,400 If we look at the equations, then we get for our negative numbers — so, speed plus 91 00:08:54,400 --> 00:09:01,440 size being less than 10 — a 0, and a 1 for all positive numbers — being speed plus 92 00:09:01,440 --> 00:09:07,680 size larger than 10, greater than 10. This way we again have a separating line 93 00:09:07,680 --> 00:09:14,600 between unicorns and rabbits. But again we have this really simplistic model. We want 94 00:09:14,600 --> 00:09:21,529 to become more and more complicated in order to express more complex tasks. So 95 00:09:21,529 --> 00:09:26,279 what do we do? We take more neurons. We take our three input values and put them 96 00:09:26,279 --> 00:09:31,920 into one neuron, and into a second neuron, and into a third neuron. And we take the 97 00:09:31,920 --> 00:09:38,330 output of those three neurons as input for another neuron. We also call this a 98 00:09:38,330 --> 00:09:42,140 multilayer perceptron, perceptron just being a different name for a neuron, what 99 00:09:42,140 --> 00:09:48,670 we have there. And the whole thing is also called a neural network. So now the 100 00:09:48,670 --> 00:09:53,300 question: How do we train this? How do we learn what this network should encode? 101 00:09:53,300 --> 00:09:57,620 Well, we want a mapping from input to output, and what we can change are the 102 00:09:57,620 --> 00:10:02,880 weights. First, what we do is we take a training sample, some input. Put it 103 00:10:02,880 --> 00:10:07,010 through the network, get an output. But this might not be the desired output which 104 00:10:07,010 --> 00:10:13,570 we know. So, in the binary case there are four possible cases: computed output, 105 00:10:13,570 --> 00:10:19,860 expected output, each two values, 0 and 1. The best case would be: we want a 0, get a 106 00:10:19,860 --> 00:10:27,120 0, want a 1 and get a 1. But there is also the opposite case. In these two cases we 107 00:10:27,120 --> 00:10:31,440 can learn something about our model. Namely, in which direction to change the 108 00:10:31,440 --> 00:10:37,270 weights. It's a little bit simplified, but in principle you just raise the weights if 109 00:10:37,270 --> 00:10:41,250 you need a higher number as output and you lower the weights if you need a lower 110 00:10:41,250 --> 00:10:47,350 number as output. To tell you how much, we have two terms. First term being the 111 00:10:47,350 --> 00:10:53,110 error, so in this case just the difference between desired and expected output – also 112 00:10:53,110 --> 00:10:56,890 often called a loss function, especially in deep learning and more complex 113 00:10:56,890 --> 00:11:04,120 applications. You also have a second term we call the act the learning rate, and the 114 00:11:04,120 --> 00:11:09,170 learning rate is what tells us how quickly we should change the weights, how quickly 115 00:11:09,170 --> 00:11:14,890 we should adapt the weights. Okay, this is how we learn a model. This is almost 116 00:11:14,890 --> 00:11:18,550 everything you need to know. There are mathematical equations that tell you how 117 00:11:18,550 --> 00:11:23,770 much to change based on the error and the learning function. And this is the entire 118 00:11:23,770 --> 00:11:30,339 learning process. Let's get back to the terminology. We have the input layer. We 119 00:11:30,339 --> 00:11:34,020 have the output layer, which somehow encodes our output either in one value or 120 00:11:34,020 --> 00:11:39,650 in several values if we have a multiple, if we have multiple classes. We also have 121 00:11:39,650 --> 00:11:45,930 the hidden layers, which are actually what makes our model deep. What we can change, 122 00:11:45,930 --> 00:11:51,980 what we can learn, is the are the weights, the parameters of this model. But what we 123 00:11:51,980 --> 00:11:55,490 also need to keep in mind, is the number of layers, the number of neurons per 124 00:11:55,490 --> 00:11:59,590 layer, the learning rate, and the activation function. These are called 125 00:11:59,590 --> 00:12:04,240 hyper parameters, and they determine how complex our model is, how well it is 126 00:12:04,240 --> 00:12:09,970 suited to solve the task at hand. I quite often spoke about solving tasks, so the 127 00:12:09,970 --> 00:12:14,630 question is: What can we actually do with neural networks? Mostly classification 128 00:12:14,630 --> 00:12:19,560 tasks, for example: Tell me, is this animal a rabbit or unicorn? Is this text 129 00:12:19,560 --> 00:12:24,690 message spam or legitimate? Is this patient healthy or ill? Is this image a 130 00:12:24,690 --> 00:12:30,710 picture of a cat or a dog? We already saw for the animal that we need something 131 00:12:30,710 --> 00:12:35,040 called features, which somehow encodes information about what we want to 132 00:12:35,040 --> 00:12:39,530 classify, something we can use as input for the neural network. Some kind of 133 00:12:39,530 --> 00:12:43,830 number that is meaningful. So, for the animal it could be speed, size, or 134 00:12:43,830 --> 00:12:48,740 something like color. Color, of course, being more complex again, because we have, 135 00:12:48,740 --> 00:12:55,940 for example, RGB, so three values. And, text message being a more complex case 136 00:12:55,940 --> 00:13:00,060 again, because we somehow need to encode the sender, and whether the sender is 137 00:13:00,060 --> 00:13:04,770 legitimate. Same for the recipient, or the number of hyperlinks, or where the 138 00:13:04,770 --> 00:13:11,400 hyperlinks refer to, or the, whether there are certain words present in the text. It 139 00:13:11,400 --> 00:13:16,720 gets more and more complicated. Even more so for a patient. How do we encode medical 140 00:13:16,720 --> 00:13:22,420 history in a proper way for the network to learn. I mean, temperature is simple. It's 141 00:13:22,420 --> 00:13:26,750 a scalar value, we just have a number. But how do we encode whether certain symptoms 142 00:13:26,750 --> 00:13:32,720 are present. And the image, which is actually what I work with everyday, is 143 00:13:32,720 --> 00:13:38,350 again quite complex. We have values, we have numbers, but only pixel values, which 144 00:13:38,350 --> 00:13:43,450 make it difficult, which are difficult to use as input for a neural network. Why? 145 00:13:43,450 --> 00:13:48,350 I'll show you. I'll actually show you with this picture, it's a very famous picture, 146 00:13:48,350 --> 00:13:53,970 and everybody uses it in computer vision. They will tell you, it's because there is 147 00:13:53,970 --> 00:14:01,010 a multitude of different characteristics in this image: shapes, edges, whatever you 148 00:14:01,010 --> 00:14:07,080 desire. The truth is, it's a crop from the centrefold of the Playboy, and in earlier 149 00:14:07,080 --> 00:14:12,070 years, the computer vision engineers was a mostly male audience. Anyway, let's take 150 00:14:12,070 --> 00:14:16,850 five by five pixels. Let's assume, this is a five by five pixels, a really small, 151 00:14:16,850 --> 00:14:22,230 image. If we take those 25 pixels and use them as input for a neural network you 152 00:14:22,230 --> 00:14:26,730 already see that we have many connections - many weights - which means a very 153 00:14:26,730 --> 00:14:32,540 complex model. Complex model, of course, prone to overfitting. But there are more 154 00:14:32,540 --> 00:14:38,800 problems. First being, we have disconnected the pixels from its neigh-, a 155 00:14:38,800 --> 00:14:43,670 pixel from its neighbors. We can't encode information about the neighborhood 156 00:14:43,670 --> 00:14:47,850 anymore, and that really sucks. If we just take the whole picture, and move it to the 157 00:14:47,850 --> 00:14:52,790 left or to the right by just one pixel, the network will see something completely 158 00:14:52,790 --> 00:14:58,470 different, even though to us it is exactly the same. But, we can solve that with some 159 00:14:58,470 --> 00:15:03,400 very clever engineering, something we call a convolutional layer. It is again a 160 00:15:03,400 --> 00:15:08,860 hidden layer in a neural network, but it does something special. It actually is a 161 00:15:08,860 --> 00:15:13,970 very simple neuron again, just four input values - one output value. But the four 162 00:15:13,970 --> 00:15:19,780 input values look at two by two pixels, and encode one output value. And then the 163 00:15:19,780 --> 00:15:23,790 same network is shifted to the right, and encodes another pixel, and another pixel, 164 00:15:23,790 --> 00:15:30,150 and the next row of pixels. And in this way creates another 2D image. We have 165 00:15:30,150 --> 00:15:34,900 preserved information about the neighborhood, and we just have a very low 166 00:15:34,900 --> 00:15:41,910 number of weights, not the huge number of parameters we saw earlier. We can use this 167 00:15:41,910 --> 00:15:49,640 once, or twice, or several hundred times. And this is actually where we go deep. 168 00:15:49,640 --> 00:15:54,920 Deep means: We have several layers, and having layers that don't need thousands or 169 00:15:54,920 --> 00:16:01,040 millions of connections, but only a few. This is what allows us to go really deep. 170 00:16:01,040 --> 00:16:06,250 And in this fashion we can encode an entire image in just a few meaningful 171 00:16:06,250 --> 00:16:11,480 values. How these values look like, and what they encode, this is learned through 172 00:16:11,480 --> 00:16:18,240 the learning process. And we can then, for example, use these few values as input for 173 00:16:18,240 --> 00:16:24,709 a classification network. The fully connected network we saw earlier. 174 00:16:24,709 --> 00:16:29,560 Or we can do something more clever. We can do the inverse operation and create an image 175 00:16:29,560 --> 00:16:35,170 again, for example, the same image, which is then called an auto encoder. Auto 176 00:16:35,170 --> 00:16:40,200 encoders are tremendously useful, even though they don't appear that way. For 177 00:16:40,200 --> 00:16:43,959 example, imagine you want to check whether something has a defect, or not, a picture 178 00:16:43,959 --> 00:16:51,290 of a fabric, or of something. You just train the network with normal pictures. 179 00:16:51,290 --> 00:16:56,770 And then, if you have a defect picture, the network is not able to produce this 180 00:16:56,770 --> 00:17:02,149 defect. And so the difference of the reproduced picture, and the real picture 181 00:17:02,149 --> 00:17:07,420 will show you where errors are. If it works properly, I'll have to admit that. 182 00:17:07,420 --> 00:17:12,569 But we can go even further. Let's say, we want to encode something entirely else. 183 00:17:12,569 --> 00:17:17,400 Well, let's encode the image, the information in the image, but in another 184 00:17:17,400 --> 00:17:21,859 representation. For example, let's say we have three classes again. The background 185 00:17:21,859 --> 00:17:30,049 class in grey, a class called hat or headwear in blue, and person in green. We 186 00:17:30,049 --> 00:17:34,309 can also use this for other applications than just for pictures of humans. For 187 00:17:34,309 --> 00:17:38,370 example, we have a picture of a street and want to encode: Where is the car, where's 188 00:17:38,370 --> 00:17:44,860 the pedestrian? Tremendously useful. Or we have an MRI scan of a brain: Where in the 189 00:17:44,860 --> 00:17:51,110 brain is the tumor? Can we somehow learn this? Yes we can do this, with methods 190 00:17:51,110 --> 00:17:57,480 like these, if they are trained properly. More about that later. Well we expect 191 00:17:57,480 --> 00:18:01,020 something like this to come out but the truth looks rather like this – especially 192 00:18:01,020 --> 00:18:05,870 if it's not properly trained. We have not the real shape we want to get but 193 00:18:05,870 --> 00:18:11,980 something distorted. So here is again where we need to do learning. First we 194 00:18:11,980 --> 00:18:15,790 take a picture, put it through the network, get our output representation. 195 00:18:15,790 --> 00:18:21,110 And we have the information about how we want it to look. We again compute some 196 00:18:21,110 --> 00:18:27,040 kind of loss value. This time for example being the overlap between the shape we get 197 00:18:27,040 --> 00:18:34,040 out of the model and the shape we want to have. And we use this error, this lost 198 00:18:34,040 --> 00:18:38,660 function, to update the weights of our network. Again – even though it's more 199 00:18:38,660 --> 00:18:43,570 complicated here, even though we have more layers, and even though the layers look 200 00:18:43,570 --> 00:18:48,640 slightly different – it is the same process all over again as with a binary 201 00:18:48,640 --> 00:18:56,540 case. And we need lots of training data. This is something that you'll hear often 202 00:18:56,540 --> 00:19:02,960 in connection with deep learning: You need lots of training data to make this work. 203 00:19:02,960 --> 00:19:10,100 Images are complex things and in order to meaningful extract knowledge from them, 204 00:19:10,100 --> 00:19:17,090 the network needs to see a multitude of different images. Well now I already 205 00:19:17,090 --> 00:19:22,230 showed you some things we use in network architecture, some support networks: The 206 00:19:22,230 --> 00:19:26,679 fully convolutional encoder, which takes an image and produces a few meaningful 207 00:19:26,679 --> 00:19:33,110 values out of this image; its counterpart the fully convolutional decoder – fully 208 00:19:33,110 --> 00:19:36,960 convolutional meaning by the way that we only have these convolutional layers with 209 00:19:36,960 --> 00:19:42,980 a few parameters that somehow encode spatial information and keep it for the 210 00:19:42,980 --> 00:19:49,360 next layers. The decoder takes a few meaningful numbers and reproduces an image 211 00:19:49,360 --> 00:19:55,420 – either the same image or another representation of the information encoded 212 00:19:55,420 --> 00:20:01,400 in the image. We also already saw the fully connected network. Fully connected 213 00:20:01,400 --> 00:20:06,640 meaning every neuron is connected to every neuron in the next layer. This of course 214 00:20:06,640 --> 00:20:12,570 can be dangerous because this is where we actually get most of our parameters. If we 215 00:20:12,570 --> 00:20:16,390 have a fully connected network, this is where the most parameters will be present 216 00:20:16,390 --> 00:20:21,580 because connecting every node to every node … this is just a high number of 217 00:20:21,580 --> 00:20:25,860 connections. We can also do other things. For example something called a pooling 218 00:20:25,860 --> 00:20:32,280 layer. A pooling layer being basically the same as one of those convolutional layers, 219 00:20:32,280 --> 00:20:36,370 just that we don't have parameters we need to learn. This works without parameters 220 00:20:36,370 --> 00:20:43,740 because this neuron just chooses whichever value is the highest and takes that value 221 00:20:43,740 --> 00:20:49,600 as output. This is really great for reducing the size of your image and also 222 00:20:49,600 --> 00:20:55,150 getting rid of information that might not be that important. We can also do some 223 00:20:55,150 --> 00:20:59,890 clever techniques like adding a dropout layer. A dropout layer just being a normal 224 00:20:59,890 --> 00:21:05,799 layer in a neural network where we remove some connections: In one training step 225 00:21:05,799 --> 00:21:10,720 these connections, in the next training step some other connections. This way we 226 00:21:10,720 --> 00:21:18,049 teach the other connections to become more resilient against errors. I would like to 227 00:21:18,049 --> 00:21:22,750 start with something I call the "Model Show" now, and show you some models and 228 00:21:22,750 --> 00:21:28,870 how we train those models. And I will start with a fully convolutional decoder 229 00:21:28,870 --> 00:21:34,740 we saw earlier: This thing that takes a number and creates a picture. I would like 230 00:21:34,740 --> 00:21:41,420 to take this model, put in some number and get out a picture – a picture of a horse 231 00:21:41,420 --> 00:21:46,000 for example. If I put in a different number I also want to get a picture of a 232 00:21:46,000 --> 00:21:52,390 horse, but of a different horse. So what I want to get is a mapping from some 233 00:21:52,390 --> 00:21:56,730 numbers, some features that encode something about the horse picture, and get 234 00:21:56,730 --> 00:22:03,450 a horse picture out of it. You might see already why this is problematic. It is 235 00:22:03,450 --> 00:22:08,230 problematic because we don't have a mapping from feature to horse or from 236 00:22:08,230 --> 00:22:15,050 horse to features. So we don't have a truth value we can use to learn how to 237 00:22:15,050 --> 00:22:21,790 generate this mapping. Well computer vision engineers – or deep learning 238 00:22:21,790 --> 00:22:26,800 professionals – they're smart and have clever ideas. Let's just assume we have 239 00:22:26,800 --> 00:22:32,870 such a network and let's call it a generator. Let's take some numbers put, 240 00:22:32,870 --> 00:22:39,240 them into the generator and get some horses. Well it doesn't work yet. We still 241 00:22:39,240 --> 00:22:42,490 have to train it. So they're probably not only horses but also some very special 242 00:22:42,490 --> 00:22:47,970 unicorns among the horses; which might be nice for other applications, but I wanted 243 00:22:47,970 --> 00:22:55,480 pictures of horses right now. So I can't train with this data directly. But what I 244 00:22:55,480 --> 00:23:01,600 can do is I can create a second network. This network is called a discriminator and 245 00:23:01,600 --> 00:23:08,820 I can give it the input generated from the generator as well as the real data I have: 246 00:23:08,820 --> 00:23:13,920 the real horse pictures. And then I can teach the discriminator to distinguish 247 00:23:13,920 --> 00:23:22,080 between those. Tell me it is a real horse or it's not a real horse. And there I know 248 00:23:22,080 --> 00:23:27,000 what is the truth because I either take real horse pictures or fake horse pictures 249 00:23:27,000 --> 00:23:34,170 from the generator. So I have a truth value for this discriminator. But in doing 250 00:23:34,170 --> 00:23:39,070 this I also have a truth value for the generator. Because I want the generator to 251 00:23:39,070 --> 00:23:43,799 work against the discriminator. So I can also use the information how well the 252 00:23:43,799 --> 00:23:51,010 discriminator does to train the generator to become better in fooling. This is 253 00:23:51,010 --> 00:23:57,470 called a generative adversarial network. And it can be used to generate pictures of 254 00:23:57,470 --> 00:24:02,350 an arbitrary distribution. Let's do this with numbers and I will actually show you 255 00:24:02,350 --> 00:24:07,590 the training process. Before I start the video, I'll tell you what I did. I took 256 00:24:07,590 --> 00:24:11,550 some handwritten digits. There is a database called "??? of handwritten 257 00:24:11,550 --> 00:24:18,570 digits" so the numbers of 0 to 9. And I took those and used them as training data. 258 00:24:18,570 --> 00:24:24,299 I trained a generator in the way I showed you on the previous slide, and then I just 259 00:24:24,299 --> 00:24:30,110 took some random numbers. I put those random numbers into the network and just 260 00:24:30,110 --> 00:24:35,960 stored the image of what came out of the network. And here in the video you'll see 261 00:24:35,960 --> 00:24:43,090 how the network improved with ongoing training. You will see that we start 262 00:24:43,090 --> 00:24:50,179 basically with just noisy images … and then after some – what we call apox(???) 263 00:24:50,179 --> 00:24:55,919 so training iterations – the network is able to almost perfectly generate 264 00:24:55,919 --> 00:25:05,679 handwritten digits just from noise. Which I find truly fascinating. Of course this 265 00:25:05,679 --> 00:25:11,270 is an example where it works. It highly depends on your data set and how you train 266 00:25:11,270 --> 00:25:15,600 the model whether it is a success or not. But if it works, you can use it to 267 00:25:15,600 --> 00:25:22,559 generate fonts. You can generate characters, 3D objects, pictures of 268 00:25:22,559 --> 00:25:28,700 animals, whatever you want as long as you have training data. Let's go more crazy. 269 00:25:28,700 --> 00:25:34,539 Let's take two of those and let's say we have pictures of horses and pictures of 270 00:25:34,539 --> 00:25:41,150 zebras. I want to convert those pictures of horses into pictures of zebras, and I 271 00:25:41,150 --> 00:25:44,590 want to convert pictures of zebras into pictures of horses. So I want to have the 272 00:25:44,590 --> 00:25:49,690 same picture just with the other animal. But I don't have training data of the same 273 00:25:49,690 --> 00:25:56,270 situation just once with a horse and once with a zebra. Doesn't matter. We can train 274 00:25:56,270 --> 00:26:00,650 a network that does that for us. Again we just have a network – we call it the 275 00:26:00,650 --> 00:26:05,730 generator – and we have two of those: One that converts horses to zebras and one 276 00:26:05,730 --> 00:26:14,840 that converts zebras to horses. And then we also have two discriminators that tell 277 00:26:14,840 --> 00:26:21,150 us: real horse – fake horse – real zebra – fake zebra. And then we again need to 278 00:26:21,150 --> 00:26:27,210 perform some training. So we need to somehow encode: Did it work what we wanted 279 00:26:27,210 --> 00:26:31,460 to do? And a very simple way to do this is we take a picture of a horse put it 280 00:26:31,460 --> 00:26:35,470 through the generator that generates a zebra. Take this fake picture of a zebra, 281 00:26:35,470 --> 00:26:39,340 put it through the generator that generates a picture of a horse. And if 282 00:26:39,340 --> 00:26:43,700 this is the same picture as we put in, then our model worked. And if it didn't, 283 00:26:43,700 --> 00:26:48,549 we can use that information to update the weights. I just took a random picture, 284 00:26:48,549 --> 00:26:54,460 from a free library in the Internet, of a horse and generated a zebra and it worked 285 00:26:54,460 --> 00:26:59,470 remarkably well. I actually didn't even do training. It also doesn't need to be a 286 00:26:59,470 --> 00:27:03,120 picture. You can also convert text to images: You describe something in words 287 00:27:03,120 --> 00:27:09,570 and generate images. You can age your face or age a cell; or make a patient healthy 288 00:27:09,570 --> 00:27:15,510 or sick – or the image of a patient, not the patient self, unfortunately. You can 289 00:27:15,510 --> 00:27:20,690 do style transfer like take a picture of Van Gogh and apply it to your own picture. 290 00:27:20,690 --> 00:27:27,559 Stuff like that. Something else that we can do with neural networks. Let's assume 291 00:27:27,559 --> 00:27:31,030 we have a classification network, we have a picture of a toothbrush and the network 292 00:27:31,030 --> 00:27:36,770 tells us: Well, this is a toothbrush. Great! But how resilient is this network? 293 00:27:36,770 --> 00:27:44,530 Does it really work in every scenario. There's a second network we can apply: We 294 00:27:44,530 --> 00:27:48,701 call it an adversarial network. And that network is trained to do one thing: Look 295 00:27:48,701 --> 00:27:52,289 at the network, look at the picture, and then find the one weak spot in the 296 00:27:52,289 --> 00:27:55,880 picture: Just change one pixel slightly so that the network will tell me this 297 00:27:55,880 --> 00:28:03,600 toothbrush is an octopus. Works remarkably well. Also works with just changing the 298 00:28:03,600 --> 00:28:08,940 picture slightly, so changing all the pixels, but just slight minute changes 299 00:28:08,940 --> 00:28:12,860 that we don't perceive, but the network – the classification network – is completely 300 00:28:12,860 --> 00:28:19,640 thrown off. Well sounds bad. Is bad if you don't consider it. But you can also for 301 00:28:19,640 --> 00:28:24,200 example use this for training your network and make your network resilient. So 302 00:28:24,200 --> 00:28:28,460 there's always an upside and downside. Something entirely else: Now I'd like to 303 00:28:28,460 --> 00:28:32,880 show you something about text. A word- language model. I want to generate 304 00:28:32,880 --> 00:28:38,101 sentences for my podcast. I have a network that gives me a word, and then if I want 305 00:28:38,101 --> 00:28:42,640 to somehow get the next word in the sentence, I also need to consider this 306 00:28:42,640 --> 00:28:47,070 word. So another network architecture – quite interestingly – just takes the 307 00:28:47,070 --> 00:28:52,179 hidden states of the network and uses them as the input for the same network so that 308 00:28:52,179 --> 00:28:58,780 in the next iteration we still know what we did in the previous step. I tried to 309 00:28:58,780 --> 00:29:04,730 train a network that generates podcast episodes for my podcasts. Didn't work. 310 00:29:04,730 --> 00:29:08,450 What I learned is I don't have enough training data. I really need to produce 311 00:29:08,450 --> 00:29:15,790 more podcast episodes in order to train a model to do my job for me. And this is 312 00:29:15,790 --> 00:29:21,539 very important, a very crucial point: Training data. We need shitloads of 313 00:29:21,539 --> 00:29:26,081 training data. And actually the more complicated our model and our training 314 00:29:26,081 --> 00:29:30,990 process becomes, the more training data we need. I started with a supervised case – 315 00:29:30,990 --> 00:29:35,990 the really simple case where we, really simple, the really simpler case where we 316 00:29:35,990 --> 00:29:40,660 have a picture and a label that corresponds to that picture; or a 317 00:29:40,660 --> 00:29:46,280 representation of that picture showing entirely what I wanted to learn. But we 318 00:29:46,280 --> 00:29:51,909 also saw a more complex task, where I had to pictures – horses and zebras – that are 319 00:29:51,909 --> 00:29:56,400 from two different domains – but domains with no direct mapping. What can also 320 00:29:56,400 --> 00:30:01,020 happen – and actually happens quite a lot – is weakly annotated data, so data that 321 00:30:01,020 --> 00:30:08,750 is not precisely annotated; where we can't rely on the information we get. Or even 322 00:30:08,750 --> 00:30:13,050 more complicated: Something called reinforcement learning where we perform a 323 00:30:13,050 --> 00:30:19,380 sequence of actions and then in the end are told "yeah that was great". Which is 324 00:30:19,380 --> 00:30:24,080 often not enough information to really perform proper training. But of course 325 00:30:24,080 --> 00:30:28,190 there are also methods for that. As well as there are methods for the unsupervised 326 00:30:28,190 --> 00:30:33,590 case where we don't have annotations, labeled data – no ground truth at all – 327 00:30:33,590 --> 00:30:41,241 just the picture itself. Well I talked about pictures. I told you that we can 328 00:30:41,241 --> 00:30:45,320 learn features and create images from them. And we can use them for 329 00:30:45,320 --> 00:30:51,640 classification. And for this there exist many databases. There are public data sets 330 00:30:51,640 --> 00:30:56,659 we can use. Often they refer to for example Flickr. They're just hyperlinks 331 00:30:56,659 --> 00:31:00,960 which is also why I didn't show you many pictures right here, because I am honestly 332 00:31:00,960 --> 00:31:05,690 not sure about the copyright in those cases. But there are also challenge 333 00:31:05,690 --> 00:31:11,190 datasets where you can just sign up, get some for example medical data sets, and 334 00:31:11,190 --> 00:31:16,650 then compete against other researchers. And of course there are those companies 335 00:31:16,650 --> 00:31:22,090 that just have lots of data. And those companies also have the means, the 336 00:31:22,090 --> 00:31:28,110 capacity to perform intense computations. And those are also often the companies you 337 00:31:28,110 --> 00:31:36,179 hear from in terms of innovation for deep learning. Well this was mostly to tell you 338 00:31:36,179 --> 00:31:40,200 that you can process images quite well with deep learning if you have enough 339 00:31:40,200 --> 00:31:46,029 training data, if you have a proper training process and also a little if you 340 00:31:46,029 --> 00:31:52,090 know what you're doing. But you can also process text, you can process audio and 341 00:31:52,090 --> 00:31:58,520 time series like prices or a stack exchange – stuff like that. You can 342 00:31:58,520 --> 00:32:02,929 process almost everything if you make it encodeable to your network. Sounds like a 343 00:32:02,929 --> 00:32:08,120 dream come true. But – as I already told you – you need data, a lot of it. I told 344 00:32:08,120 --> 00:32:14,020 you about those companies that have lots of data sets and the publicly available 345 00:32:14,020 --> 00:32:21,370 data sets which you can actually use to get started with your own experiments. But 346 00:32:21,370 --> 00:32:24,309 that also makes it a little dangerous because deep learning still is a black box 347 00:32:24,309 --> 00:32:30,820 to us. I told you what happens inside the black box on a level that teaches you how 348 00:32:30,820 --> 00:32:36,529 we learn and how the network is structured, but not really what the 349 00:32:36,529 --> 00:32:42,831 network learned. It is for us computer vision engineers really nice that we can 350 00:32:42,831 --> 00:32:48,590 visualize the first layers of a neural network and see what is actually encoded 351 00:32:48,590 --> 00:32:53,950 in those first layers; what information the network looks at. But you can't really 352 00:32:53,950 --> 00:32:59,059 mathematically prove what happens in a network. Which is one major downside. And 353 00:32:59,059 --> 00:33:02,150 so if you want to use it, the numbers may be really great but be sure to properly 354 00:33:02,150 --> 00:33:08,059 evaluate them. In summary I call that "easy to learn". Every one – every single 355 00:33:08,059 --> 00:33:12,679 one of you – can just start with deep learning right away. You don't need to do 356 00:33:12,679 --> 00:33:19,440 much work. You don't need to do much learning. The model learns for you. But 357 00:33:19,440 --> 00:33:23,770 they're hard to master in a way that makes them useful for production use cases for 358 00:33:23,770 --> 00:33:29,900 example. So if you want to use deep learning for something – if you really 359 00:33:29,900 --> 00:33:34,299 want to seriously use it –, make sure that it really does what you wanted to and 360 00:33:34,299 --> 00:33:38,900 doesn't learn something else – which also happens. Pretty sure you saw some talks 361 00:33:38,900 --> 00:33:43,670 about deep learning fails – which is not what this talk is about. They're quite 362 00:33:43,670 --> 00:33:47,370 funny to look at. Just make sure that they don't happen to you! If you do that 363 00:33:47,370 --> 00:33:53,300 though, you'll achieve great things with deep learning, I'm sure. And that was 364 00:33:53,300 --> 00:34:00,740 introduction to deep learning. Thank you! *Applause* 365 00:34:09,172 --> 00:34:13,449 Herald Angel: So now it's question and answer time. So if you have a question, 366 00:34:13,449 --> 00:34:19,110 please line up at the mikes. We have in total eight, so it shouldn't be far from 367 00:34:19,110 --> 00:34:26,139 you. They are here in the corridors and on these sides. Please line up! For 368 00:34:26,139 --> 00:34:31,540 everybody: A question consists of one sentence with the question mark in the end 369 00:34:31,540 --> 00:34:38,449 – not three minutes of rambling. And also if you go to the microphone, speak into 370 00:34:38,449 --> 00:34:53,889 the microphone, so you really get close to it. Okay. Where do we have … Number 7! 371 00:34:53,889 --> 00:35:02,200 We start with mic number 7: Question: Hello. My question is: How did 372 00:35:02,200 --> 00:35:13,020 you compute the example for the fonts, the numbers? I didn't really understand it, 373 00:35:13,020 --> 00:35:19,770 you just said it was made from white noise. 374 00:35:19,770 --> 00:35:25,580 Teubi: I'll give you a really brief recap of what I did. I showed you that we have a 375 00:35:25,580 --> 00:35:31,140 model that maps image to some meaningful values, that an image can be encoded in 376 00:35:31,140 --> 00:35:36,860 just a few values. What happens here is exactly the other way round. We have some 377 00:35:36,860 --> 00:35:43,270 values, just some arbitrary values we actually know nothing about. We can 378 00:35:43,270 --> 00:35:47,480 generate pictures out of those. So I trained this model to just take some 379 00:35:47,480 --> 00:35:54,560 random values and show the pictures generated from the model. The training 380 00:35:54,560 --> 00:36:03,320 process was this "min max game", as its called. We have two networks that try to 381 00:36:03,320 --> 00:36:08,260 compete against each other. One network trying to distinguish, whether a picture 382 00:36:08,260 --> 00:36:12,790 it sees is real or one of those fake pictures, and the network that actually 383 00:36:12,790 --> 00:36:18,510 generates those pictures and in training the network that is able to distinguish 384 00:36:18,510 --> 00:36:24,599 between those, we can also get information for the training of the network that 385 00:36:24,599 --> 00:36:30,410 generates the pictures. So the videos you saw were just animations of what happens 386 00:36:30,410 --> 00:36:36,440 during this training process. At first if we input noise we get noise. But as the 387 00:36:36,440 --> 00:36:41,510 network is able to better and better recreate those images from the dataset we 388 00:36:41,510 --> 00:36:47,390 used as input, in this case pictures of handwritten digits, the output also became 389 00:36:47,390 --> 00:36:54,660 more lookalike to those numbers, these handwritten digits. Hope that helped. 390 00:36:54,660 --> 00:37:06,590 Herald Angel: Now we go to the Internet. – Can we get sound for the signal 391 00:37:06,590 --> 00:37:10,040 Angel, please? Teubi: Sounded so great, "now we go to the Internet." 392 00:37:10,040 --> 00:37:11,040 Herald Angel: Yeah, that sounds like "yeeaah". 393 00:37:11,040 --> 00:37:13,040 Signal Angel: And now we're finally ready to go to the interwebs. "Schorsch" is 394 00:37:13,040 --> 00:37:18,040 asking: Do you have any recommendations for a beginner regarding the framework or 395 00:37:18,040 --> 00:37:26,460 the software? Teubi: I, of course, am very biased to 396 00:37:26,460 --> 00:37:34,150 recommend what I use everyday. But I also think that it is a great start. Basically, 397 00:37:34,150 --> 00:37:40,210 use python and use pytorch. Many people will disagree with me and tell you 398 00:37:40,210 --> 00:37:45,930 "tensorflow is better." It might be, in my opinion not for getting started, and there 399 00:37:45,930 --> 00:37:51,560 are also some nice tutorials on the pytorch website. What you can also do is 400 00:37:51,560 --> 00:37:57,200 look at websites like OpenAI, where they have a gym to get you started with some 401 00:37:57,200 --> 00:38:02,371 training exercises, where you already have datasets. Yeah, basically my 402 00:38:02,371 --> 00:38:08,600 recommendation is get used to Python and start with a pytorch tutorial, see where 403 00:38:08,600 --> 00:38:13,590 to go from there. Often there also some github repositories linked with many 404 00:38:13,590 --> 00:38:18,740 examples for already established network architectures like the cycle GAN or the 405 00:38:18,740 --> 00:38:26,250 GAN itself or basically everything else. There will be a repo you can use to get 406 00:38:26,250 --> 00:38:29,940 started. Herald Angel: OK, we stay with the 407 00:38:29,940 --> 00:38:32,589 internet. There's some more questions, I heard. 408 00:38:32,589 --> 00:38:37,920 Signal Angel: Yes. Rubin8 is asking: Have you have you ever come across an example 409 00:38:37,920 --> 00:38:42,580 of a neural network that deals with audio instead of images? 410 00:38:42,580 --> 00:38:49,410 Teubi: Me personally, no. At least not directly. I've heard about examples, like 411 00:38:49,410 --> 00:38:54,859 where you can change the voice to sound like another person, but there is not much 412 00:38:54,859 --> 00:38:59,980 I can reliably tell about that. My expertise really is in image processing, 413 00:38:59,980 --> 00:39:05,550 I'm sorry. Herald Angel: And I think we have time for 414 00:39:05,550 --> 00:39:12,340 one more question. We have one at number 8. Microphone number 8. 415 00:39:12,340 --> 00:39:20,730 Question: Is the current Face recognition technologies in, for example iPhone X, is 416 00:39:20,730 --> 00:39:26,420 it also a deep learning algorithm or is it something more simple? Do you have any 417 00:39:26,420 --> 00:39:31,880 idea about that? Teubi: As far as I know, yes. That's all I 418 00:39:31,880 --> 00:39:38,630 can reliably tell you about that, but it is not only based on images but also uses 419 00:39:38,630 --> 00:39:45,420 other information. I think distance information encoded with some infrared 420 00:39:45,420 --> 00:39:50,599 signals. I don't really know exactly how it works, but at least iPhones already 421 00:39:50,599 --> 00:39:56,000 have a neural network processing engine built in, so a chip 422 00:39:56,000 --> 00:40:01,190 dedicated to just doing those computations. You saw that many of those 423 00:40:01,190 --> 00:40:05,820 things can be parallelized, and this is what those hardware architectures make use 424 00:40:05,820 --> 00:40:10,380 of. So I'm pretty confident in saying, yes, they also do it there. 425 00:40:10,380 --> 00:40:12,786 How exactly, no clue. 426 00:40:13,760 --> 00:40:15,323 Herald Angel: OK. I myself have a last 427 00:40:15,390 --> 00:40:20,680 completely unrelated question: Did you create the design of the slides yourself? 428 00:40:20,680 --> 00:40:29,060 Teubi: I had some help. We have a really great Congress design and I use that as an 429 00:40:29,060 --> 00:40:32,790 inspiration to create those slides, yes. 430 00:40:32,790 --> 00:40:36,760 Herald Angel: OK, yeah, because those are really amazing. I love them. 431 00:40:36,760 --> 00:40:38,140 Teubi: Thank you! 432 00:40:38,470 --> 00:40:41,200 Herald Angel: OK, thank you very much Teubi. 433 00:40:45,130 --> 00:40:48,900 *35C5 outro music* 434 00:40:48,900 --> 00:41:07,000 subtitles created by c3subtitles.de in the year 2019. Join, and help us!