1
00:00:00,000 --> 00:00:18,229
*35C3 preroll music*


2
00:00:18,229 --> 00:00:24,750
Herald Angel: Welcome to our introduction
to deep learning with Teubi. Deep

3
00:00:24,750 --> 00:00:30,247
learning, also often called machine
learning is a hype word which we hear in

4
00:00:30,247 --> 00:00:37,152
the media all the time. It's nearly as bad
as blockchain. It's a solution for

5
00:00:37,152 --> 00:00:43,249
everything. Today we'll get a sneak peek
into the internals of this mystical black

6
00:00:43,249 --> 00:00:48,820
box, they are talking about. And Teubi
will show us why people, who know what

7
00:00:48,820 --> 00:00:53,040
machine learning really is about, have to
facepalm so often, when they read the

8
00:00:53,040 --> 00:00:58,715
news. So please welcome Teubi
with a big round of applause!

9
00:00:58,715 --> 00:01:10,245
*Applause*
Teubi: Alright! Good morning and welcome

10
00:01:10,245 --> 00:01:14,470
to Introduction to Deep Learning. The
title will already tell you what this talk

11
00:01:14,470 --> 00:01:19,920
is about. I want to give you an
introduction onto how deep learning works,

12
00:01:19,920 --> 00:01:27,090
what happens inside this black box. But,
first of all, who am I? I'm Teubi. It's a

13
00:01:27,090 --> 00:01:32,280
German nickname, it has nothing to do with
toys or bees. You might have heard my

14
00:01:32,280 --> 00:01:36,480
voice before, because I host the
Nussschale podcast. There I explain

15
00:01:36,480 --> 00:01:41,560
scientific topics in under 10 minutes.
I'll have to use a little more time today,

16
00:01:41,560 --> 00:01:46,850
and you'll also have fancy animations
which hopefully will help. In my day job

17
00:01:46,850 --> 00:01:52,540
I'm a research scientist at an institute
for computer vision. I analyze microscopy

18
00:01:52,540 --> 00:01:58,240
images of bone marrow blood cells and try
to find ways to teach the computer to

19
00:01:58,240 --> 00:02:04,660
understand what it sees. Namely, to
differentiate between certain cells or,

20
00:02:04,660 --> 00:02:09,449
first of all, find cells in an image,
which is a task that is more complex than

21
00:02:09,449 --> 00:02:17,180
it might sound like. Let me start with the
introduction to deep learning. We all know

22
00:02:17,180 --> 00:02:22,769
how to code. We code in a very simple way.
We have some input for all computer

23
00:02:22,769 --> 00:02:27,618
algorithm. Then we have an algorithm which
says: Do this, do that. If this, then

24
00:02:28,510 --> 00:02:28,906
that. And in that way we generate some
output. This is not how machine learning

25
00:02:29,495 --> 00:02:30,754
works. Machine learning assumes you have
some input, and you also have some output.

26
00:02:40,810 --> 00:02:46,180
And what you also have is some statistical
model. This statistical model is flexible.

27
00:02:46,180 --> 00:02:51,549
It has certain parameters, which it can
learn from the distribution of inputs and

28
00:02:51,549 --> 00:02:57,430
outputs you give it for training. So you
basically learn the statistical model to

29
00:02:57,430 --> 00:03:03,659
generate the desired output from the given
input. Let me give you a really simple

30
00:03:03,659 --> 00:03:09,980
example of how this might work. Let's say
we have two animals. Well, we have two

31
00:03:09,980 --> 00:03:15,689
kinds of animals: unicorns and rabbits.
And now we want to find an algorithm that

32
00:03:15,689 --> 00:03:24,270
tells us whether this animal we have right
now as an input is a rabbit or a unicorn.

33
00:03:24,270 --> 00:03:28,230
We can write a simple algorithm to do
that, but we can also do it with machine

34
00:03:28,230 --> 00:03:34,590
learning. The first thing we need is some
input. I choose two features that are able

35
00:03:34,590 --> 00:03:42,269
to tell me whether this animal is a rabbit
or a unicorn. Namely, speed and size. We

36
00:03:42,269 --> 00:03:46,859
call these features, and they describe
something about what we want to classify.

37
00:03:46,859 --> 00:03:52,409
And the class is in this case our animal.
First thing I need is some training data,

38
00:03:52,409 --> 00:03:59,170
some input. The input here are just pairs
of speed and size. What I also need is

39
00:03:59,170 --> 00:04:04,129
information about the desired output. The
desired output, of course, being the

40
00:04:04,129 --> 00:04:12,999
class. So either unicorn or rabbit, here
denoted by yellow and red X's. So let's

41
00:04:12,999 --> 00:04:18,298
try to find a statistical model which we
can use to separate this feature space

42
00:04:18,298 --> 00:04:24,150
into two halves: One for the rabbits, one
for the unicorns. Looking at this, we can

43
00:04:24,150 --> 00:04:28,660
actually find a really simple statistical
model, and our statistical model in this

44
00:04:28,660 --> 00:04:34,390
case is just a straight line. And the
learning process is then to find where in

45
00:04:34,390 --> 00:04:41,080
this feature space the line should be.
Ideally, for example, here. Right in the

46
00:04:41,080 --> 00:04:45,220
middle between the two classes rabbit and
unicorn. Of course this is an overly

47
00:04:45,220 --> 00:04:50,370
simplified example. Real-world
applications have feature distributions

48
00:04:50,370 --> 00:04:56,080
which look much more like this. So, we
have a gradient, we don't have a perfect

49
00:04:56,080 --> 00:05:00,130
separation between those two classes, and
those two classes are definitely not

50
00:05:00,130 --> 00:05:05,560
separable by a line. If we look again at
some training samples — training samples

51
00:05:05,560 --> 00:05:11,730
are the data points we use for the machine
learning process, so, to try to find the

52
00:05:11,730 --> 00:05:17,540
parameters of our statistical model — if
we look at the line again, then this will

53
00:05:17,540 --> 00:05:23,000
not be able to separate this training set.
Well, we will have a line that has some

54
00:05:23,000 --> 00:05:27,320
errors, some unicorns which will be
classified as rabbits, some rabbits which

55
00:05:27,320 --> 00:05:33,070
will be classified as unicorns. This is
what we call underfitting. Our model is

56
00:05:33,070 --> 00:05:40,150
just not able to express what we want it
to learn. There is the opposite case. The

57
00:05:40,150 --> 00:05:45,510
opposite case being: we just learn all the
training samples by heart. This is if we

58
00:05:45,510 --> 00:05:50,020
have a very complex model and just a few
training samples to teach the model what

59
00:05:50,020 --> 00:05:55,120
it should learn. In this case we have a
perfect separation of unicorns and

60
00:05:55,120 --> 00:06:00,700
rabbits, at least for the few data points
we have. If we draw another example from

61
00:06:00,700 --> 00:06:07,300
the real world,some other data points,
they will most likely be wrong. And this

62
00:06:07,300 --> 00:06:11,380
is what we call overfitting. The perfect
scenario in this case would be something

63
00:06:11,380 --> 00:06:17,340
like this: a classifier which is really
close to the distribution we have in the

64
00:06:17,340 --> 00:06:23,350
real world and machine learning is tasked
with finding this perfect model and its

65
00:06:23,350 --> 00:06:28,960
parameters. Let me show you a different
kind of model, something you probably all

66
00:06:28,960 --> 00:06:35,670
have heard about: Neural networks. Neural
networks are inspired by the brain.

67
00:06:35,670 --> 00:06:41,210
Or more precisely, by the neurons in our
brain. Neurons are tiny objects, tiny

68
00:06:41,210 --> 00:06:47,250
cells in our brain that take some input
and generate some output. Sounds familiar,

69
00:06:47,250 --> 00:06:52,680
right? We have inputs usually in the form
of electrical signals. And if they are

70
00:06:52,680 --> 00:06:57,860
strong enough, this neuron will also send
out an electrical signal. And this is

71
00:06:57,860 --> 00:07:03,430
something we can model in a computer-
engineering way. So, what we do is: We

72
00:07:03,430 --> 00:07:09,240
take a neuron. The neuron is just a simple
mapping from input to output. Input here,

73
00:07:09,240 --> 00:07:17,200
just three input nodes. We denote them by
i1, i2 and i3 and output denoted by o. And

74
00:07:17,200 --> 00:07:20,840
now you will actually see some
mathematical equations. There are not many

75
00:07:20,840 --> 00:07:26,700
of these in this foundation talk, don't
worry, and it's really simple. There's one

76
00:07:26,700 --> 00:07:30,250
more thing we need first, though, if we
want to map input to output in the way a

77
00:07:30,250 --> 00:07:35,490
neuron does. Namely, the weights. The
weights are just some arbitrary numbers

78
00:07:35,490 --> 00:07:43,020
for now. Let's call them w1, w2 and w3.
So, we take those weights and we multiply

79
00:07:43,020 --> 00:07:51,360
them with the input. Input1 times weight1,
input2 times weight2, and so on. And this,

80
00:07:51,360 --> 00:07:57,550
this sum just will be our output. Well,
not quite. We make it a little bit more

81
00:07:57,550 --> 00:08:02,430
complicated. We also use something called
an activation function. The activation

82
00:08:02,430 --> 00:08:08,520
function is just a mapping from one scalar
value to another scalar value. In this

83
00:08:08,520 --> 00:08:14,280
case from what we got as an output, the
sum, to something that more closely fits

84
00:08:14,280 --> 00:08:19,360
what we need. This could for example be
something binary, where we have all the

85
00:08:19,360 --> 00:08:23,780
negative numbers being mapped to zero and
all the positive numbers being mapped to

86
00:08:23,780 --> 00:08:30,910
one. And then this zero and one can encode
something. For example: rabbit or unicorn.

87
00:08:30,910 --> 00:08:35,309
So, let me give you an example of how we
can make the previous example with the

88
00:08:35,309 --> 00:08:41,729
rabbits and unicorns work with such a
simple neuron. We just use speed, size,

89
00:08:41,729 --> 00:08:49,650
and the arbitrarily chosen number 10 as
our inputs and the weights 1, 1, and -1.

90
00:08:49,650 --> 00:08:54,400
If we look at the equations, then we get
for our negative numbers — so, speed plus

91
00:08:54,400 --> 00:09:01,440
size being less than 10 — a 0, and a 1 for
all positive numbers — being speed plus

92
00:09:01,440 --> 00:09:07,680
size larger than 10, greater than 10. This
way we again have a separating line

93
00:09:07,680 --> 00:09:14,600
between unicorns and rabbits. But again we
have this really simplistic model. We want

94
00:09:14,600 --> 00:09:21,529
to become more and more complicated in
order to express more complex tasks. So

95
00:09:21,529 --> 00:09:26,279
what do we do? We take more neurons. We
take our three input values and put them

96
00:09:26,279 --> 00:09:31,920
into one neuron, and into a second neuron,
and into a third neuron. And we take the

97
00:09:31,920 --> 00:09:38,330
output of those three neurons as input for
another neuron. We also call this a

98
00:09:38,330 --> 00:09:42,140
multilayer perceptron, perceptron just
being a different name for a neuron, what

99
00:09:42,140 --> 00:09:48,670
we have there. And the whole thing is also
called a neural network. So now the

100
00:09:48,670 --> 00:09:53,300
question: How do we train this? How do we
learn what this network should encode?

101
00:09:53,300 --> 00:09:57,620
Well, we want a mapping from input to
output, and what we can change are the

102
00:09:57,620 --> 00:10:02,880
weights. First, what we do is we take a
training sample, some input. Put it

103
00:10:02,880 --> 00:10:07,010
through the network, get an output. But
this might not be the desired output which

104
00:10:07,010 --> 00:10:13,570
we know. So, in the binary case there are
four possible cases: computed output,

105
00:10:13,570 --> 00:10:19,860
expected output, each two values, 0 and 1.
The best case would be: we want a 0, get a

106
00:10:19,860 --> 00:10:27,120
0, want a 1 and get a 1. But there is also
the opposite case. In these two cases we

107
00:10:27,120 --> 00:10:31,440
can learn something about our model.
Namely, in which direction to change the

108
00:10:31,440 --> 00:10:37,270
weights. It's a little bit simplified, but
in principle you just raise the weights if

109
00:10:37,270 --> 00:10:41,250
you need a higher number as output and you
lower the weights if you need a lower

110
00:10:41,250 --> 00:10:47,350
number as output. To tell you how much, we
have two terms. First term being the

111
00:10:47,350 --> 00:10:53,110
error, so in this case just the difference
between desired and expected output – also

112
00:10:53,110 --> 00:10:56,890
often called a loss function, especially
in deep learning and more complex

113
00:10:56,890 --> 00:11:04,120
applications. You also have a second term
we call the act the learning rate, and the

114
00:11:04,120 --> 00:11:09,170
learning rate is what tells us how quickly
we should change the weights, how quickly

115
00:11:09,170 --> 00:11:14,890
we should adapt the weights. Okay, this is
how we learn a model. This is almost

116
00:11:14,890 --> 00:11:18,550
everything you need to know. There are
mathematical equations that tell you how

117
00:11:18,550 --> 00:11:23,770
much to change based on the error and the
learning function. And this is the entire

118
00:11:23,770 --> 00:11:30,339
learning process. Let's get back to the
terminology. We have the input layer. We

119
00:11:30,339 --> 00:11:34,020
have the output layer, which somehow
encodes our output either in one value or

120
00:11:34,020 --> 00:11:39,650
in several values if we have a multiple,
if we have multiple classes. We also have

121
00:11:39,650 --> 00:11:45,930
the hidden layers, which are actually what
makes our model deep. What we can change,

122
00:11:45,930 --> 00:11:51,980
what we can learn, is the are the weights,
the parameters of this model. But what we

123
00:11:51,980 --> 00:11:55,490
also need to keep in mind, is the number
of layers, the number of neurons per

124
00:11:55,490 --> 00:11:59,590
layer, the learning rate, and the
activation function. These are called

125
00:11:59,590 --> 00:12:04,240
hyper parameters, and they determine how
complex our model is, how well it is

126
00:12:04,240 --> 00:12:09,970
suited to solve the task at hand. I quite
often spoke about solving tasks, so the

127
00:12:09,970 --> 00:12:14,630
question is: What can we actually do with
neural networks? Mostly classification

128
00:12:14,630 --> 00:12:19,560
tasks, for example: Tell me, is this
animal a rabbit or unicorn? Is this text

129
00:12:19,560 --> 00:12:24,690
message spam or legitimate? Is this
patient healthy or ill? Is this image a

130
00:12:24,690 --> 00:12:30,710
picture of a cat or a dog? We already saw
for the animal that we need something

131
00:12:30,710 --> 00:12:35,040
called features, which somehow encodes
information about what we want to

132
00:12:35,040 --> 00:12:39,530
classify, something we can use as input
for the neural network. Some kind of

133
00:12:39,530 --> 00:12:43,830
number that is meaningful. So, for the
animal it could be speed, size, or

134
00:12:43,830 --> 00:12:48,740
something like color. Color, of course,
being more complex again, because we have,

135
00:12:48,740 --> 00:12:55,940
for example, RGB, so three values. And,
text message being a more complex case

136
00:12:55,940 --> 00:13:00,060
again, because we somehow need to encode
the sender, and whether the sender is

137
00:13:00,060 --> 00:13:04,770
legitimate. Same for the recipient, or the
number of hyperlinks, or where the

138
00:13:04,770 --> 00:13:11,400
hyperlinks refer to, or the, whether there
are certain words present in the text. It

139
00:13:11,400 --> 00:13:16,720
gets more and more complicated. Even more
so for a patient. How do we encode medical

140
00:13:16,720 --> 00:13:22,420
history in a proper way for the network to
learn. I mean, temperature is simple. It's

141
00:13:22,420 --> 00:13:26,750
a scalar value, we just have a number. But
how do we encode whether certain symptoms

142
00:13:26,750 --> 00:13:32,720
are present. And the image, which is
actually what I work with everyday, is

143
00:13:32,720 --> 00:13:38,350
again quite complex. We have values, we
have numbers, but only pixel values, which

144
00:13:38,350 --> 00:13:43,450
make it difficult, which are difficult to
use as input for a neural network. Why?

145
00:13:43,450 --> 00:13:48,350
I'll show you. I'll actually show you with
this picture, it's a very famous picture,

146
00:13:48,350 --> 00:13:53,970
and everybody uses it in computer vision.
They will tell you, it's because there is

147
00:13:53,970 --> 00:14:01,010
a multitude of different characteristics
in this image: shapes, edges, whatever you

148
00:14:01,010 --> 00:14:07,080
desire. The truth is, it's a crop from the
centrefold of the Playboy, and in earlier

149
00:14:07,080 --> 00:14:12,070
years, the computer vision engineers was a
mostly male audience. Anyway, let's take

150
00:14:12,070 --> 00:14:16,850
five by five pixels. Let's assume, this is
a five by five pixels, a really small,

151
00:14:16,850 --> 00:14:22,230
image. If we take those 25 pixels and use
them as input for a neural network you

152
00:14:22,230 --> 00:14:26,730
already see that we have many connections
- many weights - which means a very

153
00:14:26,730 --> 00:14:32,540
complex model. Complex model, of course,
prone to overfitting. But there are more

154
00:14:32,540 --> 00:14:38,800
problems. First being, we have
disconnected the pixels from its neigh-, a

155
00:14:38,800 --> 00:14:43,670
pixel from its neighbors. We can't encode
information about the neighborhood

156
00:14:43,670 --> 00:14:47,850
anymore, and that really sucks. If we just
take the whole picture, and move it to the

157
00:14:47,850 --> 00:14:52,790
left or to the right by just one pixel,
the network will see something completely

158
00:14:52,790 --> 00:14:58,470
different, even though to us it is exactly
the same. But, we can solve that with some

159
00:14:58,470 --> 00:15:03,400
very clever engineering, something we call
a convolutional layer. It is again a

160
00:15:03,400 --> 00:15:08,860
hidden layer in a neural network, but it
does something special. It actually is a

161
00:15:08,860 --> 00:15:13,970
very simple neuron again, just four input
values - one output value. But the four

162
00:15:13,970 --> 00:15:19,780
input values look at two by two pixels,
and encode one output value. And then the

163
00:15:19,780 --> 00:15:23,790
same network is shifted to the right, and
encodes another pixel, and another pixel,

164
00:15:23,790 --> 00:15:30,150
and the next row of pixels. And in this
way creates another 2D image. We have

165
00:15:30,150 --> 00:15:34,900
preserved information about the
neighborhood, and we just have a very low

166
00:15:34,900 --> 00:15:41,910
number of weights, not the huge number of
parameters we saw earlier. We can use this

167
00:15:41,910 --> 00:15:49,640
once, or twice, or several hundred times.
And this is actually where we go deep.

168
00:15:49,640 --> 00:15:54,920
Deep means: We have several layers, and
having layers that don't need thousands or

169
00:15:54,920 --> 00:16:01,040
millions of connections, but only a few.
This is what allows us to go really deep.

170
00:16:01,040 --> 00:16:06,250
And in this fashion we can encode an
entire image in just a few meaningful

171
00:16:06,250 --> 00:16:11,480
values. How these values look like, and
what they encode, this is learned through

172
00:16:11,480 --> 00:16:18,240
the learning process. And we can then, for
example, use these few values as input for

173
00:16:18,240 --> 00:16:24,709
a classification network. 
The fully connected network we saw earlier.

174
00:16:24,709 --> 00:16:29,560
Or we can do something more clever. We can 
do the inverse operation and create an image

175
00:16:29,560 --> 00:16:35,170
again, for example, the same image, which
is then called an auto encoder. Auto

176
00:16:35,170 --> 00:16:40,200
encoders are tremendously useful, even
though they don't appear that way. For

177
00:16:40,200 --> 00:16:43,959
example, imagine you want to check whether
something has a defect, or not, a picture

178
00:16:43,959 --> 00:16:51,290
of a fabric, or of something. You just
train the network with normal pictures.

179
00:16:51,290 --> 00:16:56,770
And then, if you have a defect picture,
the network is not able to produce this

180
00:16:56,770 --> 00:17:02,149
defect. And so the difference of the
reproduced picture, and the real picture

181
00:17:02,149 --> 00:17:07,420
will show you where errors are. If it
works properly, I'll have to admit that.

182
00:17:07,420 --> 00:17:12,569
But we can go even further. Let's say, we
want to encode something entirely else.

183
00:17:12,569 --> 00:17:17,400
Well, let's encode the image, the
information in the image, but in another

184
00:17:17,400 --> 00:17:21,859
representation. For example, let's say we
have three classes again. The background

185
00:17:21,859 --> 00:17:30,049
class in grey, a class called hat or
headwear in blue, and person in green. We

186
00:17:30,049 --> 00:17:34,309
can also use this for other applications
than just for pictures of humans. For

187
00:17:34,309 --> 00:17:38,370
example, we have a picture of a street and
want to encode: Where is the car, where's

188
00:17:38,370 --> 00:17:44,860
the pedestrian? Tremendously useful. Or we
have an MRI scan of a brain: Where in the

189
00:17:44,860 --> 00:17:51,110
brain is the tumor? Can we somehow learn
this? Yes we can do this, with methods

190
00:17:51,110 --> 00:17:57,480
like these, if they are trained properly.
More about that later. Well we expect

191
00:17:57,480 --> 00:18:01,020
something like this to come out but the
truth looks rather like this – especially

192
00:18:01,020 --> 00:18:05,870
if it's not properly trained. We have not
the real shape we want to get but

193
00:18:05,870 --> 00:18:11,980
something distorted. So here is again
where we need to do learning. First we

194
00:18:11,980 --> 00:18:15,790
take a picture, put it through the
network, get our output representation.

195
00:18:15,790 --> 00:18:21,110
And we have the information about how we
want it to look. We again compute some

196
00:18:21,110 --> 00:18:27,040
kind of loss value. This time for example
being the overlap between the shape we get

197
00:18:27,040 --> 00:18:34,040
out of the model and the shape we want to
have. And we use this error, this lost

198
00:18:34,040 --> 00:18:38,660
function, to update the weights of our
network. Again – even though it's more

199
00:18:38,660 --> 00:18:43,570
complicated here, even though we have more
layers, and even though the layers look

200
00:18:43,570 --> 00:18:48,640
slightly different – it is the same
process all over again as with a binary

201
00:18:48,640 --> 00:18:56,540
case. And we need lots of training data.
This is something that you'll hear often

202
00:18:56,540 --> 00:19:02,960
in connection with deep learning: You need
lots of training data to make this work.

203
00:19:02,960 --> 00:19:10,100
Images are complex things and in order to
meaningful extract knowledge from them,

204
00:19:10,100 --> 00:19:17,090
the network needs to see a multitude of
different images. Well now I already

205
00:19:17,090 --> 00:19:22,230
showed you some things we use in network
architecture, some support networks: The

206
00:19:22,230 --> 00:19:26,679
fully convolutional encoder, which takes
an image and produces a few meaningful

207
00:19:26,679 --> 00:19:33,110
values out of this image; its counterpart
the fully convolutional decoder – fully

208
00:19:33,110 --> 00:19:36,960
convolutional meaning by the way that we
only have these convolutional layers with

209
00:19:36,960 --> 00:19:42,980
a few parameters that somehow encode
spatial information and keep it for the

210
00:19:42,980 --> 00:19:49,360
next layers. The decoder takes a few
meaningful numbers and reproduces an image

211
00:19:49,360 --> 00:19:55,420
– either the same image or another
representation of the information encoded

212
00:19:55,420 --> 00:20:01,400
in the image. We also already saw the
fully connected network. Fully connected

213
00:20:01,400 --> 00:20:06,640
meaning every neuron is connected to every
neuron in the next layer. This of course

214
00:20:06,640 --> 00:20:12,570
can be dangerous because this is where we
actually get most of our parameters. If we

215
00:20:12,570 --> 00:20:16,390
have a fully connected network, this is
where the most parameters will be present

216
00:20:16,390 --> 00:20:21,580
because connecting every node to every
node … this is just a high number of

217
00:20:21,580 --> 00:20:25,860
connections. We can also do other things.
For example something called a pooling

218
00:20:25,860 --> 00:20:32,280
layer. A pooling layer being basically the
same as one of those convolutional layers,

219
00:20:32,280 --> 00:20:36,370
just that we don't have parameters we need
to learn. This works without parameters

220
00:20:36,370 --> 00:20:43,740
because this neuron just chooses whichever
value is the highest and takes that value

221
00:20:43,740 --> 00:20:49,600
as output. This is really great for
reducing the size of your image and also

222
00:20:49,600 --> 00:20:55,150
getting rid of information that might not
be that important. We can also do some

223
00:20:55,150 --> 00:20:59,890
clever techniques like adding a dropout
layer. A dropout layer just being a normal

224
00:20:59,890 --> 00:21:05,799
layer in a neural network where we remove
some connections: In one training step

225
00:21:05,799 --> 00:21:10,720
these connections, in the next training
step some other connections. This way we

226
00:21:10,720 --> 00:21:18,049
teach the other connections to become more
resilient against errors. I would like to

227
00:21:18,049 --> 00:21:22,750
start with something I call the "Model
Show" now, and show you some models and

228
00:21:22,750 --> 00:21:28,870
how we train those models. And I will
start with a fully convolutional decoder

229
00:21:28,870 --> 00:21:34,740
we saw earlier: This thing that takes a
number and creates a picture. I would like

230
00:21:34,740 --> 00:21:41,420
to take this model, put in some number and
get out a picture – a picture of a horse

231
00:21:41,420 --> 00:21:46,000
for example. If I put in a different
number I also want to get a picture of a

232
00:21:46,000 --> 00:21:52,390
horse, but of a different horse. So what I
want to get is a mapping from some

233
00:21:52,390 --> 00:21:56,730
numbers, some features that encode
something about the horse picture, and get

234
00:21:56,730 --> 00:22:03,450
a horse picture out of it. You might see
already why this is problematic. It is

235
00:22:03,450 --> 00:22:08,230
problematic because we don't have a
mapping from feature to horse or from

236
00:22:08,230 --> 00:22:15,050
horse to features. So we don't have a
truth value we can use to learn how to

237
00:22:15,050 --> 00:22:21,790
generate this mapping. Well computer
vision engineers – or deep learning

238
00:22:21,790 --> 00:22:26,800
professionals – they're smart and have
clever ideas. Let's just assume we have

239
00:22:26,800 --> 00:22:32,870
such a network and let's call it a
generator. Let's take some numbers put,

240
00:22:32,870 --> 00:22:39,240
them into the generator and get some
horses. Well it doesn't work yet. We still

241
00:22:39,240 --> 00:22:42,490
have to train it. So they're probably not
only horses but also some very special

242
00:22:42,490 --> 00:22:47,970
unicorns among the horses; which might be
nice for other applications, but I wanted

243
00:22:47,970 --> 00:22:55,480
pictures of horses right now. So I can't
train with this data directly. But what I

244
00:22:55,480 --> 00:23:01,600
can do is I can create a second network.
This network is called a discriminator and

245
00:23:01,600 --> 00:23:08,820
I can give it the input generated from the
generator as well as the real data I have:

246
00:23:08,820 --> 00:23:13,920
the real horse pictures. And then I can
teach the discriminator to distinguish

247
00:23:13,920 --> 00:23:22,080
between those. Tell me it is a real horse
or it's not a real horse. And there I know

248
00:23:22,080 --> 00:23:27,000
what is the truth because I either take
real horse pictures or fake horse pictures

249
00:23:27,000 --> 00:23:34,170
from the generator. So I have a truth
value for this discriminator. But in doing

250
00:23:34,170 --> 00:23:39,070
this I also have a truth value for the
generator. Because I want the generator to

251
00:23:39,070 --> 00:23:43,799
work against the discriminator. So I can
also use the information how well the

252
00:23:43,799 --> 00:23:51,010
discriminator does to train the generator
to become better in fooling. This is

253
00:23:51,010 --> 00:23:57,470
called a generative adversarial network.
And it can be used to generate pictures of

254
00:23:57,470 --> 00:24:02,350
an arbitrary distribution. Let's do this
with numbers and I will actually show you

255
00:24:02,350 --> 00:24:07,590
the training process. Before I start the
video, I'll tell you what I did. I took

256
00:24:07,590 --> 00:24:11,550
some handwritten digits. There is a
database called "??? of handwritten

257
00:24:11,550 --> 00:24:18,570
digits" so the numbers of 0 to 9. And I
took those and used them as training data.

258
00:24:18,570 --> 00:24:24,299
I trained a generator in the way I showed
you on the previous slide, and then I just

259
00:24:24,299 --> 00:24:30,110
took some random numbers. I put those
random numbers into the network and just

260
00:24:30,110 --> 00:24:35,960
stored the image of what came out of the
network. And here in the video you'll see

261
00:24:35,960 --> 00:24:43,090
how the network improved with ongoing
training. You will see that we start

262
00:24:43,090 --> 00:24:50,179
basically with just noisy images … and
then after some – what we call apox(???)

263
00:24:50,179 --> 00:24:55,919
so training iterations – the network is
able to almost perfectly generate

264
00:24:55,919 --> 00:25:05,679
handwritten digits just from noise. Which
I find truly fascinating. Of course this

265
00:25:05,679 --> 00:25:11,270
is an example where it works. It highly
depends on your data set and how you train

266
00:25:11,270 --> 00:25:15,600
the model whether it is a success or not.
But if it works, you can use it to

267
00:25:15,600 --> 00:25:22,559
generate fonts. You can generate
characters, 3D objects, pictures of

268
00:25:22,559 --> 00:25:28,700
animals, whatever you want as long as you
have training data. Let's go more crazy.

269
00:25:28,700 --> 00:25:34,539
Let's take two of those and let's say we
have pictures of horses and pictures of

270
00:25:34,539 --> 00:25:41,150
zebras. I want to convert those pictures
of horses into pictures of zebras, and I

271
00:25:41,150 --> 00:25:44,590
want to convert pictures of zebras into
pictures of horses. So I want to have the

272
00:25:44,590 --> 00:25:49,690
same picture just with the other animal.
But I don't have training data of the same

273
00:25:49,690 --> 00:25:56,270
situation just once with a horse and once
with a zebra. Doesn't matter. We can train

274
00:25:56,270 --> 00:26:00,650
a network that does that for us. Again we
just have a network – we call it the

275
00:26:00,650 --> 00:26:05,730
generator – and we have two of those: One
that converts horses to zebras and one

276
00:26:05,730 --> 00:26:14,840
that converts zebras to horses. And then
we also have two discriminators that tell

277
00:26:14,840 --> 00:26:21,150
us: real horse – fake horse – real zebra –
fake zebra. And then we again need to

278
00:26:21,150 --> 00:26:27,210
perform some training. So we need to
somehow encode: Did it work what we wanted

279
00:26:27,210 --> 00:26:31,460
to do? And a very simple way to do this is
we take a picture of a horse put it

280
00:26:31,460 --> 00:26:35,470
through the generator that generates a
zebra. Take this fake picture of a zebra,

281
00:26:35,470 --> 00:26:39,340
put it through the generator that
generates a picture of a horse. And if

282
00:26:39,340 --> 00:26:43,700
this is the same picture as we put in,
then our model worked. And if it didn't,

283
00:26:43,700 --> 00:26:48,549
we can use that information to update the
weights. I just took a random picture,

284
00:26:48,549 --> 00:26:54,460
from a free library in the Internet, of a
horse and generated a zebra and it worked

285
00:26:54,460 --> 00:26:59,470
remarkably well. I actually didn't even do
training. It also doesn't need to be a

286
00:26:59,470 --> 00:27:03,120
picture. You can also convert text to
images: You describe something in words

287
00:27:03,120 --> 00:27:09,570
and generate images. You can age your face
or age a cell; or make a patient healthy

288
00:27:09,570 --> 00:27:15,510
or sick – or the image of a patient, not
the patient self, unfortunately. You can

289
00:27:15,510 --> 00:27:20,690
do style transfer like take a picture of
Van Gogh and apply it to your own picture.

290
00:27:20,690 --> 00:27:27,559
Stuff like that. Something else that we
can do with neural networks. Let's assume

291
00:27:27,559 --> 00:27:31,030
we have a classification network, we have
a picture of a toothbrush and the network

292
00:27:31,030 --> 00:27:36,770
tells us: Well, this is a toothbrush.
Great! But how resilient is this network?

293
00:27:36,770 --> 00:27:44,530
Does it really work in every scenario.
There's a second network we can apply: We

294
00:27:44,530 --> 00:27:48,701
call it an adversarial network. And that
network is trained to do one thing: Look

295
00:27:48,701 --> 00:27:52,289
at the network, look at the picture, and
then find the one weak spot in the

296
00:27:52,289 --> 00:27:55,880
picture: Just change one pixel slightly so
that the network will tell me this

297
00:27:55,880 --> 00:28:03,600
toothbrush is an octopus. Works remarkably
well. Also works with just changing the

298
00:28:03,600 --> 00:28:08,940
picture slightly, so changing all the
pixels, but just slight minute changes

299
00:28:08,940 --> 00:28:12,860
that we don't perceive, but the network –
the classification network – is completely

300
00:28:12,860 --> 00:28:19,640
thrown off. Well sounds bad. Is bad if you
don't consider it. But you can also for

301
00:28:19,640 --> 00:28:24,200
example use this for training your network
and make your network resilient. So

302
00:28:24,200 --> 00:28:28,460
there's always an upside and downside.
Something entirely else: Now I'd like to

303
00:28:28,460 --> 00:28:32,880
show you something about text. A word-
language model. I want to generate

304
00:28:32,880 --> 00:28:38,101
sentences for my podcast. I have a network
that gives me a word, and then if I want

305
00:28:38,101 --> 00:28:42,640
to somehow get the next word in the
sentence, I also need to consider this

306
00:28:42,640 --> 00:28:47,070
word. So another network architecture –
quite interestingly – just takes the

307
00:28:47,070 --> 00:28:52,179
hidden states of the network and uses them
as the input for the same network so that

308
00:28:52,179 --> 00:28:58,780
in the next iteration we still know what
we did in the previous step. I tried to

309
00:28:58,780 --> 00:29:04,730
train a network that generates podcast
episodes for my podcasts. Didn't work.

310
00:29:04,730 --> 00:29:08,450
What I learned is I don't have enough
training data. I really need to produce

311
00:29:08,450 --> 00:29:15,790
more podcast episodes in order to train a
model to do my job for me. And this is

312
00:29:15,790 --> 00:29:21,539
very important, a very crucial point:
Training data. We need shitloads of

313
00:29:21,539 --> 00:29:26,081
training data. And actually the more
complicated our model and our training

314
00:29:26,081 --> 00:29:30,990
process becomes, the more training data we
need. I started with a supervised case –

315
00:29:30,990 --> 00:29:35,990
the really simple case where we, really
simple, the really simpler case where we

316
00:29:35,990 --> 00:29:40,660
have a picture and a label that
corresponds to that picture; or a

317
00:29:40,660 --> 00:29:46,280
representation of that picture showing
entirely what I wanted to learn. But we

318
00:29:46,280 --> 00:29:51,909
also saw a more complex task, where I had
to pictures – horses and zebras – that are

319
00:29:51,909 --> 00:29:56,400
from two different domains – but domains
with no direct mapping. What can also

320
00:29:56,400 --> 00:30:01,020
happen – and actually happens quite a lot
– is weakly annotated data, so data that

321
00:30:01,020 --> 00:30:08,750
is not precisely annotated; where we can't
rely on the information we get. Or even

322
00:30:08,750 --> 00:30:13,050
more complicated: Something called
reinforcement learning where we perform a

323
00:30:13,050 --> 00:30:19,380
sequence of actions and then in the end
are told "yeah that was great". Which is

324
00:30:19,380 --> 00:30:24,080
often not enough information to really
perform proper training. But of course

325
00:30:24,080 --> 00:30:28,190
there are also methods for that. As well
as there are methods for the unsupervised

326
00:30:28,190 --> 00:30:33,590
case where we don't have annotations,
labeled data – no ground truth at all –

327
00:30:33,590 --> 00:30:41,241
just the picture itself. Well I talked
about pictures. I told you that we can

328
00:30:41,241 --> 00:30:45,320
learn features and create images from
them. And we can use them for

329
00:30:45,320 --> 00:30:51,640
classification. And for this there exist
many databases. There are public data sets

330
00:30:51,640 --> 00:30:56,659
we can use. Often they refer to for
example Flickr. They're just hyperlinks

331
00:30:56,659 --> 00:31:00,960
which is also why I didn't show you many
pictures right here, because I am honestly

332
00:31:00,960 --> 00:31:05,690
not sure about the copyright in those
cases. But there are also challenge

333
00:31:05,690 --> 00:31:11,190
datasets where you can just sign up, get
some for example medical data sets, and

334
00:31:11,190 --> 00:31:16,650
then compete against other researchers.
And of course there are those companies

335
00:31:16,650 --> 00:31:22,090
that just have lots of data. And those
companies also have the means, the

336
00:31:22,090 --> 00:31:28,110
capacity to perform intense computations.
And those are also often the companies you

337
00:31:28,110 --> 00:31:36,179
hear from in terms of innovation for deep
learning. Well this was mostly to tell you

338
00:31:36,179 --> 00:31:40,200
that you can process images quite well
with deep learning if you have enough

339
00:31:40,200 --> 00:31:46,029
training data, if you have a proper
training process and also a little if you

340
00:31:46,029 --> 00:31:52,090
know what you're doing. But you can also
process text, you can process audio and

341
00:31:52,090 --> 00:31:58,520
time series like prices or a stack
exchange – stuff like that. You can

342
00:31:58,520 --> 00:32:02,929
process almost everything if you make it
encodeable to your network. Sounds like a

343
00:32:02,929 --> 00:32:08,120
dream come true. But – as I already told
you – you need data, a lot of it. I told

344
00:32:08,120 --> 00:32:14,020
you about those companies that have lots
of data sets and the publicly available

345
00:32:14,020 --> 00:32:21,370
data sets which you can actually use to
get started with your own experiments. But

346
00:32:21,370 --> 00:32:24,309
that also makes it a little dangerous
because deep learning still is a black box

347
00:32:24,309 --> 00:32:30,820
to us. I told you what happens inside the
black box on a level that teaches you how

348
00:32:30,820 --> 00:32:36,529
we learn and how the network is
structured, but not really what the

349
00:32:36,529 --> 00:32:42,831
network learned. It is for us computer
vision engineers really nice that we can

350
00:32:42,831 --> 00:32:48,590
visualize the first layers of a neural
network and see what is actually encoded

351
00:32:48,590 --> 00:32:53,950
in those first layers; what information
the network looks at. But you can't really

352
00:32:53,950 --> 00:32:59,059
mathematically prove what happens in a
network. Which is one major downside. And

353
00:32:59,059 --> 00:33:02,150
so if you want to use it, the numbers may
be really great but be sure to properly

354
00:33:02,150 --> 00:33:08,059
evaluate them. In summary I call that
"easy to learn". Every one – every single

355
00:33:08,059 --> 00:33:12,679
one of you – can just start with deep
learning right away. You don't need to do

356
00:33:12,679 --> 00:33:19,440
much work. You don't need to do much
learning. The model learns for you. But

357
00:33:19,440 --> 00:33:23,770
they're hard to master in a way that makes
them useful for production use cases for

358
00:33:23,770 --> 00:33:29,900
example. So if you want to use deep
learning for something – if you really

359
00:33:29,900 --> 00:33:34,299
want to seriously use it –, make sure that
it really does what you wanted to and

360
00:33:34,299 --> 00:33:38,900
doesn't learn something else – which also
happens. Pretty sure you saw some talks

361
00:33:38,900 --> 00:33:43,670
about deep learning fails – which is not
what this talk is about. They're quite

362
00:33:43,670 --> 00:33:47,370
funny to look at. Just make sure that they
don't happen to you! If you do that

363
00:33:47,370 --> 00:33:53,300
though, you'll achieve great things with
deep learning, I'm sure. And that was

364
00:33:53,300 --> 00:34:00,740
introduction to deep learning. Thank you!
*Applause*

365
00:34:09,172 --> 00:34:13,449
Herald Angel: So now it's question and
answer time. So if you have a question,

366
00:34:13,449 --> 00:34:19,110
please line up at the mikes. We have in
total eight, so it shouldn't be far from

367
00:34:19,110 --> 00:34:26,139
you. They are here in the corridors and on
these sides. Please line up! For

368
00:34:26,139 --> 00:34:31,540
everybody: A question consists of one
sentence with the question mark in the end

369
00:34:31,540 --> 00:34:38,449
– not three minutes of rambling. And also
if you go to the microphone, speak into

370
00:34:38,449 --> 00:34:53,889
the microphone, so you really get close to
it. Okay. Where do we have … Number 7!

371
00:34:53,889 --> 00:35:02,200
We start with mic number 7:
Question: Hello. My question is: How did

372
00:35:02,200 --> 00:35:13,020
you compute the example for the fonts, the
numbers? I didn't really understand it,

373
00:35:13,020 --> 00:35:19,770
you just said it was made from white
noise.

374
00:35:19,770 --> 00:35:25,580
Teubi: I'll give you a really brief recap
of what I did. I showed you that we have a

375
00:35:25,580 --> 00:35:31,140
model that maps image to some meaningful
values, that an image can be encoded in

376
00:35:31,140 --> 00:35:36,860
just a few values. What happens here is
exactly the other way round. We have some

377
00:35:36,860 --> 00:35:43,270
values, just some arbitrary values we
actually know nothing about. We can

378
00:35:43,270 --> 00:35:47,480
generate pictures out of those. So I
trained this model to just take some

379
00:35:47,480 --> 00:35:54,560
random values and show the pictures
generated from the model. The training

380
00:35:54,560 --> 00:36:03,320
process was this "min max game", as its
called. We have two networks that try to

381
00:36:03,320 --> 00:36:08,260
compete against each other. One network
trying to distinguish, whether a picture

382
00:36:08,260 --> 00:36:12,790
it sees is real or one of those fake
pictures, and the network that actually

383
00:36:12,790 --> 00:36:18,510
generates those pictures and in training
the network that is able to distinguish

384
00:36:18,510 --> 00:36:24,599
between those, we can also get information
for the training of the network that

385
00:36:24,599 --> 00:36:30,410
generates the pictures. So the videos you
saw were just animations of what happens

386
00:36:30,410 --> 00:36:36,440
during this training process. At first if
we input noise we get noise. But as the

387
00:36:36,440 --> 00:36:41,510
network is able to better and better
recreate those images from the dataset we

388
00:36:41,510 --> 00:36:47,390
used as input, in this case pictures of
handwritten digits, the output also became

389
00:36:47,390 --> 00:36:54,660
more lookalike to those numbers, these
handwritten digits. Hope that helped.

390
00:36:54,660 --> 00:37:06,590
Herald Angel: Now we go to the
Internet. – Can we get sound for the signal

391
00:37:06,590 --> 00:37:10,040
Angel, please? Teubi: Sounded so great,
"now we go to the Internet."

392
00:37:10,040 --> 00:37:11,040
Herald Angel: Yeah, that sounds like
"yeeaah".

393
00:37:11,040 --> 00:37:13,040
Signal Angel: And now we're finally ready
to go to the interwebs. "Schorsch" is

394
00:37:13,040 --> 00:37:18,040
asking: Do you have any recommendations
for a beginner regarding the framework or

395
00:37:18,040 --> 00:37:26,460
the software?
Teubi: I, of course, am very biased to

396
00:37:26,460 --> 00:37:34,150
recommend what I use everyday. But I also
think that it is a great start. Basically,

397
00:37:34,150 --> 00:37:40,210
use python and use pytorch. Many people
will disagree with me and tell you

398
00:37:40,210 --> 00:37:45,930
"tensorflow is better." It might be, in my
opinion not for getting started, and there

399
00:37:45,930 --> 00:37:51,560
are also some nice tutorials on the
pytorch website. What you can also do is

400
00:37:51,560 --> 00:37:57,200
look at websites like OpenAI, where they
have a gym to get you started with some

401
00:37:57,200 --> 00:38:02,371
training exercises, where you already have
datasets. Yeah, basically my

402
00:38:02,371 --> 00:38:08,600
recommendation is get used to Python and
start with a pytorch tutorial, see where

403
00:38:08,600 --> 00:38:13,590
to go from there. Often there also some
github repositories linked with many

404
00:38:13,590 --> 00:38:18,740
examples for already established network
architectures like the cycle GAN or the

405
00:38:18,740 --> 00:38:26,250
GAN itself or basically everything else.
There will be a repo you can use to get

406
00:38:26,250 --> 00:38:29,940
started.
Herald Angel: OK, we stay with the

407
00:38:29,940 --> 00:38:32,589
internet. There's some more questions, I
heard.

408
00:38:32,589 --> 00:38:37,920
Signal Angel: Yes. Rubin8 is asking: Have
you have you ever come across an example

409
00:38:37,920 --> 00:38:42,580
of a neural network that deals with audio
instead of images?

410
00:38:42,580 --> 00:38:49,410
Teubi: Me personally, no. At least not
directly. I've heard about examples, like

411
00:38:49,410 --> 00:38:54,859
where you can change the voice to sound
like another person, but there is not much

412
00:38:54,859 --> 00:38:59,980
I can reliably tell about that. My
expertise really is in image processing,

413
00:38:59,980 --> 00:39:05,550
I'm sorry.
Herald Angel: And I think we have time for

414
00:39:05,550 --> 00:39:12,340
one more question. We have one at number
8. Microphone number 8.

415
00:39:12,340 --> 00:39:20,730
Question: Is the current Face recognition
technologies in, for example iPhone X, is

416
00:39:20,730 --> 00:39:26,420
it also a deep learning algorithm or is
it something more simple? Do you have any

417
00:39:26,420 --> 00:39:31,880
idea about that?
Teubi: As far as I know, yes. That's all I

418
00:39:31,880 --> 00:39:38,630
can reliably tell you about that, but it
is not only based on images but also uses

419
00:39:38,630 --> 00:39:45,420
other information. I think distance
information encoded with some infrared

420
00:39:45,420 --> 00:39:50,599
signals. I don't really know exactly how
it works, but at least iPhones already

421
00:39:50,599 --> 00:39:56,000
have a neural network
processing engine built in, so a chip

422
00:39:56,000 --> 00:40:01,190
dedicated to just doing those
computations. You saw that many of those

423
00:40:01,190 --> 00:40:05,820
things can be parallelized, and this is
what those hardware architectures make use

424
00:40:05,820 --> 00:40:10,380
of. So I'm pretty confident in saying,
yes, they also do it there.

425
00:40:10,380 --> 00:40:12,786
How exactly, no clue.

426
00:40:13,760 --> 00:40:15,323

Herald Angel: OK. I myself have a last

427
00:40:15,390 --> 00:40:20,680
completely unrelated question: Did you
create the design of the slides yourself?

428
00:40:20,680 --> 00:40:29,060
Teubi: I had some help. We have a really
great Congress design and I use that as an

429
00:40:29,060 --> 00:40:32,790
inspiration to create those slides, yes.


430
00:40:32,790 --> 00:40:36,760
Herald Angel: OK, yeah, because those are really amazing. I love them.


431
00:40:36,760 --> 00:40:38,140
Teubi: Thank you!

432
00:40:38,470 --> 00:40:41,200
Herald Angel: OK, thank you very much
Teubi.

433
00:40:45,130 --> 00:40:48,900
*35C5 outro music*

434
00:40:48,900 --> 00:41:07,000
subtitles created by c3subtitles.de
in the year 2019. Join, and help us!