1 00:00:00,000 --> 00:00:20,510 *36C3 preroll music* 2 00:00:20,510 --> 00:00:24,750 Daniel: Good morning! I'm glad you all made it here this early on the last day. I 3 00:00:24,750 --> 00:00:32,439 know it can can't be easy wasn't easy for me I have to warn you that the way I 4 00:00:32,439 --> 00:00:36,160 prepared for this song is a bit experimental. I didn't make a slide set I 5 00:00:36,160 --> 00:00:44,559 just made a mind map and I'll just click through it while I talk to you. So, 6 00:00:44,559 --> 00:00:51,180 this talk is about modernizing Wikipedia as you probably have noticed visiting, 7 00:00:51,180 --> 00:00:58,500 Wikipedia can feel a bit like visiting a website from 10-15 years ago but before I 8 00:00:58,500 --> 00:01:05,280 talk about any problems or things to improve, I first want to revisit that the 9 00:01:05,280 --> 00:01:11,619 software and the the infrastructure we build around it has been running Wikipedia 10 00:01:11,619 --> 00:01:20,160 and its sister sites for the last... well nearly 19 years now and it's extremely 11 00:01:20,160 --> 00:01:32,200 successful. We serve 17 billion page views a month, yes? 12 00:01:32,200 --> 00:01:40,870 Person in the audience: Could you make it louder or speak up and also make the image 13 00:01:40,870 --> 00:01:42,870 bigger? 14 00:01:42,870 --> 00:01:43,870 *inaudible dialogue* 15 00:01:43,870 --> 00:01:45,870 Daniel: Is this better? Like if I speak up I will loose my voice in 10 minutes it's 16 00:01:45,870 --> 00:01:55,720 already in it, no it's fine. We have technology for this. I can... the light 17 00:01:55,720 --> 00:02:05,490 doesn't help, yeah the contrast could be better. Is it better like this? Okay cool. 18 00:02:05,490 --> 00:02:13,840 All right so yeah we are serving 17 billion page views a month, which is quite 19 00:02:13,840 --> 00:02:19,560 a lot. Wikipedia exists in about 100 languages. If you attended the talk about 20 00:02:19,560 --> 00:02:24,250 the Wikimedia infrastructure yesterday, we talked about 300 languages. We actually 21 00:02:24,250 --> 00:02:29,989 support 300 languages for localization but we have Wikipedia in about 100, if I'm not 22 00:02:29,989 --> 00:02:38,689 completely off. I find this picture quite fascinating. This is a visualization of 23 00:02:38,689 --> 00:02:43,719 all the places in the world that are described on Wikipedia and sister projects 24 00:02:43,719 --> 00:02:49,319 and I find this quite impressive although it's also a nice display of cultural bias 25 00:02:49,319 --> 00:03:00,790 of course. We, that is Wikimedia Foundation, run about 900 to a 1000 wikis 26 00:03:00,790 --> 00:03:06,680 depending on how you count, but there are many many more media wiki installations 27 00:03:06,680 --> 00:03:11,459 out there, some of them big and many many of them small. We have actually no idea 28 00:03:11,459 --> 00:03:17,150 how many small instances there are. So it's a very powerful very flexible and 29 00:03:17,150 --> 00:03:23,730 versatile piece of software but, you know, but sometimes it can feel like... you can do a 30 00:03:23,730 --> 00:03:28,329 lot of things with it, right, but sometimes it feels like it's a bit 31 00:03:28,329 --> 00:03:42,180 overburdened and maybe you should look at improving the foundations. So one of the 32 00:03:42,180 --> 00:03:47,829 things that make MediaWiki great but also sometimes hard to use is that kind of 33 00:03:47,829 --> 00:03:52,609 everything is text, everything is markup, everything is done with with wikitext, 34 00:03:52,609 --> 00:04:02,529 which has grown in complexity over the years so if you look at the autonomy of a 35 00:04:02,529 --> 00:04:09,159 wiki page it can be a bit daunting. You have different syntax for markup at 36 00:04:09,159 --> 00:04:16,150 different kinds of transclusion or templates and media and some things 37 00:04:16,150 --> 00:04:21,739 actually, you know, get displayed in place, some things show up in a completely 38 00:04:21,739 --> 00:04:26,340 different place on the page it can be rather confusing and daunting for 39 00:04:26,340 --> 00:04:31,720 newcomers. And also things like having a conversation just talking to people like, 40 00:04:31,720 --> 00:04:35,540 you know, having a conversation thread looks like this. You open the page you 41 00:04:35,540 --> 00:04:40,510 look through the markup and you indent to make a conversation thread and then you 42 00:04:40,510 --> 00:04:43,480 get confused about the indenting and someone messes with the formatting and 43 00:04:43,480 --> 00:04:52,120 it's all excellent. There have been many attempts over the years to improve the 44 00:04:52,120 --> 00:05:00,290 situation, we have things like echo which notifies you, for instance when someone 45 00:05:00,290 --> 00:05:09,130 mentions your name or someone... It is also used to to welcome people and do this 46 00:05:09,130 --> 00:05:12,400 kind of achievement unlocked notifications: hey, you did your first 47 00:05:12,400 --> 00:05:19,900 edit, this is great welcome! To make people a bit more engaged with the system 48 00:05:19,900 --> 00:05:24,380 but it's really mostly improvements around the fringes. We have had a system called 49 00:05:24,380 --> 00:05:31,350 Flow for awhile to improve the way conversations work. So you have more like 50 00:05:31,350 --> 00:05:37,960 a thread structure that the software actually knows about but then there are 51 00:05:37,960 --> 00:05:42,160 many, well quite a few people who have been around for a while that are very used 52 00:05:42,160 --> 00:05:46,900 to the manual system and also there's a lot of tools to support this manual system 53 00:05:46,900 --> 00:05:52,780 which of course are incompatible with making things more modern. So we use this 54 00:05:52,780 --> 00:05:56,250 for instance on MediaWiki.org which is a site which is basically a self 55 00:05:56,250 --> 00:06:03,000 documentation site of MediaWiki but on most Wikipedia this is not enabled or at 56 00:06:03,000 --> 00:06:14,530 least not used for default everywhere. The biggest attempt to move away from the text 57 00:06:14,530 --> 00:06:23,050 only approach is Wikidata, which we started in 2012. The idea of Wikidata of 58 00:06:23,050 --> 00:06:29,580 course, if you didn't attend many great talks we had about it here over of the 59 00:06:29,580 --> 00:06:36,470 course of the Congress, is a way to basically model the world using structured 60 00:06:36,470 --> 00:06:45,470 data, using a semantic approach instead of natural language which has its own 61 00:06:45,470 --> 00:06:50,740 complexities but at least it's a way to represent the knowledge of the world in a 62 00:06:50,740 --> 00:06:56,790 way that machines can understand. So this would be an alternative to wiki text but 63 00:06:56,790 --> 00:07:09,389 still the vast majority of things especially on Wikipedia are just markup. 64 00:07:09,389 --> 00:07:13,800 And this markup is pretty powerful and there's lots of ways to extend it and to 65 00:07:13,800 --> 00:07:21,050 do things with it. So a lot of things on MediaWiki are just DIY, just do it 66 00:07:21,050 --> 00:07:29,250 yourself. Templates are a great example of this. Infoboxes of course, the nice blue 67 00:07:29,250 --> 00:07:34,730 boxes here on the right side of pages, are done using templates but these templates 68 00:07:34,730 --> 00:07:39,090 are just for formatting, there is not data processing there's no the data base or 69 00:07:39,090 --> 00:07:47,530 structured data backing them. It's just basically, you know, it's still just 70 00:07:47,530 --> 00:07:56,630 markup. It's still... you have a predefined layout but you're still feeding a text not 71 00:07:56,630 --> 00:08:04,520 data. You have parameters but the values of the parameters are still again maybe 72 00:08:04,520 --> 00:08:11,610 templates or links or you have markup in them, like you know HTML line breaks and 73 00:08:11,610 --> 00:08:18,860 stuff. So it's kind of semi structured. And this of course is also used to do 74 00:08:18,860 --> 00:08:24,100 things like workflow. The template... Oh no, this was actually an infobox, wrong 75 00:08:24,100 --> 00:08:34,229 picture, wrong capture. This is also used to do workflows, so if a page on Wikipedia 76 00:08:34,229 --> 00:08:39,789 gets nominated for deletion you put manual put a template on the page that defines 77 00:08:39,789 --> 00:08:44,870 why this is supposed to be deleted and then you have to go to a different page 78 00:08:44,870 --> 00:08:49,390 and put a different template there, giving more explanation and this again is used 79 00:08:49,390 --> 00:08:55,149 for discussion. It's a lot of structure created by the community and maintained by 80 00:08:55,149 --> 00:09:02,730 the community, using conventions and tools built on top of what is essentially just a 81 00:09:02,730 --> 00:09:10,620 pile of markup. And because doing all this manually is kind of painful, only on there 82 00:09:10,620 --> 00:09:17,360 we created a system to allow people to add JavaScript to the site, which is then 83 00:09:17,360 --> 00:09:27,019 maintained on wiki pages by the community and it can tweak and automate. But again, 84 00:09:27,019 --> 00:09:30,589 it doesn't really have much to work with, right? It basically messes with whatever 85 00:09:30,589 --> 00:09:35,470 it can, it directly interacts with the DOM of the page, whenever the layout of the 86 00:09:35,470 --> 00:09:41,040 software changes, things break. So this is not great for for compatibility but it's 87 00:09:41,040 --> 00:09:54,730 used a lot and it is very important for the community to have this power. Sorry, I 88 00:09:54,730 --> 00:10:00,110 wish there was a better way to show these pictures. Okay, that's just to give you an 89 00:10:00,110 --> 00:10:05,220 idea of what kind of thing is implemented that way and maintained by the community 90 00:10:05,220 --> 00:10:10,189 on their site. One of the problems we have with that is: these are bound to a wiki 91 00:10:10,189 --> 00:10:19,410 and I just told you that we run over 900 of these not over 9,000 and it would be 92 00:10:19,410 --> 00:10:26,300 great if you could just share them between wikis but we can't. And again, there have 93 00:10:26,300 --> 00:10:30,790 been... we have been talking about it a lot and it seems like it shouldn't be so 94 00:10:30,790 --> 00:10:36,759 hard, but you kind of need to write these tools differently, if you want to share 95 00:10:36,759 --> 00:10:39,899 them across sites, because different sites use different conventions, they use 96 00:10:39,899 --> 00:10:45,529 different templates. Then it just doesn't work and you actually have to write decent 97 00:10:45,529 --> 00:10:50,970 software that uses internationalization if you want to use it across wikis. While 98 00:10:50,970 --> 00:10:55,019 these are usually just you know one-off hacks with everything hard-coded we would 99 00:10:55,019 --> 00:10:58,450 have to put in place an internationalization system and it's 100 00:10:58,450 --> 00:11:02,910 actually a lot of effort and there's a lot of things that are actually unclear about 101 00:11:02,910 --> 00:11:15,260 it. So, before I dive more deeply into the different things that will make it hard to 102 00:11:15,260 --> 00:11:20,529 improve on the current situation and the things that we are doing to improve it do 103 00:11:20,529 --> 00:11:27,309 we have any questions or do you have any other - do you have any things you may 104 00:11:27,309 --> 00:11:34,519 find particularly, well, annoying or particularly outdated, when interacting 105 00:11:34,519 --> 00:11:40,920 with Wikipedia? Any thoughts on that? Beyond what I just said? 106 00:11:40,920 --> 00:11:48,769 Microphone: The strict separation, just in Wikipedia, between mobile layout and 107 00:11:48,769 --> 00:11:54,259 desktop layout. Daniel: Yeah. So, actually having a 108 00:11:54,259 --> 00:12:02,069 reactive layout system that would just work for mobile and desktop in the same 109 00:12:02,069 --> 00:12:09,130 way and allowing the designers and UX experts, who work on the system to just do 110 00:12:09,130 --> 00:12:15,180 this once and not two or maybe even three times - because of course we also have 111 00:12:15,180 --> 00:12:20,550 native applications for different platforms - would be great and it's 112 00:12:20,550 --> 00:12:24,360 something that we're looking into at the moment. But it's not, you know , it's not 113 00:12:24,360 --> 00:12:29,519 that easy we could build a completely new system, that does this but then again you 114 00:12:29,519 --> 00:12:33,249 would be telling people: "You can no longer use the old system", but now they 115 00:12:33,249 --> 00:12:39,019 have build all these tools that rely on how the old system works and you have to 116 00:12:39,019 --> 00:12:52,089 port all of this over so there's a lot of inertia. Any other thoughts? Everyone is 117 00:12:52,089 --> 00:13:03,720 still asleep that's excellent. So I can continue. So, another thing that makes it 118 00:13:03,720 --> 00:13:10,879 difficult to change how MediaWiki works or to improve it is that we are trying to do 119 00:13:10,879 --> 00:13:19,180 well to be at least two things at once: on the one hand we are running a top 5 120 00:13:19,180 --> 00:13:24,360 website and serving over 100,000 requests per second using the system and you on the 121 00:13:24,360 --> 00:13:30,540 other hand, at least until now, we have always made sure that you can just 122 00:13:30,540 --> 00:13:33,800 download MediaWiki and install it on a shared hosting platform you don't even 123 00:13:33,800 --> 00:13:38,920 need root on the system, right? You don't even need administrative privileges you 124 00:13:38,920 --> 00:13:44,769 can just set it up and run it in your web space and it will work. And, having the 125 00:13:44,769 --> 00:13:51,779 same piece of software do both, run in a minimal environment and run at scale, is 126 00:13:51,779 --> 00:13:55,040 rather difficult and it also means that there's a lot of things that we can't 127 00:13:55,040 --> 00:14:02,110 easily do, right? All this modern micro service architecture separate front-end 128 00:14:02,110 --> 00:14:09,309 and back-end systems, all of that means that it's a lot more complicated to set up 129 00:14:09,309 --> 00:14:15,720 and needs more knowledge or more infrastructure to set up and so far that 130 00:14:15,720 --> 00:14:19,500 meant we can't do it, because so far there was this requirement that you should 131 00:14:19,500 --> 00:14:23,569 really be able to just run it on your shared hosting. And we are currently 132 00:14:23,569 --> 00:14:29,639 considering to what extent we can continue this, I mean, container based hosting is 133 00:14:29,639 --> 00:14:34,620 picking up. Maybe this is an alternative it's still unclear but it seems like this 134 00:14:34,620 --> 00:14:45,999 is something that we need to reconsider. Yeah, but if we make this harder to do 135 00:14:45,999 --> 00:14:52,739 then a lot of current users of MediaWiki would maybe not, well, maybe no longer 136 00:14:52,739 --> 00:14:57,230 exist or at least would not exist as they do now, right. You probably have seen 137 00:14:57,230 --> 00:15:05,259 this nice MediaWiki instance the Congress wiki. Which - with a completely customized 138 00:15:05,259 --> 00:15:09,689 skin and a lot of extensions installed to allow people to define their sessions 139 00:15:09,689 --> 00:15:14,410 there and making sure these sessions automatically get listed and get put into 140 00:15:14,410 --> 00:15:20,660 a calendar - this is all done using extensions, like Semantic MediaWiki, that 141 00:15:20,660 --> 00:15:34,279 allow you to basically define queries in the wiki text markup. Yeah, another thing 142 00:15:34,279 --> 00:15:42,079 that, of course, slows down development is that Wikimedia does engineering on a, 143 00:15:42,079 --> 00:15:48,130 well, comparatively a shoestring budget, right? The budget of the Wikimedia 144 00:15:48,130 --> 00:15:52,199 Foundation, the annual budget is something like a hundred million dollars, that 145 00:15:52,199 --> 00:15:58,009 sounds like a lot of money, but if you compare it to other companies running a 146 00:15:58,009 --> 00:16:03,209 top five or top ten website it's like two percent of their budget or something like 147 00:16:03,209 --> 00:16:10,769 that, right? It's really, I mean, 100 million is not peanuts but compared to 148 00:16:10,769 --> 00:16:16,699 what other companies invest to achieve this kind of goal it kind of is, so , what 149 00:16:16,699 --> 00:16:22,230 this budget translates into is something like 300, depending on how you count, 150 00:16:22,230 --> 00:16:28,800 between three hundred and four hundred staff. So, this is the people who run all 151 00:16:28,800 --> 00:16:32,189 of this, including all the community outreach all the social aspects all the 152 00:16:32,189 --> 00:16:40,920 administrative aspects. Less than half of these are the engineers who do all this. 153 00:16:40,920 --> 00:16:50,989 And we have like, something like 2,500 servers, bare-metal, so, which is not a 154 00:16:50,989 --> 00:16:57,619 lot for this kind of thing. Which also means that we have to design the software 155 00:16:57,619 --> 00:17:07,079 to be not just scalable but also quite efficient. The modern approach to scaling 156 00:17:07,079 --> 00:17:11,640 is usually scale horizontally make it so you can just spin up another virtual 157 00:17:11,640 --> 00:17:19,280 machine in some cloud service, but, yeah, we run our own service, we run our own 158 00:17:19,280 --> 00:17:24,440 servers, so we can design to scale horizontally, but it means ordering 159 00:17:24,440 --> 00:17:32,070 hardware and setting it up and it's going to take half a year or so. And we don't 160 00:17:32,070 --> 00:17:38,390 actually have that many people who do this, so, scalability and performance are 161 00:17:38,390 --> 00:17:49,000 also important factors when designing the software. Okay. Before I dive into what we 162 00:17:49,000 --> 00:18:03,860 are actually doing - any questions? This one in the back. Wait for the mic, please. 163 00:18:03,860 --> 00:18:07,330 In the very... Q: Hi! 164 00:18:07,330 --> 00:18:12,950 Daniel: Hello. Q: So, you said you don't have that many 165 00:18:12,950 --> 00:18:22,990 people, but how many do you actually have? Daniel: For... it's something like 150 engineers 166 00:18:22,990 --> 00:18:27,170 worldwide. It always depends on what you count, right? So you count the people, who 167 00:18:27,170 --> 00:18:32,260 - do you count engineers, who work on the native apps, do you account engineers, who 168 00:18:32,260 --> 00:18:36,980 work on the Wikimedia cloud services - actually we do have cloud services, we 169 00:18:36,980 --> 00:18:41,190 offer them to the community to run their own things, but we don't run our stuff on 170 00:18:41,190 --> 00:18:45,560 other people's cloud. Yeah, so depending on how you count or something and whether 171 00:18:45,560 --> 00:18:50,210 you count the people working here in Germany for Wikimedia Germany, which is a 172 00:18:50,210 --> 00:18:57,760 separate organization technically - it's something like 150 engineers. 173 00:18:57,760 --> 00:19:08,210 Q: Thanks! Q: I'm interested: What are the reasons 174 00:19:08,210 --> 00:19:13,880 that you don't run on other people's services like on the cloud. I mean, then 175 00:19:13,880 --> 00:19:17,090 it will be easy to scale horizontally, right? 176 00:19:17,090 --> 00:19:25,330 Daniel: There's, well, one reason is being independent, right? If we, yeah, I imagine 177 00:19:25,330 --> 00:19:32,350 we ran all our stuff on Amazon's infrastructure and then maybe Amazon 178 00:19:32,350 --> 00:19:38,060 doesn't like the way that the Wikipedia article about Amazon is written - what do 179 00:19:38,060 --> 00:19:42,050 we do, right? Maybe they shut us down, maybe they make things very expensive, 180 00:19:42,050 --> 00:19:47,360 maybe they make things very painful for us, maybe there is some at least like it 181 00:19:47,360 --> 00:19:54,070 self-censorship mechanism happening and we want to avoid that. There are there are 182 00:19:54,070 --> 00:19:58,440 thoughts about this there are thoughts like maybe we can do this at least for 183 00:19:58,440 --> 00:20:04,270 development infrastructure and CI, not for production or maybe we can make it so that 184 00:20:04,270 --> 00:20:12,200 we run stuff in the cloud services by more than one vendor, so we basically we spread 185 00:20:12,200 --> 00:20:17,860 out so we are not reliant on a single company. We are thinking about these 186 00:20:17,860 --> 00:20:21,820 things but so far the way to actually stay independent has been to run our own 187 00:20:21,820 --> 00:20:28,300 servers. Q: You've been talking about scalability 188 00:20:28,300 --> 00:20:35,490 and changing the architecture, that kind of seems to imply to me that there's a 189 00:20:35,490 --> 00:20:42,270 problem with scaling at the moment or that it's foreseeable that things are not gonna 190 00:20:42,270 --> 00:20:46,580 work out if you just keep doing what you're doing at the moment. Can you maybe 191 00:20:46,580 --> 00:20:52,480 elaborate on that. Daniel: So, there's, I think there's two sides 192 00:20:52,480 --> 00:20:56,850 to this. On the one hand the reason I mentioned it is just that a lot of things 193 00:20:56,850 --> 00:21:01,610 that are really easy to do basically for me, right? Works on my machine are really 194 00:21:01,610 --> 00:21:08,920 hard to do if you want to do them at scale. That's one aspect. The other aspect 195 00:21:08,920 --> 00:21:16,670 is MediaWiki is pretty much a PHP monolith and that means getting it always means 196 00:21:16,670 --> 00:21:23,680 copying the monolith and breaking it down so you have smaller units that you can 197 00:21:23,680 --> 00:21:29,040 scale and just say, yeah, I don't know, I need more instances for authentication 198 00:21:29,040 --> 00:21:33,910 handling or something like that. That would be more efficient, right, because 199 00:21:33,910 --> 00:21:40,730 you have higher granularity, you can just scale the things that you actually need 200 00:21:40,730 --> 00:21:47,530 but that of course needs rearchitecting. It's not like things are going to explode 201 00:21:47,530 --> 00:21:52,910 if we don't do that very soon, it's not, so there's not like an urgent problem 202 00:21:52,910 --> 00:21:58,400 there. The reason for us to rearchitect is more, to gain more flexibility in 203 00:21:58,400 --> 00:22:03,330 development, because if you have a monolith that is pretty entangled, code 204 00:22:03,330 --> 00:22:16,130 changes are risky and take a long time. Q: How many people work on product design 205 00:22:16,130 --> 00:22:25,460 or like user experience research to, like, sit down with users and try to understand 206 00:22:25,460 --> 00:22:28,440 what their needs are and from there proceed. 207 00:22:28,440 --> 00:22:33,230 A: Across... I don't have an exact number, something like five. 208 00:22:33,230 --> 00:22:37,930 Audience: Do you think that's sufficient? Herald: The question was, whether it's 209 00:22:37,930 --> 00:22:46,800 sufficient. So just... Daniel: Probably not? But it's more than, 210 00:22:46,800 --> 00:22:50,310 that's more people than we have for database administration, and that's also 211 00:22:50,310 --> 00:23:06,040 not sufficient. Herald: Are the further questions? I don't 212 00:23:06,040 --> 00:23:16,270 think. Daniel: Okay. So, one of the things, that 213 00:23:16,270 --> 00:23:20,320 holds us back a bit, is that there's literally thousands of extensions for 214 00:23:20,320 --> 00:23:26,870 MediaWiki and the extension mechanism is heavily reliant on hooks, so basically on 215 00:23:26,870 --> 00:23:39,600 callbacks. And, we have - I don't have a picture, I have a link here - we have a 216 00:23:39,600 --> 00:23:44,500 great number of these. So, you see, each paragraph is basically documenting one 217 00:23:44,500 --> 00:23:51,970 callback that you can use to modify the behavior of the software and, I mean, 218 00:23:51,970 --> 00:23:59,240 there's, I have never counted, but something like a thousand? And all of them 219 00:23:59,240 --> 00:24:07,520 are of course interfaces to extra - to software that is maintained externally, so 220 00:24:07,520 --> 00:24:12,611 they have to be kept stable and if you have a large chunk of software that you 221 00:24:12,611 --> 00:24:16,730 want to restructure but you have a thousand fixed points that you can't 222 00:24:16,730 --> 00:24:22,960 change, things become rather difficult. It's basi.. yeah, these hook points kind 223 00:24:22,960 --> 00:24:27,640 of, like, they act like nails in the architecture and then you kind of have to 224 00:24:27,640 --> 00:24:36,650 wiggle around them - it's fun. We are working to change that. We want to 225 00:24:36,650 --> 00:24:43,950 architect it so the interface that is exposed to these hooks become much more 226 00:24:43,950 --> 00:24:51,360 narrow and the things that these hooks or these callback functions can do is much 227 00:24:51,360 --> 00:24:58,690 more restricted. There's currently an RSC open for this, has been open for a while 228 00:24:58,690 --> 00:25:04,690 actually. The problem is that in order to assess whether the proposal is actually 229 00:25:04,690 --> 00:25:11,530 viable you have to survey all the current users of these hooks and make sure that we 230 00:25:11,530 --> 00:25:15,660 can, the use case is still covered in the new system and, yeah, we have like a 231 00:25:15,660 --> 00:25:21,030 thousand hook points and we have like a thousand extensions that's quite a bit of 232 00:25:21,030 --> 00:25:31,060 work. Another thing that I'm currently working on is establishing a stable 233 00:25:31,060 --> 00:25:36,990 interface policy. This may sound pretty obvious - it has a lot of pretty obvious 234 00:25:36,990 --> 00:25:42,430 things like, yeah, if you have a class and there's a public method then that's a 235 00:25:42,430 --> 00:25:46,410 stable interface it will not just change without notice, we have deprecation policy 236 00:25:46,410 --> 00:25:53,020 and all that. But if you have worked with extensible systems that rely on the 237 00:25:53,020 --> 00:25:58,350 mechanisms of object-oriented programming, you may have come across the question 238 00:25:58,350 --> 00:26:05,040 whether a protected method is part of this stable interface of the software or not, 239 00:26:05,040 --> 00:26:10,010 or maybe the constructor? I don't know, if you have worked in environments that use 240 00:26:10,010 --> 00:26:15,860 dependency injection the idea is basically that the construction signature should be 241 00:26:15,860 --> 00:26:21,270 able to change at any time but then you have extensions that you're subclassing and 242 00:26:21,270 --> 00:26:25,640 things break. So, this is why we are trying to establish a much more 243 00:26:25,640 --> 00:26:32,750 restrictive stable interface policy, that would would make explicit things like 244 00:26:32,750 --> 00:26:36,650 constructor signatures actually not being stable and that gives us a lot more wiggle 245 00:26:36,650 --> 00:26:51,030 room to restructure the software. MediaWiki itself has grown as a software 246 00:26:51,030 --> 00:26:58,750 for the last 18 years or so and, at least in the beginning, was mostly created by 247 00:26:58,750 --> 00:27:06,330 volunteers. And in a monolithic architecture there's a great tendency to 248 00:27:06,330 --> 00:27:11,070 just, you know, find and grab the thing that you want to use and just use it. 249 00:27:11,070 --> 00:27:19,100 Which leads to, well, structures like this one: everything depends on everything. And 250 00:27:19,100 --> 00:27:26,360 if you change one bit of code everything else may or may not break. And with, yeah. 251 00:27:26,360 --> 00:27:31,350 And if you don't have great test coverage at the same time this just makes it so 252 00:27:31,350 --> 00:27:35,312 that any change becomes very risky and you have to do a lot of manual testing a lot 253 00:27:35,312 --> 00:27:43,690 of manual digging around, touching a lot of files and we are for the last year, 254 00:27:43,690 --> 00:27:50,510 year and a half we have started a concerted effort to tie the worst - to cut 255 00:27:50,510 --> 00:27:57,760 the worst ties, to decouple these things that are, basically that have most impact 256 00:27:57,760 --> 00:28:03,320 there's a few objects in the software that rep... - for instance one that represents 257 00:28:03,320 --> 00:28:08,280 the user and one that represents a title that are used everywhere and the way 258 00:28:08,280 --> 00:28:14,240 they're implemented currently also means that they depend on everything and that of 259 00:28:14,240 --> 00:28:29,620 course is not a good situation. On a, well, a similar idea on a higher level is 260 00:28:29,620 --> 00:28:34,400 decomposition of the software so the decoupling was about the software 261 00:28:34,400 --> 00:28:39,990 architecture this is about the system architecture breaking up the 262 00:28:39,990 --> 00:28:45,490 monolith itself into multiple services that serve different purposes. The specifics of 263 00:28:45,490 --> 00:28:50,281 this diagram are not really relevant to this talk. This is more to, you know, give 264 00:28:50,281 --> 00:28:57,710 you an impression of the complexity and the sort of work we are doing there. The 265 00:28:57,710 --> 00:29:05,580 idea is that perhaps we could split out certain functionality into its own service 266 00:29:05,580 --> 00:29:11,160 into a separate application, like maybe move all the search functionality into 267 00:29:11,160 --> 00:29:17,150 something separate and self-contained, but then the question is how do you, again, 268 00:29:17,150 --> 00:29:23,280 compose this into the final user interface - at some point these things have to get 269 00:29:23,280 --> 00:29:28,420 composed together again - and again this is a very trivial trivial issue if you 270 00:29:28,420 --> 00:29:32,470 only want to want this to work on your machine or you only need to serve a 271 00:29:32,470 --> 00:29:39,680 hundred users or something. But doing this at scale doing it at the rate of something 272 00:29:39,680 --> 00:29:45,230 like 10,000 page views a second, I said a hundred thousand requests earlier but that 273 00:29:45,230 --> 00:29:51,790 includes resources, icons, CSS and all that. So, yeah, then you have to think 274 00:29:51,790 --> 00:29:58,470 pretty hard about what you can cache and, thank you, how you can recombine things 275 00:29:58,470 --> 00:30:02,760 without having to recompute everything and this is something that we are currently 276 00:30:02,760 --> 00:30:08,580 looking into - coming up with a architecture that allows us to compose and 277 00:30:08,580 --> 00:30:23,220 recombine the output of different background services. Okay. Before I 278 00:30:23,220 --> 00:30:27,600 started this talk I said I would probably roughly use half of my time going through 279 00:30:27,600 --> 00:30:33,310 the presentation and I guess I just hit that spot on. So, this is all I have 280 00:30:33,310 --> 00:30:41,070 prepared but I'm happy to talk to you more about the things I said or maybe any other 281 00:30:41,070 --> 00:30:48,050 aspects of this that you may be interested in. If any comments or questions. Oh! 282 00:30:48,050 --> 00:30:56,800 Three already. Q: First of all thanks a lot for the 283 00:30:56,800 --> 00:31:03,150 presentation, such a really interesting case of a legacy system and thanks for the 284 00:31:03,150 --> 00:31:10,130 honesty. It was really interesting as a, you know, software engineer to see how 285 00:31:10,130 --> 00:31:15,101 that works. I have a question about decoupling, so, I mean, I kind of, you 286 00:31:15,101 --> 00:31:23,190 have like, probably your system is enormous and how do you find, so to say, 287 00:31:23,190 --> 00:31:29,100 the most evil, you know, parts which sort of have to be decoupled. Do you use other 288 00:31:29,100 --> 00:31:34,820 software, with, you know, this, like, what a metrics and stuff or do you just know, 289 00:31:34,820 --> 00:31:38,370 kind of intuitively.. Daniel: Yeah, it's actually, this is quite 290 00:31:38,370 --> 00:31:44,970 interesting and maybe I can, maybe we can talk about it a bit more in depth later. 291 00:31:44,970 --> 00:31:49,020 Very quickly: it's a combination on the one hand you just have the anecdotal 292 00:31:49,020 --> 00:31:53,280 experience of what is actually annoying when you work with the software and you 293 00:31:53,280 --> 00:31:59,111 try to fix it and on the other hand I try to find good tooling for this and the 294 00:31:59,111 --> 00:32:05,440 existing tooling tends to die when you just run it against our code base. So, one 295 00:32:05,440 --> 00:32:09,930 of the things that you are looking for are cyclic dependencies but the number of 296 00:32:09,930 --> 00:32:15,080 possible cycles in a graph grows exponentially with a number of nodes. And 297 00:32:15,080 --> 00:32:17,710 if you have a pretty tightly knit graph that number quickly goes into the 298 00:32:17,710 --> 00:32:26,580 millions. And, yeah, the tool just goes to 100% CPU and never returns. So, I spend 299 00:32:26,580 --> 00:32:33,600 quite a bit of time trying to find heuristics to get around that - was a lot 300 00:32:33,600 --> 00:32:41,550 of fun. I can, yeah, we can talk about that later, if you like. Okay, thanks. 301 00:32:41,550 --> 00:32:49,221 Q: So what exactly is this Wikidata you mentioned before. Is it like an extension 302 00:32:49,221 --> 00:32:55,580 or is it a completely different project? Daniel: Wiki - so there's an extension called 303 00:32:55,580 --> 00:33:04,630 Wikibase, that implements this, well I would say, ontological modeling interface 304 00:33:04,630 --> 00:33:11,980 for MediaWiki and that is used to run a website called Wikidata which has 305 00:33:11,980 --> 00:33:19,500 something like 30 million items modeled that describe the world and serve as a 306 00:33:19,500 --> 00:33:25,610 machine-readable data back-end to other wiki project, other Wikimedia projects. 307 00:33:25,610 --> 00:33:32,890 Yeah, I used to work on that project for Wikimedia Germany. I moved on to do 308 00:33:32,890 --> 00:33:41,150 different things now for a couple of years. Lukas here in front is probably the 309 00:33:41,150 --> 00:33:50,190 person most knowledgeable about the latest and greatest in the Wikidata development. 310 00:33:50,190 --> 00:33:56,240 Q: You've shortly talked about test coverage. I will be into history.. 311 00:33:56,240 --> 00:33:58,650 Daniel: Sorry? Q: You talked about test coverage. 312 00:33:58,650 --> 00:34:02,010 Daniel: Yes. Q: I would be interested in if you amped 313 00:34:02,010 --> 00:34:07,660 your efforts to help you modernize it and how your current situation is with test 314 00:34:07,660 --> 00:34:11,809 coverage. Daniel: Test coverage for MediaWiki core is below 315 00:34:11,809 --> 00:34:21,809 50%. In some parts it's below 10% which is very worrying. One thing that we started 316 00:34:21,809 --> 00:34:30,050 to look into, like half a year ago, is instead of writing unit tests for all the 317 00:34:30,050 --> 00:34:36,010 code that we actually want to throw away, before we touch it, we tried to improve 318 00:34:36,010 --> 00:34:40,900 the test coverage using integration tests on the API level. So we are currently in 319 00:34:40,900 --> 00:34:48,240 the process of writing a suite of tests, not just for the API modules, but for all 320 00:34:48,240 --> 00:34:54,540 the functionality, all the application logic behind the the API. And that will 321 00:34:54,540 --> 00:35:01,070 hopefully cover most of the relevant code paths and will give us confidence when we 322 00:35:01,070 --> 00:35:12,420 refactor the code. Q: Thanks. 323 00:35:12,420 --> 00:35:26,280 Herald: Other questions? Q: So you said that you have this legacy 324 00:35:26,280 --> 00:35:32,240 system and eventually you have to move away from it but are there any, like, I 325 00:35:32,240 --> 00:35:39,820 don't know, plans for the near future to, I don't know. At some point you have to 326 00:35:39,820 --> 00:35:47,310 cut the current infrastructure to your extensions and so on and it's a hard cut, I 327 00:35:47,310 --> 00:35:53,330 see. But are there any plans to build it up from scratch or what are the plans? 328 00:35:53,330 --> 00:35:58,060 Daniel: Yeah, we are not going to rewrite from scratch - that's a pretty sure fire way to 329 00:35:58,060 --> 00:36:05,370 just kill the system. We will have to make some tough decisions about backwards 330 00:36:05,370 --> 00:36:11,340 compatibility and probably reconsider some of the requirements and constraints we 331 00:36:11,340 --> 00:36:17,100 have, well, with respect to the platforms we run on and also the platforms we serve. 332 00:36:17,100 --> 00:36:21,130 One of the things that we have been very careful to do in the past for instance is 333 00:36:21,130 --> 00:36:26,530 to make sure that you can do pretty much everything with MediaWiki with no 334 00:36:26,530 --> 00:36:32,800 JavaScript on the client side. And that requirement is likely to drop. You will 335 00:36:32,800 --> 00:36:40,010 still be able to read of course, without any JavaScript or anything, but the extent 336 00:36:40,010 --> 00:36:45,910 of functionality you will have without JavaScript on the client side is likely to 337 00:36:45,910 --> 00:36:51,140 be greatly reduced - that kind of thing. Also we will probably end up breaking 338 00:36:51,140 --> 00:36:57,660 compatibility to at least some of the user-created tools. Hopefully we can offer 339 00:36:57,660 --> 00:37:02,390 good alternatives, good APIs, good libraries that people can actually port 340 00:37:02,390 --> 00:37:11,070 to, that are less brittle. I hope that will motivate people and maybe repay them 341 00:37:11,070 --> 00:37:15,950 a bit for the pain of having their tool broken. If we can give them something that 342 00:37:15,950 --> 00:37:21,119 is more stable, more reliable, and hopefully even nicer to use. Yeah, so, 343 00:37:21,119 --> 00:37:25,930 it's small increments, bits, and pieces all over the system there's no, you know, 344 00:37:25,930 --> 00:37:32,550 no great master plan, no big change to point to really. 345 00:37:32,550 --> 00:37:45,470 Herald: Okay, okay, further questions? Daniel: I plan to just sit outside here at 346 00:37:45,470 --> 00:37:54,800 the table later if you just want to come and chat so we can also do that there. 347 00:37:54,800 --> 00:38:01,250 Herald: Okay, so, last call are there any other questions? It does not appear so, 348 00:38:01,250 --> 00:38:08,110 so, I'd like ask for a huge applause for Daniel for this talk. 349 00:38:08,110 --> 00:38:12,627 *Applause* 350 00:38:12,627 --> 00:38:14,730 *36C3 postroll music* 351 00:38:14,730 --> 00:38:38,320 Subtitles created by c3subtitles.de in the year 2020. Join, and help us!