1 00:00:06,303 --> 00:00:07,362 (Lydia) Thank you so much. 2 00:00:07,362 --> 00:00:11,244 So, this conference, one of the big themes is languages. 3 00:00:14,220 --> 00:00:18,508 I want to give you an overview of where we actually are currently 4 00:00:18,508 --> 00:00:19,812 when it comes to languages 5 00:00:20,264 --> 00:00:22,167 and where we can go from here. 6 00:00:29,036 --> 00:00:32,580 Wikidata is all about giving more people more access to more knowledge, 7 00:00:32,580 --> 00:00:37,168 and language is such an important part of making that a reality, 8 00:00:38,205 --> 00:00:43,291 especially since more and more of our lives depends on technology. 9 00:00:44,114 --> 00:00:48,873 And as our keynote speaker earlier today was talking, 10 00:00:49,723 --> 00:00:51,588 some of the technology leaves people behind 11 00:00:51,588 --> 00:00:55,020 simply because they can't speak a certain language, 12 00:00:55,320 --> 00:00:57,573 and that's not okay. 13 00:00:58,633 --> 00:01:02,097 So we want to do something about that. 14 00:01:02,927 --> 00:01:05,841 And in order to change that, you need at least two things. 15 00:01:06,411 --> 00:01:11,270 One is you need to provide content to the people in their language, 16 00:01:11,270 --> 00:01:12,955 and the second thing you need 17 00:01:12,955 --> 00:01:15,910 is to provide them with interaction in their language 18 00:01:15,910 --> 00:01:19,189 in those applications or whatever it is you have. 19 00:01:20,367 --> 00:01:25,277 And Wikidata helps with both of those. 20 00:01:25,277 --> 00:01:28,408 And the first thing, *content in your language*, 21 00:01:28,408 --> 00:01:30,879 that is basically what we have in items and properties, 22 00:01:31,319 --> 00:01:33,082 how we describe the world. 23 00:01:33,082 --> 00:01:35,085 Now, this is certainly not everything you need, 24 00:01:35,085 --> 00:01:39,294 but it gets you quite far ahead. 25 00:01:39,764 --> 00:01:41,847 The other thing is *interaction in your language*, 26 00:01:41,847 --> 00:01:46,389 and that's where lexemes come into play 27 00:01:46,389 --> 00:01:49,382 If you want to talk to your digital personal assistant 28 00:01:49,382 --> 00:01:54,918 or if you want to have your device translate a text and things like that. 29 00:01:56,404 --> 00:01:59,254 Alright, let's look into *content in your language.* 30 00:01:59,254 --> 00:02:03,396 So what we have in *items* and *properties*. 31 00:02:05,406 --> 00:02:09,696 For this, the labels in those items and properties are crucial. 32 00:02:10,236 --> 00:02:14,866 We need to know what this entity is called that we're talking about. 33 00:02:15,656 --> 00:02:19,987 And instead of talking about Q5, 34 00:02:19,987 --> 00:02:22,180 someone who speaks English knows that's a "human," 35 00:02:22,180 --> 00:02:24,706 someone who speaks German knows that's a "mensch," 36 00:02:24,706 --> 00:02:26,374 and similar things. 37 00:02:26,374 --> 00:02:29,742 So those labels on items and properties 38 00:02:29,742 --> 00:02:33,619 are bridging the gap between humans and machines. 39 00:02:33,619 --> 00:02:35,439 And humans and humans 40 00:02:35,439 --> 00:02:40,115 making more existing knowledge accessible to them. 41 00:02:43,270 --> 00:02:46,290 Now, that's a nice aspiration. 42 00:02:46,290 --> 00:02:48,342 What does it actually look like? 43 00:02:48,342 --> 00:02:49,607 It looks like this. 44 00:02:50,947 --> 00:02:52,416 What you're seeing here 45 00:02:52,416 --> 00:02:58,496 is that most of the items on Wikidata have two labels, 46 00:02:58,496 --> 00:03:00,767 so labels in two languages. 47 00:03:01,697 --> 00:03:03,851 And after that, it's one, and then three, 48 00:03:03,851 --> 00:03:06,115 and then it becomes very sad. 49 00:03:06,781 --> 00:03:08,581 (quiet laughter) 50 00:03:10,047 --> 00:03:12,713 I think we need to do better than this. 51 00:03:14,185 --> 00:03:15,319 But, on the other hand, 52 00:03:15,319 --> 00:03:17,478 I was actually expecting this to be even worse. 53 00:03:17,478 --> 00:03:19,560 I was expecting the average to be one. 54 00:03:19,560 --> 00:03:22,503 So I was quite happy to see two. (chuckles) 55 00:03:24,921 --> 00:03:26,186 Alright. 56 00:03:27,156 --> 00:03:29,527 But it's not just interesting to know 57 00:03:29,527 --> 00:03:33,742 how many labels our items and properties have. 58 00:03:33,742 --> 00:03:36,565 It's also interesting to see in which languages. 59 00:03:38,045 --> 00:03:43,764 Here you see a graph of the languages 60 00:03:43,764 --> 00:03:46,838 that we have labels for on *Items*. 61 00:03:46,838 --> 00:03:50,669 So the biggest part there is *Other*. 62 00:03:51,229 --> 00:03:53,863 So I just took the top 100 languages 63 00:03:54,533 --> 00:03:58,902 and everything else is *Other* to make this graph readable. 64 00:03:59,542 --> 00:04:02,142 And then there's English and Dutch, 65 00:04:03,002 --> 00:04:04,254 French, 66 00:04:05,924 --> 00:04:09,129 and not to forget, Asturian. 67 00:04:09,659 --> 00:04:11,889 - (person 1) Whoo! - Whoo-hoo, yes! 68 00:04:13,899 --> 00:04:16,954 So what you see here is quite an imbalance 69 00:04:16,954 --> 00:04:20,114 and still quite a lot of focus on English. 70 00:04:21,236 --> 00:04:24,367 Another thing is if you look at the same thing for *Properties*, 71 00:04:24,367 --> 00:04:25,999 it's actually looking better. 72 00:04:27,399 --> 00:04:32,750 And I think part of that constituted just being way less properties. 73 00:04:32,750 --> 00:04:36,770 So even smaller communities have a chance to keep up with that. 74 00:04:36,770 --> 00:04:39,173 But it's also a pretty important part of Wikidata 75 00:04:39,173 --> 00:04:41,159 to localize into your language. 76 00:04:41,159 --> 00:04:42,384 So that's good. 77 00:04:45,752 --> 00:04:47,842 What I want to highlight here with Asturian 78 00:04:47,842 --> 00:04:53,698 is that a small community can really make a huge difference 79 00:04:54,448 --> 00:04:57,085 with some dedication and work, 80 00:04:57,085 --> 00:04:58,420 and that's really cool. 81 00:05:01,846 --> 00:05:03,530 A small quiz for you. 82 00:05:03,530 --> 00:05:05,493 If you take all the properties on Wikidata 83 00:05:05,493 --> 00:05:07,687 that are not external identifiers, 84 00:05:07,687 --> 00:05:10,358 which one has the most labels, like the most languages? 85 00:05:10,977 --> 00:05:13,847 (audience) [inaudible] 86 00:05:13,847 --> 00:05:16,786 I hear some agreement on *instance of*? 87 00:05:17,506 --> 00:05:19,443 You would be wrong. 88 00:05:19,983 --> 00:05:22,210 It's *image*. (chuckles) 89 00:05:23,230 --> 00:05:26,366 So, yeah, that tells you, if you speak one of the languages 90 00:05:26,366 --> 00:05:28,621 where *instance of* doesn't yet have a label, 91 00:05:28,621 --> 00:05:30,190 you might want to add it. 92 00:05:32,102 --> 00:05:35,676 So it has 148 labels currently. 93 00:05:37,688 --> 00:05:41,249 But that's just another slide. 94 00:05:42,631 --> 00:05:44,162 This graph tells us something 95 00:05:44,162 --> 00:05:49,321 about how much content we are making available in a certain language 96 00:05:49,321 --> 00:05:52,042 and how much of that content is actually used. 97 00:05:52,042 --> 00:05:55,448 So what you're seeing is basically a curve 98 00:05:55,448 --> 00:06:00,987 with most content having English labels, being available in English, 99 00:06:01,507 --> 00:06:04,295 and being used a lot. 100 00:06:04,295 --> 00:06:06,449 And then it kind of goes down. 101 00:06:06,449 --> 00:06:09,436 But, again, what you can see are outliers 102 00:06:09,436 --> 00:06:15,333 who have a lot more content than you would necessarily expect, 103 00:06:16,903 --> 00:06:19,539 and that is really, really good. 104 00:06:20,839 --> 00:06:24,945 The problem still is it's not used a lot. 105 00:06:25,565 --> 00:06:28,742 Asturian and Dutch should be higher, 106 00:06:28,742 --> 00:06:31,994 and I think helping those communities 107 00:06:33,266 --> 00:06:35,563 increase the use of the data they collected 108 00:06:35,563 --> 00:06:37,682 is a really useful thing to do. 109 00:06:42,910 --> 00:06:48,110 What this analysis and others showed us is also a good thing though 110 00:06:48,300 --> 00:06:51,378 is that we are seeing that highly used items 111 00:06:51,378 --> 00:06:55,295 also tend to have more labels 112 00:06:55,295 --> 00:06:58,188 or the other way around-- it's not entirely clear. 113 00:07:02,513 --> 00:07:04,376 And then the question is, 114 00:07:04,806 --> 00:07:07,009 are we serving just the powerful languages? 115 00:07:07,899 --> 00:07:11,147 Or are we serving everyone? 116 00:07:12,757 --> 00:07:17,743 And what you see here is a grouping of languages. 117 00:07:17,743 --> 00:07:21,832 The languages that are grouped together tend to have labels together. 118 00:07:26,042 --> 00:07:28,599 And you see it clustering. 119 00:07:28,599 --> 00:07:34,065 Now here's a similar clustering, colored, 120 00:07:34,065 --> 00:07:39,475 based on how alive, how used, 121 00:07:40,455 --> 00:07:43,156 how endangered the language is. 122 00:07:43,156 --> 00:07:44,642 And a good thing you're seeing here 123 00:07:44,642 --> 00:07:49,566 is that safe languages and endangered languages 124 00:07:49,566 --> 00:07:53,773 do not form two different clusters. 125 00:07:53,773 --> 00:07:58,872 But they're all mixed together, 126 00:08:00,262 --> 00:08:04,625 which is much better than it would be the other way around 127 00:08:04,625 --> 00:08:09,377 where the safe languages, the powerful languages 128 00:08:10,197 --> 00:08:12,164 are just helping each other out. 129 00:08:12,744 --> 00:08:14,356 No, that's not the case. 130 00:08:14,356 --> 00:08:17,417 And it's a really good thing. 131 00:08:17,417 --> 00:08:20,042 When I saw this, I thought this was very good. 132 00:08:23,474 --> 00:08:25,169 Here's a similar thing 133 00:08:26,239 --> 00:08:28,800 where we looked at 134 00:08:30,230 --> 00:08:34,222 the languages' status 135 00:08:34,222 --> 00:08:36,225 and how many labels it has. 136 00:08:39,367 --> 00:08:42,937 What you're seeing is a clear win for safe languages, 137 00:08:42,937 --> 00:08:44,248 as is expected. 138 00:08:45,508 --> 00:08:46,693 But what you're also seeing 139 00:08:46,693 --> 00:08:54,407 is that the languages in category 2 and 3 and maybe even 4 140 00:08:54,407 --> 00:08:59,280 are not that bad, actually, 141 00:08:59,280 --> 00:09:02,367 in terms of their representation in Wikidata and others. 142 00:09:03,287 --> 00:09:06,408 It's a really good thing to find. 143 00:09:07,646 --> 00:09:09,129 Now, if you look at the same thing 144 00:09:09,129 --> 00:09:12,418 for how much of that content of those labels 145 00:09:12,418 --> 00:09:15,495 is actually used on Wikipedia, for example, 146 00:09:17,455 --> 00:09:22,563 then we see a similar picture emerging again. 147 00:09:23,603 --> 00:09:29,813 And it tells us that those communities are actually making good use of their time 148 00:09:29,813 --> 00:09:34,504 by filling in labels for higher used items, for example. 149 00:09:36,410 --> 00:09:40,493 There are outliers where I think we can help, 150 00:09:41,683 --> 00:09:48,202 to help those communities find the places where their work would be most valuable. 151 00:09:49,312 --> 00:09:52,663 But, overall, I'm happy with this picture. 152 00:09:54,823 --> 00:09:59,844 Now, that was the items and properties part of Wikidata. 153 00:10:00,714 --> 00:10:03,033 Now, let's look at interaction in your languages. 154 00:10:03,033 --> 00:10:05,203 So the lexeme parts of Wikidata 155 00:10:05,203 --> 00:10:09,394 where we describe words and their forms and their meanings. 156 00:10:10,167 --> 00:10:13,301 We've been doing this now since May last year, 157 00:10:16,461 --> 00:10:19,127 and content has been growing. 158 00:10:20,114 --> 00:10:22,149 You can see here in blue the lexemes, 159 00:10:22,149 --> 00:10:25,938 and then in red, the forms on those lexemes 160 00:10:25,938 --> 00:10:29,910 and yellow, the senses on those lexemes. 161 00:10:30,991 --> 00:10:34,451 So some communities-- we'll get to that later-- 162 00:10:34,451 --> 00:10:39,793 have spent a lot of time creating forms and senses for their lexemes, 163 00:10:39,793 --> 00:10:42,753 which is really useful 164 00:10:42,753 --> 00:10:48,243 because that builds the core of the data set that you need. 165 00:10:50,562 --> 00:10:55,133 Now, we looked at all the languages 166 00:10:55,133 --> 00:10:57,906 that have lexemes on Wikidata. 167 00:10:57,906 --> 00:11:01,003 So words we have, 168 00:11:01,713 --> 00:11:04,404 those are right now 310 languages. 169 00:11:04,884 --> 00:11:08,290 Now, what do you think is the top language 170 00:11:08,290 --> 00:11:11,949 when it comes to the number of lexemes currently in Wikidata? 171 00:11:12,933 --> 00:11:14,700 (audience) [inaudible] 172 00:11:19,183 --> 00:11:20,216 Huh? 173 00:11:20,216 --> 00:11:21,741 (person 2) German. 174 00:11:21,741 --> 00:11:24,252 Sorry, I've heard it before. 175 00:11:24,252 --> 00:11:25,651 It's Russian. 176 00:11:28,011 --> 00:11:29,754 Russian is quite ahead. 177 00:11:31,897 --> 00:11:33,832 And just to give you some perspective, 178 00:11:35,652 --> 00:11:36,816 there's different opinions 179 00:11:36,816 --> 00:11:42,231 but I've read, for example, that 1,000 to 3,000 words 180 00:11:42,231 --> 00:11:45,450 gets you to conversation level, roughly, in another language, 181 00:11:45,450 --> 00:11:49,461 and 4,000 to 10,000 words to an advanced level. 182 00:11:51,591 --> 00:11:55,282 So, we still have a bit to catch up there. 183 00:11:58,483 --> 00:12:03,279 One thing I want you to pay attention to is Basque here 184 00:12:03,279 --> 00:12:07,744 with 10,000, roughly, lexemes. 185 00:12:09,244 --> 00:12:13,003 Now, if you look at the number of forms for those lexemes, 186 00:12:14,163 --> 00:12:16,497 Basque is way up there, 187 00:12:18,257 --> 00:12:20,006 which is really cool, 188 00:12:20,006 --> 00:12:24,930 and you should go to a talk that explains to you why that is the case. 189 00:12:27,341 --> 00:12:31,175 Now, if you look at the number of senses, so what do words mean, 190 00:12:32,015 --> 00:12:35,081 Basque even gets to the top of the list. 191 00:12:35,081 --> 00:12:37,102 I think that deserves an applause. 192 00:12:37,102 --> 00:12:38,921 (applause) 193 00:12:45,678 --> 00:12:47,118 Another short quiz. 194 00:12:47,118 --> 00:12:50,181 What's the lexeme with the most translations currently? 195 00:12:50,651 --> 00:12:55,414 (audience) Cats, cats, [inaudible], Douglas Adams, [inaudible] 196 00:12:56,766 --> 00:13:00,014 All good guesses, but no. 197 00:13:01,012 --> 00:13:04,137 It's this, the Russian word for "water." 198 00:13:09,571 --> 00:13:12,253 Alright, so now we talked a lot 199 00:13:12,253 --> 00:13:16,412 about how many lexemes, forms, and senses we have, 200 00:13:16,412 --> 00:13:20,493 but that's just one thing you need. 201 00:13:20,493 --> 00:13:21,515 The other thing you need 202 00:13:21,515 --> 00:13:25,161 is actually describing those lexemes, forms, and senses 203 00:13:25,161 --> 00:13:27,647 in a machine-readable way. 204 00:13:27,647 --> 00:13:30,039 And for that you have statements, like on items. 205 00:13:31,479 --> 00:13:36,362 And one of the properties you use is usage example. 206 00:13:36,362 --> 00:13:38,582 So whoever is using that data 207 00:13:38,582 --> 00:13:42,089 can understand how to use that word in context, 208 00:13:42,089 --> 00:13:44,158 so that could be a quote, for example. 209 00:13:45,396 --> 00:13:47,113 And here, Polish rocks. 210 00:13:47,900 --> 00:13:49,764 Good job, Polish speakers. 211 00:13:54,219 --> 00:13:57,680 Another property that's really useful is IPA, 212 00:13:57,680 --> 00:14:00,186 so how do you pronounce this word. 213 00:14:00,876 --> 00:14:07,497 Russian apparently needs lots of IPA statements. 214 00:14:10,419 --> 00:14:13,314 But, again, Polish, second. 215 00:14:17,148 --> 00:14:20,753 And last but not least we have pronunciation audio. 216 00:14:20,753 --> 00:14:23,372 So that is links to files on Commons 217 00:14:23,372 --> 00:14:25,959 where someone speaks the word, 218 00:14:25,959 --> 00:14:29,913 so you can hear a native speaker pronounce the word 219 00:14:29,913 --> 00:14:32,871 in case you can't read IPA, for example. 220 00:14:34,959 --> 00:14:39,205 And there's a really nice actually Wiki-based powered project 221 00:14:39,205 --> 00:14:40,474 called Lingua Libre 222 00:14:40,884 --> 00:14:45,173 where you can go and help record words in your language 223 00:14:45,173 --> 00:14:47,836 that then can be added to lexemes on Wikidata, 224 00:14:48,446 --> 00:14:52,103 so other people can understand how to pronounce your words. 225 00:14:53,663 --> 00:14:55,694 (person 2) [inaudible] 226 00:14:55,694 --> 00:14:57,665 If you search for "Lingua Libre," 227 00:14:57,665 --> 00:15:00,981 and I'm sure someone can post it in the Telegram channel. 228 00:15:03,138 --> 00:15:04,621 Those guys rock. 229 00:15:04,621 --> 00:15:06,726 They did really cool stuff with Wikibase. 230 00:15:09,416 --> 00:15:10,617 Alright. 231 00:15:12,706 --> 00:15:17,285 Then the question is, where do we go from here? 232 00:15:19,165 --> 00:15:22,010 Based on the numbers I've just shown you, 233 00:15:23,030 --> 00:15:25,172 we've come a long way 234 00:15:25,172 --> 00:15:28,430 towards giving more people more access to more knowledge 235 00:15:28,430 --> 00:15:31,240 when looking at languages on Wikidata. 236 00:15:32,530 --> 00:15:36,392 But there is also still a lot of work ahead of us. 237 00:15:38,992 --> 00:15:42,341 Some of the things you can do to help, for example, 238 00:15:42,341 --> 00:15:44,921 is run label-a-thons 239 00:15:44,921 --> 00:15:50,124 like get people together to label items in Wikidata 240 00:15:50,914 --> 00:15:55,121 or do an edit-a-thon around lexemes in your language 241 00:15:55,121 --> 00:15:59,212 to get the most used words in your language into Wikidata. 242 00:16:00,773 --> 00:16:03,285 Or you can use a tool like Terminator 243 00:16:03,285 --> 00:16:08,493 that helps you find the most important items in your language 244 00:16:08,493 --> 00:16:11,549 that are still missing a label. 245 00:16:13,274 --> 00:16:18,359 Most important being measured by how often it is used 246 00:16:18,359 --> 00:16:22,553 in other Wikidata items as links in statements. 247 00:16:25,768 --> 00:16:30,022 And, of course, for the lexeme part, 248 00:16:31,342 --> 00:16:35,169 now that we've got a basic coverage of those lexemes, 249 00:16:35,169 --> 00:16:41,163 it's also about building them out, adding more statements to them 250 00:16:41,163 --> 00:16:44,401 so that they actually can build the base 251 00:16:44,401 --> 00:16:47,421 for meaningful applications to build on top of that. 252 00:16:48,141 --> 00:16:50,795 Because we're getting closer to that critical mass, 253 00:16:50,795 --> 00:16:53,616 but we're still away from that, 254 00:16:53,616 --> 00:16:56,624 that you can build serious applications on top of it. 255 00:16:58,277 --> 00:17:01,680 And I hope all of you will join us in doing that. 256 00:17:02,583 --> 00:17:07,103 And that already brings me 257 00:17:07,103 --> 00:17:09,843 to a little help from our friends, 258 00:17:09,843 --> 00:17:12,812 and Bruno, do you want to come over 259 00:17:13,882 --> 00:17:16,854 and talk to us about lexical masks. 260 00:17:17,541 --> 00:17:18,567 (Bruno) Thank you, Lydia, 261 00:17:18,567 --> 00:17:21,519 thank you for giving me this short period of time 262 00:17:21,519 --> 00:17:24,150 to present this work that we are doing at Google 263 00:17:24,150 --> 00:17:29,635 Denny that most of you probably have heard of or know. 264 00:17:30,126 --> 00:17:32,030 Because at Google so I'm a linguist. 265 00:17:32,030 --> 00:17:36,150 so I'm very happy to be here amongst other language enthusiasts. 266 00:17:36,620 --> 00:17:39,278 We are also building some lexicons, 267 00:17:39,278 --> 00:17:41,766 and we have built this technology 268 00:17:41,766 --> 00:17:45,589 or this approach that we think can be useful for you. 269 00:17:46,369 --> 00:17:48,455 Just to give you a little bit of background, 270 00:17:48,455 --> 00:17:52,068 this is my lexicographic background talking here. 271 00:17:52,788 --> 00:17:54,347 When we build a lexicon database, 272 00:17:54,347 --> 00:17:58,623 there is a lot of hard time to maintain, to keep them consistent 273 00:17:58,623 --> 00:18:00,125 and to exchange data, 274 00:18:00,125 --> 00:18:02,027 as you probably know. 275 00:18:02,517 --> 00:18:05,927 There are several attempts to unify the feature and the properties 276 00:18:05,927 --> 00:18:09,184 that are describing those lexemes and those forms, 277 00:18:09,184 --> 00:18:10,936 and it's not a solved problem, 278 00:18:10,936 --> 00:18:13,958 but there are some unification attempts on that side. 279 00:18:13,958 --> 00:18:15,209 But what is really missing-- 280 00:18:15,209 --> 00:18:18,732 and this is a problem we had at the beginning of our project at Google 281 00:18:18,732 --> 00:18:21,607 is to try to have an internal structure 282 00:18:22,197 --> 00:18:25,910 that describes how a lexical entry should look like, 283 00:18:25,910 --> 00:18:28,581 what kind of data or what kind of information we have 284 00:18:28,581 --> 00:18:32,237 and the specification that are expected. 285 00:18:32,237 --> 00:18:38,187 So, this is what we came up with this thing called lexicon mask. 286 00:18:38,897 --> 00:18:44,841 A lexicon mask is describing what is expected for an entry, 287 00:18:44,841 --> 00:18:47,329 a lexicographic entry, to be complete, 288 00:18:47,329 --> 00:18:51,436 both in terms of the number of forms you expect for a lexeme, 289 00:18:51,436 --> 00:18:55,607 and the number of features you expect for each of those forms. 290 00:18:56,397 --> 00:18:58,329 Here is an example for Italian adjectives. 291 00:18:58,329 --> 00:19:02,002 You expect, in Italian, to have four forms for your adjectives, 292 00:19:02,002 --> 00:19:05,383 and each of these forms have a specific combination 293 00:19:05,383 --> 00:19:07,946 of gender and number features. 294 00:19:08,606 --> 00:19:12,672 This is what we expect for the Italian adjectives. 295 00:19:12,672 --> 00:19:16,176 Of course, you can have extremely complex masks, 296 00:19:16,176 --> 00:19:20,783 like the French verbs conjugation, which is quite extensive, 297 00:19:20,783 --> 00:19:23,487 and I don't show you any other Russian mask 298 00:19:23,487 --> 00:19:25,378 because it doesn't fit the screen. 299 00:19:26,308 --> 00:19:29,531 And we also have some detailed specifications 300 00:19:29,531 --> 00:19:33,421 because we distinguish what is at the form level. 301 00:19:33,421 --> 00:19:37,544 So here you have Russian nouns that have three numbers 302 00:19:37,544 --> 00:19:40,048 and a number of cases with different forms, 303 00:19:40,048 --> 00:19:43,086 but they also have an entry level specification 304 00:19:43,086 --> 00:19:45,590 that says a noun particularly has 305 00:19:45,590 --> 00:19:50,133 an inherent gender and an inherent animacy feature 306 00:19:50,133 --> 00:19:52,488 that is also specified in the mask. 307 00:19:54,518 --> 00:19:58,779 We also want to distinguish that a mask gives a specification 308 00:19:58,779 --> 00:20:01,874 for, in general, what an entry should look like. 309 00:20:01,874 --> 00:20:07,158 But you can have smaller masks for defective aspects of the form 310 00:20:07,158 --> 00:20:11,282 or defective aspects of the lexeme that happen in language. 311 00:20:11,282 --> 00:20:14,537 So here is the simplest version of French verbs 312 00:20:14,537 --> 00:20:19,729 that have only the 3rd person singular for all the weather verbs, 313 00:20:19,729 --> 00:20:23,969 like "it rains" or "it snows," like in English. 314 00:20:24,537 --> 00:20:26,493 So we distinguish these two levels. 315 00:20:26,923 --> 00:20:29,962 And how we use this at Google 316 00:20:29,962 --> 00:20:32,643 is that when we have a lexicon that we want to use, 317 00:20:33,063 --> 00:20:38,309 we use the mask to really literally throw the lexicons, 318 00:20:38,309 --> 00:20:40,163 all the entries, through the mask 319 00:20:40,163 --> 00:20:44,303 and see which entry has a problem in terms of structure. 320 00:20:44,303 --> 00:20:46,523 Are we missing a form? Are we missing a feature? 321 00:20:46,523 --> 00:20:51,497 And when there is a problem, we do some human validation 322 00:20:51,497 --> 00:20:53,751 or just to see if it passes the mask. 323 00:20:53,751 --> 00:20:57,924 So it's an extremely powerful tool to check the quality of the structure. 324 00:20:59,427 --> 00:21:01,964 So what we are happy to announce today 325 00:21:01,964 --> 00:21:05,408 is that we get the green light to open source our mask. 326 00:21:05,948 --> 00:21:07,573 So this is a schema. 327 00:21:07,573 --> 00:21:09,477 If you want that, we can release 328 00:21:09,477 --> 00:21:13,483 and that we will provide to Wikidata as to ShEx files. 329 00:21:13,483 --> 00:21:16,688 This is a ShEx file for German nouns, 330 00:21:16,688 --> 00:21:20,428 and Denny is working on the conversion from our internal specification 331 00:21:20,428 --> 00:21:23,666 to a more open-source specification. 332 00:21:23,666 --> 00:21:27,522 We currently cover more than 25 languages. 333 00:21:27,522 --> 00:21:29,225 So we expect to grow on our side, 334 00:21:29,225 --> 00:21:34,350 but we also look for this opportunity to collaborate for other languages. 335 00:21:34,350 --> 00:21:40,728 And one of the ongoing collaborations also that Denny has with Lukas. 336 00:21:40,728 --> 00:21:45,052 Lukas has these great tools to have a UI 337 00:21:45,052 --> 00:21:51,061 to help the user or the contributor to add more forms. 338 00:21:51,061 --> 00:21:54,151 So if you want to add an adjective in French, 339 00:21:54,151 --> 00:21:59,057 the UI is telling you how many forms are expected 340 00:21:59,057 --> 00:22:01,562 and what kind of features this form should have. 341 00:22:01,562 --> 00:22:06,268 So our mask will help the tool to be defined and expanded. 342 00:22:07,238 --> 00:22:08,385 That's it. 343 00:22:08,791 --> 00:22:10,358 (Lydia) Thank you so much. 344 00:22:10,358 --> 00:22:11,993 (applause) 345 00:22:14,249 --> 00:22:16,891 Alright. Are there questions? 346 00:22:16,891 --> 00:22:19,381 Do you want to talk more about lexemes? 347 00:22:19,817 --> 00:22:21,475 - (person 3) Yes. - Yes. (chuckles) 348 00:22:33,485 --> 00:22:35,380 (person 3) My question, because you were talking 349 00:22:35,380 --> 00:22:39,106 about giving more access to more people in more languages. 350 00:22:39,106 --> 00:22:42,444 But there are a lot of languages that can't be used in Wikidata. 351 00:22:42,444 --> 00:22:44,588 So what solution do you have for that? 352 00:22:45,889 --> 00:22:47,686 When you say that can't use Wikidata, 353 00:22:47,686 --> 00:22:50,308 are you talking about entering labels? 354 00:22:50,308 --> 00:22:52,578 - (person 3) Labels, descriptions. - Right. 355 00:22:52,578 --> 00:22:55,498 So, for lexemes, it's a bit different 356 00:22:55,498 --> 00:22:57,793 because there we don't have that restriction. 357 00:22:58,923 --> 00:23:05,003 For labels on items and properties, there is some restriction 358 00:23:05,433 --> 00:23:12,411 because we wanted to make sure that it's not completely 359 00:23:12,411 --> 00:23:14,229 anyone does anything, 360 00:23:14,229 --> 00:23:17,769 and it becomes unmanageable. 361 00:23:19,349 --> 00:23:23,328 Even a small community who wants one language and wants to work on that, 362 00:23:23,898 --> 00:23:26,787 come talk to us, we will make it happen. 363 00:23:26,787 --> 00:23:29,202 (person 3) I mean, we did this at the Prague Hackathon in May, 364 00:23:29,202 --> 00:23:32,459 and it took us until almost August in order to be able to use our language. 365 00:23:32,459 --> 00:23:35,135 - Yeah. - (person 3) So, it's very slow. 366 00:23:35,135 --> 00:23:37,854 Yeah, it is, unfortunately, very slow. 367 00:23:37,854 --> 00:23:39,883 We're currently working with the language Committee 368 00:23:39,883 --> 00:23:46,048 on solving some fundamental... 369 00:23:49,537 --> 00:23:55,447 Like, getting agreement on what kind of languages are actually "allowed," 370 00:23:56,047 --> 00:23:59,398 and that has taken too long, 371 00:23:59,988 --> 00:24:04,178 which is the reason why your request probably took longer than it should have. 372 00:24:04,778 --> 00:24:05,963 (person 3) Thanks. 373 00:24:06,815 --> 00:24:07,950 (person 4) Thank you. 374 00:24:07,950 --> 00:24:10,938 Lydia, if you remember the statistics that you showed, 375 00:24:10,938 --> 00:24:12,886 the number of lexemes per language. 376 00:24:12,886 --> 00:24:17,599 So, did you count all the forms as a data point 377 00:24:17,599 --> 00:24:20,034 or only lexemes? 378 00:24:21,289 --> 00:24:22,941 (Lydia) Do you mean this? 379 00:24:22,941 --> 00:24:24,053 Which one do you mean? 380 00:24:24,053 --> 00:24:25,529 (person 4) Yes, exactly. 381 00:24:25,797 --> 00:24:28,341 If you remember, does this number [inaudible] 382 00:24:28,341 --> 00:24:31,954 all the forms for all the lexemes or just how many lexemes there are? 383 00:24:31,954 --> 00:24:33,585 No, this is just a number of lexemes. 384 00:24:33,585 --> 00:24:35,395 (person 4) Just a number of lexemes, okay. 385 00:24:35,395 --> 00:24:36,797 So then it is a just statistic 386 00:24:36,797 --> 00:24:39,390 because if it would then compose the forms-- 387 00:24:39,390 --> 00:24:40,614 that's why I'm asking-- 388 00:24:40,614 --> 00:24:42,817 then all the languages with the inflectional morphology, 389 00:24:42,817 --> 00:24:45,027 like Russian, Serbian, Slovenian and et cetera, 390 00:24:45,027 --> 00:24:47,616 they have a natural advantage because they have so many. 391 00:24:47,616 --> 00:24:51,990 So, this kind of kicks in here on this number of forms. 392 00:24:51,990 --> 00:24:53,851 (person 4) Yeah, that was this one. Thank you. 393 00:24:56,546 --> 00:25:00,224 (person 5) So, I had a quick question about the... 394 00:25:00,644 --> 00:25:06,824 When we're talking about the actual items and properties. 395 00:25:07,124 --> 00:25:08,901 Like as far as I understand, 396 00:25:08,901 --> 00:25:11,955 there is currently no way to give an actual source 397 00:25:11,955 --> 00:25:14,726 to any of the labels and descriptions that are given. 398 00:25:14,726 --> 00:25:18,047 So, for example, because when you're talking 399 00:25:18,047 --> 00:25:20,920 about an item property, 400 00:25:20,920 --> 00:25:24,509 like, for example, you can get conflicting labels. 401 00:25:24,509 --> 00:25:25,739 Yes. 402 00:25:25,739 --> 00:25:27,662 (person 5) So this person is like... 403 00:25:28,402 --> 00:25:30,781 We were talking about indigenous things before, for example. 404 00:25:30,781 --> 00:25:35,965 So this person is a Norwegian artist according to this source, 405 00:25:35,965 --> 00:25:38,750 and a Sami artist, according to this source. 406 00:25:39,550 --> 00:25:42,883 Or, for example, in Estonian, we had an issue 407 00:25:42,883 --> 00:25:47,729 where we had to change terminology to the official use terminology 408 00:25:47,729 --> 00:25:49,482 in official lexicons, 409 00:25:49,482 --> 00:25:52,262 but we have no way to indicate really why, 410 00:25:52,262 --> 00:25:53,596 like what was the source of this 411 00:25:53,596 --> 00:25:55,561 and why this was better and what was there before. 412 00:25:55,561 --> 00:25:57,150 It was just me as a random person 413 00:25:57,150 --> 00:25:59,615 just switching the thing to anyone who sees it. 414 00:25:59,615 --> 00:26:02,520 So is there a plan to make this possible in any way 415 00:26:02,520 --> 00:26:06,355 so that we can actually have proper sources for the language data? 416 00:26:07,045 --> 00:26:11,568 So, it is partially possible. 417 00:26:11,568 --> 00:26:15,958 So, for example, when you have an item for a person, 418 00:26:16,968 --> 00:26:22,720 you have a statement, first name, last name, and so on, of that person, 419 00:26:22,720 --> 00:26:26,226 and then you can provide the reference for that there. 420 00:26:28,211 --> 00:26:32,544 I'm quite hesitant to add more complexity 421 00:26:32,544 --> 00:26:35,557 for references on labels and descriptions, 422 00:26:35,557 --> 00:26:38,624 but if people really, really think 423 00:26:38,624 --> 00:26:44,939 this is something that isn't covered by any reference on the statement, 424 00:26:44,939 --> 00:26:46,803 then let's talk about it. 425 00:26:49,079 --> 00:26:53,303 But I fear it will add a lot of complexity 426 00:26:53,303 --> 00:26:56,523 for what I hope are few cases, 427 00:26:57,393 --> 00:27:00,188 but I'm willing to be convinced otherwise 428 00:27:00,188 --> 00:27:04,087 if people really feel very strongly about this. 429 00:27:04,087 --> 00:27:08,177 (person 5) I mean, if it's added it probably shouldn't be the default, 430 00:27:08,177 --> 00:27:12,452 show to all the users as a beginner, interface, in any case. 431 00:27:12,452 --> 00:27:16,190 More like, "Click here if you need to say a specific thing about this." 432 00:27:17,632 --> 00:27:23,368 Do we have a sense of how many times that would actually matter? 433 00:27:24,520 --> 00:27:26,423 (person 5) In Estonian, for example-- 434 00:27:26,423 --> 00:27:28,844 I expect this is true of other languages as well-- 435 00:27:29,274 --> 00:27:34,203 for example, there is an official name that is the actual legitimate translation, 436 00:27:34,203 --> 00:27:36,206 for example, into English, 437 00:27:36,206 --> 00:27:40,314 of, say, a specific kind of municipality. 438 00:27:40,614 --> 00:27:42,182 That was my use case, for example, 439 00:27:42,182 --> 00:27:44,409 where we were using the word "parish" 440 00:27:45,159 --> 00:27:50,885 which the original Estonian word was meant kind of like church parish, 441 00:27:50,885 --> 00:27:51,899 and that was the origin, 442 00:27:51,899 --> 00:27:54,809 but that's not the official translation Estonia gets right now. 443 00:27:55,189 --> 00:27:58,993 In this case, I would just add it as official name statements 444 00:27:58,993 --> 00:28:00,817 and add the reference there. 445 00:28:02,032 --> 00:28:03,158 (person 5) Okay. 446 00:28:05,186 --> 00:28:06,572 More questions, yes? 447 00:28:07,682 --> 00:28:10,044 (person 6) I have two quick comments. 448 00:28:10,044 --> 00:28:13,934 You specifically called out Asturian as a language that does well, 449 00:28:13,934 --> 00:28:16,455 and I think that's a false artifact. 450 00:28:16,455 --> 00:28:17,724 Tell me about it. 451 00:28:17,724 --> 00:28:19,748 (person 6) I think it's just a bot 452 00:28:19,748 --> 00:28:24,068 that pasted person names, like proper names, 453 00:28:24,068 --> 00:28:27,172 and said, "Well, this is exactly like in French or Spanish," 454 00:28:27,172 --> 00:28:28,558 and just massively copied it. 455 00:28:28,558 --> 00:28:33,316 One point of evidence is that you don't see that energy in Asturian 456 00:28:33,316 --> 00:28:37,205 in things that actually require translation, like property names, 457 00:28:37,205 --> 00:28:39,648 or names of items that are not proper names. 458 00:28:39,648 --> 00:28:41,219 Asaf, you break my heart. 459 00:28:41,219 --> 00:28:43,198 (person 6) I know, I like raining on parades, 460 00:28:43,198 --> 00:28:48,458 but I have good news as well, which is about the pronunciation numbers. 461 00:28:49,408 --> 00:28:53,515 As you probably know, Commons is full of pronunciation files, 462 00:28:53,515 --> 00:28:54,668 and, for example, 463 00:28:54,668 --> 00:29:01,102 Dutch has no less than 300,000 pronunciation files already on Commons 464 00:29:01,912 --> 00:29:05,051 that just need to somehow be ingested. 465 00:29:05,051 --> 00:29:07,697 So if anyone's looking for a side project, 466 00:29:07,697 --> 00:29:08,997 there's tons and tons 467 00:29:08,997 --> 00:29:13,280 of classified, categorized pronunciation files on Commons 468 00:29:13,280 --> 00:29:16,893 under the category "Pronunciation" by language. 469 00:29:16,893 --> 00:29:22,840 So that's just waiting to be matched to lexemes and put on Lexeme. 470 00:29:23,180 --> 00:29:25,484 And I was wondering if you could say something 471 00:29:25,484 --> 00:29:26,585 about the road map, 472 00:29:26,585 --> 00:29:28,757 something about how much investment 473 00:29:28,757 --> 00:29:31,995 or what can we expect from Lexeme in the coming year, 474 00:29:31,995 --> 00:29:34,020 because I, for one, can't wait. 475 00:29:34,949 --> 00:29:37,044 You can't wait? (chuckles) 476 00:29:37,044 --> 00:29:39,118 - (person 6) For more. - Yes. (chuckles) 477 00:29:44,541 --> 00:29:49,523 Right now, we're concentrating more on Wikibase and data quality 478 00:29:51,493 --> 00:29:55,087 to see how much traction this gets 479 00:29:55,087 --> 00:30:01,676 and then getting more for feeding off where the pain points are next, 480 00:30:01,676 --> 00:30:06,003 and then going back to improving lexicographical data further. 481 00:30:06,903 --> 00:30:09,790 And one of the things I'd love to hear from you 482 00:30:09,790 --> 00:30:14,136 is where exactly do you see the next steps, 483 00:30:14,136 --> 00:30:15,966 where do you want to see improvements 484 00:30:15,966 --> 00:30:20,340 so that we can then figure out how to make that happen. 485 00:30:21,125 --> 00:30:22,810 But, of course, you're right, 486 00:30:22,810 --> 00:30:25,712 there's still so much to do also on the technical side. 487 00:30:30,573 --> 00:30:35,848 (person 7) Okay, as we were uploading the Basque words with forms, 488 00:30:35,848 --> 00:30:37,768 and you'll see some of these kinds of things, 489 00:30:37,768 --> 00:30:41,329 we were both like, last week we said, "Oh, we are the first one in something." 490 00:30:42,919 --> 00:30:44,928 It's It appears in press, and it's like, 491 00:30:44,928 --> 00:30:49,488 "Oh, Basque are the first time in some-- they are the first in something, okay." 492 00:30:49,488 --> 00:30:50,606 (laughs) 493 00:30:50,606 --> 00:30:53,318 And then people ask, "Okay, but what is this for?" 494 00:30:54,678 --> 00:30:56,849 We don't have a real good answer. 495 00:30:56,849 --> 00:30:57,888 I mean it's like, okay, 496 00:30:57,888 --> 00:31:01,841 this will help computers to understand more our language, yes, 497 00:31:01,841 --> 00:31:05,279 but what kind of tools can we make in the future? 498 00:31:05,279 --> 00:31:07,467 And we don't have a good answer for this. 499 00:31:07,467 --> 00:31:10,625 So I don't know if you have a good answer for this. 500 00:31:10,625 --> 00:31:12,742 (chuckles) I don't know if I have a good answer, 501 00:31:12,742 --> 00:31:14,746 but I have an answer. 502 00:31:15,480 --> 00:31:20,425 So I think right now as I was telling [inaudible], 503 00:31:20,425 --> 00:31:21,924 we haven't reached that critical mass 504 00:31:21,924 --> 00:31:25,529 where you can build a lot of the really interesting tools. 505 00:31:25,529 --> 00:31:27,707 But there are already some tools. 506 00:31:28,267 --> 00:31:31,912 Just the other day, Esther [Pandelia], for example, 507 00:31:31,912 --> 00:31:33,817 released a tool where you can see, 508 00:31:35,837 --> 00:31:38,889 I think it was the words on a globe 509 00:31:38,889 --> 00:31:41,901 where they're spoken, where they're coming from. 510 00:31:42,631 --> 00:31:44,090 I'm probably wrong about this, 511 00:31:44,090 --> 00:31:46,346 but she had answered on the Project chat on Wikidata-- 512 00:31:46,346 --> 00:31:48,984 you can look it up there. 513 00:31:49,574 --> 00:31:51,805 So we have seen these first tools, 514 00:31:51,805 --> 00:31:55,696 just like we've seen back when Wikidata started. 515 00:31:56,846 --> 00:31:59,602 First some--like just a network, 516 00:31:59,602 --> 00:32:03,424 and like, "Hey, look, there's this thing that connects to this other thing." 517 00:32:04,824 --> 00:32:07,059 And as we have more data, 518 00:32:07,059 --> 00:32:10,352 and as we've reached some critical mass, 519 00:32:11,852 --> 00:32:14,747 more powerful applications become possible, 520 00:32:15,677 --> 00:32:17,516 things like Histropedia, 521 00:32:19,126 --> 00:32:21,988 things like question and answering 522 00:32:21,988 --> 00:32:26,663 in your digital personal assistant, Platypus, and so on. 523 00:32:26,663 --> 00:32:29,668 And we're seeing a similar thing with lexemes. 524 00:32:31,198 --> 00:32:34,650 We're at the stage where you can build like these little, 525 00:32:34,650 --> 00:32:37,464 hey, look, there's a connection between the two things, 526 00:32:37,864 --> 00:32:42,738 and there's a translation of this word into that language stage, 527 00:32:42,738 --> 00:32:47,747 and as we build it out and as we describe more words, 528 00:32:47,747 --> 00:32:49,533 more becomes possible. 529 00:32:49,533 --> 00:32:51,795 Now, what becomes possible? 530 00:32:53,482 --> 00:32:59,483 As Ben, our keynote speaker earlier was talking about translations, 531 00:33:00,103 --> 00:33:03,455 being able to translate from one language to another. 532 00:33:03,455 --> 00:33:07,929 And Jens, my colleague, he's always talking about 533 00:33:07,929 --> 00:33:11,452 the European Union looking for a translator 534 00:33:11,452 --> 00:33:17,439 who can translate from I think it was Maltese to Swedish-- 535 00:33:17,439 --> 00:33:19,436 - (person 8) Estonian. - Estonian. 536 00:33:22,016 --> 00:33:26,211 And that is not a usual combination. 537 00:33:27,211 --> 00:33:31,735 But once you have all these languages in one machine-readable place, 538 00:33:31,735 --> 00:33:33,143 you can do that, 539 00:33:33,143 --> 00:33:36,857 you can get a dictionary 540 00:33:36,857 --> 00:33:41,735 from Estonian to Maltese and back. 541 00:33:42,935 --> 00:33:45,607 So covering language combinations in dictionaries 542 00:33:45,607 --> 00:33:47,911 that just haven't been covered before 543 00:33:47,911 --> 00:33:51,050 because there wasn't enough demand for it, for example, 544 00:33:51,050 --> 00:33:55,540 to make it financially viable and to justify the work. 545 00:33:55,540 --> 00:33:57,147 Now we can do that. 546 00:33:59,797 --> 00:34:02,318 Then text generation. 547 00:34:02,318 --> 00:34:03,653 Lucie was earlier talking 548 00:34:03,653 --> 00:34:10,136 about how she's working with Hattie on generating text 549 00:34:10,136 --> 00:34:14,673 to get Wikipedia articles in minority languages started, 550 00:34:15,423 --> 00:34:19,512 and that needs data about words, 551 00:34:19,512 --> 00:34:22,589 and you need to understand the language to do that. 552 00:34:23,769 --> 00:34:28,133 Yeah, and those are just some that come to my mind right now. 553 00:34:28,693 --> 00:34:30,494 Maybe our audience has more ideas 554 00:34:30,494 --> 00:34:34,353 what they want to do when we have all the glorious data. 555 00:34:37,693 --> 00:34:40,892 (person 9) Okay, I will deviate from the lexemes topic. 556 00:34:40,892 --> 00:34:42,666 I will ask the question, 557 00:34:42,666 --> 00:34:45,634 how can I as a member of community 558 00:34:45,634 --> 00:34:50,135 influence that priority is put on task, 559 00:34:50,135 --> 00:34:56,644 that a new user comes, and he can indicate what languages he wants to see and edit 560 00:34:56,644 --> 00:35:01,135 without some secret verbal template knowledge. 561 00:35:02,145 --> 00:35:05,053 Maybe there will be this year this technical wish list 562 00:35:05,053 --> 00:35:07,040 without Wikipedia topics. 563 00:35:07,040 --> 00:35:10,119 Maybe there's a hope we can all vote about 564 00:35:10,119 --> 00:35:14,218 this thing we didn't fix for seven years. 565 00:35:14,218 --> 00:35:17,607 So do you have any ideas and comments about this? 566 00:35:18,217 --> 00:35:20,328 So you're talking about the fact 567 00:35:20,328 --> 00:35:23,518 that someone who is not logged into Wikidata 568 00:35:23,518 --> 00:35:25,971 can't change their language easily? 569 00:35:25,971 --> 00:35:27,839 (person 9) No, for [inaudible] users. 570 00:35:28,309 --> 00:35:30,689 So, if they are logged in, 571 00:35:30,689 --> 00:35:34,871 they can just change their language at the top of the page, 572 00:35:35,891 --> 00:35:38,099 and then it will appear 573 00:35:39,769 --> 00:35:42,013 where the labels' description [inaudible] are, 574 00:35:42,013 --> 00:35:43,483 and they can edit it. 575 00:35:45,657 --> 00:35:49,009 (person 9) Well, actually, usually many times the workflow 576 00:35:49,009 --> 00:35:52,447 is that if you want to have multiple languages, they are available, 577 00:35:52,447 --> 00:35:55,419 and it's not always the case. 578 00:35:55,419 --> 00:35:58,584 Okay, maybe we should sit down after this talk and you show me. 579 00:36:01,562 --> 00:36:04,089 Cool. More questions? 580 00:36:05,534 --> 00:36:06,536 Yes. 581 00:36:11,595 --> 00:36:13,196 (person 10) Thanks for the presentation. 582 00:36:14,106 --> 00:36:15,127 Can you comment 583 00:36:15,127 --> 00:36:19,307 on the state of the correlation with the Wiktionary community. 584 00:36:19,307 --> 00:36:22,296 As far as I've seen, there were some discussions 585 00:36:22,296 --> 00:36:26,051 about importing some elements of the work, 586 00:36:26,051 --> 00:36:30,843 but there seems to be licensing issues and some disagreements, et cetera. 587 00:36:30,843 --> 00:36:31,848 Right. 588 00:36:31,848 --> 00:36:36,330 So, Wiktionary communities have spent a lot of time 589 00:36:37,320 --> 00:36:39,473 building Wiktionary. 590 00:36:39,473 --> 00:36:42,643 They have built 591 00:36:43,193 --> 00:36:47,554 amazingly complicated and complex templates 592 00:36:47,554 --> 00:36:53,614 to build pretty tables that automatically generate forms for you 593 00:36:53,614 --> 00:36:56,392 and all kinds of really impressive, 594 00:36:56,392 --> 00:37:00,683 and kind of crazy stuff, if you think about it. 595 00:37:02,311 --> 00:37:07,994 And, of course, they have invested a lot of time and effort into that. 596 00:37:09,364 --> 00:37:11,801 And understandably, 597 00:37:11,801 --> 00:37:17,116 they don't just want that to be grabbed, 598 00:37:18,046 --> 00:37:19,102 just like that. 599 00:37:19,102 --> 00:37:21,791 So there's some of that coming from there. 600 00:37:22,761 --> 00:37:25,137 And that's fine, that's okay. 601 00:37:25,737 --> 00:37:32,092 Now, the first Wiktionary communities are talking about turning out 602 00:37:32,092 --> 00:37:34,329 and importing some of their data into Wikidata. 603 00:37:34,329 --> 00:37:39,095 Russian, you have seen, for example, is one of those cases 604 00:37:40,375 --> 00:37:42,355 And I expect more of that to happen. 605 00:37:43,635 --> 00:37:46,800 But it will be a slow process, 606 00:37:46,800 --> 00:37:49,383 just like adoption of Wikidata's data on Wikipedia 607 00:37:49,383 --> 00:37:51,909 has been a rather slow process. 608 00:37:52,849 --> 00:37:56,183 On the other side of making it actually easier 609 00:37:56,183 --> 00:37:59,132 to use the data that is in lexemes, 610 00:37:59,132 --> 00:38:02,209 on Wiktionary, so that they can make use of that 611 00:38:02,209 --> 00:38:05,531 and share data between the language Wiktionaries 612 00:38:05,531 --> 00:38:08,853 which is super hard to impossible right now, 613 00:38:08,853 --> 00:38:11,560 which is crazy, just like it was on Wikipedia. 614 00:38:13,860 --> 00:38:16,325 Wait for the birthday present. (chuckles) 615 00:38:20,038 --> 00:38:21,182 Yes. 616 00:38:22,599 --> 00:38:24,827 (person 11) When I was thinking the other way around it, 617 00:38:24,827 --> 00:38:28,168 I actually didn't want to say it because I think this will be super silly, 618 00:38:28,168 --> 00:38:32,003 but I think that Wiktionary already has some content, 619 00:38:32,003 --> 00:38:34,978 and I know that we can't transfer it to Wikidata 620 00:38:34,978 --> 00:38:37,048 because there's a difference in licenses. 621 00:38:37,048 --> 00:38:39,631 But I was thinking maybe we can do something about that. 622 00:38:40,321 --> 00:38:45,913 Maybe, I don't know, we can obtain the communities' permission 623 00:38:45,913 --> 00:38:51,205 after like, I don't know, having like a public voting 624 00:38:52,075 --> 00:38:55,642 and for the community, the active members of the community 625 00:38:55,642 --> 00:39:02,523 to vote and say if they would like or accept or to transfer the content 626 00:39:02,523 --> 00:39:05,528 for which they may do the Wikidata lexemes. 627 00:39:06,238 --> 00:39:08,537 Because I just think it is such a waste. 628 00:39:09,568 --> 00:39:14,443 So, that's definitely a conversation those people 629 00:39:14,443 --> 00:39:18,249 who are in Wiktionary communities are very welcome to bring up there. 630 00:39:18,249 --> 00:39:24,647 I think it would be a bit presumptuous for us to go and force that. 631 00:39:25,917 --> 00:39:31,142 But, yeah, I think it's definitely worth having a conversation. 632 00:39:31,142 --> 00:39:33,898 But I think it's also important to understand 633 00:39:33,898 --> 00:39:39,082 that there's a distinction between what is actually legally allowed 634 00:39:39,082 --> 00:39:43,147 and what we should be doing 635 00:39:43,147 --> 00:39:45,426 and what those people want or do not want. 636 00:39:45,736 --> 00:39:47,329 So even if it's legally allowed, 637 00:39:47,329 --> 00:39:50,640 if some other Wiktionary communities do not want that, 638 00:39:50,640 --> 00:39:53,537 I would be careful, at least. 639 00:39:58,886 --> 00:40:02,489 I think you need the mic for the stream. 640 00:40:04,540 --> 00:40:07,299 (person 12) So, obviously, it's all very exciting, 641 00:40:07,979 --> 00:40:12,319 and I immediately think how can I take that to my students 642 00:40:12,319 --> 00:40:15,558 and how can I incorporate it with the courses, 643 00:40:15,558 --> 00:40:18,531 the work that we're doing, educational settings. 644 00:40:18,531 --> 00:40:22,271 And I don't have, at the moment, 645 00:40:22,871 --> 00:40:24,116 first of all, enough knowledge, 646 00:40:24,116 --> 00:40:27,278 but I think the documentation that we do have 647 00:40:27,808 --> 00:40:30,082 could be maybe improved. 648 00:40:30,082 --> 00:40:33,437 So that's a kind of request to make cool videos 649 00:40:33,437 --> 00:40:35,898 that explain how it works 650 00:40:35,898 --> 00:40:39,948 because if we have it, we can then use it, 651 00:40:39,948 --> 00:40:41,985 and we can have students on board, 652 00:40:41,985 --> 00:40:47,072 and we can make people understand how awesome it all is. 653 00:40:47,072 --> 00:40:52,001 And yeah, just think about documentation and think about education, please. 654 00:40:52,001 --> 00:40:54,480 Because I think a lot could be done. 655 00:40:54,480 --> 00:40:58,585 These are like many tasks that could be done even with... 656 00:41:00,125 --> 00:41:02,033 well, I wouldn't say primary schools, 657 00:41:02,033 --> 00:41:05,495 but certainly, even younger students. 658 00:41:05,915 --> 00:41:10,866 And so I would really like to see that potential being tapped into, 659 00:41:10,866 --> 00:41:15,272 and, as of now, I personally don't understand enough 660 00:41:15,272 --> 00:41:19,500 to be able to create tasks or to create like... 661 00:41:20,430 --> 00:41:22,155 to do something practical with it. 662 00:41:22,155 --> 00:41:25,772 So any help, any thoughts anyone here has about that, 663 00:41:25,772 --> 00:41:29,648 I would be very happy to hear your thoughts, and yours as well. 664 00:41:30,508 --> 00:41:32,129 Yeah, let's talk about that. 665 00:41:35,473 --> 00:41:37,139 More questions? 666 00:41:37,809 --> 00:41:39,195 Someone else raised a hand. 667 00:41:39,195 --> 00:41:40,495 I forgot where it was. 668 00:41:45,739 --> 00:41:49,996 (person 13) So, if we can't import from Wiktionary, 669 00:41:49,996 --> 00:41:55,772 is there some concerted effort to find other public domain sources, 670 00:41:55,772 --> 00:41:57,459 maybe all the data, 671 00:41:58,769 --> 00:42:03,167 and kind of prefilter it, organize it 672 00:42:03,167 --> 00:42:08,470 so that it's easy to be checked by people for import? 673 00:42:09,093 --> 00:42:11,181 So there are first efforts. 674 00:42:11,181 --> 00:42:14,769 My understanding is that Basque is one of those efforts. 675 00:42:14,769 --> 00:42:17,474 Maybe you want to say a bit more about it? 676 00:42:18,426 --> 00:42:20,130 (person 14) [inaudible] 677 00:42:23,166 --> 00:42:27,148 Okay, the actual answer is paying for that... 678 00:42:28,374 --> 00:42:33,381 I mean, we have an agreement with a contractor we usually work with. 679 00:42:34,801 --> 00:42:38,725 They do dictionaries-- 680 00:42:40,315 --> 00:42:42,458 lots of stuff, but they do dictionaries. 681 00:42:42,458 --> 00:42:47,473 So we agreed with them to make free the students' dictionary, 682 00:42:47,473 --> 00:42:52,782 we would [cast] the most common words and start uploading it 683 00:42:52,782 --> 00:42:55,590 with an external identifier and the scheme of things. 684 00:42:56,420 --> 00:43:02,902 But there was some discussion about leaving it on CC0 685 00:43:03,212 --> 00:43:05,322 because they have the dictionary with CC by it, 686 00:43:06,537 --> 00:43:10,326 and they understood what the difference was. 687 00:43:10,326 --> 00:43:13,866 So there was some discussion. 688 00:43:13,866 --> 00:43:19,709 But I think that we can provide some tools or some examples in the future, 689 00:43:19,709 --> 00:43:21,761 and I think that there will be other dictionaries 690 00:43:21,761 --> 00:43:24,016 that we can handle, 691 00:43:24,016 --> 00:43:29,274 and also I think Wiktionary should start moving in that direction, 692 00:43:29,274 --> 00:43:32,260 but that's another great discussion. 693 00:43:33,285 --> 00:43:34,487 And on top of that, 694 00:43:34,487 --> 00:43:38,839 Lea is also in contact with people from Occitan 695 00:43:38,839 --> 00:43:41,827 who work on Occitan dictionaries, 696 00:43:41,827 --> 00:43:45,138 and they're currently working on a Sumerian collaboration. 697 00:43:51,644 --> 00:43:53,363 More questions? 698 00:44:01,487 --> 00:44:05,349 (person 15) Hi! We are the people who want to import Occitan data. 699 00:44:05,349 --> 00:44:06,585 Aha! Perfect! 700 00:44:06,585 --> 00:44:08,368 (person 15) And we have a small problem. 701 00:44:09,188 --> 00:44:14,215 We don't know how to represent the variety of all lexemes. 702 00:44:14,215 --> 00:44:17,893 We have six dialects, 703 00:44:17,893 --> 00:44:24,014 and we want to indicate for Lexeme in which dialect it's used, 704 00:44:24,014 --> 00:44:27,285 and we don't have a proper C0 statement to do that. 705 00:44:27,285 --> 00:44:31,105 So as long as the segment doesn't exist, 706 00:44:31,635 --> 00:44:34,465 it prevents us from [inaudible] 707 00:44:34,465 --> 00:44:37,603 because we will need to do it again 708 00:44:37,603 --> 00:44:42,076 when we will be able to [export] the statement. 709 00:44:42,076 --> 00:44:44,551 And it's complicated because it's a statement 710 00:44:44,551 --> 00:44:47,802 which won't be asked by many people 711 00:44:47,802 --> 00:44:53,444 because it's a statement which concerns mostly minority languages. 712 00:44:53,444 --> 00:44:56,933 So you will have one person to ask this. 713 00:44:56,933 --> 00:45:00,022 But as our colleagues Basque, 714 00:45:00,022 --> 00:45:06,082 it can be one person who will power thousands of others, 715 00:45:06,082 --> 00:45:10,884 so it might not be asking a lot, 716 00:45:10,884 --> 00:45:14,136 but it will be very important for us. 717 00:45:14,874 --> 00:45:17,600 Do you already have a new property proposal up, 718 00:45:17,600 --> 00:45:19,470 or do you need help creating it? 719 00:45:21,524 --> 00:45:24,300 (person 15) We asked four months ago. 720 00:45:24,720 --> 00:45:28,755 Alright, then let's get some people to help out with this property proposal. 721 00:45:30,159 --> 00:45:33,092 I'm sure there are enough people in this room to make this happen. 722 00:45:33,360 --> 00:45:35,452 (person 15) Property proposal [speaking in French]. 723 00:45:35,452 --> 00:45:36,965 (person 16) We didn't have an answer. 724 00:45:36,965 --> 00:45:39,769 (person 15) We didn't have any answer, and we don't know how to do this 725 00:45:39,769 --> 00:45:42,953 because we aren't in the Wikidata community. 726 00:45:44,694 --> 00:45:48,817 Yup, so there are people here who can help you. 727 00:45:48,817 --> 00:45:52,134 Maybe someone raises their hand to take-- 728 00:45:52,574 --> 00:45:53,644 (person 14) I'm for that. 729 00:45:53,644 --> 00:45:55,512 But I think this is quite interesting 730 00:45:55,512 --> 00:45:59,059 that only the variant of form 731 00:45:59,059 --> 00:46:02,607 also can handle it geographically, 732 00:46:02,607 --> 00:46:04,995 with coordinates or some kind of mapping. 733 00:46:05,595 --> 00:46:07,815 Also having different pronunciations, 734 00:46:07,815 --> 00:46:11,837 and I think this is something that happens in lots of languages. 735 00:46:12,607 --> 00:46:16,262 We should start making it happen [inaudible], 736 00:46:16,262 --> 00:46:18,865 and I'm going to search for the property. 737 00:46:19,782 --> 00:46:20,933 Cool. 738 00:46:20,933 --> 00:46:24,446 So you will get backing for your property proposal. 739 00:46:26,136 --> 00:46:27,297 Thank you. 740 00:46:28,153 --> 00:46:30,261 Alright, more questions? 741 00:46:32,410 --> 00:46:33,474 Finn. 742 00:46:33,974 --> 00:46:35,055 Finn is one of those people 743 00:46:35,055 --> 00:46:38,031 who builds stuff on top of lexicographical data. 744 00:46:38,031 --> 00:46:40,085 (Finn) It's just a small question, 745 00:46:40,405 --> 00:46:44,226 and that's about spelling variations. 746 00:46:44,896 --> 00:46:48,002 It seems to be difficult to put them in... 747 00:46:48,532 --> 00:46:53,368 You could, of course, have multiple forms for the same word. 748 00:46:56,327 --> 00:46:58,448 I don't know, it seems to be... 749 00:46:59,558 --> 00:47:03,535 If you don't do it that way, it seems to be difficult to specify... 750 00:47:04,771 --> 00:47:05,888 or I don't know whether 751 00:47:05,888 --> 00:47:09,731 this is just a minor technical issue or whether... 752 00:47:09,731 --> 00:47:11,252 Let's look at it together. 753 00:47:11,642 --> 00:47:15,230 I would love to see an example. 754 00:47:17,478 --> 00:47:18,478 Asaf. 755 00:47:26,886 --> 00:47:28,396 (Asaf) Thank you. 756 00:47:29,386 --> 00:47:33,685 I can give a very concrete example from my mother tongue, Hebrew. 757 00:47:34,205 --> 00:47:38,845 Hebrew has two main variants 758 00:47:38,845 --> 00:47:42,786 for expressing almost every word 759 00:47:42,786 --> 00:47:47,640 because the traditional spelling 760 00:47:47,640 --> 00:47:50,044 leaves out many of the vowels. 761 00:47:50,934 --> 00:47:55,207 And, therefore, in modern editions of the Bible and of poetry, 762 00:47:55,207 --> 00:47:57,461 diacritics are used. 763 00:47:57,461 --> 00:48:02,670 However, those diacritics are never used for modern prose 764 00:48:02,670 --> 00:48:05,974 or newspaper writing or street signs. 765 00:48:05,974 --> 00:48:11,209 So the average daily casual use puts in extra vowels 766 00:48:12,169 --> 00:48:13,519 and doesn't use the diacritics 767 00:48:13,519 --> 00:48:15,607 because they are, of course, more cumbersome 768 00:48:15,607 --> 00:48:17,893 and have all kinds of rules and nobody knows the rules. 769 00:48:18,633 --> 00:48:20,531 So there are basically two variants. 770 00:48:20,531 --> 00:48:25,322 There's the everyday casual prose variant, 771 00:48:25,322 --> 00:48:27,827 and there's the Bible or poetry, 772 00:48:27,827 --> 00:48:32,200 which always come in this traditional diacriticized text. 773 00:48:32,200 --> 00:48:33,302 To be useful, 774 00:48:33,302 --> 00:48:37,428 Lexeme would have to recognize both varieties of every single word 775 00:48:37,428 --> 00:48:39,747 and every single form of every single word. 776 00:48:40,677 --> 00:48:43,391 So that's a very comprehensive use case 777 00:48:43,391 --> 00:48:46,340 for official stable variants. 778 00:48:46,340 --> 00:48:48,942 It's not dialect, it's not regions, 779 00:48:49,332 --> 00:48:53,627 it's basically two coexisting morphological systems. 780 00:48:54,537 --> 00:48:58,926 And I too don't know exactly how to express that in Lexeme today, 781 00:48:58,926 --> 00:49:02,800 which is one thing that is keeping me in partial answer to Magnus' question 782 00:49:02,800 --> 00:49:05,238 from uploading the parts that are ready 783 00:49:05,238 --> 00:49:09,394 from the biggest Hebrew dictionary, which is public domain 784 00:49:09,394 --> 00:49:13,141 and which I have been digitizing for several years now. 785 00:49:13,141 --> 00:49:14,803 A good portion of it is ready, 786 00:49:14,803 --> 00:49:16,549 but I'm not putting it on Lexeme right now 787 00:49:16,549 --> 00:49:20,245 because I don't know exactly how to solve this problem. 788 00:49:20,245 --> 00:49:23,387 Alright, let's solve this problem here. (chuckles) 789 00:49:24,503 --> 00:49:26,021 That has to be possible. 790 00:49:30,045 --> 00:49:32,047 Alright, more questions? 791 00:49:37,173 --> 00:49:39,735 If not, then thank you so much. 792 00:49:40,605 --> 00:49:42,675 (applause)