1 00:00:05,945 --> 00:00:09,476 Hello everyone to the Data Quality panel. 2 00:00:10,288 --> 00:00:13,671 Data quality matters because more and more people out there 3 00:00:13,672 --> 00:00:19,289 rely on our data being in good shape, so we're going to talk about data quality, 4 00:00:20,029 --> 00:00:26,000 and there will be four speakers who will give short introductions 5 00:00:26,000 --> 00:00:29,539 on topics related to data quality and then we will have a Q and A. 6 00:00:30,130 --> 00:00:32,234 And the first one is Lucas. 7 00:00:34,385 --> 00:00:35,385 Thank you. 8 00:00:35,901 --> 00:00:39,899 Hi, I'm Lucas, and I'm going to start with an overview 9 00:00:39,899 --> 00:00:43,806 of data quality tools that we already have on Wikidata 10 00:00:43,807 --> 00:00:46,109 and also some things that are coming up soon. 11 00:00:46,932 --> 00:00:50,623 And I've grouped them into some general themes 12 00:00:50,623 --> 00:00:53,761 of making errors more visible, making problems actionable, 13 00:00:53,762 --> 00:00:56,322 getting more eyes on the data so that people notice the problems, 14 00:00:56,945 --> 00:01:02,616 fix some common sources of errors, maintain the quality of the existing data 15 00:01:02,616 --> 00:01:03,966 and also human curation. 16 00:01:05,063 --> 00:01:09,874 And the ones that are currently available start with property constraints. 17 00:01:10,388 --> 00:01:12,421 So you've probably seen this if you're on Wikidata. 18 00:01:12,422 --> 00:01:14,029 You can sometimes get these icons 19 00:01:14,530 --> 00:01:17,241 which check the internal consistency of the data. 20 00:01:17,242 --> 00:01:20,800 For example, if one event follows the other, 21 00:01:20,801 --> 00:01:23,760 then the other event should also be followed by this one, 22 00:01:23,761 --> 00:01:27,161 which on the WikidataCon item was apparently missing. 23 00:01:27,162 --> 00:01:29,360 I'm not sure, this feature is a few days old. 24 00:01:30,040 --> 00:01:34,681 And there's also, if this is too limited or simple for you, 25 00:01:34,682 --> 00:01:38,080 you can write any checks you want using the Query Service 26 00:01:38,081 --> 00:01:39,842 which is useful for lots of things of course, 27 00:01:39,843 --> 00:01:44,543 but you can also use it for finding errors. 28 00:01:44,544 --> 00:01:46,974 Like if you've noticed one occurrence of a mistake, 29 00:01:46,975 --> 00:01:49,709 then you can check if there are other places 30 00:01:49,710 --> 00:01:51,958 where people have made a very similar error 31 00:01:51,958 --> 00:01:53,438 and find that with the Query Service. 32 00:01:53,439 --> 00:01:54,559 You can also combine the two 33 00:01:54,560 --> 00:01:57,874 and search for constraint violations in the Query Service, 34 00:01:57,875 --> 00:02:01,240 for example, only the violations in some area 35 00:02:01,241 --> 00:02:03,762 or WikiProject that's relevant to you, 36 00:02:03,762 --> 00:02:06,828 although the results are currently not complete, sadly. 37 00:02:08,422 --> 00:02:09,877 There is revision scoring. 38 00:02:10,690 --> 00:02:12,666 That's... I think this is from the recent changes 39 00:02:12,667 --> 00:02:16,217 you can also get it on your watch list an automatic assessment 40 00:02:16,217 --> 00:02:20,249 of is this edit likely to be in good faith or in bad faith 41 00:02:20,250 --> 00:02:22,312 and is it likely to be damaging or not damaging, 42 00:02:22,313 --> 00:02:24,205 I think those are the two dimensions. 43 00:02:24,206 --> 00:02:25,686 So you can, if you want, 44 00:02:25,687 --> 00:02:29,898 focus on just looking through the damaging but good faith edits. 45 00:02:29,899 --> 00:02:32,523 If you're feeling particularly friendly and welcoming 46 00:02:32,524 --> 00:02:37,121 you can tell these editors, "Thank you for your contribution, 47 00:02:37,122 --> 00:02:40,560 here's how you should have done it but thank you, still." 48 00:02:40,561 --> 00:02:42,186 And if you're not feeling that way, 49 00:02:42,187 --> 00:02:44,452 you can go through the bad faith, damaging edits, 50 00:02:44,453 --> 00:02:45,573 and revert the vandals. 51 00:02:47,544 --> 00:02:49,761 There's also, similar to that, entity scoring. 52 00:02:49,762 --> 00:02:52,590 So instead of scoring an edit, the change that it made, 53 00:02:52,591 --> 00:02:53,904 you score the whole revision, 54 00:02:53,904 --> 00:02:56,483 and I think that is the same quality measure 55 00:02:56,483 --> 00:02:59,863 that Lydia mentions at the beginning of the conference. 56 00:03:00,372 --> 00:03:04,569 That gives a user script up here and gives you a score of like one to five, 57 00:03:04,570 --> 00:03:08,176 I think it was, of what the quality of the current item is. 58 00:03:10,043 --> 00:03:15,528 The primary sources tool is for any database that you want to import, 59 00:03:15,528 --> 00:03:18,364 but that's not high enough quality to directly add to Wikidata, 60 00:03:18,374 --> 00:03:20,335 so you add it to the primary sources tool instead, 61 00:03:20,336 --> 00:03:22,956 and then humans can decide 62 00:03:22,956 --> 00:03:26,024 should they add these individual statements or not. 63 00:03:28,595 --> 00:03:31,901 Showing coordinates as maps is mainly a convenience feature 64 00:03:31,901 --> 00:03:33,588 but it's also useful for quality control. 65 00:03:33,588 --> 00:03:36,937 Like if you see this is supposed to be the office of Wikimedia Germany 66 00:03:36,938 --> 00:03:39,400 and if the coordinates are somewhere in the Indian Ocean, 67 00:03:39,401 --> 00:03:41,529 then you know that something is not right there 68 00:03:41,530 --> 00:03:44,790 and you can see it much more easily than if you just had the numbers. 69 00:03:46,382 --> 00:03:49,576 This is a gadget called the relative completeness indicator 70 00:03:49,577 --> 00:03:52,480 which shows you this little icon here 71 00:03:53,007 --> 00:03:55,652 telling you how complete it thinks this item is 72 00:03:55,652 --> 00:03:57,613 and also which properties are most likely missing, 73 00:03:57,614 --> 00:03:59,769 which is really useful if you're editing an item 74 00:03:59,769 --> 00:04:03,172 and you're in an area that you're not very familiar with 75 00:04:03,172 --> 00:04:05,661 and you don't know what the right properties to use are, 76 00:04:05,662 --> 00:04:08,230 then this is a very useful gadget to have. 77 00:04:09,604 --> 00:04:11,401 And we have Shape Expressions. 78 00:04:11,402 --> 00:04:15,624 I think Andra or Jose are going to talk more about those 79 00:04:15,624 --> 00:04:19,757 but basically, a very powerful way of comparing the data you have 80 00:04:19,758 --> 00:04:20,758 against the schema, 81 00:04:20,759 --> 00:04:22,680 like what statement should certain entities have, 82 00:04:22,681 --> 00:04:25,677 what other entities should they link to and what should those look like, 83 00:04:26,229 --> 00:04:29,374 and then you can find problems that way. 84 00:04:30,366 --> 00:04:32,361 I think... No there is still more. 85 00:04:32,362 --> 00:04:34,321 Integraality or property dashboard. 86 00:04:34,322 --> 00:04:36,773 It gives you a quick overview of the data you already have. 87 00:04:36,774 --> 00:04:39,147 For example, this is from the WikiProject Red Pandas, 88 00:04:39,657 --> 00:04:41,681 and you can see that we have a sex or gender 89 00:04:41,682 --> 00:04:43,561 for almost all of the red pandas, 90 00:04:43,561 --> 00:04:46,854 the date of birth varies a lot by which zoo they come from 91 00:04:46,854 --> 00:04:50,255 and we have almost no dead pandas which is wonderful, 92 00:04:51,437 --> 00:04:52,600 because they're so cute. 93 00:04:53,699 --> 00:04:55,654 So this is also useful. 94 00:04:56,377 --> 00:04:59,185 There we go, OK, now for the things that are coming up. 95 00:04:59,889 --> 00:05:03,784 Wikidata Bridge, or also known, formerly known as client editing, 96 00:05:03,785 --> 00:05:07,076 so editing Wikidata from Wikipedia infoboxes 97 00:05:07,675 --> 00:05:11,725 which will on the one hand get more eyes on the data 98 00:05:11,725 --> 00:05:13,441 because more people can see the data there 99 00:05:13,441 --> 00:05:18,841 and it will hopefully encourage more use of Wikidata in the Wikipedias 100 00:05:18,841 --> 00:05:20,920 and that means that more people can notice 101 00:05:20,921 --> 00:05:23,389 if, for example some data is outdated and needs to be updated 102 00:05:23,857 --> 00:05:27,000 instead of if they would only see it on Wikidata itself. 103 00:05:28,630 --> 00:05:30,656 There is also tainted references. 104 00:05:30,657 --> 00:05:33,959 The idea here is that if you edit a statement value, 105 00:05:34,683 --> 00:05:37,279 you might want to update the references as well, 106 00:05:37,280 --> 00:05:39,373 unless it was just a typo or something. 107 00:05:39,897 --> 00:05:43,662 And this tainted references tells editors that 108 00:05:43,663 --> 00:05:49,756 and also that other editors see which other edits were made 109 00:05:49,756 --> 00:05:52,471 that edited a statement value and didn't update a reference 110 00:05:52,472 --> 00:05:56,766 then you can clean up after that and decide should that be... 111 00:05:57,737 --> 00:05:59,566 Do you need to do any thing more of that 112 00:05:59,566 --> 00:06:02,796 or is that actually fine and you don't need to update the reference. 113 00:06:03,543 --> 00:06:09,336 That's related to signed statements which is coming from a concern, I think, 114 00:06:09,336 --> 00:06:12,355 that some data providers have that like... 115 00:06:14,131 --> 00:06:17,231 There's a statement that's referenced through the UNESCO or something 116 00:06:17,232 --> 00:06:19,872 and then suddenly, someone vandalizes the statement 117 00:06:19,873 --> 00:06:21,836 and they are worried that it will look like 118 00:06:22,827 --> 00:06:26,992 this organization, like UNESCO, still set this vandalism value 119 00:06:26,993 --> 00:06:28,706 and so, with signed statements, 120 00:06:28,706 --> 00:06:31,488 they can cryptographically sign this reference 121 00:06:31,488 --> 00:06:33,562 and that doesn't prevent any edits to it, 122 00:06:34,169 --> 00:06:37,744 but at least, if someone vandalizes the statement 123 00:06:37,744 --> 00:06:40,255 or edits it in any way, then the signature is no longer valid, 124 00:06:40,255 --> 00:06:43,401 and you can tell this is not exactly what the organization said, 125 00:06:43,402 --> 00:06:47,064 and perhaps it's a good edit and they should re-sign the new statement, 126 00:06:47,065 --> 00:06:49,851 but also perhaps it should be reverted. 127 00:06:51,203 --> 00:06:54,166 And also, this is going to be very exciting, I think, 128 00:06:54,166 --> 00:06:56,846 Citoid is this amazing system they have on Wikipedia 129 00:06:57,379 --> 00:07:01,340 where you can paste a URL, or an identifier, or an ISBN 130 00:07:01,340 --> 00:07:04,759 or Wikidata ID or basically anything into the Visual Editor, 131 00:07:05,260 --> 00:07:08,241 and it spits out a reference that is nicely formatted 132 00:07:08,242 --> 00:07:11,049 and has all the data you want and it's wonderful to use. 133 00:07:11,049 --> 00:07:14,337 And by comparison, on Wikidata, if I want to add a reference 134 00:07:14,338 --> 00:07:18,801 I typically have to add a reference URL, title, author name string, 135 00:07:18,802 --> 00:07:20,449 published in, publication date, 136 00:07:20,450 --> 00:07:25,141 retrieve dates, at least those, and that's annoying, 137 00:07:25,141 --> 00:07:29,261 and integrating Citoid into Wikibase will hopefully help with that. 138 00:07:30,245 --> 00:07:33,604 And I think that's all the ones I had, yeah. 139 00:07:33,604 --> 00:07:36,400 So now, I'm going to pass to Cristina. 140 00:07:37,788 --> 00:07:42,339 (applause) 141 00:07:43,780 --> 00:07:45,471 Hi, I'm Cristina. 142 00:07:45,472 --> 00:07:47,672 I'm a research scientist from the University of Zürich, 143 00:07:47,673 --> 00:07:51,417 and I'm also an active member of the Swiss Community. 144 00:07:52,698 --> 00:07:57,901 When Claudia Müller-Birn and I submitted this to the WikidataCon, 145 00:07:57,902 --> 00:08:00,410 what we wanted to do is continue our discussion 146 00:08:00,411 --> 00:08:02,424 that we started in the beginning of the year 147 00:08:02,424 --> 00:08:07,442 with a workshop on data quality and also some sessions in Wikimania. 148 00:08:07,442 --> 00:08:10,535 So the goal of this talk is basically to bring some thoughts 149 00:08:10,536 --> 00:08:14,432 that we have been collecting from the community and ourselves 150 00:08:14,432 --> 00:08:16,560 and continue discussion. 151 00:08:16,561 --> 00:08:20,065 So what we would like is to continue interacting a lot with you. 152 00:08:21,557 --> 00:08:23,371 So what we think is very important 153 00:08:23,372 --> 00:08:27,580 is that we continuously ask all types of users in the community 154 00:08:27,581 --> 00:08:32,240 about what they really need, what problems they have with data quality, 155 00:08:32,240 --> 00:08:35,000 not only editors but also the people who are coding, 156 00:08:35,000 --> 00:08:36,241 or consuming the data, 157 00:08:36,242 --> 00:08:39,494 and also researchers who are actually using all the edit history 158 00:08:39,494 --> 00:08:40,800 to analyze what is happening. 159 00:08:42,367 --> 00:08:48,431 So we did a review of around 80 tools that are existing in Wikidata 160 00:08:48,431 --> 00:08:52,380 and we aligned them to the different data quality dimensions. 161 00:08:52,380 --> 00:08:54,360 And what we saw was that actually, 162 00:08:54,361 --> 00:08:57,681 many of them were looking at, monitoring completeness, 163 00:08:57,682 --> 00:09:02,820 but actually... and also some of them are also enabling interlinking. 164 00:09:02,820 --> 00:09:08,442 But there is a big need for tools that are looking into diversity, 165 00:09:08,443 --> 00:09:12,824 which is one of the things that we actually can have in Wikidata, 166 00:09:12,824 --> 00:09:15,958 especially this design principle of Wikidata 167 00:09:15,959 --> 00:09:17,901 where we can have plurality 168 00:09:17,902 --> 00:09:20,308 and different statements with different values 169 00:09:21,034 --> 00:09:22,236 coming from different sources. 170 00:09:22,236 --> 00:09:24,921 Because it's a secondary source, we don't have really tools 171 00:09:24,922 --> 00:09:27,750 that actually tell us how many plural statements there are, 172 00:09:27,751 --> 00:09:30,889 and how many we can improve and how, 173 00:09:30,890 --> 00:09:32,833 and we also don't know really 174 00:09:32,833 --> 00:09:35,538 what are all the reasons for plurality that we can have. 175 00:09:36,491 --> 00:09:39,201 So from these community meetings, 176 00:09:39,201 --> 00:09:43,084 what we discussed was the challenges that still need attention. 177 00:09:43,084 --> 00:09:47,249 For example, that having all these crowdsourcing communities 178 00:09:47,249 --> 00:09:49,613 is very good because different people attack different parts 179 00:09:49,613 --> 00:09:51,833 of the data or the graph, 180 00:09:51,834 --> 00:09:54,615 and we also have different background knowledge 181 00:09:54,616 --> 00:09:59,161 but actually, it's very difficult to align everything in something homogeneous 182 00:09:59,162 --> 00:10:04,920 because different people are using different properties in different ways 183 00:10:04,920 --> 00:10:08,401 and they are also expecting different things from entity descriptions. 184 00:10:09,003 --> 00:10:12,721 People also said that they also need more tools 185 00:10:12,722 --> 00:10:16,000 that give a better overview of the global status of things. 186 00:10:16,000 --> 00:10:20,733 So what entities are missing in terms of completeness, 187 00:10:20,733 --> 00:10:26,121 but also like what are people working on right now most of the time, 188 00:10:26,121 --> 00:10:30,516 and they also mention many times a tighter collaboration 189 00:10:30,517 --> 00:10:33,311 across not only languages but the WikiProjects 190 00:10:33,311 --> 00:10:35,571 and the different Wikimedia platforms. 191 00:10:35,571 --> 00:10:38,859 And we published all the transcribed comments 192 00:10:38,860 --> 00:10:42,959 from all these discussions in those links here in the Etherpads 193 00:10:42,959 --> 00:10:46,162 and also in the wiki page of Wikimania. 194 00:10:46,162 --> 00:10:48,481 Some solutions that appeared actually 195 00:10:48,481 --> 00:10:53,001 were going into the direction of sharing more the best practices 196 00:10:53,001 --> 00:10:55,762 that are being developed in different WikiProjects, 197 00:10:55,762 --> 00:11:01,238 but also people want tools that help organize work in teams 198 00:11:01,239 --> 00:11:03,845 or at least understanding who is working on that, 199 00:11:03,845 --> 00:11:07,815 and they were also mentioning that they want more showcases 200 00:11:07,816 --> 00:11:12,019 and more templates that help them create things in a better way. 201 00:11:12,946 --> 00:11:15,161 And from the contact that we have 202 00:11:15,162 --> 00:11:18,721 with Open Governmental Data Organizations, 203 00:11:18,722 --> 00:11:20,068 and in particularly, 204 00:11:20,068 --> 00:11:23,102 I am in contact with the canton and the city of Zürich, 205 00:11:23,102 --> 00:11:26,207 they are very interested in working with Wikidata 206 00:11:26,207 --> 00:11:29,896 because they want their data to be accessible for everyone 207 00:11:29,897 --> 00:11:33,681 in the place where people go and consult or access data. 208 00:11:33,682 --> 00:11:36,550 So for them, something that would be really interesting 209 00:11:36,551 --> 00:11:38,600 is to have some kind of quality indicators 210 00:11:38,600 --> 00:11:41,082 both in the wiki, which is already happening, 211 00:11:41,082 --> 00:11:42,801 but also in SPARQL results, 212 00:11:42,802 --> 00:11:46,066 to know whether they can trust or not that data from the community. 213 00:11:46,067 --> 00:11:48,230 And then, they also want to know 214 00:11:48,230 --> 00:11:51,417 what parts of their own data sets are useful for Wikidata 215 00:11:51,418 --> 00:11:56,040 and they would love to have a tool that can help them assess that automatically. 216 00:11:56,041 --> 00:11:59,066 They also need some kind of methodology or tool 217 00:11:59,067 --> 00:12:03,894 that helps them decide whether they should import or link their data 218 00:12:03,894 --> 00:12:04,894 because in some cases, 219 00:12:04,895 --> 00:12:07,137 they also have their own linked open data sets, 220 00:12:07,138 --> 00:12:09,746 so they don't know whether to just ingest the data 221 00:12:09,747 --> 00:12:13,424 or to keep on creating links from the data sets to Wikidata 222 00:12:13,425 --> 00:12:14,425 and the other way around. 223 00:12:14,950 --> 00:12:20,043 And they also want to know where their websites are referred in Wikidata. 224 00:12:20,044 --> 00:12:23,361 And when they run such a query in the query service, 225 00:12:23,362 --> 00:12:24,848 they often get timeouts, 226 00:12:24,849 --> 00:12:28,181 so maybe we should really create more tools 227 00:12:28,181 --> 00:12:32,240 that help them get these answers for their questions. 228 00:12:33,148 --> 00:12:36,208 And, besides that, 229 00:12:36,208 --> 00:12:39,361 we wiki researchers also sometimes 230 00:12:39,362 --> 00:12:42,023 lack some information in the edit summaries. 231 00:12:42,024 --> 00:12:44,953 So I remember that when we were doing some work 232 00:12:44,954 --> 00:12:48,919 to understand the different behavior of editors 233 00:12:48,919 --> 00:12:53,403 with tools or bots or anonymous users and so on, 234 00:12:53,403 --> 00:12:56,154 we were really lacking, for example, 235 00:12:56,154 --> 00:13:01,112 a standard way of tracing that tools were being used. 236 00:13:01,113 --> 00:13:03,154 And there are some tools that are already doing that 237 00:13:03,155 --> 00:13:05,230 like PetScan and many others, 238 00:13:05,230 --> 00:13:07,720 but maybe we should in the community 239 00:13:07,721 --> 00:13:13,531 discuss more about how to record these for fine-grained provenance. 240 00:13:14,169 --> 00:13:15,321 And further on, 241 00:13:15,322 --> 00:13:20,801 we think that we need to think of more concrete data quality dimensions 242 00:13:20,802 --> 00:13:24,961 that are related to link data but not all the types of data, 243 00:13:24,962 --> 00:13:30,721 so we worked on some measures to access actually the information gain 244 00:13:30,722 --> 00:13:33,881 enabled by the links, and what we mean by that 245 00:13:33,882 --> 00:13:36,681 is that when we link Wikidata to other data sets, 246 00:13:36,682 --> 00:13:38,201 we should also be thinking 247 00:13:38,202 --> 00:13:41,921 how much the entities are actually gaining in the classification, 248 00:13:41,922 --> 00:13:45,601 also in the description but also in the vocabularies they use. 249 00:13:45,602 --> 00:13:51,041 So just to give a very simple example of what I mean with this 250 00:13:51,042 --> 00:13:54,269 is we can think of-- in this case, would be Wikidata 251 00:13:54,270 --> 00:13:57,771 or the external data center that is linking to Wikidata, 252 00:13:57,772 --> 00:14:00,487 we have the entity for a person that is called Natasha Noy, 253 00:14:00,487 --> 00:14:02,601 we have the affiliation and other things, 254 00:14:02,602 --> 00:14:05,239 and then we say OK, we link to an external place, 255 00:14:05,240 --> 00:14:08,919 and that entity also has that name, but we actually have the same value. 256 00:14:08,920 --> 00:14:12,889 So what it would be better is that we link to something that has a different name, 257 00:14:12,889 --> 00:14:16,881 that is still valid because this person has two ways of writing the name, 258 00:14:16,882 --> 00:14:19,714 and also other information that we don't have in Wikidata 259 00:14:19,715 --> 00:14:21,760 or that we don't have in the other data set. 260 00:14:22,390 --> 00:14:24,652 But also, what is even better 261 00:14:24,653 --> 00:14:27,770 is that we are actually looking in the target data set 262 00:14:27,770 --> 00:14:31,392 that they also have new ways of classifying the information. 263 00:14:31,393 --> 00:14:35,354 So not only is this a person, but in the other data set, 264 00:14:35,355 --> 00:14:39,525 they also say it's a female or anything else that they classify with. 265 00:14:39,526 --> 00:14:43,401 And if in the other data set, they are using many other vocabularies 266 00:14:43,402 --> 00:14:46,588 that is also helping in their whole information retrieval thing. 267 00:14:47,371 --> 00:14:51,233 So with that, I also would like to say 268 00:14:51,234 --> 00:14:55,809 that we think that we can showcase federated queries better 269 00:14:55,810 --> 00:15:00,448 because when we look at the query log provided by Malyshev et al., 270 00:15:01,285 --> 00:15:04,301 we see actually that from the organic queries, 271 00:15:04,302 --> 00:15:06,921 we have only very few federated queries. 272 00:15:06,922 --> 00:15:12,801 And actually, federation is one of the key advantages of having link data, 273 00:15:12,802 --> 00:15:16,903 so maybe the community or the people using Wikidata 274 00:15:16,903 --> 00:15:18,898 also need more examples on this. 275 00:15:18,898 --> 00:15:22,666 And if we look at the list of endpoints that are being used, 276 00:15:22,667 --> 00:15:25,401 this is not a complete list and we have many more. 277 00:15:25,402 --> 00:15:30,479 Of course, this data was analyzed from queries until March 2018, 278 00:15:30,480 --> 00:15:34,807 but we should look into the list of federated endpoints that we have 279 00:15:34,808 --> 00:15:37,048 and see whether we are really using them or not. 280 00:15:37,813 --> 00:15:40,441 So two questions that I have for the audience 281 00:15:40,442 --> 00:15:43,001 that maybe we can use afterwards for the discussion are: 282 00:15:43,001 --> 00:15:46,001 what data quality problems should be addressed in your opinion, 283 00:15:46,002 --> 00:15:47,412 because of the needs that you have, 284 00:15:47,412 --> 00:15:50,401 but also, where do you need more automation 285 00:15:50,402 --> 00:15:52,943 to help you with editing or patrolling. 286 00:15:53,866 --> 00:15:55,146 That's all, thank you very much. 287 00:15:55,779 --> 00:15:57,527 (applause) 288 00:16:06,030 --> 00:16:08,595 (Jose Emilio Labra) OK, so what I'm going to talk about 289 00:16:08,595 --> 00:16:14,715 is some tools that we were developing related with Shape Expressions. 290 00:16:15,536 --> 00:16:19,371 So this is what I want to talk... I am Jose Emilio Labra, 291 00:16:19,371 --> 00:16:23,215 but this has... all these tools have been done by different people, 292 00:16:23,920 --> 00:16:28,480 mainly related with W3C ShEx, Shape Expressions Community Group. 293 00:16:28,481 --> 00:16:29,481 ShEx Community Group. 294 00:16:30,144 --> 00:16:36,081 So the first tool that I want to mention is RDFShape, this is a general tool, 295 00:16:36,082 --> 00:16:40,681 because Shape Expressions is not only for Wikidata, 296 00:16:40,682 --> 00:16:44,168 Shape Expressions is a language to validate RDF in general. 297 00:16:44,168 --> 00:16:47,568 So this tool was developed mainly by me 298 00:16:47,568 --> 00:16:50,880 and it's a tool to validate RDF in general. 299 00:16:50,881 --> 00:16:55,139 So if you want to learn about RDF or you want to validate RDF 300 00:16:55,140 --> 00:16:58,621 or SPARQL endpoints not only in Wikidata, 301 00:16:58,622 --> 00:17:00,891 my advice is that you can use this tool. 302 00:17:00,891 --> 00:17:03,255 Also for teaching. 303 00:17:03,255 --> 00:17:05,640 I am a teacher in the university 304 00:17:05,641 --> 00:17:09,151 and I use it in my semantic web course to teach RDF. 305 00:17:09,161 --> 00:17:12,121 So if you want to learn RDF, I think it's a good tool. 306 00:17:13,033 --> 00:17:17,598 For example, this is just a visualization of an RDF graph with the tool. 307 00:17:18,587 --> 00:17:22,643 But before coming here, in the last month, 308 00:17:22,643 --> 00:17:28,441 I started a fork of rdfshape specifically for Wikidata, because I thought... 309 00:17:28,443 --> 00:17:33,082 It's called WikiShape, and yesterday, I presented it as a present for Wikidata. 310 00:17:33,082 --> 00:17:34,441 So what I took is... 311 00:17:34,442 --> 00:17:39,898 What I did is to remove all the stuff that was not related with Wikidata 312 00:17:39,898 --> 00:17:44,801 and to put several things, hard-coded, for example, the Wikidata SPARQL endpoint, 313 00:17:44,802 --> 00:17:49,041 but now, someone asked me if I could do it also for Wikibase. 314 00:17:49,042 --> 00:17:52,000 And it is very easy to do it for Wikibase also. 315 00:17:52,760 --> 00:17:56,280 So this tool, WikiShape, is quite new. 316 00:17:57,015 --> 00:17:59,843 I think it works, most of the features, 317 00:17:59,844 --> 00:18:02,468 but there are some features that maybe don't work, 318 00:18:02,469 --> 00:18:06,281 and if you try it and you want to improve it, please tell me. 319 00:18:06,281 --> 00:18:12,680 So this is [inaudible] captures, but I think I can even try so let's try. 320 00:18:15,385 --> 00:18:16,945 So let's see if it works. 321 00:18:16,953 --> 00:18:20,070 First, I have to go out of the... 322 00:18:22,453 --> 00:18:23,453 Here. 323 00:18:24,226 --> 00:18:28,324 Alright, yeah. So this is the tool here. 324 00:18:28,324 --> 00:18:29,844 Things that you can do with the tool, 325 00:18:29,845 --> 00:18:35,275 for example, is that you can check schemas, entity schemas. 326 00:18:35,276 --> 00:18:38,611 You know that there is a new namespace which is "E whatever," 327 00:18:38,612 --> 00:18:44,805 so here, if you start for example, write for example "human"... 328 00:18:44,806 --> 00:18:48,812 As you are writing, its autocomplete allows you to check, 329 00:18:48,812 --> 00:18:52,001 for example, this is the Shape Expressions of a human, 330 00:18:52,790 --> 00:18:55,937 and this is the Shape Expressions here. 331 00:18:55,938 --> 00:18:59,841 And as you can see, this editor has syntax highlighting, 332 00:18:59,842 --> 00:19:04,559 this is... well, maybe it's very small, the screen. 333 00:19:05,676 --> 00:19:07,590 I can try to do it bigger. 334 00:19:09,194 --> 00:19:10,973 Maybe you see it better now. 335 00:19:10,973 --> 00:19:14,241 So... and this is the editor with syntax highlighting and also has... 336 00:19:14,241 --> 00:19:17,851 I mean, this editor comes from the same source code 337 00:19:17,851 --> 00:19:19,641 as the Wikidata query service. 338 00:19:19,642 --> 00:19:23,960 So for example, if you hover with the mouse here, 339 00:19:23,961 --> 00:19:27,961 it shows you the labels of the different properties. 340 00:19:27,962 --> 00:19:31,298 So I think it's very helpful because now, 341 00:19:32,588 --> 00:19:38,601 the entity schemas that is in the Wikidata is just a plain text idea, 342 00:19:38,602 --> 00:19:42,493 and I think this editor is much better because it has autocomplete 343 00:19:42,494 --> 00:19:43,743 and it also has... 344 00:19:43,744 --> 00:19:48,241 I mean, if you, for example, wanted to add a constraint, 345 00:19:48,241 --> 00:19:51,570 you say "wdt:" 346 00:19:51,570 --> 00:19:56,884 You start writing "author" and then you click *Ctrl+Space* 347 00:19:56,884 --> 00:19:58,922 and it suggests the different things. 348 00:19:58,922 --> 00:20:02,388 So this is similar to the Wikidata query service 349 00:20:02,389 --> 00:20:06,445 but specifically for Shape Expressions 350 00:20:06,445 --> 00:20:11,975 because my feeling is that creating Shape Expressions 351 00:20:11,976 --> 00:20:15,841 is not more difficult than writing SPARQL queries. 352 00:20:15,842 --> 00:20:21,255 So some people think that it's at the same level, 353 00:20:22,278 --> 00:20:26,296 It's probably easier, I think, because Shape Expressions was, 354 00:20:26,296 --> 00:20:31,241 when we designed it, we were doing it to be easier to work. 355 00:20:31,242 --> 00:20:35,001 OK, so this is one of the first things, that you have this editor 356 00:20:35,001 --> 00:20:36,620 for Shape Expressions. 357 00:20:37,371 --> 00:20:41,467 And then you also have the possibility, for example, to visualize. 358 00:20:41,468 --> 00:20:44,801 If you have a Shape Expression, use for example... 359 00:20:44,802 --> 00:20:49,386 I think, "written work" is a nice Shape Expression 360 00:20:49,386 --> 00:20:53,300 because it has some relationships between different things. 361 00:20:54,823 --> 00:20:58,160 And this is the UML visualization of written work. 362 00:20:58,161 --> 00:21:02,090 In a UML, this is easy to see the different properties. 363 00:21:02,790 --> 00:21:06,794 When you do this, I realized when I tried with several people, 364 00:21:06,795 --> 00:21:09,216 they find some mistakes in their Shape Expressions 365 00:21:09,217 --> 00:21:12,988 because it's easy to detect which are the missing properties or whatever. 366 00:21:13,588 --> 00:21:15,771 Then there is another possibility here 367 00:21:15,772 --> 00:21:19,520 is that you can also validate, I think I have it here, the validation. 368 00:21:20,496 --> 00:21:25,285 I think I had it in some label, maybe I closed it. 369 00:21:26,267 --> 00:21:30,988 OK, but you can, for example, you can click here, *Validate entities.* 370 00:21:32,308 --> 00:21:34,232 You, for example, 371 00:21:35,404 --> 00:21:41,921 "q42" with "e42" which is author. 372 00:21:42,818 --> 00:21:46,180 With "human," I think we can do it with "human." 373 00:21:49,050 --> 00:21:50,050 And then it's... 374 00:21:50,688 --> 00:21:56,365 And it's taking a little while to do it because this is doing the SPARQL queries 375 00:21:56,365 --> 00:21:59,134 and now, for example, it's failing by the network but... 376 00:21:59,657 --> 00:22:01,580 So you can try it. 377 00:22:02,759 --> 00:22:07,026 OK, so let's go continue with the presentation, with other tools. 378 00:22:07,026 --> 00:22:12,353 So my advice is that if you want to try it and you want any feedback let me know. 379 00:22:13,133 --> 00:22:15,540 So to continue with the presentation... 380 00:22:18,923 --> 00:22:20,233 So this is WikiShape. 381 00:22:23,800 --> 00:22:26,509 Then, I already said this, 382 00:22:27,681 --> 00:22:34,157 the Shape Expressions Editor is an independent project in GitHub. 383 00:22:35,605 --> 00:22:37,472 You can use it in your own project. 384 00:22:37,472 --> 00:22:41,036 If you want to do a Shape Expressions tool, 385 00:22:41,036 --> 00:22:45,635 you can just embed it in any other project, 386 00:22:45,636 --> 00:22:48,235 so this is in GitHub and you can use it. 387 00:22:48,868 --> 00:22:51,970 Then the same author, it's one of my students, 388 00:22:52,684 --> 00:22:55,704 he also created an editor for Shape Expressions, 389 00:22:55,704 --> 00:22:57,799 also inspired by the Wikidata query service 390 00:22:57,800 --> 00:23:00,681 where, in a column, 391 00:23:00,682 --> 00:23:05,103 you have this more visual editor of SPARQL queries 392 00:23:05,104 --> 00:23:07,135 where you can put this kind of things. 393 00:23:07,136 --> 00:23:09,123 So this is a screen capture. 394 00:23:09,123 --> 00:23:12,662 You can see that that's the Shape Expressions in text 395 00:23:12,662 --> 00:23:17,822 but this is a form-based Shape Expressions where it would probably take a bit longer 396 00:23:18,595 --> 00:23:23,400 where you can put the different rows on the different fields. 397 00:23:23,401 --> 00:23:25,800 OK, then there is ShExEr. 398 00:23:26,879 --> 00:23:31,882 We have... it's done by one PhD student at the University of Oviedo 399 00:23:31,883 --> 00:23:34,080 and he's here, so you can present ShExEr. 400 00:23:38,147 --> 00:23:40,024 (Danny) Hello, I am Danny Fernández, 401 00:23:40,025 --> 00:23:43,800 I am a PhD student in University of Oviedo working with Labra. 402 00:23:44,710 --> 00:23:47,725 Since we are running out of time, let's make these quickly, 403 00:23:47,726 --> 00:23:52,641 so let's not go for any actual demo, but just print some screenshots. 404 00:23:52,642 --> 00:23:57,897 OK, so the usual way to work with Shape Expressions or any shape language 405 00:23:57,897 --> 00:23:59,521 is that you have a domain expert 406 00:23:59,522 --> 00:24:02,313 that defines a priori how the graph should look like 407 00:24:02,314 --> 00:24:03,555 define some structures, 408 00:24:03,556 --> 00:24:06,983 and then you use these structures to validate the actual data against it. 409 00:24:08,124 --> 00:24:11,641 This tool, which is as well as the ones that Labra has been presenting, 410 00:24:11,642 --> 00:24:14,441 this is a general purpose tool for any RDF source, 411 00:24:14,442 --> 00:24:17,375 is designed to do the other way around. 412 00:24:17,376 --> 00:24:18,758 You already have some data, 413 00:24:18,759 --> 00:24:23,165 you select what nodes you want to get the shape about 414 00:24:23,165 --> 00:24:26,718 and then you automatically extract or infer the shape. 415 00:24:26,719 --> 00:24:29,791 So even if this is a general purpose tool, 416 00:24:29,791 --> 00:24:34,063 what we did for this WikidataCon is these fancy button 417 00:24:34,884 --> 00:24:37,081 that if you click it, essentially what happens 418 00:24:37,081 --> 00:24:42,079 is that there are so many configurations params 419 00:24:42,080 --> 00:24:46,251 and it configures it to work against the Wikidata endpoint 420 00:24:46,251 --> 00:24:47,971 and it will end soon, sorry. 421 00:24:48,733 --> 00:24:52,883 So, once you press this button what you get is essentially this. 422 00:24:52,884 --> 00:24:55,126 After having selected what kind of notes, 423 00:24:55,127 --> 00:24:59,360 what kind of instances of our class, whatever you are looking for, 424 00:24:59,361 --> 00:25:01,321 you get an automatic schema. 425 00:25:02,319 --> 00:25:07,111 All the constraints are sorted by how many modes actually conform to it, 426 00:25:07,112 --> 00:25:09,772 you can filter the less common ones, etc. 427 00:25:09,772 --> 00:25:12,126 So there is a poster downstairs about this stuff 428 00:25:12,127 --> 00:25:14,595 and well, I will be downstairs and upstairs 429 00:25:14,596 --> 00:25:16,454 and all over the place all day, 430 00:25:16,455 --> 00:25:19,081 so if you have any further interest in this tool, 431 00:25:19,082 --> 00:25:21,476 just speak to me during this journey. 432 00:25:21,477 --> 00:25:24,624 And now, I'll give back the micro to Labra, thank you. 433 00:25:24,625 --> 00:25:29,265 (applause) 434 00:25:29,812 --> 00:25:32,578 (Jose) So let's continue with the other tools. 435 00:25:32,579 --> 00:25:34,984 The other tool is the ShapeDesigner. 436 00:25:34,984 --> 00:25:37,241 Andra, do you want to do the ShapeDesigner now 437 00:25:37,242 --> 00:25:39,287 or maybe later or in the workshop? 438 00:25:39,287 --> 00:25:40,603 There is a workshop... 439 00:25:40,603 --> 00:25:44,437 This afternoon, there is a workshop specifically for Shape Expressions, and... 440 00:25:45,265 --> 00:25:47,939 The idea is that was going to be more hands on, 441 00:25:47,940 --> 00:25:52,324 and if you want to practice some ShEx, you can do it there. 442 00:25:52,875 --> 00:25:55,720 This tool is ShEx... and there is Eric here, 443 00:25:55,721 --> 00:25:56,890 so you can present it. 444 00:25:57,969 --> 00:26:00,687 (Eric) So just super quick, the thing that I want to say 445 00:26:00,687 --> 00:26:05,711 is that you've probably already seen the ShEx interface 446 00:26:05,711 --> 00:26:07,601 that's tailored for Wikidata. 447 00:26:07,602 --> 00:26:12,930 That's effectively stripped down and tailored specifically for Wikidata 448 00:26:12,930 --> 00:26:17,937 because the generic one has more features but it turns out I thought I'd mention it 449 00:26:17,937 --> 00:26:19,977 because one of those features is particularly useful 450 00:26:19,978 --> 00:26:23,201 for debugging Wikidata schemas, 451 00:26:23,201 --> 00:26:29,224 which is if you go and you select the slurp mode, 452 00:26:29,225 --> 00:26:31,444 what it does is it says while I'm validating, 453 00:26:31,445 --> 00:26:34,694 I want to pull all the the triples down and that means 454 00:26:34,695 --> 00:26:36,274 if I get a bunch of failures, 455 00:26:36,275 --> 00:26:39,586 I can go through and start looking at those failures and saying, 456 00:26:39,587 --> 00:26:41,800 OK, what are the triples that are in here, 457 00:26:41,801 --> 00:26:44,120 sorry, I apologize, the triples are down there, 458 00:26:44,121 --> 00:26:45,647 this is just a log of what went by. 459 00:26:46,327 --> 00:26:49,180 And then you can just sit there and fiddle with it in real time 460 00:26:49,181 --> 00:26:51,033 like you play with something and it changes. 461 00:26:51,033 --> 00:26:54,160 So it's a quicker version for doing all that stuff. 462 00:26:55,361 --> 00:26:56,481 This is a ShExC form, 463 00:26:56,482 --> 00:26:59,455 this is something [Joachim] had suggested 464 00:27:00,035 --> 00:27:04,631 could be useful for populating Wikidata documents 465 00:27:04,631 --> 00:27:07,338 based on a Shape Expression for that that document. 466 00:27:08,095 --> 00:27:11,681 This is not tailored for Wikidata, 467 00:27:11,682 --> 00:27:14,081 but this is just to say that you can have a schema 468 00:27:14,082 --> 00:27:15,402 and you can have some annotations 469 00:27:15,403 --> 00:27:17,518 to say specifically how I want that schema rendered 470 00:27:17,519 --> 00:27:19,031 and then it just builds a form, 471 00:27:19,031 --> 00:27:21,191 and if you've got data, it can even populate the form. 472 00:27:24,517 --> 00:27:26,164 PyShEx [inaudible]. 473 00:27:28,025 --> 00:27:31,080 (Jose) I think this is the last one. 474 00:27:31,821 --> 00:27:34,080 Yes, so the last one is PyShEx. 475 00:27:34,675 --> 00:27:38,151 PyShEx is a Python implementation of Shape Expressions, 476 00:27:39,193 --> 00:27:42,680 you can play also with Jupyter Notebooks if you want those kind of things. 477 00:27:42,680 --> 00:27:44,432 OK, so that's all for this. 478 00:27:44,433 --> 00:27:47,170 (applause) 479 00:27:52,916 --> 00:27:57,073 (Andra) So I'm going to talk about a specific project that I'm involved in 480 00:27:57,074 --> 00:27:58,074 called Gene Wiki, 481 00:27:58,075 --> 00:28:04,596 and where we are also dealing with quality issues. 482 00:28:04,597 --> 00:28:06,684 But before going into the quality, 483 00:28:06,685 --> 00:28:09,229 maybe a quick introduction about what Gene Wiki is, 484 00:28:09,855 --> 00:28:15,175 and we just released a pre-print of a paper that we recently have written 485 00:28:15,175 --> 00:28:18,160 that explains the details of the project. 486 00:28:19,821 --> 00:28:23,839 I see people taking pictures, but basically, what Gene Wiki does, 487 00:28:23,846 --> 00:28:28,027 it's trying to get biomedical data, public data into Wikidata, 488 00:28:28,028 --> 00:28:32,200 and we follow a specific pattern to get that data into Wikidata. 489 00:28:33,130 --> 00:28:36,809 So when we have a new repository or a new data set 490 00:28:36,810 --> 00:28:39,600 that is eligible to be included into Wikidata, 491 00:28:39,601 --> 00:28:41,293 the first step is community engagement. 492 00:28:41,294 --> 00:28:43,784 It is not necessary directly to a Wikidata community 493 00:28:43,785 --> 00:28:46,120 but a local research community, 494 00:28:46,121 --> 00:28:50,286 and we meet in person or online or on any platform 495 00:28:50,286 --> 00:28:52,881 and try to come up with a data model 496 00:28:52,882 --> 00:28:56,197 that bridges their data with the Wikidata model. 497 00:28:56,197 --> 00:28:59,944 So here I have a picture of a workshop that happened here last year 498 00:28:59,945 --> 00:29:02,663 which was trying to look at a specific data set 499 00:29:02,663 --> 00:29:05,280 and, well, you see a lot of discussions, 500 00:29:05,281 --> 00:29:09,780 then aligning it with schema.org and other ontologies that are out there. 501 00:29:10,320 --> 00:29:15,508 And then, at the end of the first step, we have a whiteboard drawing of the schema 502 00:29:15,509 --> 00:29:17,336 that we want to implement in Wikidata. 503 00:29:17,337 --> 00:29:20,440 What you see over there, this is just plain, 504 00:29:20,441 --> 00:29:21,766 we have it in the back there 505 00:29:21,767 --> 00:29:25,240 so we can make some schemas within this panel today even. 506 00:29:26,560 --> 00:29:28,399 So once we have the schema in place, 507 00:29:28,400 --> 00:29:31,320 the next thing is try to make that schema machine readable 508 00:29:32,358 --> 00:29:36,841 because you want to have actionable models to bridge the data that you're bringing in 509 00:29:36,842 --> 00:29:39,690 from any biomedical database into Wikidata. 510 00:29:40,393 --> 00:29:45,182 And here we are applying Shape Expressions. 511 00:29:46,471 --> 00:29:52,518 And we use that because Shape Expressions allow you to test 512 00:29:52,518 --> 00:29:57,040 whether the data set is actually-- no, to first see 513 00:29:57,041 --> 00:30:01,782 of already existing data in Wikidata follows the same data model 514 00:30:01,783 --> 00:30:04,718 that was achieved in the previous process. 515 00:30:04,719 --> 00:30:06,641 So then with the Shape Expression we can check: 516 00:30:06,642 --> 00:30:10,926 OK the data that are on this topic in Wikidata, does it need some cleaning up 517 00:30:10,926 --> 00:30:15,013 or do we need to adapt our model to the Wikidata model or vice versa. 518 00:30:15,937 --> 00:30:19,867 Once that is in place and we start writing bots, 519 00:30:20,670 --> 00:30:23,801 and bots are seeding the information 520 00:30:23,802 --> 00:30:27,308 that is in the primary sources into Wikidata. 521 00:30:27,846 --> 00:30:29,303 And when the bots are ready, 522 00:30:29,304 --> 00:30:33,001 we write these bots with a platform called-- 523 00:30:33,002 --> 00:30:36,201 with a Python library called Wikidata Integrator 524 00:30:36,202 --> 00:30:38,167 that came out of our project. 525 00:30:38,698 --> 00:30:42,921 And once we have our bots, we use a platform called Jenkins 526 00:30:42,921 --> 00:30:44,540 for continuous integration. 527 00:30:44,540 --> 00:30:45,762 And with Jenkins, 528 00:30:45,762 --> 00:30:51,160 we continuously update the primary sources with Wikidata. 529 00:30:52,178 --> 00:30:55,889 And this is a diagram for the paper I previously mentioned. 530 00:30:55,890 --> 00:30:57,241 This is our current landscape. 531 00:30:57,242 --> 00:31:02,059 So every orange box out there is a primary resource on drugs, 532 00:31:02,060 --> 00:31:07,827 proteins, genes, diseases, chemical compounds with interaction, 533 00:31:07,827 --> 00:31:10,870 and this model is too small to read now 534 00:31:10,870 --> 00:31:17,472 but this is the database, the sources that we manage in Wikidata 535 00:31:17,473 --> 00:31:20,560 and bridge with the primary sources. 536 00:31:20,561 --> 00:31:22,355 Here is such a workflow. 537 00:31:22,870 --> 00:31:25,312 So one of our partners is the Disease Ontology 538 00:31:25,312 --> 00:31:27,672 the Disease Ontology is a CC0 ontology, 539 00:31:28,179 --> 00:31:31,990 and the CC0 Ontology has a curation cycle on its own, 540 00:31:32,756 --> 00:31:35,736 and they just continuously update the Disease Ontology 541 00:31:35,737 --> 00:31:39,687 to reflect the disease space or the interpretation of diseases. 542 00:31:40,336 --> 00:31:44,361 And there is the Wikidata curation cycle also on diseases 543 00:31:44,362 --> 00:31:49,844 where the Wikidata community constantly monitors what's going on on Wikidata. 544 00:31:50,406 --> 00:31:51,601 And then we have two roles, 545 00:31:51,602 --> 00:31:55,477 we call them colloquially the gatekeeper curator, 546 00:31:56,009 --> 00:31:59,561 and this was me and a colleague five years ago 547 00:31:59,562 --> 00:32:03,414 where we just sit on our computers and we monitor Wikipedia and Wikidata, 548 00:32:03,415 --> 00:32:08,601 and if there is an issue that was reported back to the primary community, 549 00:32:08,602 --> 00:32:11,765 the primary resources, they looked at the implementation and decided: 550 00:32:11,765 --> 00:32:14,240 OK, do we do we trust the Wikidata input? 551 00:32:14,850 --> 00:32:18,555 Yes--then it's considered, it goes into the cycle, 552 00:32:18,555 --> 00:32:22,686 and the next iteration is part of the Disease Ontology 553 00:32:22,687 --> 00:32:25,411 and fed back into Wikidata. 554 00:32:27,419 --> 00:32:31,480 We're doing the same for WikiPathways. 555 00:32:31,481 --> 00:32:36,601 WikiPathways is a MediaWiki-inspired pathway and pathway repository. 556 00:32:36,602 --> 00:32:40,901 Same story, there are different pathway resources on Wikidata already. 557 00:32:41,463 --> 00:32:44,713 There might be conflicts between those pathway resources 558 00:32:44,722 --> 00:32:46,701 and these conflicts are reported back 559 00:32:46,702 --> 00:32:49,521 by the gatekeeper curators to that community, 560 00:32:49,522 --> 00:32:53,715 and you maintain the individual curation cycles. 561 00:32:53,715 --> 00:32:57,068 But if you remember the previous cycle, 562 00:32:57,069 --> 00:33:03,041 here I mentioned only two cycles, two resources, 563 00:33:03,566 --> 00:33:06,300 we have to do that for every single resource that we have 564 00:33:06,300 --> 00:33:08,061 and we have to manage what's going on 565 00:33:08,062 --> 00:33:09,185 because when I say curation, 566 00:33:09,185 --> 00:33:11,377 I really mean going to the Wikipedia top pages, 567 00:33:11,377 --> 00:33:14,544 going into the Wikidata top pages and trying to do that. 568 00:33:14,545 --> 00:33:19,316 That doesn't scale for the two gatekeeper curators we had. 569 00:33:19,860 --> 00:33:22,777 So when I was in a conference in 2016 570 00:33:22,778 --> 00:33:26,933 where Eric gave a presentation on Shape Expressions, 571 00:33:26,934 --> 00:33:29,277 I jumped on the bandwagon and said OK, 572 00:33:29,278 --> 00:33:34,240 Shape Expressions can help us detect what differences in Wikidata 573 00:33:34,240 --> 00:33:41,159 and so that allows the gatekeepers to have some more efficient reporting to report. 574 00:33:42,275 --> 00:33:46,019 So this year, I was delighted by the schema entity 575 00:33:46,020 --> 00:33:50,765 because now, we can store those entity schemas on Wikidata, 576 00:33:50,765 --> 00:33:53,183 on Wikidata itself, whereas before, it was on GitHub, 577 00:33:53,860 --> 00:33:56,815 and this aligns with the Wikidata interface, 578 00:33:56,816 --> 00:33:59,350 so you have things like document discussions 579 00:33:59,350 --> 00:34:00,762 but you also have revisions. 580 00:34:00,763 --> 00:34:05,261 So you can leverage the top pages and the revisions in Wikidata 581 00:34:05,262 --> 00:34:12,255 to use that to discuss about what is in Wikidata 582 00:34:12,255 --> 00:34:14,060 and what are in the primary resources. 583 00:34:14,966 --> 00:34:19,686 So this what Eric just presented, this is already quite a benefit. 584 00:34:19,686 --> 00:34:24,335 So here, we made up a Shape Expression for the human gene, 585 00:34:24,336 --> 00:34:30,225 and then we ran it through simple ShEx, and as you can see, 586 00:34:30,225 --> 00:34:32,428 we just got already ni-- 587 00:34:32,429 --> 00:34:34,641 There is one issue that needs to be monitored 588 00:34:34,642 --> 00:34:37,316 which there is an item that doesn't fit that schema, 589 00:34:37,316 --> 00:34:43,139 and then you can sort of already create schema entities curation reports 590 00:34:43,140 --> 00:34:46,240 based on... and send that to the different curation reports. 591 00:34:48,058 --> 00:34:52,788 But the ShEx.js a built interface, 592 00:34:52,788 --> 00:34:55,860 and if I can show back here, I only do ten, 593 00:34:55,860 --> 00:35:00,362 but we have tens of thousands, and so that again doesn't scale. 594 00:35:00,362 --> 00:35:04,654 So the Wikidata Integrator now supports ShEx support as well, 595 00:35:05,168 --> 00:35:07,431 and then we can just loop item loops 596 00:35:07,431 --> 00:35:11,494 where we say yes-no, yes-no, true-false, true-false. 597 00:35:11,495 --> 00:35:12,495 So again, 598 00:35:13,065 --> 00:35:16,514 increasing a bit of the efficiency of dealing with the reports. 599 00:35:17,256 --> 00:35:22,662 But now, recently, that builds on the Wikidata Query Service, 600 00:35:23,181 --> 00:35:24,998 and well, we recently have been throttling 601 00:35:24,999 --> 00:35:26,560 so again, that doesn't scale. 602 00:35:26,561 --> 00:35:31,391 So it's still an ongoing process, how to deal with models on Wikidata. 603 00:35:32,202 --> 00:35:36,682 And so again, ShEx is not only intimidating 604 00:35:36,683 --> 00:35:40,356 but also the scale is just too big to deal with. 605 00:35:41,068 --> 00:35:46,081 So I started working, this is my first proof of concept or exercise 606 00:35:46,082 --> 00:35:47,680 where I used a tool called yED, 607 00:35:48,184 --> 00:35:52,590 and I started to draw those Shape Expressions and because... 608 00:35:52,591 --> 00:35:58,098 and then regenerate this schema 609 00:35:58,099 --> 00:36:01,279 into this adjacent format of the Shape Expressions, 610 00:36:01,280 --> 00:36:04,520 so that would open up already to the audience 611 00:36:04,521 --> 00:36:07,432 that are intimidated by the Shape Expressions languages. 612 00:36:07,961 --> 00:36:12,308 But actually, there is a problem with those visual descriptions 613 00:36:12,309 --> 00:36:18,229 because this is also a schema that was actually drawn in yEd by someone. 614 00:36:18,230 --> 00:36:23,838 And here is another one which is beautiful. 615 00:36:23,838 --> 00:36:29,414 I would love to have this on my wall, but it is still not interoperable. 616 00:36:30,281 --> 00:36:32,131 So I want to end my talk with, 617 00:36:32,131 --> 00:36:35,732 and the first time, I've been stealing this slide, using this slide. 618 00:36:35,732 --> 00:36:37,594 It's an honor to have him in the audience 619 00:36:37,595 --> 00:36:39,423 and I really like this: 620 00:36:39,424 --> 00:36:42,362 "People think RDF is a pain because it's complicated. 621 00:36:42,362 --> 00:36:43,985 The truth is even worse, it's so simple, 622 00:36:45,581 --> 00:36:48,133 because you have to work with real-world data problems 623 00:36:48,134 --> 00:36:50,031 that are horribly complicated. 624 00:36:50,031 --> 00:36:51,451 While you can avoid RDF, 625 00:36:51,451 --> 00:36:55,760 it is harder to avoid complicated data and complicated computer problems." 626 00:36:55,761 --> 00:36:59,535 This is about RDF, but I think this so applies to modeling as well. 627 00:37:00,112 --> 00:37:02,769 So my point of discussion is should we really... 628 00:37:03,387 --> 00:37:05,882 How do we get modeling going? 629 00:37:05,882 --> 00:37:10,826 Should we discuss ShEx or visual models or... 630 00:37:11,426 --> 00:37:13,271 How do we continue? 631 00:37:13,474 --> 00:37:14,840 Thank you very much for your time. 632 00:37:15,102 --> 00:37:17,787 (applause) 633 00:37:20,001 --> 00:37:21,188 (Lydia) Thank you so much. 634 00:37:21,692 --> 00:37:24,001 Would you come to the front 635 00:37:24,002 --> 00:37:27,741 so that we can open the questions from the audience. 636 00:37:28,610 --> 00:37:30,203 Are there questions? 637 00:37:31,507 --> 00:37:32,507 Yes. 638 00:37:34,253 --> 00:37:36,890 And I think, for the camera, we need to... 639 00:37:38,835 --> 00:37:40,968 (Lydia laughing) Yeah. 640 00:37:43,094 --> 00:37:46,273 (man3) So a question for Cristina, I think. 641 00:37:47,366 --> 00:37:51,641 So you mentioned exactly the term "information gain" 642 00:37:51,642 --> 00:37:53,689 from linking with other systems. 643 00:37:53,690 --> 00:37:55,619 There is an information theoretic measure 644 00:37:55,620 --> 00:37:58,001 using statistic and probability called information gain. 645 00:37:58,002 --> 00:37:59,541 Do you have the same... 646 00:37:59,542 --> 00:38:01,736 I mean did you mean exactly that measure, 647 00:38:01,736 --> 00:38:04,173 the information gain from the probability theory 648 00:38:04,174 --> 00:38:05,240 from information theory 649 00:38:05,241 --> 00:38:09,024 or just use this conceptual thing to measure information gain some way? 650 00:38:09,025 --> 00:38:13,016 No, so we actually defined and implemented measures 651 00:38:13,695 --> 00:38:20,161 that are using the Shannon entropy, so it's meant as that. 652 00:38:20,162 --> 00:38:22,696 I didn't want to go into details of the concrete formulas... 653 00:38:22,697 --> 00:38:24,977 (man3) No, no, of course, that's why I asked the question. 654 00:38:24,978 --> 00:38:26,698 - (Cristina) But yeah... - (man3) Thank you. 655 00:38:33,091 --> 00:38:35,047 (man4) Make more of a comment than a question. 656 00:38:35,048 --> 00:38:36,241 (Lydia) Go for it. 657 00:38:36,242 --> 00:38:39,840 (man4) So there's been a lot of focus at the item level 658 00:38:39,840 --> 00:38:42,547 about quality and completeness, 659 00:38:42,547 --> 00:38:47,374 one of the things that concerns me is that we're not applying the same to hierarchies 660 00:38:47,374 --> 00:38:51,480 and I think we have an issue is that our hierarchy often isn't good. 661 00:38:51,481 --> 00:38:53,463 We're seeing this is going to be a real problem 662 00:38:53,464 --> 00:38:55,774 with Commons searching and other things. 663 00:38:56,771 --> 00:39:00,601 One of the abilities that we can do is to import external-- 664 00:39:00,602 --> 00:39:04,842 The way that external thesauruses structure their hierarchies, 665 00:39:04,842 --> 00:39:10,291 using the P4900 broader concept qualifier. 666 00:39:11,037 --> 00:39:16,167 But what I think would be really helpful would be much better tools for doing that 667 00:39:16,168 --> 00:39:21,212 so that you can import an external... thesaurus's hierarchy 668 00:39:21,212 --> 00:39:24,111 map that onto our Wikidata items. 669 00:39:24,111 --> 00:39:28,199 Once it's in place with those P4900 qualifiers, 670 00:39:28,200 --> 00:39:31,494 you can actually do some quite good querying through SPARQL 671 00:39:32,490 --> 00:39:37,534 to see where our hierarchy diverges from that external hierarchy. 672 00:39:37,534 --> 00:39:41,346 For instance, [Paula Morma], user PKM, you may know, 673 00:39:41,346 --> 00:39:43,533 does a lot of work on fashion. 674 00:39:43,533 --> 00:39:50,524 So we use that to pull in the Europeana Fashion Thesaurus's hierarchy 675 00:39:50,524 --> 00:39:53,812 and the Getty AAT fashion thesaurus hierarchy, 676 00:39:53,812 --> 00:39:57,957 and then see where the gaps were in our higher level items, 677 00:39:57,957 --> 00:40:00,511 which is a real problem for us because often, 678 00:40:00,511 --> 00:40:04,355 these are things that only exist as disambiguation pages on Wikipedia, 679 00:40:04,356 --> 00:40:09,270 so we have a lot of higher level items in our hierarchies missing 680 00:40:09,271 --> 00:40:14,480 and this is something that we must address in terms of quality and completeness, 681 00:40:14,480 --> 00:40:15,971 but what would really help 682 00:40:16,643 --> 00:40:20,871 would be better tools than the jungle of pull scripts that I wrote... 683 00:40:20,872 --> 00:40:26,010 If somebody could put that into a PAWS notebook in Python 684 00:40:26,561 --> 00:40:31,972 to be able to take an external thesaurus, take its hierarchy, 685 00:40:31,973 --> 00:40:34,595 which may well be available as linked data or may not, 686 00:40:35,379 --> 00:40:40,580 to then put those into quick statements to put in P4900 values. 687 00:40:41,165 --> 00:40:42,165 And then later, 688 00:40:42,166 --> 00:40:44,527 when our representation gets more complete, 689 00:40:44,528 --> 00:40:49,691 to update those P4900s because as our representation gets dated, 690 00:40:49,691 --> 00:40:51,590 becomes more dense, 691 00:40:51,590 --> 00:40:55,377 the values of those qualifiers need to change 692 00:40:56,230 --> 00:40:59,526 to represent that we've got more of their hierarchy in our system. 693 00:40:59,526 --> 00:41:03,728 If somebody could do that, I think that would be very helpful, 694 00:41:03,728 --> 00:41:07,121 and we do need to also look at other approaches 695 00:41:07,122 --> 00:41:10,762 to improve quality and completeness at the hierarchy level 696 00:41:10,763 --> 00:41:12,378 not just at the item level. 697 00:41:13,308 --> 00:41:14,840 (Andra) Can I add to that? 698 00:41:16,362 --> 00:41:19,901 Yes, and we actually do that, 699 00:41:19,911 --> 00:41:23,551 and I can recommend looking at the Shape Expression that Finn made 700 00:41:23,552 --> 00:41:27,330 with the lexical data where he creates Shape Expressions 701 00:41:27,330 --> 00:41:29,640 and then build on authorship expressions 702 00:41:29,641 --> 00:41:32,528 so you have this concept of linked Shape Expressions in Wikidata, 703 00:41:32,529 --> 00:41:35,005 and specifically, the use case, if I understand correctly, 704 00:41:35,006 --> 00:41:37,183 is exactly what we are doing in Gene Wiki. 705 00:41:37,184 --> 00:41:40,841 So you have the Disease Ontology which is put into Wikidata 706 00:41:40,842 --> 00:41:44,681 and then disease data comes in and we apply the Shape Expressions 707 00:41:44,682 --> 00:41:47,247 to see if that fits with this thesaurus. 708 00:41:47,248 --> 00:41:50,919 And there are other thesauruses or other ontologies for controlled vocabularies 709 00:41:50,920 --> 00:41:52,559 that still need to go into Wikidata, 710 00:41:52,559 --> 00:41:55,401 and that's exactly why Shape Expression is so interesting 711 00:41:55,402 --> 00:41:57,963 because you can have a Shape Expression for the Disease Ontology, 712 00:41:57,964 --> 00:41:59,644 you can have a Shape Expression for MeSH, 713 00:41:59,645 --> 00:42:01,761 you can say: OK, now I want to check the quality. 714 00:42:01,762 --> 00:42:04,059 Because you also have in Wikidata the context 715 00:42:04,060 --> 00:42:09,567 of when you have a controlled vocabulary, you say the quality is according to this, 716 00:42:09,568 --> 00:42:11,636 but you might have a disagreeing community. 717 00:42:11,636 --> 00:42:16,081 So the tooling is indeed in place but now is indeed to create those models 718 00:42:16,082 --> 00:42:18,144 and apply them on the different use cases. 719 00:42:18,811 --> 00:42:20,921 (man4) The ShapeExpression's very useful 720 00:42:20,922 --> 00:42:25,928 once you have the external ontology mapped into Wikidata, 721 00:42:25,929 --> 00:42:29,474 but my problem is that it's getting to that stage, 722 00:42:29,475 --> 00:42:34,881 it's working out how much of the external ontology isn't yet in Wikidata 723 00:42:34,882 --> 00:42:36,256 and where the gaps are, 724 00:42:36,257 --> 00:42:40,660 and that's where I think that having much more robust tools 725 00:42:40,660 --> 00:42:44,286 to see what's missing from external ontologies 726 00:42:44,286 --> 00:42:45,537 would be very helpful. 727 00:42:47,678 --> 00:42:49,062 The biggest problem there 728 00:42:49,062 --> 00:42:51,201 is not so much tooling but more licensing. 729 00:42:51,803 --> 00:42:55,249 So getting the ontologies into Wikidata is actually a piece of cake 730 00:42:55,250 --> 00:42:59,295 but most of the ontologies have, how can I say that politely, 731 00:42:59,965 --> 00:43:03,256 restrictive licensing, so they are not compatible with Wikidata. 732 00:43:04,068 --> 00:43:06,678 (man4) There's a huge number of public sector thesauruses 733 00:43:06,678 --> 00:43:08,209 in cultural fields. 734 00:43:08,210 --> 00:43:10,851 - (Andra) Then we need to talk. - (man4) Not a problem. 735 00:43:10,852 --> 00:43:12,384 (Andra) Then we need to talk. 736 00:43:13,624 --> 00:43:19,192 (man5) Just... the comment I want to make is actually answer to James, 737 00:43:19,192 --> 00:43:22,401 so the thing is that hierarchies make graphs, 738 00:43:22,374 --> 00:43:24,041 and when you want to... 739 00:43:24,579 --> 00:43:28,888 I want to basically talk about... a common problem in hierarchies 740 00:43:28,889 --> 00:43:30,820 is circle hierarchies, 741 00:43:30,821 --> 00:43:33,796 so they come back to each other when there's a problem, 742 00:43:33,796 --> 00:43:35,920 which you should not have that in hierarchies. 743 00:43:37,022 --> 00:43:41,295 This, funnily enough, happens in categories in Wikipedia a lot 744 00:43:41,295 --> 00:43:42,990 we have a lot of circles in categories, 745 00:43:43,898 --> 00:43:46,612 but the good news is that this is... 746 00:43:47,713 --> 00:43:51,582 Technically, it's a PMP complete problem, so you cannot find this, 747 00:43:51,583 --> 00:43:53,414 and easily if you built a graph of that, 748 00:43:54,473 --> 00:43:57,046 but there are lots of ways that have been developed 749 00:43:57,047 --> 00:44:00,624 to find problems in these hierarchy graphs. 750 00:44:00,625 --> 00:44:04,860 Like there is a paper called *Finding Cycles*... 751 00:44:04,861 --> 00:44:07,955 *Breaking Cycles in Noisy Hierarchies,* 752 00:44:07,956 --> 00:44:12,671 and it's been used to help categorization of English Wikipedia. 753 00:44:12,672 --> 00:44:17,141 You can just take this and apply these hierarchies in Wikidata, 754 00:44:17,142 --> 00:44:19,540 and then you can find things that are problematic 755 00:44:19,541 --> 00:44:22,481 and just remove the ones that are causing issues 756 00:44:22,482 --> 00:44:24,593 and find the issues, actually. 757 00:44:24,594 --> 00:44:26,960 So this is just an idea, just so you... 758 00:44:28,780 --> 00:44:29,930 (man4) That's all very well 759 00:44:29,931 --> 00:44:34,402 but I think you're underestimating the number of bad subclass relations 760 00:44:34,402 --> 00:44:35,402 that we have. 761 00:44:35,403 --> 00:44:39,680 It's like having a city in completely the wrong country, 762 00:44:40,250 --> 00:44:44,874 and there are tools for geography to identify that, 763 00:44:44,875 --> 00:44:49,201 and we need to have much better tools in hierarchies 764 00:44:49,202 --> 00:44:53,477 to identify where the equivalent of the item for the country 765 00:44:53,478 --> 00:44:57,673 is missing entirely, or where it's actually been subclassed 766 00:44:57,674 --> 00:45:01,804 to something that isn't meaning something completely different. 767 00:45:02,804 --> 00:45:07,165 (Lydia) Yeah, I think you're getting to something 768 00:45:07,166 --> 00:45:12,024 that me and my team keeps hearing from people who reuse our data 769 00:45:12,025 --> 00:45:13,991 quite a bit as well, right, 770 00:45:15,002 --> 00:45:16,638 Individual data point might be great 771 00:45:16,639 --> 00:45:20,163 but if you have to look at the ontology and so on, 772 00:45:20,164 --> 00:45:21,857 then it gets very... 773 00:45:22,388 --> 00:45:26,437 And I think one of the big problems why this is happening 774 00:45:26,437 --> 00:45:30,736 is that a lot of editing on Wikidata 775 00:45:30,736 --> 00:45:34,544 happens on the basis of an individual item, right, 776 00:45:34,545 --> 00:45:36,201 you make an edit on that item, 777 00:45:37,653 --> 00:45:42,075 without realizing that this might have very global consequences 778 00:45:42,075 --> 00:45:44,245 on the rest of the graph, for example. 779 00:45:44,245 --> 00:45:50,040 And if people have ideas around how to make this more visible, 780 00:45:50,041 --> 00:45:53,185 the consequences of an individual local edit, 781 00:45:54,005 --> 00:45:56,537 I think that would be worth exploring, 782 00:45:57,550 --> 00:46:01,583 to show people better what the consequence of their edit 783 00:46:01,584 --> 00:46:03,434 that they might do in very good faith, 784 00:46:04,481 --> 00:46:05,481 what that is. 785 00:46:06,939 --> 00:46:12,237 Whoa! OK, let's start with, yeah, you, then you, then you, then you. 786 00:46:12,237 --> 00:46:13,921 (man5) Well, after the discussion, 787 00:46:13,922 --> 00:46:18,262 just to express my agreement with what James was saying. 788 00:46:18,263 --> 00:46:22,467 So essentially, it seems the most dangerous thing is the hierarchy, 789 00:46:22,468 --> 00:46:23,910 not the hierarchy, but generally 790 00:46:23,911 --> 00:46:28,022 the semantics of the subclass relations seen in Wikidata, right. 791 00:46:28,022 --> 00:46:32,561 So I've been studying languages recently, just for the purposes of this conference, 792 00:46:32,562 --> 00:46:35,257 and for example, you find plenty of cases 793 00:46:35,257 --> 00:46:39,463 where a language is a part of and subclass of the same thing, OK. 794 00:46:39,463 --> 00:46:43,577 So you know, you can say we have a flexible ontology. 795 00:46:43,577 --> 00:46:46,256 Wikidata gives you freedom to express that, sometimes. 796 00:46:46,256 --> 00:46:47,257 Because, for example, 797 00:46:47,258 --> 00:46:50,721 that ontology of languages is also politically complicated, right? 798 00:46:50,722 --> 00:46:55,038 It is even good to be in a position to express a level of uncertainty. 799 00:46:55,038 --> 00:46:57,983 But imagine anyone who wants to do machine reading from that. 800 00:46:57,984 --> 00:46:59,468 So that's really problematic. 801 00:46:59,468 --> 00:47:00,468 And then again, 802 00:47:00,469 --> 00:47:03,686 I don't think that ontology was ever imported from somewhere, 803 00:47:03,687 --> 00:47:05,490 that's something which is originally ours. 804 00:47:05,491 --> 00:47:08,321 It's harvested from Wikipedia in the very beginning I will say. 805 00:47:08,322 --> 00:47:11,324 So I wonder... this Shape Expressions thing is great, 806 00:47:11,325 --> 00:47:15,575 and also validating and fixing, if you like, the Wikidata ontology 807 00:47:15,576 --> 00:47:18,191 by external resources, beautiful idea. 808 00:47:19,026 --> 00:47:20,026 In the end, 809 00:47:20,027 --> 00:47:25,440 will we end by reflecting the external ontologies in Wikidata? 810 00:47:25,441 --> 00:47:28,651 And also, what we do with the core part of our ontology 811 00:47:28,652 --> 00:47:30,642 which is never harvested from external resources, 812 00:47:30,643 --> 00:47:31,978 how do we go and fix that? 813 00:47:31,979 --> 00:47:35,276 And I really think that that will be a problem on its own. 814 00:47:35,277 --> 00:47:39,010 We will have to focus on that independently of the idea 815 00:47:39,010 --> 00:47:41,046 of validating ontology with something external. 816 00:47:49,353 --> 00:47:53,379 (man6) OK, and constrains and shapes are very impressive 817 00:47:53,380 --> 00:47:54,495 what we can do with it, 818 00:47:55,205 --> 00:47:58,481 but the main point is not being really made clear-- 819 00:47:58,482 --> 00:48:03,229 it's because now we can make more explicit what we expect from the data. 820 00:48:03,229 --> 00:48:06,893 Before, each one has to write its own tools and scripts 821 00:48:06,894 --> 00:48:10,601 and so it's more visible and we can discuss about it. 822 00:48:10,602 --> 00:48:13,641 But because it's not about what's wrong or right, 823 00:48:13,642 --> 00:48:15,870 it's about an expectation, 824 00:48:15,870 --> 00:48:18,105 and you will have different expectations and discussions 825 00:48:18,106 --> 00:48:20,737 about how we want to model things in Wikidata, 826 00:48:21,246 --> 00:48:23,095 and this... 827 00:48:23,096 --> 00:48:26,280 The current state is just one step in the direction 828 00:48:26,281 --> 00:48:28,041 because now you need 829 00:48:28,042 --> 00:48:31,041 very much technical expertise to get into this, 830 00:48:31,042 --> 00:48:35,721 and we need better ways to visualize this constraint, 831 00:48:35,722 --> 00:48:39,995 to transform it maybe in natural language so people can better understand, 832 00:48:40,939 --> 00:48:43,768 but it's less about what's wrong or right. 833 00:48:44,925 --> 00:48:45,925 (Lydia) Yeah. 834 00:48:50,986 --> 00:48:53,893 (man7) So for quality issues, I just want to echo it like... 835 00:48:53,894 --> 00:48:57,010 I've definitely found a lot of the issues I've encountered have been 836 00:48:58,838 --> 00:49:02,330 differences in opinion between *instance of* versus *subclass*. 837 00:49:02,331 --> 00:49:05,963 I would say errors in those situations 838 00:49:05,963 --> 00:49:11,521 and trying to find those has been a very time-consuming process. 839 00:49:11,522 --> 00:49:14,840 What I've found is like: "Oh, if I find very high-impression items 840 00:49:14,840 --> 00:49:16,051 that are something... 841 00:49:16,052 --> 00:49:21,628 and then use all the subclass instances to find all derived statements of this," 842 00:49:21,628 --> 00:49:26,215 this is a very useful way of looking for these errors. 843 00:49:26,215 --> 00:49:28,067 But I was curious if Shape Expressions, 844 00:49:29,841 --> 00:49:31,582 if there is... 845 00:49:31,583 --> 00:49:36,934 If this can be used as a tool to help resolve those issues but, yeah... 846 00:49:40,514 --> 00:49:42,555 (man8) If it has a structural footprint... 847 00:49:45,910 --> 00:49:49,310 If it has a structural footprint that you can...that's sort of falsifiable, 848 00:49:49,310 --> 00:49:51,191 you can look at that and say well, that's wrong, 849 00:49:51,192 --> 00:49:52,670 then yeah, you can do that. 850 00:49:52,671 --> 00:49:56,921 But if it's just sort of trying to map it to real-world objects, 851 00:49:56,922 --> 00:49:59,082 then you're just going to need lots and lots of brains. 852 00:50:05,768 --> 00:50:08,631 (man9) Hi, Pablo Mendes from Apple Siri Knowledge. 853 00:50:09,154 --> 00:50:12,770 We're here to find out how to help the project and the community 854 00:50:12,770 --> 00:50:15,645 but Cristina made the mistake of asking what we want. 855 00:50:16,471 --> 00:50:20,052 (laughing) So I think one thing I'd like to see 856 00:50:20,958 --> 00:50:23,521 is a lot around verifiability 857 00:50:23,522 --> 00:50:26,372 which is one of the core tenets of the project in the community, 858 00:50:27,062 --> 00:50:28,590 and trustworthiness. 859 00:50:28,590 --> 00:50:32,412 Not every statement is the same, some of them are heavily disputed, 860 00:50:32,413 --> 00:50:33,653 some of them are easy to guess, 861 00:50:33,654 --> 00:50:35,541 like somebody's date of birth can be verified, 862 00:50:36,071 --> 00:50:39,082 as you saw today in the Keynote, gender issues are a lot more complicated. 863 00:50:40,205 --> 00:50:42,130 Can you discuss a little bit what you know 864 00:50:42,131 --> 00:50:47,271 in this area of data quality around trustworthiness and verifiability? 865 00:50:55,442 --> 00:50:58,138 If there isn't a lot, I'd love to see a lot more. (laughs) 866 00:51:00,646 --> 00:51:01,646 (Lydia) Yeah. 867 00:51:03,314 --> 00:51:06,548 Apparently, we don't have a lot to say on that. (laughs) 868 00:51:08,024 --> 00:51:12,299 (Andra) I think we can do a lot, but I had a discussion with you yesterday. 869 00:51:12,300 --> 00:51:15,774 My favorite example I learned yesterday that's already deprecated 870 00:51:15,774 --> 00:51:20,281 is if you go to the Q2, which is earth, 871 00:51:20,282 --> 00:51:23,343 there is statement that claims that the earth is flat. 872 00:51:24,183 --> 00:51:26,055 And I love that example 873 00:51:26,056 --> 00:51:28,391 because there is a community out there that claims that 874 00:51:28,392 --> 00:51:30,417 and they have verifiable resources. 875 00:51:30,418 --> 00:51:32,254 So I think it's a genuine case, 876 00:51:32,255 --> 00:51:34,641 it shouldn't be deprecated, it should be in Wikidata. 877 00:51:34,642 --> 00:51:40,385 And I think Shape Expressions can be really instrumental there, 878 00:51:40,386 --> 00:51:41,832 because what you can say, 879 00:51:41,833 --> 00:51:44,856 OK, I'm really interested in this use case, 880 00:51:44,857 --> 00:51:47,129 or this is a use case where you disagree, 881 00:51:47,130 --> 00:51:51,059 but there can also be a use case where you say OK, I'm interested. 882 00:51:51,059 --> 00:51:53,449 So there is this example you say, I have glucose. 883 00:51:53,449 --> 00:51:55,841 And glucose when you're a biologist, 884 00:51:55,842 --> 00:52:00,176 you don't care for the chemical constraints of the glucose molecule, 885 00:52:00,177 --> 00:52:03,201 you just... everything glucose is the same. 886 00:52:03,202 --> 00:52:05,973 But if you're a chemist, you cringe when you hear that, 887 00:52:05,973 --> 00:52:08,191 you have 200 something... 888 00:52:08,191 --> 00:52:10,443 So then you can have multiple Shape Expressions, 889 00:52:10,443 --> 00:52:12,721 OK, I'm coming in with... I'm at a chemist view, 890 00:52:12,722 --> 00:52:13,887 I'm applying that. 891 00:52:13,887 --> 00:52:16,691 And then you say I'm from a biological use case, 892 00:52:16,691 --> 00:52:18,524 I'm applying that Shape Expression. 893 00:52:18,524 --> 00:52:20,358 And then when you want to collaborate, 894 00:52:20,358 --> 00:52:22,784 yes, well you should talk to Eric about ShEx maps. 895 00:52:23,910 --> 00:52:28,873 And so... but this journey is just starting. 896 00:52:28,873 --> 00:52:32,238 But I personally I believe that it's quite instrumental in that area. 897 00:52:34,292 --> 00:52:35,535 (Lydia) OK. Over there. 898 00:52:37,949 --> 00:52:39,168 (laughs) 899 00:52:40,597 --> 00:52:46,035 (woman2) I had several ideas from some points in the discussions, 900 00:52:46,035 --> 00:52:50,902 so I will try not to lose... I had three ideas so... 901 00:52:52,394 --> 00:52:55,201 Based on what James said a while ago, 902 00:52:55,202 --> 00:52:59,001 we have a very, very big problem on Wikidata since the beginning 903 00:52:59,002 --> 00:53:01,574 for the upper ontology. 904 00:53:02,363 --> 00:53:05,339 We talked about that two years ago at WikidataCon, 905 00:53:05,340 --> 00:53:07,432 and we talked about that at Wikimania. 906 00:53:07,432 --> 00:53:09,818 Well, always we have a Wikidata meeting 907 00:53:09,818 --> 00:53:11,656 we are talking about that, 908 00:53:11,656 --> 00:53:15,782 because it's a very big problem at a very very eye level 909 00:53:15,783 --> 00:53:23,118 what entity is, with what work is, what genre is, art, 910 00:53:23,118 --> 00:53:25,461 are really the biggest concept. 911 00:53:26,195 --> 00:53:33,117 And that's actually a very weak point on global ontology 912 00:53:33,118 --> 00:53:37,453 because people try to clean up regularly 913 00:53:38,017 --> 00:53:41,047 and broke everything down the line, 914 00:53:42,516 --> 00:53:48,649 because yes, I think some of you may remember the guy who in good faith 915 00:53:48,649 --> 00:53:51,785 broke absolutely all cities in the world. 916 00:53:51,785 --> 00:53:57,537 We were not geographical items anymore, so violation constraints everywhere. 917 00:53:58,720 --> 00:54:00,278 And it was in good faith 918 00:54:00,278 --> 00:54:03,623 because he was really correcting a mistake in an item, 919 00:54:04,170 --> 00:54:05,732 but everything broke down. 920 00:54:06,349 --> 00:54:09,373 And I'm not sure how we can solve that 921 00:54:10,216 --> 00:54:15,709 because there is actually no external institution we could just copy 922 00:54:15,710 --> 00:54:18,490 because everyone is working on... 923 00:54:19,154 --> 00:54:22,041 Well, if I am performing art database, 924 00:54:22,042 --> 00:54:24,601 I will just go at the performing art label, 925 00:54:24,601 --> 00:54:29,361 or I won't go to the philosophical concept of what an entity is, 926 00:54:29,362 --> 00:54:31,201 and that's actually... 927 00:54:31,202 --> 00:54:34,561 I don't know any database which is working at this level, 928 00:54:34,562 --> 00:54:36,827 but that's the weakest point of Wikidata. 929 00:54:37,936 --> 00:54:40,812 And probably, when we are talking about data quality, 930 00:54:40,812 --> 00:54:44,034 that's actually a big part of it, so... 931 00:54:44,034 --> 00:54:48,569 And I think it's the same we have stated in... 932 00:54:48,569 --> 00:54:50,452 Oh, I am sorry, I am changing the subject, 933 00:54:51,401 --> 00:54:55,774 but we have stated in different sessions about qualities, 934 00:54:55,774 --> 00:54:59,398 which is actually some of us are doing good modeling job, 935 00:54:59,399 --> 00:55:01,240 are doing ShEx, are doing things like that. 936 00:55:01,967 --> 00:55:07,655 People don't see it on Wikidata, they don't see the ShEx, 937 00:55:07,655 --> 00:55:10,392 they don't see the WikiProject on the discussion page, 938 00:55:10,393 --> 00:55:11,393 and sometimes, 939 00:55:11,394 --> 00:55:14,958 they don't even see the talk pages of properties, 940 00:55:14,958 --> 00:55:19,628 which is explicitly stating, a), this property is used for that. 941 00:55:19,628 --> 00:55:23,887 Like last week, I added constraints to a property. 942 00:55:23,888 --> 00:55:26,324 The constraint was explicitly written 943 00:55:26,325 --> 00:55:28,690 in the discussion of the creation of the property. 944 00:55:28,690 --> 00:55:34,548 I just created the technical part of adding the constraint, and someone: 945 00:55:34,548 --> 00:55:37,182 "What! You broke down all my edits!" 946 00:55:37,183 --> 00:55:41,542 And he was using the property wrongly for the last two years. 947 00:55:41,542 --> 00:55:46,868 And the property was actually very clear, but there were no warnings and everything, 948 00:55:46,869 --> 00:55:49,922 and so, it's the same at the Pink Pony we said at Wikimania 949 00:55:49,922 --> 00:55:54,719 to make WikiProject more visible or to make ShEx more visible, but... 950 00:55:54,719 --> 00:55:56,917 And that's what Cristina said. 951 00:55:56,917 --> 00:56:02,368 We have a visibility problem of what the existing solutions are. 952 00:56:02,368 --> 00:56:04,242 And at this session, 953 00:56:04,242 --> 00:56:06,862 we are all talking about how to create more ShEx, 954 00:56:06,863 --> 00:56:10,727 or to facilitate the jobs of the people who are doing the cleanup. 955 00:56:11,605 --> 00:56:15,835 But we are cleaning up since the first day of Wikidata, 956 00:56:15,836 --> 00:56:20,921 and globally, we are losing, and we are losing because, well, 957 00:56:20,922 --> 00:56:22,960 if I know names are complicated 958 00:56:22,961 --> 00:56:26,162 but I am the only one doing the cleaning up job, 959 00:56:26,662 --> 00:56:29,671 the guy who added Latin script name 960 00:56:29,672 --> 00:56:31,584 to all Chinese researcher, 961 00:56:32,088 --> 00:56:35,616 I will take months to clean that and I can't do it alone, 962 00:56:35,616 --> 00:56:38,777 and he did one massive batch. 963 00:56:38,777 --> 00:56:40,241 So we really need... 964 00:56:40,242 --> 00:56:44,158 we have a visibility problem more than a tool problem, I think, 965 00:56:44,158 --> 00:56:45,733 because we have many tools. 966 00:56:45,733 --> 00:56:50,255 (Lydia) Right, so unfortunately, I've got shown a sign, (laughs), 967 00:56:50,256 --> 00:56:52,121 so we need to wrap this up. 968 00:56:52,122 --> 00:56:53,563 Thank you so much for your comments, 969 00:56:53,563 --> 00:56:56,611 I hope you will continue discussing during the rest of the day, 970 00:56:56,611 --> 00:56:57,840 and thanks for your input. 971 00:56:58,359 --> 00:56:59,944 (applause)