1
00:00:05,945 --> 00:00:09,476
Hello everyone to the Data Quality panel.

2
00:00:10,288 --> 00:00:13,671
Data quality matters because
more and more people out there

3
00:00:13,672 --> 00:00:19,289
rely on our data being in good shape,
so we're going to talk about data quality,

4
00:00:20,029 --> 00:00:26,000
and there will be four speakers
who will give short introductions

5
00:00:26,000 --> 00:00:29,539
on topics related to data quality
and then we will have a Q and A.

6
00:00:30,130 --> 00:00:32,234
And the first one is Lucas.

7
00:00:34,385 --> 00:00:35,385
Thank you.

8
00:00:35,901 --> 00:00:39,899
Hi, I'm Lucas, and I'm going
to start with an overview

9
00:00:39,899 --> 00:00:43,806
of data quality tools
that we already have on Wikidata

10
00:00:43,807 --> 00:00:46,109
and also some things
that are coming up soon.

11
00:00:46,932 --> 00:00:50,623
And I've grouped them
into some general themes

12
00:00:50,623 --> 00:00:53,761
of making errors more visible,
making problems actionable,

13
00:00:53,762 --> 00:00:56,322
getting more eyes on the data
so that people notice the problems,

14
00:00:56,945 --> 00:01:02,616
fix some common sources of errors,
maintain the quality of the existing data

15
00:01:02,616 --> 00:01:03,966
and also human curation.

16
00:01:05,063 --> 00:01:09,874
And the ones that are currently available
start with property constraints.

17
00:01:10,388 --> 00:01:12,421
So you've probably seen this
if you're on Wikidata.

18
00:01:12,422 --> 00:01:14,029
You can sometimes get these icons

19
00:01:14,530 --> 00:01:17,241
which check
the internal consistency of the data.

20
00:01:17,242 --> 00:01:20,800
For example,
if one event follows the other,

21
00:01:20,801 --> 00:01:23,760
then the other event should
also be followed by this one,

22
00:01:23,761 --> 00:01:27,161
which on the WikidataCon item
was apparently missing.

23
00:01:27,162 --> 00:01:29,360
I'm not sure,
this feature is a few days old.

24
00:01:30,040 --> 00:01:34,681
And there's also,
if this is too limited or simple for you,

25
00:01:34,682 --> 00:01:38,080
you can write any checks you want
using the Query Service

26
00:01:38,081 --> 00:01:39,842
which is useful for
lots of things of course,

27
00:01:39,843 --> 00:01:44,543
but you can also use it
for finding errors.

28
00:01:44,544 --> 00:01:46,974
Like if you've noticed
one occurrence of a mistake,

29
00:01:46,975 --> 00:01:49,709
then you can check
if there are other places

30
00:01:49,710 --> 00:01:51,958
where people have made
a very similar error

31
00:01:51,958 --> 00:01:53,438
and find that with the Query Service.

32
00:01:53,439 --> 00:01:54,559
You can also combine the two

33
00:01:54,560 --> 00:01:57,874
and search for constraint violations
in the Query Service,

34
00:01:57,875 --> 00:02:01,240
for example,
only the violations in some area

35
00:02:01,241 --> 00:02:03,762
or WikiProject that's relevant to you,

36
00:02:03,762 --> 00:02:06,828
although the results are currently
not complete, sadly.

37
00:02:08,422 --> 00:02:09,877
There is revision scoring.

38
00:02:10,690 --> 00:02:12,666
That's... I think this is
from the recent changes

39
00:02:12,667 --> 00:02:16,217
you can also get it on your watch list
an automatic assessment

40
00:02:16,217 --> 00:02:20,249
of is this edit likely to be
in good faith or in bad faith

41
00:02:20,250 --> 00:02:22,312
and is it likely to be
damaging or not damaging,

42
00:02:22,313 --> 00:02:24,205
I think those are the two dimensions.

43
00:02:24,206 --> 00:02:25,686
So you can, if you want,

44
00:02:25,687 --> 00:02:29,898
focus on just looking through
the damaging but good faith edits.

45
00:02:29,899 --> 00:02:32,523
If you're feeling particularly
friendly and welcoming

46
00:02:32,524 --> 00:02:37,121
you can tell these editors,
"Thank you for your contribution,

47
00:02:37,122 --> 00:02:40,560
here's how you should have done it
but thank you, still."

48
00:02:40,561 --> 00:02:42,186
And if you're not feeling that way,

49
00:02:42,187 --> 00:02:44,452
you can go through
the bad faith, damaging edits,

50
00:02:44,453 --> 00:02:45,573
and revert the vandals.

51
00:02:47,544 --> 00:02:49,761
There's also, similar to that,
entity scoring.

52
00:02:49,762 --> 00:02:52,590
So instead of scoring an edit,
the change that it made,

53
00:02:52,591 --> 00:02:53,904
you score the whole revision,

54
00:02:53,904 --> 00:02:56,483
and I think that is
the same quality measure

55
00:02:56,483 --> 00:02:59,863
that Lydia mentions
at the beginning of the conference.

56
00:03:00,372 --> 00:03:04,569
That gives a user script up here
and gives you a score of like one to five,

57
00:03:04,570 --> 00:03:08,176
I think it was, of what the quality
of the current item is.

58
00:03:10,043 --> 00:03:15,528
The primary sources tool is for
any database that you want to import,

59
00:03:15,528 --> 00:03:18,364
but that's not high enough quality
to directly add to Wikidata,

60
00:03:18,374 --> 00:03:20,335
so you add it
to the primary sources tool instead,

61
00:03:20,336 --> 00:03:22,956
and then humans can decide

62
00:03:22,956 --> 00:03:26,024
should they add
these individual statements or not.

63
00:03:28,595 --> 00:03:31,901
Showing coordinates as maps
is mainly a convenience feature

64
00:03:31,901 --> 00:03:33,588
but it's also useful for quality control.

65
00:03:33,588 --> 00:03:36,937
Like if you see this is supposed to be
the office of Wikimedia Germany

66
00:03:36,938 --> 00:03:39,400
and if the coordinates
are somewhere in the Indian Ocean,

67
00:03:39,401 --> 00:03:41,529
then you know that
something is not right there

68
00:03:41,530 --> 00:03:44,790
and you can see it much more easily
than if you just had the numbers.

69
00:03:46,382 --> 00:03:49,576
This is a gadget called
the relative completeness indicator

70
00:03:49,577 --> 00:03:52,480
which shows you this little icon here

71
00:03:53,007 --> 00:03:55,652
telling you how complete
it thinks this item is

72
00:03:55,652 --> 00:03:57,613
and also which properties
are most likely missing,

73
00:03:57,614 --> 00:03:59,769
which is really useful
if you're editing an item

74
00:03:59,769 --> 00:04:03,172
and you're in an area
that you're not very familiar with

75
00:04:03,172 --> 00:04:05,661
and you don't know what
the right properties to use are,

76
00:04:05,662 --> 00:04:08,230
then this is a very useful gadget to have.

77
00:04:09,604 --> 00:04:11,401
And we have Shape Expressions.

78
00:04:11,402 --> 00:04:15,624
I think Andra or Jose
are going to talk more about those

79
00:04:15,624 --> 00:04:19,757
but basically, a very powerful way
of comparing the data you have

80
00:04:19,758 --> 00:04:20,758
against the schema,

81
00:04:20,759 --> 00:04:22,680
like what statement should
certain entities have,

82
00:04:22,681 --> 00:04:25,677
what other entities should they link to
and what should those look like,

83
00:04:26,229 --> 00:04:29,374
and then you can find problems that way.

84
00:04:30,366 --> 00:04:32,361
I think... No there is still more.

85
00:04:32,362 --> 00:04:34,321
Integraality or property dashboard.

86
00:04:34,322 --> 00:04:36,773
It gives you a quick overview
of the data you already have.

87
00:04:36,774 --> 00:04:39,147
For example, this is from
the WikiProject Red Pandas,

88
00:04:39,657 --> 00:04:41,681
and you can see that
we have a sex or gender

89
00:04:41,682 --> 00:04:43,561
for almost all of the red pandas,

90
00:04:43,561 --> 00:04:46,854
the date of birth varies a lot
by which zoo they come from

91
00:04:46,854 --> 00:04:50,255
and we have almost
no dead pandas which is wonderful,

92
00:04:51,437 --> 00:04:52,600
because they're so cute.

93
00:04:53,699 --> 00:04:55,654
So this is also useful.

94
00:04:56,377 --> 00:04:59,185
There we go, OK,
now for the things that are coming up.

95
00:04:59,889 --> 00:05:03,784
Wikidata Bridge, or also known,
formerly known as client editing,

96
00:05:03,785 --> 00:05:07,076
so editing Wikidata
from Wikipedia infoboxes

97
00:05:07,675 --> 00:05:11,725
which will on the one hand
get more eyes on the data

98
00:05:11,725 --> 00:05:13,441
because more people can see the data there

99
00:05:13,441 --> 00:05:18,841
and it will hopefully encourage
more use of Wikidata in the Wikipedias

100
00:05:18,841 --> 00:05:20,920
and that means that more
people can notice

101
00:05:20,921 --> 00:05:23,389
if, for example some data is outdated
and needs to be updated

102
00:05:23,857 --> 00:05:27,000
instead of if they would
only see it on Wikidata itself.

103
00:05:28,630 --> 00:05:30,656
There is also tainted references.

104
00:05:30,657 --> 00:05:33,959
The idea here is that
if you edit a statement value,

105
00:05:34,683 --> 00:05:37,279
you might want to update
the references as well,

106
00:05:37,280 --> 00:05:39,373
unless it was just a typo or something.

107
00:05:39,897 --> 00:05:43,662
And this tainted references
tells editors that

108
00:05:43,663 --> 00:05:49,756
and also that other editors
see which other edits were made

109
00:05:49,756 --> 00:05:52,471
that edited a statement value
and didn't update a reference

110
00:05:52,472 --> 00:05:56,766
then you can clean up after that
and decide should that be...

111
00:05:57,737 --> 00:05:59,566
Do you need to do any thing more of that

112
00:05:59,566 --> 00:06:02,796
or is that actually fine and
you don't need to update the reference.

113
00:06:03,543 --> 00:06:09,336
That's related to signed statements
which is coming from a concern, I think,

114
00:06:09,336 --> 00:06:12,355
that some data providers have that like...

115
00:06:14,131 --> 00:06:17,231
There's a statement that's referenced
through the UNESCO or something

116
00:06:17,232 --> 00:06:19,872
and then suddenly,
someone vandalizes the statement

117
00:06:19,873 --> 00:06:21,836
and they are worried
that it will look like

118
00:06:22,827 --> 00:06:26,992
this organization, like UNESCO,
still set this vandalism value

119
00:06:26,993 --> 00:06:28,706
and so, with signed statements,

120
00:06:28,706 --> 00:06:31,488
they can cryptographically
sign this reference

121
00:06:31,488 --> 00:06:33,562
and that doesn't prevent any edits to it,

122
00:06:34,169 --> 00:06:37,744
but at least, if someone
vandalizes the statement

123
00:06:37,744 --> 00:06:40,255
or edits it in any way,
then the signature is no longer valid,

124
00:06:40,255 --> 00:06:43,401
and you can tell this is not exactly
what the organization said,

125
00:06:43,402 --> 00:06:47,064
and perhaps it's a good edit
and they should re-sign the new statement,

126
00:06:47,065 --> 00:06:49,851
but also perhaps it should be reverted.

127
00:06:51,203 --> 00:06:54,166
And also, this is going
to be very exciting, I think,

128
00:06:54,166 --> 00:06:56,846
Citoid is this amazing system
they have on Wikipedia

129
00:06:57,379 --> 00:07:01,340
where you can paste a URL,
or an identifier, or an ISBN

130
00:07:01,340 --> 00:07:04,759
or Wikidata ID or basically
anything into the Visual Editor,

131
00:07:05,260 --> 00:07:08,241
and it spits out a reference
that is nicely formatted

132
00:07:08,242 --> 00:07:11,049
and has all the data you want
and it's wonderful to use.

133
00:07:11,049 --> 00:07:14,337
And by comparison, on Wikidata,
if I want to add a reference

134
00:07:14,338 --> 00:07:18,801
I typically have to add a reference URL,
title, author name string,

135
00:07:18,802 --> 00:07:20,449
published in, publication date,

136
00:07:20,450 --> 00:07:25,141
retrieve dates,
at least those, and that's annoying,

137
00:07:25,141 --> 00:07:29,261
and integrating Citoid into Wikibase
will hopefully help with that.

138
00:07:30,245 --> 00:07:33,604
And I think
that's all the ones I had, yeah.

139
00:07:33,604 --> 00:07:36,400
So now, I'm going to pass to Cristina.

140
00:07:37,788 --> 00:07:42,339
(applause)

141
00:07:43,780 --> 00:07:45,471
Hi, I'm Cristina.

142
00:07:45,472 --> 00:07:47,672
I'm a research scientist
from the University of Zürich,

143
00:07:47,673 --> 00:07:51,417
and I'm also an active member
of the Swiss Community.

144
00:07:52,698 --> 00:07:57,901
When Claudia Müller-Birn
and I submitted this to the WikidataCon,

145
00:07:57,902 --> 00:08:00,410
what we wanted to do
is continue our discussion

146
00:08:00,411 --> 00:08:02,424
that we started
in the beginning of the year

147
00:08:02,424 --> 00:08:07,442
with a workshop on data quality
and also some sessions in Wikimania.

148
00:08:07,442 --> 00:08:10,535
So the goal of this talk
is basically to bring some thoughts

149
00:08:10,536 --> 00:08:14,432
that we have been collecting
from the community and ourselves

150
00:08:14,432 --> 00:08:16,560
and continue discussion.

151
00:08:16,561 --> 00:08:20,065
So what we would like is to continue
interacting a lot with you.

152
00:08:21,557 --> 00:08:23,371
So what we think is very important

153
00:08:23,372 --> 00:08:27,580
is that we continuously ask
all types of users in the community

154
00:08:27,581 --> 00:08:32,240
about what they really need,
what problems they have with data quality,

155
00:08:32,240 --> 00:08:35,000
not only editors
but also the people who are coding,

156
00:08:35,000 --> 00:08:36,241
or consuming the data,

157
00:08:36,242 --> 00:08:39,494
and also researchers who are
actually using all the edit history

158
00:08:39,494 --> 00:08:40,800
to analyze what is happening.

159
00:08:42,367 --> 00:08:48,431
So we did a review of around 80 tools
that are existing in Wikidata

160
00:08:48,431 --> 00:08:52,380
and we aligned them to the different
data quality dimensions.

161
00:08:52,380 --> 00:08:54,360
And what we saw was that actually,

162
00:08:54,361 --> 00:08:57,681
many of them were looking at,
monitoring completeness,

163
00:08:57,682 --> 00:09:02,820
but actually... and also some of them
are also enabling interlinking.

164
00:09:02,820 --> 00:09:08,442
But there is a big need for tools
that are looking into diversity,

165
00:09:08,443 --> 00:09:12,824
which is one of the things
that we actually can have in Wikidata,

166
00:09:12,824 --> 00:09:15,958
especially
this design principle of Wikidata

167
00:09:15,959 --> 00:09:17,901
where we can have plurality

168
00:09:17,902 --> 00:09:20,308
and different statements
with different values

169
00:09:21,034 --> 00:09:22,236
coming from different sources.

170
00:09:22,236 --> 00:09:24,921
Because it's a secondary source,
we don't have really tools

171
00:09:24,922 --> 00:09:27,750
that actually tell us how many
plural statements there are,

172
00:09:27,751 --> 00:09:30,889
and how many we can improve and how,

173
00:09:30,890 --> 00:09:32,833
and we also don't know really

174
00:09:32,833 --> 00:09:35,538
what are all the reasons
for plurality that we can have.

175
00:09:36,491 --> 00:09:39,201
So from these community meetings,

176
00:09:39,201 --> 00:09:43,084
what we discussed was the challenges
that still need attention.

177
00:09:43,084 --> 00:09:47,249
For example, that having
all these crowdsourcing communities

178
00:09:47,249 --> 00:09:49,613
is very good because different people
attack different parts

179
00:09:49,613 --> 00:09:51,833
of the data or the graph,

180
00:09:51,834 --> 00:09:54,615
and we also have
different background knowledge

181
00:09:54,616 --> 00:09:59,161
but actually, it's very difficult to align
everything in something homogeneous

182
00:09:59,162 --> 00:10:04,920
because different people are using
different properties in different ways

183
00:10:04,920 --> 00:10:08,401
and they are also expecting
different things from entity descriptions.

184
00:10:09,003 --> 00:10:12,721
People also said that
they also need more tools

185
00:10:12,722 --> 00:10:16,000
that give a better overview
of the global status of things.

186
00:10:16,000 --> 00:10:20,733
So what entities are missing
in terms of completeness,

187
00:10:20,733 --> 00:10:26,121
but also like what are people
working on right now most of the time,

188
00:10:26,121 --> 00:10:30,516
and they also mention many times
a tighter collaboration

189
00:10:30,517 --> 00:10:33,311
across not only languages
but the WikiProjects

190
00:10:33,311 --> 00:10:35,571
and the different Wikimedia platforms.

191
00:10:35,571 --> 00:10:38,859
And we published
all the transcribed comments

192
00:10:38,860 --> 00:10:42,959
from all these discussions
in those links here in the Etherpads

193
00:10:42,959 --> 00:10:46,162
and also in the wiki page of Wikimania.

194
00:10:46,162 --> 00:10:48,481
Some solutions that appeared actually

195
00:10:48,481 --> 00:10:53,001
were going into the direction
of sharing more the best practices

196
00:10:53,001 --> 00:10:55,762
that are being developed
in different WikiProjects,

197
00:10:55,762 --> 00:11:01,238
but also people want tools
that help organize work in teams

198
00:11:01,239 --> 00:11:03,845
or at least understanding
who is working on that,

199
00:11:03,845 --> 00:11:07,815
and they were also mentioning
that they want more showcases

200
00:11:07,816 --> 00:11:12,019
and more templates that help them
create things in a better way.

201
00:11:12,946 --> 00:11:15,161
And from the contact that we have

202
00:11:15,162 --> 00:11:18,721
with Open Governmental Data Organizations,

203
00:11:18,722 --> 00:11:20,068
and in particularly,

204
00:11:20,068 --> 00:11:23,102
I am in contact with the canton
and the city of Zürich,

205
00:11:23,102 --> 00:11:26,207
they are very interested
in working with Wikidata

206
00:11:26,207 --> 00:11:29,896
because they want their data
to be accessible for everyone

207
00:11:29,897 --> 00:11:33,681
in the place where people go
and consult or access data.

208
00:11:33,682 --> 00:11:36,550
So for them, something that
would be really interesting

209
00:11:36,551 --> 00:11:38,600
is to have some kind of quality indicators

210
00:11:38,600 --> 00:11:41,082
both in the wiki,
which is already happening,

211
00:11:41,082 --> 00:11:42,801
but also in SPARQL results,

212
00:11:42,802 --> 00:11:46,066
to know whether they can trust
or not that data from the community.

213
00:11:46,067 --> 00:11:48,230
And then, they also want to know

214
00:11:48,230 --> 00:11:51,417
what parts of their own data sets
are useful for Wikidata

215
00:11:51,418 --> 00:11:56,040
and they would love to have a tool that
can help them assess that automatically.

216
00:11:56,041 --> 00:11:59,066
They also need
some kind of methodology or tool

217
00:11:59,067 --> 00:12:03,894
that helps them decide whether
they should import or link their data

218
00:12:03,894 --> 00:12:04,894
because in some cases,

219
00:12:04,895 --> 00:12:07,137
they also have their own
linked open data sets,

220
00:12:07,138 --> 00:12:09,746
so they don't know whether
to just ingest the data

221
00:12:09,747 --> 00:12:13,424
or to keep on creating links
from the data sets to Wikidata

222
00:12:13,425 --> 00:12:14,425
and the other way around.

223
00:12:14,950 --> 00:12:20,043
And they also want to know where
their websites are referred in Wikidata.

224
00:12:20,044 --> 00:12:23,361
And when they run such a query
in the query service,

225
00:12:23,362 --> 00:12:24,848
they often get timeouts,

226
00:12:24,849 --> 00:12:28,181
so maybe we should
really create more tools

227
00:12:28,181 --> 00:12:32,240
that help them get these answers
for their questions.

228
00:12:33,148 --> 00:12:36,208
And, besides that,

229
00:12:36,208 --> 00:12:39,361
we wiki researchers also sometimes

230
00:12:39,362 --> 00:12:42,023
lack some information
in the edit summaries.

231
00:12:42,024 --> 00:12:44,953
So I remember that when
we were doing some work

232
00:12:44,954 --> 00:12:48,919
to understand
the different behavior of editors

233
00:12:48,919 --> 00:12:53,403
with tools or bots
or anonymous users and so on,

234
00:12:53,403 --> 00:12:56,154
we were really lacking, for example,

235
00:12:56,154 --> 00:13:01,112
a standard way of tracing
that tools were being used.

236
00:13:01,113 --> 00:13:03,154
And there are some tools
that are already doing that

237
00:13:03,155 --> 00:13:05,230
like PetScan and many others,

238
00:13:05,230 --> 00:13:07,720
but maybe we should in the community

239
00:13:07,721 --> 00:13:13,531
discuss more about how to record these
for fine-grained provenance.

240
00:13:14,169 --> 00:13:15,321
And further on,

241
00:13:15,322 --> 00:13:20,801
we think that we need to think
of more concrete data quality dimensions

242
00:13:20,802 --> 00:13:24,961
that are related to link data
but not all the types of data,

243
00:13:24,962 --> 00:13:30,721
so we worked on some measures
to access actually the information gain

244
00:13:30,722 --> 00:13:33,881
enabled by the links,
and what we mean by that

245
00:13:33,882 --> 00:13:36,681
is that when we link
Wikidata to other data sets,

246
00:13:36,682 --> 00:13:38,201
we should also be thinking

247
00:13:38,202 --> 00:13:41,921
how much the entities are actually
gaining in the classification,

248
00:13:41,922 --> 00:13:45,601
also in the description
but also in the vocabularies they use.

249
00:13:45,602 --> 00:13:51,041
So just to give a very simple
example of what I mean with this

250
00:13:51,042 --> 00:13:54,269
is we can think of--
in this case, would be Wikidata

251
00:13:54,270 --> 00:13:57,771
or the external data center
that is linking to Wikidata,

252
00:13:57,772 --> 00:14:00,487
we have the entity for a person
that is called Natasha Noy,

253
00:14:00,487 --> 00:14:02,601
we have the affiliation and other things,

254
00:14:02,602 --> 00:14:05,239
and then we say OK,
we link to an external place,

255
00:14:05,240 --> 00:14:08,919
and that entity also has that name,
but we actually have the same value.

256
00:14:08,920 --> 00:14:12,889
So what it would be better is that we link
to something that has a different name,

257
00:14:12,889 --> 00:14:16,881
that is still valid because this person
has two ways of writing the name,

258
00:14:16,882 --> 00:14:19,714
and also other information
that we don't have in Wikidata

259
00:14:19,715 --> 00:14:21,760
or that we don't have
in the other data set.

260
00:14:22,390 --> 00:14:24,652
But also, what is even better

261
00:14:24,653 --> 00:14:27,770
is that we are actually
looking in the target data set

262
00:14:27,770 --> 00:14:31,392
that they also have new ways
of classifying the information.

263
00:14:31,393 --> 00:14:35,354
So not only is this a person,
but in the other data set,

264
00:14:35,355 --> 00:14:39,525
they also say it's a female
or anything else that they classify with.

265
00:14:39,526 --> 00:14:43,401
And if in the other data set,
they are using many other vocabularies

266
00:14:43,402 --> 00:14:46,588
that is also helping in their whole
information retrieval thing.

267
00:14:47,371 --> 00:14:51,233
So with that, I also would like to say

268
00:14:51,234 --> 00:14:55,809
that we think that we can
showcase federated queries better

269
00:14:55,810 --> 00:15:00,448
because when we look at the query log
provided by Malyshev et al.,

270
00:15:01,285 --> 00:15:04,301
we see actually that
from the organic queries,

271
00:15:04,302 --> 00:15:06,921
we have only very few federated queries.

272
00:15:06,922 --> 00:15:12,801
And actually, federation is one
of the key advantages of having link data,

273
00:15:12,802 --> 00:15:16,903
so maybe the community
or the people using Wikidata

274
00:15:16,903 --> 00:15:18,898
also need more examples on this.

275
00:15:18,898 --> 00:15:22,666
And if we look at the list
of endpoints that are being used,

276
00:15:22,667 --> 00:15:25,401
this is not a complete list
and we have many more.

277
00:15:25,402 --> 00:15:30,479
Of course, this data was analyzed
from queries until March 2018,

278
00:15:30,480 --> 00:15:34,807
but we should look into the list
of federated endpoints that we have

279
00:15:34,808 --> 00:15:37,048
and see whether
we are really using them or not.

280
00:15:37,813 --> 00:15:40,441
So two questions that
I have for the audience

281
00:15:40,442 --> 00:15:43,001
that maybe we can use
afterwards for the discussion are:

282
00:15:43,001 --> 00:15:46,001
what data quality problems
should be addressed in your opinion,

283
00:15:46,002 --> 00:15:47,412
because of the needs that you have,

284
00:15:47,412 --> 00:15:50,401
but also, where do you need
more automation

285
00:15:50,402 --> 00:15:52,943
to help you with editing or patrolling.

286
00:15:53,866 --> 00:15:55,146
That's all, thank you very much.

287
00:15:55,779 --> 00:15:57,527
(applause)

288
00:16:06,030 --> 00:16:08,595
(Jose Emilio Labra) OK,
so what I'm going to talk about

289
00:16:08,595 --> 00:16:14,715
is some tools that we were developing
related with Shape Expressions.

290
00:16:15,536 --> 00:16:19,371
So this is what I want to talk...
I am Jose Emilio Labra,

291
00:16:19,371 --> 00:16:23,215
but this has... all these tools
have been done by different people,

292
00:16:23,920 --> 00:16:28,480
mainly related with W3C ShEx,
Shape Expressions Community Group.

293
00:16:28,481 --> 00:16:29,481
ShEx Community Group.

294
00:16:30,144 --> 00:16:36,081
So the first tool that I want to mention
is RDFShape, this is a general tool,

295
00:16:36,082 --> 00:16:40,681
because Shape Expressions
is not only for Wikidata,

296
00:16:40,682 --> 00:16:44,168
Shape Expressions is a language
to validate RDF in general.

297
00:16:44,168 --> 00:16:47,568
So this tool was developed mainly by me

298
00:16:47,568 --> 00:16:50,880
and it's a tool
to validate RDF in general.

299
00:16:50,881 --> 00:16:55,139
So if you want to learn about RDF
or you want to validate RDF

300
00:16:55,140 --> 00:16:58,621
or SPARQL endpoints not only in Wikidata,

301
00:16:58,622 --> 00:17:00,891
my advice is that you can use this tool.

302
00:17:00,891 --> 00:17:03,255
Also for teaching.

303
00:17:03,255 --> 00:17:05,640
I am a teacher in the university

304
00:17:05,641 --> 00:17:09,151
and I use it in my semantic web course
to teach RDF.

305
00:17:09,161 --> 00:17:12,121
So if you want to learn RDF,
I think it's a good tool.

306
00:17:13,033 --> 00:17:17,598
For example, this is just a visualization
of an RDF graph with the tool.

307
00:17:18,587 --> 00:17:22,643
But before coming here, in the last month,

308
00:17:22,643 --> 00:17:28,441
I started a fork of rdfshape specifically
for Wikidata, because I thought...

309
00:17:28,443 --> 00:17:33,082
It's called WikiShape, and yesterday,
I presented it as a present for Wikidata.

310
00:17:33,082 --> 00:17:34,441
So what I took is...

311
00:17:34,442 --> 00:17:39,898
What I did is to remove all the stuff
that was not related with Wikidata

312
00:17:39,898 --> 00:17:44,801
and to put several things, hard-coded,
for example, the Wikidata SPARQL endpoint,

313
00:17:44,802 --> 00:17:49,041
but now, someone asked me
if I could do it also for Wikibase.

314
00:17:49,042 --> 00:17:52,000
And it is very easy
to do it for Wikibase also.

315
00:17:52,760 --> 00:17:56,280
So this tool, WikiShape, is quite new.

316
00:17:57,015 --> 00:17:59,843
I think it works, most of the features,

317
00:17:59,844 --> 00:18:02,468
but there are some features
that maybe don't work,

318
00:18:02,469 --> 00:18:06,281
and if you try it and you want
to improve it, please tell me.

319
00:18:06,281 --> 00:18:12,680
So this is [inaudible] captures,
but I think I can even try so let's try.

320
00:18:15,385 --> 00:18:16,945
So let's see if it works.

321
00:18:16,953 --> 00:18:20,070
First, I have to go out of the...

322
00:18:22,453 --> 00:18:23,453
Here.

323
00:18:24,226 --> 00:18:28,324
Alright, yeah. So this is the tool here.

324
00:18:28,324 --> 00:18:29,844
Things that you can do with the tool,

325
00:18:29,845 --> 00:18:35,275
for example, is that you can
check schemas, entity schemas.

326
00:18:35,276 --> 00:18:38,611
You know that there is
a new namespace which is "E whatever,"

327
00:18:38,612 --> 00:18:44,805
so here, if you start for example,
write for example "human"...

328
00:18:44,806 --> 00:18:48,812
As you are writing,
its autocomplete allows you to check,

329
00:18:48,812 --> 00:18:52,001
for example,
this is the Shape Expressions of a human,

330
00:18:52,790 --> 00:18:55,937
and this is the Shape Expressions here.

331
00:18:55,938 --> 00:18:59,841
And as you can see,
this editor has syntax highlighting,

332
00:18:59,842 --> 00:19:04,559
this is... well,
maybe it's very small, the screen.

333
00:19:05,676 --> 00:19:07,590
I can try to do it bigger.

334
00:19:09,194 --> 00:19:10,973
Maybe you see it better now.

335
00:19:10,973 --> 00:19:14,241
So... and this is the editor
with syntax highlighting and also has...

336
00:19:14,241 --> 00:19:17,851
I mean, this editor
comes from the same source code

337
00:19:17,851 --> 00:19:19,641
as the Wikidata query service.

338
00:19:19,642 --> 00:19:23,960
So for example,
if you hover with the mouse here,

339
00:19:23,961 --> 00:19:27,961
it shows you the labels
of the different properties.

340
00:19:27,962 --> 00:19:31,298
So I think it's very helpful because now,

341
00:19:32,588 --> 00:19:38,601
the entity schemas that is
in the Wikidata is just a plain text idea,

342
00:19:38,602 --> 00:19:42,493
and I think this editor is much better
because it has autocomplete

343
00:19:42,494 --> 00:19:43,743
and it also has...

344
00:19:43,744 --> 00:19:48,241
I mean, if you, for example,
wanted to add a constraint,

345
00:19:48,241 --> 00:19:51,570
you say "wdt:"

346
00:19:51,570 --> 00:19:56,884
You start writing "author"
and then you click *Ctrl+Space*

347
00:19:56,884 --> 00:19:58,922
and it suggests the different things.

348
00:19:58,922 --> 00:20:02,388
So this is similar
to the Wikidata query service

349
00:20:02,389 --> 00:20:06,445
but specifically for Shape Expressions

350
00:20:06,445 --> 00:20:11,975
because my feeling is that
creating Shape Expressions

351
00:20:11,976 --> 00:20:15,841
is not more difficult
than writing SPARQL queries.

352
00:20:15,842 --> 00:20:21,255
So some people think
that it's at the same level,

353
00:20:22,278 --> 00:20:26,296
It's probably easier, I think,
because Shape Expressions was,

354
00:20:26,296 --> 00:20:31,241
when we designed it,
we were doing it to be easier to work.

355
00:20:31,242 --> 00:20:35,001
OK, so this is one of the first things,
that you have this editor

356
00:20:35,001 --> 00:20:36,620
for Shape Expressions.

357
00:20:37,371 --> 00:20:41,467
And then you also have the possibility,
for example, to visualize.

358
00:20:41,468 --> 00:20:44,801
If you have a Shape Expression,
use for example...

359
00:20:44,802 --> 00:20:49,386
I think, "written work" is
a nice Shape Expression

360
00:20:49,386 --> 00:20:53,300
because it has some relationships
between different things.

361
00:20:54,823 --> 00:20:58,160
And this is the UML visualization
of written work.

362
00:20:58,161 --> 00:21:02,090
In a UML, this is easy to see
the different properties.

363
00:21:02,790 --> 00:21:06,794
When you do this, I realized
when I tried with several people,

364
00:21:06,795 --> 00:21:09,216
they find some mistakes
in their Shape Expressions

365
00:21:09,217 --> 00:21:12,988
because it's easy to detect which are
the missing properties or whatever.

366
00:21:13,588 --> 00:21:15,771
Then there is another possibility here

367
00:21:15,772 --> 00:21:19,520
is that you can also validate,
I think I have it here, the validation.

368
00:21:20,496 --> 00:21:25,285
I think I had it in some label,
maybe I closed it.

369
00:21:26,267 --> 00:21:30,988
OK, but you can, for example,
you can click here, *Validate entities.*

370
00:21:32,308 --> 00:21:34,232
You, for example,

371
00:21:35,404 --> 00:21:41,921
"q42" with "e42" which is author.

372
00:21:42,818 --> 00:21:46,180
With "human,"
I think we can do it with "human."

373
00:21:49,050 --> 00:21:50,050
And then it's...

374
00:21:50,688 --> 00:21:56,365
And it's taking a little while to do it
because this is doing the SPARQL queries

375
00:21:56,365 --> 00:21:59,134
and now, for example,
it's failing by the network but...

376
00:21:59,657 --> 00:22:01,580
So you can try it.

377
00:22:02,759 --> 00:22:07,026
OK, so let's go continue
with the presentation, with other tools.

378
00:22:07,026 --> 00:22:12,353
So my advice is that if you want to try it
and you want any feedback let me know.

379
00:22:13,133 --> 00:22:15,540
So to continue with the presentation...

380
00:22:18,923 --> 00:22:20,233
So this is WikiShape.

381
00:22:23,800 --> 00:22:26,509
Then, I already said this,

382
00:22:27,681 --> 00:22:34,157
the Shape Expressions Editor
is an independent project in GitHub.

383
00:22:35,605 --> 00:22:37,472
You can use it in your own project.

384
00:22:37,472 --> 00:22:41,036
If you want to do
a Shape Expressions tool,

385
00:22:41,036 --> 00:22:45,635
you can just embed it
in any other project,

386
00:22:45,636 --> 00:22:48,235
so this is in GitHub and you can use it.

387
00:22:48,868 --> 00:22:51,970
Then the same author,
it's one of my students,

388
00:22:52,684 --> 00:22:55,704
he also created
an editor for Shape Expressions,

389
00:22:55,704 --> 00:22:57,799
also inspired by
the Wikidata query service

390
00:22:57,800 --> 00:23:00,681
where, in a column,

391
00:23:00,682 --> 00:23:05,103
you have this more visual editor
of SPARQL queries

392
00:23:05,104 --> 00:23:07,135
where you can put this kind of things.

393
00:23:07,136 --> 00:23:09,123
So this is a screen capture.

394
00:23:09,123 --> 00:23:12,662
You can see that
that's the Shape Expressions in text

395
00:23:12,662 --> 00:23:17,822
but this is a form-based Shape Expressions
where it would probably take a bit longer

396
00:23:18,595 --> 00:23:23,400
where you can put the different rows
on the different fields.

397
00:23:23,401 --> 00:23:25,800
OK, then there is ShExEr.

398
00:23:26,879 --> 00:23:31,882
We have... it's done by one PhD student
at the University of Oviedo

399
00:23:31,883 --> 00:23:34,080
and he's here, so you can present ShExEr.

400
00:23:38,147 --> 00:23:40,024
(Danny) Hello, I am Danny Fernández,

401
00:23:40,025 --> 00:23:43,800
I am a PhD student in University of Oviedo
working with Labra.

402
00:23:44,710 --> 00:23:47,725
Since we are running out of time,
let's make these quickly,

403
00:23:47,726 --> 00:23:52,641
so let's not go for any actual demo,
but just print some screenshots.

404
00:23:52,642 --> 00:23:57,897
OK, so the usual way to work with
Shape Expressions or any shape language

405
00:23:57,897 --> 00:23:59,521
is that you have a domain expert

406
00:23:59,522 --> 00:24:02,313
that defines a priori
how the graph should look like

407
00:24:02,314 --> 00:24:03,555
define some structures,

408
00:24:03,556 --> 00:24:06,983
and then you use these structures
to validate the actual data against it.

409
00:24:08,124 --> 00:24:11,641
This tool, which is as well as the ones
that Labra has been presenting,

410
00:24:11,642 --> 00:24:14,441
this is a general purpose tool
for any RDF source,

411
00:24:14,442 --> 00:24:17,375
is designed to do the other way around.

412
00:24:17,376 --> 00:24:18,758
You already have some data,

413
00:24:18,759 --> 00:24:23,165
you select what nodes
you want to get the shape about

414
00:24:23,165 --> 00:24:26,718
and then you automatically
extract or infer the shape.

415
00:24:26,719 --> 00:24:29,791
So even if this is a general purpose tool,

416
00:24:29,791 --> 00:24:34,063
what we did for this WikidataCon
is these fancy button

417
00:24:34,884 --> 00:24:37,081
that if you click it,
essentially what happens

418
00:24:37,081 --> 00:24:42,079
is that there are
so many configurations params

419
00:24:42,080 --> 00:24:46,251
and it configures it to work
against the Wikidata endpoint

420
00:24:46,251 --> 00:24:47,971
and it will end soon, sorry.

421
00:24:48,733 --> 00:24:52,883
So, once you press this button
what you get is essentially this.

422
00:24:52,884 --> 00:24:55,126
After having selected what kind of notes,

423
00:24:55,127 --> 00:24:59,360
what kind of instances of our class,
whatever you are looking for,

424
00:24:59,361 --> 00:25:01,321
you get an automatic schema.

425
00:25:02,319 --> 00:25:07,111
All the constraints are sorted
by how many modes actually conform to it,

426
00:25:07,112 --> 00:25:09,772
you can filter the less common ones, etc.

427
00:25:09,772 --> 00:25:12,126
So there is a poster downstairs
about this stuff

428
00:25:12,127 --> 00:25:14,595
and well,
I will be downstairs and upstairs

429
00:25:14,596 --> 00:25:16,454
and all over the place all day,

430
00:25:16,455 --> 00:25:19,081
so if you have any further
interest in this tool,

431
00:25:19,082 --> 00:25:21,476
just speak to me during this journey.

432
00:25:21,477 --> 00:25:24,624
And now, I'll give back
the micro to Labra, thank you.

433
00:25:24,625 --> 00:25:29,265
(applause)

434
00:25:29,812 --> 00:25:32,578
(Jose) So let's continue
with the other tools.

435
00:25:32,579 --> 00:25:34,984
The other tool is the ShapeDesigner.

436
00:25:34,984 --> 00:25:37,241
Andra, do you want to do
the ShapeDesigner now

437
00:25:37,242 --> 00:25:39,287
or maybe later or in the workshop?

438
00:25:39,287 --> 00:25:40,603
There is a workshop...

439
00:25:40,603 --> 00:25:44,437
This afternoon, there is a workshop
specifically for Shape Expressions, and...

440
00:25:45,265 --> 00:25:47,939
The idea is that was going to be
more hands on,

441
00:25:47,940 --> 00:25:52,324
and if you want to practice
some ShEx, you can do it there.

442
00:25:52,875 --> 00:25:55,720
This tool is ShEx...
and there is Eric here,

443
00:25:55,721 --> 00:25:56,890
so you can present it.

444
00:25:57,969 --> 00:26:00,687
(Eric) So just super quick,
the thing that I want to say

445
00:26:00,687 --> 00:26:05,711
is that you've probably
already seen the ShEx interface

446
00:26:05,711 --> 00:26:07,601
that's tailored for Wikidata.

447
00:26:07,602 --> 00:26:12,930
That's effectively stripped down
and tailored specifically for Wikidata

448
00:26:12,930 --> 00:26:17,937
because the generic one has more features
but it turns out I thought I'd mention it

449
00:26:17,937 --> 00:26:19,977
because one of those features
is particularly useful

450
00:26:19,978 --> 00:26:23,201
for debugging Wikidata schemas,

451
00:26:23,201 --> 00:26:29,224
which is if you go
and you select the slurp mode,

452
00:26:29,225 --> 00:26:31,444
what it does is it says
while I'm validating,

453
00:26:31,445 --> 00:26:34,694
I want to pull all the the triples down
and that means

454
00:26:34,695 --> 00:26:36,274
if I get a bunch of failures,

455
00:26:36,275 --> 00:26:39,586
I can go through and start looking
at those failures and saying,

456
00:26:39,587 --> 00:26:41,800
OK, what are the triples
that are in here,

457
00:26:41,801 --> 00:26:44,120
sorry, I apologize,
the triples are down there,

458
00:26:44,121 --> 00:26:45,647
this is just a log of what went by.

459
00:26:46,327 --> 00:26:49,180
And then you can just sit there
and fiddle with it in real time

460
00:26:49,181 --> 00:26:51,033
like you play with something
and it changes.

461
00:26:51,033 --> 00:26:54,160
So it's a quicker version
for doing all that stuff.

462
00:26:55,361 --> 00:26:56,481
This is a ShExC form,

463
00:26:56,482 --> 00:26:59,455
this is something [Joachim] had suggested

464
00:27:00,035 --> 00:27:04,631
could be useful for populating
Wikidata documents

465
00:27:04,631 --> 00:27:07,338
based on a Shape Expression
for that that document.

466
00:27:08,095 --> 00:27:11,681
This is not tailored for Wikidata,

467
00:27:11,682 --> 00:27:14,081
but this is just to say
that you can have a schema

468
00:27:14,082 --> 00:27:15,402
and you can have some annotations

469
00:27:15,403 --> 00:27:17,518
to say specifically how I want
that schema rendered

470
00:27:17,519 --> 00:27:19,031
and then it just builds a form,

471
00:27:19,031 --> 00:27:21,191
and if you've got data,
it can even populate the form.

472
00:27:24,517 --> 00:27:26,164
PyShEx [inaudible].

473
00:27:28,025 --> 00:27:31,080
(Jose) I think this is the last one.

474
00:27:31,821 --> 00:27:34,080
Yes, so the last one is PyShEx.

475
00:27:34,675 --> 00:27:38,151
PyShEx is a Python implementation
of Shape Expressions,

476
00:27:39,193 --> 00:27:42,680
you can play also with Jupyter Notebooks
if you want those kind of things.

477
00:27:42,680 --> 00:27:44,432
OK, so that's all for this.

478
00:27:44,433 --> 00:27:47,170
(applause)

479
00:27:52,916 --> 00:27:57,073
(Andra) So I'm going to talk about
a specific project that I'm involved in

480
00:27:57,074 --> 00:27:58,074
called Gene Wiki,

481
00:27:58,075 --> 00:28:04,596
and where we are also
dealing with quality issues.

482
00:28:04,597 --> 00:28:06,684
But before going into the quality,

483
00:28:06,685 --> 00:28:09,229
maybe a quick introduction
about what Gene Wiki is,

484
00:28:09,855 --> 00:28:15,175
and we just released a pre-print
of a paper that we recently have written

485
00:28:15,175 --> 00:28:18,160
that explains the details of the project.

486
00:28:19,821 --> 00:28:23,839
I see people taking pictures,
but basically, what Gene Wiki does,

487
00:28:23,846 --> 00:28:28,027
it's trying to get biomedical data,
public data into Wikidata,

488
00:28:28,028 --> 00:28:32,200
and we follow a specific pattern
to get that data into Wikidata.

489
00:28:33,130 --> 00:28:36,809
So when we have a new repository
or a new data set

490
00:28:36,810 --> 00:28:39,600
that is eligible
to be included into Wikidata,

491
00:28:39,601 --> 00:28:41,293
the first step is community engagement.

492
00:28:41,294 --> 00:28:43,784
It is not necessary
directly to a Wikidata community

493
00:28:43,785 --> 00:28:46,120
but a local research community,

494
00:28:46,121 --> 00:28:50,286
and we meet in person
or online or on any platform

495
00:28:50,286 --> 00:28:52,881
and try to come up with a data model

496
00:28:52,882 --> 00:28:56,197
that bridges their data
with the Wikidata model.

497
00:28:56,197 --> 00:28:59,944
So here I have a picture of a workshop
that happened here last year

498
00:28:59,945 --> 00:29:02,663
which was trying to look
at a specific data set

499
00:29:02,663 --> 00:29:05,280
and, well, you see a lot of discussions,

500
00:29:05,281 --> 00:29:09,780
then aligning it with schema.org
and other ontologies that are out there.

501
00:29:10,320 --> 00:29:15,508
And then, at the end of the first step,
we have a whiteboard drawing of the schema

502
00:29:15,509 --> 00:29:17,336
that we want to implement in Wikidata.

503
00:29:17,337 --> 00:29:20,440
What you see over there,
this is just plain,

504
00:29:20,441 --> 00:29:21,766
we have it in the back there

505
00:29:21,767 --> 00:29:25,240
so we can make some schemas
within this panel today even.

506
00:29:26,560 --> 00:29:28,399
So once we have the schema in place,

507
00:29:28,400 --> 00:29:31,320
the next thing is try to make
that schema machine readable

508
00:29:32,358 --> 00:29:36,841
because you want to have actionable models
to bridge the data that you're bringing in

509
00:29:36,842 --> 00:29:39,690
from any biomedical database
into Wikidata.

510
00:29:40,393 --> 00:29:45,182
And here we are applying
Shape Expressions.

511
00:29:46,471 --> 00:29:52,518
And we use that because
Shape Expressions allow you to test

512
00:29:52,518 --> 00:29:57,040
whether the data set
is actually-- no, to first see

513
00:29:57,041 --> 00:30:01,782
of already existing data in Wikidata
follows the same data model

514
00:30:01,783 --> 00:30:04,718
that was achieved in the previous process.

515
00:30:04,719 --> 00:30:06,641
So then with the Shape Expression
we can check:

516
00:30:06,642 --> 00:30:10,926
OK the data that are on this topic
in Wikidata, does it need some cleaning up

517
00:30:10,926 --> 00:30:15,013
or do we need to adapt our model
to the Wikidata model or vice versa.

518
00:30:15,937 --> 00:30:19,867
Once that is in place
and we start writing bots,

519
00:30:20,670 --> 00:30:23,801
and bots are seeding the information

520
00:30:23,802 --> 00:30:27,308
that is in the primary sources
into Wikidata.

521
00:30:27,846 --> 00:30:29,303
And when the bots are ready,

522
00:30:29,304 --> 00:30:33,001
we write these bots
with a platform called--

523
00:30:33,002 --> 00:30:36,201
with a Python library
called Wikidata Integrator

524
00:30:36,202 --> 00:30:38,167
that came out of our project.

525
00:30:38,698 --> 00:30:42,921
And once we have our bots,
we use a platform called Jenkins

526
00:30:42,921 --> 00:30:44,540
for continuous integration.

527
00:30:44,540 --> 00:30:45,762
And with Jenkins,

528
00:30:45,762 --> 00:30:51,160
we continuously update
the primary sources with Wikidata.

529
00:30:52,178 --> 00:30:55,889
And this is a diagram for the paper
I previously mentioned.

530
00:30:55,890 --> 00:30:57,241
This is our current landscape.

531
00:30:57,242 --> 00:31:02,059
So every orange box out there
is a primary resource on drugs,

532
00:31:02,060 --> 00:31:07,827
proteins, genes, diseases,
chemical compounds with interaction,

533
00:31:07,827 --> 00:31:10,870
and this model is too small to read now

534
00:31:10,870 --> 00:31:17,472
but this is the database,
the sources that we manage in Wikidata

535
00:31:17,473 --> 00:31:20,560
and bridge with the primary sources.

536
00:31:20,561 --> 00:31:22,355
Here is such a workflow.

537
00:31:22,870 --> 00:31:25,312
So one of our partners
is the Disease Ontology

538
00:31:25,312 --> 00:31:27,672
the Disease Ontology is a CC0 ontology,

539
00:31:28,179 --> 00:31:31,990
and the CC0 Ontology
has a curation cycle on its own,

540
00:31:32,756 --> 00:31:35,736
and they just continuously
update the Disease Ontology

541
00:31:35,737 --> 00:31:39,687
to reflect the disease space
or the interpretation of diseases.

542
00:31:40,336 --> 00:31:44,361
And there is the Wikidata
curation cycle also on diseases

543
00:31:44,362 --> 00:31:49,844
where the Wikidata community constantly
monitors what's going on on Wikidata.

544
00:31:50,406 --> 00:31:51,601
And then we have two roles,

545
00:31:51,602 --> 00:31:55,477
we call them colloquially
the gatekeeper curator,

546
00:31:56,009 --> 00:31:59,561
and this was me
and a colleague five years ago

547
00:31:59,562 --> 00:32:03,414
where we just sit on our computers
and we monitor Wikipedia and Wikidata,

548
00:32:03,415 --> 00:32:08,601
and if there is an issue that was
reported back to the primary community,

549
00:32:08,602 --> 00:32:11,765
the primary resources, they looked
at the implementation and decided:

550
00:32:11,765 --> 00:32:14,240
OK, do we do we trust the Wikidata input?

551
00:32:14,850 --> 00:32:18,555
Yes--then it's considered,
it goes into the cycle,

552
00:32:18,555 --> 00:32:22,686
and the next iteration
is part of the Disease Ontology

553
00:32:22,687 --> 00:32:25,411
and fed back into Wikidata.

554
00:32:27,419 --> 00:32:31,480
We're doing the same for WikiPathways.

555
00:32:31,481 --> 00:32:36,601
WikiPathways is a MediaWiki-inspired
pathway and pathway repository.

556
00:32:36,602 --> 00:32:40,901
Same story, there are different
pathway resources on Wikidata already.

557
00:32:41,463 --> 00:32:44,713
There might be conflicts
between those pathway resources

558
00:32:44,722 --> 00:32:46,701
and these conflicts are reported back

559
00:32:46,702 --> 00:32:49,521
by the gatekeeper curators
to that community,

560
00:32:49,522 --> 00:32:53,715
and you maintain
the individual curation cycles.

561
00:32:53,715 --> 00:32:57,068
But if you remember the previous cycle,

562
00:32:57,069 --> 00:33:03,041
here I mentioned
only two cycles, two resources,

563
00:33:03,566 --> 00:33:06,300
we have to do that
for every single resource that we have

564
00:33:06,300 --> 00:33:08,061
and we have to manage what's going on

565
00:33:08,062 --> 00:33:09,185
because when I say curation,

566
00:33:09,185 --> 00:33:11,377
I really mean going
to the Wikipedia top pages,

567
00:33:11,377 --> 00:33:14,544
going into the Wikidata top pages
and trying to do that.

568
00:33:14,545 --> 00:33:19,316
That doesn't scale for
the two gatekeeper curators we had.

569
00:33:19,860 --> 00:33:22,777
So when I was in a conference in 2016

570
00:33:22,778 --> 00:33:26,933
where Eric gave a presentation
on Shape Expressions,

571
00:33:26,934 --> 00:33:29,277
I jumped on the bandwagon and said OK,

572
00:33:29,278 --> 00:33:34,240
Shape Expressions can help us
detect what differences in Wikidata

573
00:33:34,240 --> 00:33:41,159
and so that allows the gatekeepers to have
some more efficient reporting to report.

574
00:33:42,275 --> 00:33:46,019
So this year,
I was delighted by the schema entity

575
00:33:46,020 --> 00:33:50,765
because now, we can store
those entity schemas on Wikidata,

576
00:33:50,765 --> 00:33:53,183
on Wikidata itself,
whereas before, it was on GitHub,

577
00:33:53,860 --> 00:33:56,815
and this aligns
with the Wikidata interface,

578
00:33:56,816 --> 00:33:59,350
so you have things
like document discussions

579
00:33:59,350 --> 00:34:00,762
but you also have revisions.

580
00:34:00,763 --> 00:34:05,261
So you can leverage the top pages
and the revisions in Wikidata

581
00:34:05,262 --> 00:34:12,255
to use that to discuss
about what is in Wikidata

582
00:34:12,255 --> 00:34:14,060
and what are in the primary resources.

583
00:34:14,966 --> 00:34:19,686
So this what Eric just presented,
this is already quite a benefit.

584
00:34:19,686 --> 00:34:24,335
So here, we made up a Shape Expression
for the human gene,

585
00:34:24,336 --> 00:34:30,225
and then we ran it through simple ShEx,
and as you can see,

586
00:34:30,225 --> 00:34:32,428
we just got already ni--

587
00:34:32,429 --> 00:34:34,641
There is one issue
that needs to be monitored

588
00:34:34,642 --> 00:34:37,316
which there is an item
that doesn't fit that schema,

589
00:34:37,316 --> 00:34:43,139
and then you can sort of already
create schema entities curation reports

590
00:34:43,140 --> 00:34:46,240
based on... and send that
to the different curation reports.

591
00:34:48,058 --> 00:34:52,788
But the ShEx.js a built interface,

592
00:34:52,788 --> 00:34:55,860
and if I can show back here,
I only do ten,

593
00:34:55,860 --> 00:35:00,362
but we have tens of thousands,
and so that again doesn't scale.

594
00:35:00,362 --> 00:35:04,654
So the Wikidata Integrator now
supports ShEx support as well,

595
00:35:05,168 --> 00:35:07,431
and then we can just loop item loops

596
00:35:07,431 --> 00:35:11,494
where we say yes-no,
yes-no, true-false, true-false.

597
00:35:11,495 --> 00:35:12,495
So again,

598
00:35:13,065 --> 00:35:16,514
increasing a bit of the efficiency
of dealing with the reports.

599
00:35:17,256 --> 00:35:22,662
But now, recently, that builds
on the Wikidata Query Service,

600
00:35:23,181 --> 00:35:24,998
and well, we recently have been throttling

601
00:35:24,999 --> 00:35:26,560
so again, that doesn't scale.

602
00:35:26,561 --> 00:35:31,391
So it's still an ongoing process,
how to deal with models on Wikidata.

603
00:35:32,202 --> 00:35:36,682
And so again,
ShEx is not only intimidating

604
00:35:36,683 --> 00:35:40,356
but also the scale is just
too big to deal with.

605
00:35:41,068 --> 00:35:46,081
So I started working, this is my first
proof of concept or exercise

606
00:35:46,082 --> 00:35:47,680
where I used a tool called yED,

607
00:35:48,184 --> 00:35:52,590
and I started to draw
those Shape Expressions and because...

608
00:35:52,591 --> 00:35:58,098
and then regenerate this schema

609
00:35:58,099 --> 00:36:01,279
into this adjacent format
of the Shape Expressions,

610
00:36:01,280 --> 00:36:04,520
so that would open up already
to the audience

611
00:36:04,521 --> 00:36:07,432
that are intimidated
by the Shape Expressions languages.

612
00:36:07,961 --> 00:36:12,308
But actually, there is a problem
with those visual descriptions

613
00:36:12,309 --> 00:36:18,229
because this is also a schema
that was actually drawn in yEd by someone.

614
00:36:18,230 --> 00:36:23,838
And here is another one
which is beautiful.

615
00:36:23,838 --> 00:36:29,414
I would love to have this on my wall,
but it is still not interoperable.

616
00:36:30,281 --> 00:36:32,131
So I want to end my talk with,

617
00:36:32,131 --> 00:36:35,732
and the first time, I've been
stealing this slide, using this slide.

618
00:36:35,732 --> 00:36:37,594
It's an honor to have him in the audience

619
00:36:37,595 --> 00:36:39,423
and I really like this:

620
00:36:39,424 --> 00:36:42,362
"People think RDF is a pain
because it's complicated.

621
00:36:42,362 --> 00:36:43,985
The truth is even worse, it's so simple,

622
00:36:45,581 --> 00:36:48,133
because you have to work
with real-world data problems

623
00:36:48,134 --> 00:36:50,031
that are horribly complicated.

624
00:36:50,031 --> 00:36:51,451
While you can avoid RDF,

625
00:36:51,451 --> 00:36:55,760
it is harder to avoid complicated data
and complicated computer problems."

626
00:36:55,761 --> 00:36:59,535
This is about RDF, but I think
this so applies to modeling as well.

627
00:37:00,112 --> 00:37:02,769
So my point of discussion
is should we really...

628
00:37:03,387 --> 00:37:05,882
How do we get modeling going?

629
00:37:05,882 --> 00:37:10,826
Should we discuss ShEx
or visual models or...

630
00:37:11,426 --> 00:37:13,271
How do we continue?

631
00:37:13,474 --> 00:37:14,840
Thank you very much for your time.

632
00:37:15,102 --> 00:37:17,787
(applause)

633
00:37:20,001 --> 00:37:21,188
(Lydia) Thank you so much.

634
00:37:21,692 --> 00:37:24,001
Would you come to the front

635
00:37:24,002 --> 00:37:27,741
so that we can open
the questions from the audience.

636
00:37:28,610 --> 00:37:30,203
Are there questions?

637
00:37:31,507 --> 00:37:32,507
Yes.

638
00:37:34,253 --> 00:37:36,890
And I think, for the camera, we need to...

639
00:37:38,835 --> 00:37:40,968
(Lydia laughing) Yeah.

640
00:37:43,094 --> 00:37:46,273
(man3) So a question
for Cristina, I think.

641
00:37:47,366 --> 00:37:51,641
So you mentioned exactly
the term "information gain"

642
00:37:51,642 --> 00:37:53,689
from linking with other systems.

643
00:37:53,690 --> 00:37:55,619
There is an information theoretic measure

644
00:37:55,620 --> 00:37:58,001
using statistic and probability
called information gain.

645
00:37:58,002 --> 00:37:59,541
Do you have the same...

646
00:37:59,542 --> 00:38:01,736
I mean did you mean exactly that measure,

647
00:38:01,736 --> 00:38:04,173
the information gain
from the probability theory

648
00:38:04,174 --> 00:38:05,240
from information theory

649
00:38:05,241 --> 00:38:09,024
or just use this conceptual thing
to measure information gain some way?

650
00:38:09,025 --> 00:38:13,016
No, so we actually defined
and implemented measures

651
00:38:13,695 --> 00:38:20,161
that are using the Shannon entropy,
so it's meant as that.

652
00:38:20,162 --> 00:38:22,696
I didn't want to go into
details of the concrete formulas...

653
00:38:22,697 --> 00:38:24,977
(man3) No, no, of course,
that's why I asked the question.

654
00:38:24,978 --> 00:38:26,698
- (Cristina) But yeah...
- (man3) Thank you.

655
00:38:33,091 --> 00:38:35,047
(man4) Make more
of a comment than a question.

656
00:38:35,048 --> 00:38:36,241
(Lydia) Go for it.

657
00:38:36,242 --> 00:38:39,840
(man4) So there's been
a lot of focus at the item level

658
00:38:39,840 --> 00:38:42,547
about quality and completeness,

659
00:38:42,547 --> 00:38:47,374
one of the things that concerns me is that
we're not applying the same to hierarchies

660
00:38:47,374 --> 00:38:51,480
and I think we have an issue
is that our hierarchy often isn't good.

661
00:38:51,481 --> 00:38:53,463
We're seeing
this is going to be a real problem

662
00:38:53,464 --> 00:38:55,774
with Commons searching and other things.

663
00:38:56,771 --> 00:39:00,601
One of the abilities that we can do
is to import external--

664
00:39:00,602 --> 00:39:04,842
The way that external thesauruses
structure their hierarchies,

665
00:39:04,842 --> 00:39:10,291
using the P4900
broader concept qualifier.

666
00:39:11,037 --> 00:39:16,167
But what I think would be really helpful
would be much better tools for doing that

667
00:39:16,168 --> 00:39:21,212
so that you can import an
external... thesaurus's hierarchy

668
00:39:21,212 --> 00:39:24,111
map that onto our Wikidata items.

669
00:39:24,111 --> 00:39:28,199
Once it's in place
with those P4900 qualifiers,

670
00:39:28,200 --> 00:39:31,494
you can actually do some
quite good querying through SPARQL

671
00:39:32,490 --> 00:39:37,534
to see where our hierarchy
diverges from that external hierarchy.

672
00:39:37,534 --> 00:39:41,346
For instance, [Paula Morma],
user PKM, you may know,

673
00:39:41,346 --> 00:39:43,533
does a lot of work on fashion.

674
00:39:43,533 --> 00:39:50,524
So we use that to pull in the Europeana
Fashion Thesaurus's hierarchy

675
00:39:50,524 --> 00:39:53,812
and the Getty AAT
fashion thesaurus hierarchy,

676
00:39:53,812 --> 00:39:57,957
and then see where the gaps
were in our higher level items,

677
00:39:57,957 --> 00:40:00,511
which is a real problem for us
because often,

678
00:40:00,511 --> 00:40:04,355
these are things that only exist
as disambiguation pages on Wikipedia,

679
00:40:04,356 --> 00:40:09,270
so we have a lot of higher level items
in our hierarchies missing

680
00:40:09,271 --> 00:40:14,480
and this is something that we must address
in terms of quality and completeness,

681
00:40:14,480 --> 00:40:15,971
but what would really help

682
00:40:16,643 --> 00:40:20,871
would be better tools than
the jungle of pull scripts that I wrote...

683
00:40:20,872 --> 00:40:26,010
If somebody could put that
into a PAWS notebook in Python

684
00:40:26,561 --> 00:40:31,972
to be able to take an external thesaurus,
take its hierarchy,

685
00:40:31,973 --> 00:40:34,595
which may well be available
as linked data or may not,

686
00:40:35,379 --> 00:40:40,580
to then put those into
quick statements to put in P4900 values.

687
00:40:41,165 --> 00:40:42,165
And then later,

688
00:40:42,166 --> 00:40:44,527
when our representation
gets more complete,

689
00:40:44,528 --> 00:40:49,691
to update those P4900s
because as our representation gets dated,

690
00:40:49,691 --> 00:40:51,590
becomes more dense,

691
00:40:51,590 --> 00:40:55,377
the values of those qualifiers
need to change

692
00:40:56,230 --> 00:40:59,526
to represent that we've got more
of their hierarchy in our system.

693
00:40:59,526 --> 00:41:03,728
If somebody could do that,
I think that would be very helpful,

694
00:41:03,728 --> 00:41:07,121
and we do need to also
look at other approaches

695
00:41:07,122 --> 00:41:10,762
to improve quality and completeness
at the hierarchy level

696
00:41:10,763 --> 00:41:12,378
not just at the item level.

697
00:41:13,308 --> 00:41:14,840
(Andra) Can I add to that?

698
00:41:16,362 --> 00:41:19,901
Yes, and we actually do that,

699
00:41:19,911 --> 00:41:23,551
and I can recommend looking at
the Shape Expression that Finn made

700
00:41:23,552 --> 00:41:27,330
with the lexical data
where he creates Shape Expressions

701
00:41:27,330 --> 00:41:29,640
and then build on authorship expressions

702
00:41:29,641 --> 00:41:32,528
so you have this concept
of linked Shape Expressions in Wikidata,

703
00:41:32,529 --> 00:41:35,005
and specifically, the use case,
if I understand correctly,

704
00:41:35,006 --> 00:41:37,183
is exactly what we are doing in Gene Wiki.

705
00:41:37,184 --> 00:41:40,841
So you have the Disease Ontology
which is put into Wikidata

706
00:41:40,842 --> 00:41:44,681
and then disease data comes in
and we apply the Shape Expressions

707
00:41:44,682 --> 00:41:47,247
to see if that fits with this thesaurus.

708
00:41:47,248 --> 00:41:50,919
And there are other thesauruses or other
ontologies for controlled vocabularies

709
00:41:50,920 --> 00:41:52,559
that still need to go into Wikidata,

710
00:41:52,559 --> 00:41:55,401
and that's exactly why
Shape Expression is so interesting

711
00:41:55,402 --> 00:41:57,963
because you can have a Shape Expression
for the Disease Ontology,

712
00:41:57,964 --> 00:41:59,644
you can have a Shape Expression for MeSH,

713
00:41:59,645 --> 00:42:01,761
you can say: OK,
now I want to check the quality.

714
00:42:01,762 --> 00:42:04,059
Because you also have
in Wikidata the context

715
00:42:04,060 --> 00:42:09,567
of when you have a controlled vocabulary,
you say the quality is according to this,

716
00:42:09,568 --> 00:42:11,636
but you might have
a disagreeing community.

717
00:42:11,636 --> 00:42:16,081
So the tooling is indeed in place
but now is indeed to create those models

718
00:42:16,082 --> 00:42:18,144
and apply them
on the different use cases.

719
00:42:18,811 --> 00:42:20,921
(man4) The ShapeExpression's very useful

720
00:42:20,922 --> 00:42:25,928
once you have the external ontology
mapped into Wikidata,

721
00:42:25,929 --> 00:42:29,474
but my problem is that
it's getting to that stage,

722
00:42:29,475 --> 00:42:34,881
it's working out how much of the
external ontology isn't yet in Wikidata

723
00:42:34,882 --> 00:42:36,256
and where the gaps are,

724
00:42:36,257 --> 00:42:40,660
and that's where I think that
having much more robust tools

725
00:42:40,660 --> 00:42:44,286
to see what's missing
from external ontologies

726
00:42:44,286 --> 00:42:45,537
would be very helpful.

727
00:42:47,678 --> 00:42:49,062
The biggest problem there

728
00:42:49,062 --> 00:42:51,201
is not so much tooling
but more licensing.

729
00:42:51,803 --> 00:42:55,249
So getting the ontologies
into Wikidata is actually a piece of cake

730
00:42:55,250 --> 00:42:59,295
but most of the ontologies have,
how can I say that politely,

731
00:42:59,965 --> 00:43:03,256
restrictive licensing,
so they are not compatible with Wikidata.

732
00:43:04,068 --> 00:43:06,678
(man4) There's a huge number
of public sector thesauruses

733
00:43:06,678 --> 00:43:08,209
in cultural fields.

734
00:43:08,210 --> 00:43:10,851
- (Andra) Then we need to talk.
- (man4) Not a problem.

735
00:43:10,852 --> 00:43:12,384
(Andra) Then we need to talk.

736
00:43:13,624 --> 00:43:19,192
(man5) Just... the comment I want to make
is actually answer to James,

737
00:43:19,192 --> 00:43:22,401
so the thing is that
hierarchies make graphs,

738
00:43:22,374 --> 00:43:24,041
and when you want to...

739
00:43:24,579 --> 00:43:28,888
I want to basically talk about...
a common problem in hierarchies

740
00:43:28,889 --> 00:43:30,820
is circle hierarchies,

741
00:43:30,821 --> 00:43:33,796
so they come back to each other
when there's a problem,

742
00:43:33,796 --> 00:43:35,920
which you should not
have that in hierarchies.

743
00:43:37,022 --> 00:43:41,295
This, funnily enough,
happens in categories in Wikipedia a lot

744
00:43:41,295 --> 00:43:42,990
we have a lot of circles in categories,

745
00:43:43,898 --> 00:43:46,612
but the good news is that this is...

746
00:43:47,713 --> 00:43:51,582
Technically, it's a PMP complete problem,
so you cannot find this,

747
00:43:51,583 --> 00:43:53,414
and easily if you built a graph of that,

748
00:43:54,473 --> 00:43:57,046
but there are lots of ways
that have been developed

749
00:43:57,047 --> 00:44:00,624
to find problems
in these hierarchy graphs.

750
00:44:00,625 --> 00:44:04,860
Like there is a paper
called *Finding Cycles*...

751
00:44:04,861 --> 00:44:07,955
*Breaking Cycles in Noisy Hierarchies,*

752
00:44:07,956 --> 00:44:12,671
and it's been used to help
categorization of English Wikipedia.

753
00:44:12,672 --> 00:44:17,141
You can just take this
and apply these hierarchies in Wikidata,

754
00:44:17,142 --> 00:44:19,540
and then you can find
things that are problematic

755
00:44:19,541 --> 00:44:22,481
and just remove the ones
that are causing issues

756
00:44:22,482 --> 00:44:24,593
and find the issues, actually.

757
00:44:24,594 --> 00:44:26,960
So this is just an idea, just so you...

758
00:44:28,780 --> 00:44:29,930
(man4) That's all very well

759
00:44:29,931 --> 00:44:34,402
but I think you're underestimating
the number of bad subclass relations

760
00:44:34,402 --> 00:44:35,402
that we have.

761
00:44:35,403 --> 00:44:39,680
It's like having a city
in completely the wrong country,

762
00:44:40,250 --> 00:44:44,874
and there are tools for geography
to identify that,

763
00:44:44,875 --> 00:44:49,201
and we need to have
much better tools in hierarchies

764
00:44:49,202 --> 00:44:53,477
to identify where the equivalent
of the item for the country

765
00:44:53,478 --> 00:44:57,673
is missing entirely,
or where it's actually been subclassed

766
00:44:57,674 --> 00:45:01,804
to something that isn't meaning
something completely different.

767
00:45:02,804 --> 00:45:07,165
(Lydia) Yeah, I think
you're getting to something

768
00:45:07,166 --> 00:45:12,024
that me and my team keeps hearing
from people who reuse our data

769
00:45:12,025 --> 00:45:13,991
quite a bit as well, right,

770
00:45:15,002 --> 00:45:16,638
Individual data point might be great

771
00:45:16,639 --> 00:45:20,163
but if you have to look
at the ontology and so on,

772
00:45:20,164 --> 00:45:21,857
then it gets very...

773
00:45:22,388 --> 00:45:26,437
And I think one of the big problems
why this is happening

774
00:45:26,437 --> 00:45:30,736
is that a lot of editing on Wikidata

775
00:45:30,736 --> 00:45:34,544
happens on the basis
of an individual item, right,

776
00:45:34,545 --> 00:45:36,201
you make an edit on that item,

777
00:45:37,653 --> 00:45:42,075
without realizing that this
might have very global consequences

778
00:45:42,075 --> 00:45:44,245
on the rest of the graph, for example.

779
00:45:44,245 --> 00:45:50,040
And if people have ideas around
how to make this more visible,

780
00:45:50,041 --> 00:45:53,185
the consequences
of an individual local edit,

781
00:45:54,005 --> 00:45:56,537
I think that would be worth exploring,

782
00:45:57,550 --> 00:46:01,583
to show people better
what the consequence of their edit

783
00:46:01,584 --> 00:46:03,434
that they might do in very good faith,

784
00:46:04,481 --> 00:46:05,481
what that is.

785
00:46:06,939 --> 00:46:12,237
Whoa! OK, let's start with, yeah, you,
then you, then you, then you.

786
00:46:12,237 --> 00:46:13,921
(man5) Well, after the discussion,

787
00:46:13,922 --> 00:46:18,262
just to express my agreement
with what James was saying.

788
00:46:18,263 --> 00:46:22,467
So essentially, it seems
the most dangerous thing is the hierarchy,

789
00:46:22,468 --> 00:46:23,910
not the hierarchy, but generally

790
00:46:23,911 --> 00:46:28,022
the semantics of the subclass relations
seen in Wikidata, right.

791
00:46:28,022 --> 00:46:32,561
So I've been studying languages recently,
just for the purposes of this conference,

792
00:46:32,562 --> 00:46:35,257
and for example, you find plenty of cases

793
00:46:35,257 --> 00:46:39,463
where a language is a part of
and subclass of the same thing, OK.

794
00:46:39,463 --> 00:46:43,577
So you know, you can say
we have a flexible ontology.

795
00:46:43,577 --> 00:46:46,256
Wikidata gives you freedom
to express that, sometimes.

796
00:46:46,256 --> 00:46:47,257
Because, for example,

797
00:46:47,258 --> 00:46:50,721
that ontology of languages
is also politically complicated, right?

798
00:46:50,722 --> 00:46:55,038
It is even good to be in a position
to express a level of uncertainty.

799
00:46:55,038 --> 00:46:57,983
But imagine anyone who wants
to do machine reading from that.

800
00:46:57,984 --> 00:46:59,468
So that's really problematic.

801
00:46:59,468 --> 00:47:00,468
And then again,

802
00:47:00,469 --> 00:47:03,686
I don't think that ontology
was ever imported from somewhere,

803
00:47:03,687 --> 00:47:05,490
that's something which is originally ours.

804
00:47:05,491 --> 00:47:08,321
It's harvested from Wikipedia
in the very beginning I will say.

805
00:47:08,322 --> 00:47:11,324
So I wonder...
this Shape Expressions thing is great,

806
00:47:11,325 --> 00:47:15,575
and also validating and fixing,
if you like, the Wikidata ontology

807
00:47:15,576 --> 00:47:18,191
by external resources, beautiful idea.

808
00:47:19,026 --> 00:47:20,026
In the end,

809
00:47:20,027 --> 00:47:25,440
will we end by reflecting
the external ontologies in Wikidata?

810
00:47:25,441 --> 00:47:28,651
And also, what we do with
the core part of our ontology

811
00:47:28,652 --> 00:47:30,642
which is never harvested
from external resources,

812
00:47:30,643 --> 00:47:31,978
how do we go and fix that?

813
00:47:31,979 --> 00:47:35,276
And I really think that
that will be a problem on its own.

814
00:47:35,277 --> 00:47:39,010
We will have to focus on that
independently of the idea

815
00:47:39,010 --> 00:47:41,046
of validating ontology
with something external.

816
00:47:49,353 --> 00:47:53,379
(man6) OK, and constrains
and shapes are very impressive

817
00:47:53,380 --> 00:47:54,495
what we can do with it,

818
00:47:55,205 --> 00:47:58,481
but the main point is not
being really made clear--

819
00:47:58,482 --> 00:48:03,229
it's because now we can make more explicit
what we expect from the data.

820
00:48:03,229 --> 00:48:06,893
Before, each one has to write
its own tools and scripts

821
00:48:06,894 --> 00:48:10,601
and so it's more visible
and we can discuss about it.

822
00:48:10,602 --> 00:48:13,641
But because it's not about
what's wrong or right,

823
00:48:13,642 --> 00:48:15,870
it's about an expectation,

824
00:48:15,870 --> 00:48:18,105
and you will have different
expectations and discussions

825
00:48:18,106 --> 00:48:20,737
about how we want
to model things in Wikidata,

826
00:48:21,246 --> 00:48:23,095
and this...

827
00:48:23,096 --> 00:48:26,280
The current state is just
one step in the direction

828
00:48:26,281 --> 00:48:28,041
because now you need

829
00:48:28,042 --> 00:48:31,041
very much technical expertise
to get into this,

830
00:48:31,042 --> 00:48:35,721
and we need better ways
to visualize this constraint,

831
00:48:35,722 --> 00:48:39,995
to transform it maybe in natural language
so people can better understand,

832
00:48:40,939 --> 00:48:43,768
but it's less about what's wrong or right.

833
00:48:44,925 --> 00:48:45,925
(Lydia) Yeah.

834
00:48:50,986 --> 00:48:53,893
(man7) So for quality issues,
I just want to echo it like...

835
00:48:53,894 --> 00:48:57,010
I've definitely found a lot of the issues
I've encountered have been

836
00:48:58,838 --> 00:49:02,330
differences in opinion
between *instance of* versus *subclass*.

837
00:49:02,331 --> 00:49:05,963
I would say errors in those situations

838
00:49:05,963 --> 00:49:11,521
and trying to find those
has been a very time-consuming process.

839
00:49:11,522 --> 00:49:14,840
What I've found is like:
"Oh, if I find very high-impression items

840
00:49:14,840 --> 00:49:16,051
that are something...

841
00:49:16,052 --> 00:49:21,628
and then use all the subclass instances
to find all derived statements of this,"

842
00:49:21,628 --> 00:49:26,215
this is a very useful way
of looking for these errors.

843
00:49:26,215 --> 00:49:28,067
But I was curious if Shape Expressions,

844
00:49:29,841 --> 00:49:31,582
if there is...

845
00:49:31,583 --> 00:49:36,934
If this can be used as a tool
to help resolve those issues but, yeah...

846
00:49:40,514 --> 00:49:42,555
(man8) If it has a structural footprint...

847
00:49:45,910 --> 00:49:49,310
If it has a structural footprint
that you can...that's sort of falsifiable,

848
00:49:49,310 --> 00:49:51,191
you can look at that
and say well, that's wrong,

849
00:49:51,192 --> 00:49:52,670
then yeah, you can do that.

850
00:49:52,671 --> 00:49:56,921
But if it's just sort of
trying to map it to real-world objects,

851
00:49:56,922 --> 00:49:59,082
then you're just going to need
lots and lots of brains.

852
00:50:05,768 --> 00:50:08,631
(man9) Hi, Pablo Mendes
from Apple Siri Knowledge.

853
00:50:09,154 --> 00:50:12,770
We're here to find out how to help
the project and the community

854
00:50:12,770 --> 00:50:15,645
but Cristina made the mistake
of asking what we want.

855
00:50:16,471 --> 00:50:20,052
(laughing) So I think
one thing I'd like to see

856
00:50:20,958 --> 00:50:23,521
is a lot around verifiability

857
00:50:23,522 --> 00:50:26,372
which is one of the core tenets
of the project in the community,

858
00:50:27,062 --> 00:50:28,590
and trustworthiness.

859
00:50:28,590 --> 00:50:32,412
Not every statement is the same,
some of them are heavily disputed,

860
00:50:32,413 --> 00:50:33,653
some of them are easy to guess,

861
00:50:33,654 --> 00:50:35,541
like somebody's
date of birth can be verified,

862
00:50:36,071 --> 00:50:39,082
as you saw today in the Keynote,
gender issues are a lot more complicated.

863
00:50:40,205 --> 00:50:42,130
Can you discuss a little bit what you know

864
00:50:42,131 --> 00:50:47,271
in this area of data quality around
trustworthiness and verifiability?

865
00:50:55,442 --> 00:50:58,138
If there isn't a lot,
I'd love to see a lot more. (laughs)

866
00:51:00,646 --> 00:51:01,646
(Lydia) Yeah.

867
00:51:03,314 --> 00:51:06,548
Apparently, we don't have
a lot to say on that. (laughs)

868
00:51:08,024 --> 00:51:12,299
(Andra) I think we can do a lot,
but I had a discussion with you yesterday.

869
00:51:12,300 --> 00:51:15,774
My favorite example I learned yesterday
that's already deprecated

870
00:51:15,774 --> 00:51:20,281
is if you go to the Q2, which is earth,

871
00:51:20,282 --> 00:51:23,343
there is statement
that claims that the earth is flat.

872
00:51:24,183 --> 00:51:26,055
And I love that example

873
00:51:26,056 --> 00:51:28,391
because there is a community
out there that claims that

874
00:51:28,392 --> 00:51:30,417
and they have verifiable resources.

875
00:51:30,418 --> 00:51:32,254
So I think it's a genuine case,

876
00:51:32,255 --> 00:51:34,641
it shouldn't be deprecated,
it should be in Wikidata.

877
00:51:34,642 --> 00:51:40,385
And I think Shape Expressions
can be really instrumental there,

878
00:51:40,386 --> 00:51:41,832
because what you can say,

879
00:51:41,833 --> 00:51:44,856
OK, I'm really interested
in this use case,

880
00:51:44,857 --> 00:51:47,129
or this is a use case where you disagree,

881
00:51:47,130 --> 00:51:51,059
but there can also be a use case
where you say OK, I'm interested.

882
00:51:51,059 --> 00:51:53,449
So there is this example you say,
I have glucose.

883
00:51:53,449 --> 00:51:55,841
And glucose when you're a biologist,

884
00:51:55,842 --> 00:52:00,176
you don't care for the chemical
constraints of the glucose molecule,

885
00:52:00,177 --> 00:52:03,201
you just... everything glucose
is the same.

886
00:52:03,202 --> 00:52:05,973
But if you're a chemist,
you cringe when you hear that,

887
00:52:05,973 --> 00:52:08,191
you have 200 something...

888
00:52:08,191 --> 00:52:10,443
So then you can have
multiple Shape Expressions,

889
00:52:10,443 --> 00:52:12,721
OK, I'm coming in with...
I'm at a chemist view,

890
00:52:12,722 --> 00:52:13,887
I'm applying that.

891
00:52:13,887 --> 00:52:16,691
And then you say
I'm from a biological use case,

892
00:52:16,691 --> 00:52:18,524
I'm applying that Shape Expression.

893
00:52:18,524 --> 00:52:20,358
And then when you want to collaborate,

894
00:52:20,358 --> 00:52:22,784
yes, well you should talk
to Eric about ShEx maps.

895
00:52:23,910 --> 00:52:28,873
And so...
but this journey is just starting.

896
00:52:28,873 --> 00:52:32,238
But I personally I believe
that it's quite instrumental in that area.

897
00:52:34,292 --> 00:52:35,535
(Lydia) OK. Over there.

898
00:52:37,949 --> 00:52:39,168
(laughs)

899
00:52:40,597 --> 00:52:46,035
(woman2) I had several ideas
from some points in the discussions,

900
00:52:46,035 --> 00:52:50,902
so I will try not to lose...
I had three ideas so...

901
00:52:52,394 --> 00:52:55,201
Based on what James said a while ago,

902
00:52:55,202 --> 00:52:59,001
we have a very, very big problem
on Wikidata since the beginning

903
00:52:59,002 --> 00:53:01,574
for the upper ontology.

904
00:53:02,363 --> 00:53:05,339
We talked about that
two years ago at WikidataCon,

905
00:53:05,340 --> 00:53:07,432
and we talked about that at Wikimania.

906
00:53:07,432 --> 00:53:09,818
Well, always we have a Wikidata meeting

907
00:53:09,818 --> 00:53:11,656
we are talking about that,

908
00:53:11,656 --> 00:53:15,782
because it's a very big problem
at a very very eye level

909
00:53:15,783 --> 00:53:23,118
what entity is, with what work is,
what genre is, art,

910
00:53:23,118 --> 00:53:25,461
are really the biggest concept.

911
00:53:26,195 --> 00:53:33,117
And that's actually
a very weak point on global ontology

912
00:53:33,118 --> 00:53:37,453
because people try to clean up regularly

913
00:53:38,017 --> 00:53:41,047
and broke everything down the line,

914
00:53:42,516 --> 00:53:48,649
because yes, I think some of you
may remember the guy who in good faith

915
00:53:48,649 --> 00:53:51,785
broke absolutely all cities in the world.

916
00:53:51,785 --> 00:53:57,537
We were not geographical items anymore,
so violation constraints everywhere.

917
00:53:58,720 --> 00:54:00,278
And it was in good faith

918
00:54:00,278 --> 00:54:03,623
because he was really
correcting a mistake in an item,

919
00:54:04,170 --> 00:54:05,732
but everything broke down.

920
00:54:06,349 --> 00:54:09,373
And I'm not sure how we can solve that

921
00:54:10,216 --> 00:54:15,709
because there is actually
no external institution we could just copy

922
00:54:15,710 --> 00:54:18,490
because everyone is working on...

923
00:54:19,154 --> 00:54:22,041
Well, if I am performing art database,

924
00:54:22,042 --> 00:54:24,601
I will just go
at the performing art label,

925
00:54:24,601 --> 00:54:29,361
or I won't go to the philosophical concept
of what an entity is,

926
00:54:29,362 --> 00:54:31,201
and that's actually...

927
00:54:31,202 --> 00:54:34,561
I don't know any database
which is working at this level,

928
00:54:34,562 --> 00:54:36,827
but that's the weakest point of Wikidata.

929
00:54:37,936 --> 00:54:40,812
And probably,
when we are talking about data quality,

930
00:54:40,812 --> 00:54:44,034
that's actually a big part of it, so...

931
00:54:44,034 --> 00:54:48,569
And I think it's the same
we have stated in...

932
00:54:48,569 --> 00:54:50,452
Oh, I am sorry, I am changing the subject,

933
00:54:51,401 --> 00:54:55,774
but we have stated
in different sessions about qualities,

934
00:54:55,774 --> 00:54:59,398
which is actually some of us
are doing good modeling job,

935
00:54:59,399 --> 00:55:01,240
are doing ShEx,
are doing things like that.

936
00:55:01,967 --> 00:55:07,655
People don't see it on Wikidata,
they don't see the ShEx,

937
00:55:07,655 --> 00:55:10,392
they don't see the WikiProject
on the discussion page,

938
00:55:10,393 --> 00:55:11,393
and sometimes,

939
00:55:11,394 --> 00:55:14,958
they don't even see
the talk pages of properties,

940
00:55:14,958 --> 00:55:19,628
which is explicitly stating,
a), this property is used for that.

941
00:55:19,628 --> 00:55:23,887
Like last week,
I added constraints to a property.

942
00:55:23,888 --> 00:55:26,324
The constraint was explicitly written

943
00:55:26,325 --> 00:55:28,690
in the discussion
of the creation of the property.

944
00:55:28,690 --> 00:55:34,548
I just created the technical part
of adding the constraint, and someone:

945
00:55:34,548 --> 00:55:37,182
"What! You broke down all my edits!"

946
00:55:37,183 --> 00:55:41,542
And he was using the property
wrongly for the last two years.

947
00:55:41,542 --> 00:55:46,868
And the property was actually very clear,
but there were no warnings and everything,

948
00:55:46,869 --> 00:55:49,922
and so, it's the same at the Pink Pony
we said at Wikimania

949
00:55:49,922 --> 00:55:54,719
to make WikiProject more visible
or to make ShEx more visible, but...

950
00:55:54,719 --> 00:55:56,917
And that's what Cristina said.

951
00:55:56,917 --> 00:56:02,368
We have a visibility problem
of what the existing solutions are.

952
00:56:02,368 --> 00:56:04,242
And at this session,

953
00:56:04,242 --> 00:56:06,862
we are all talking about
how to create more ShEx,

954
00:56:06,863 --> 00:56:10,727
or to facilitate the jobs
of the people who are doing the cleanup.

955
00:56:11,605 --> 00:56:15,835
But we are cleaning up
since the first day of Wikidata,

956
00:56:15,836 --> 00:56:20,921
and globally, we are losing,
and we are losing because, well,

957
00:56:20,922 --> 00:56:22,960
if I know names are complicated

958
00:56:22,961 --> 00:56:26,162
but I am the only one
doing the cleaning up job,

959
00:56:26,662 --> 00:56:29,671
the guy who added
Latin script name

960
00:56:29,672 --> 00:56:31,584
to all Chinese researcher,

961
00:56:32,088 --> 00:56:35,616
I will take months to clean that
and I can't do it alone,

962
00:56:35,616 --> 00:56:38,777
and he did one massive batch.

963
00:56:38,777 --> 00:56:40,241
So we really need...

964
00:56:40,242 --> 00:56:44,158
we have a visibility problem
more than a tool problem, I think,

965
00:56:44,158 --> 00:56:45,733
because we have many tools.

966
00:56:45,733 --> 00:56:50,255
(Lydia) Right, so unfortunately,
I've got shown a sign, (laughs),

967
00:56:50,256 --> 00:56:52,121
so we need to wrap this up.

968
00:56:52,122 --> 00:56:53,563
Thank you so much for your comments,

969
00:56:53,563 --> 00:56:56,611
I hope you will continue discussing
during the rest of the day,

970
00:56:56,611 --> 00:56:57,840
and thanks for your input.

971
00:56:58,359 --> 00:56:59,944
(applause)