-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter01.tex
445 lines (356 loc) · 31.7 KB
/
chapter01.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
\chapter{Introduction}
In the cultural heritage section there is a long tradition of building catalogues. During the centuries museums, archives and libraries developed different systems to record their collections.
There is no good definition for the quality, but much of the literature agrees that quality should somehow be in line with the `fitness for purpose', i.e. the quality of an object should be measured as how much the object supports a given purpose. The main purposes of the cultural heritage metadata are registering the collection and helping users in discovery. The functional analysis of MARC 21 format (the most popular metadata schema for bibliographic records) goes further and sets up functional groups, such as search, identity, select, manage, process and classifies the underlying schema elements to these categories~\cite{frbr1998, delsey2003, loc2006}. So by analysing the fields of the individual records, we can more precisely tell which aspects of the quality are good or bad.
These records are not only for registration and helping discovery of the materials, they are also the sources of additional researches in the Humanities. The catalogue contains lots of factual information, which are not available in other sources (or not in organised way), and therefore before the age of digitisation one could have found the printed catalogues of the most important collections (e.g. British Library, Library of Congress etc.) in the reading rooms of research institutions. In the past two decades several research projects attached existing library metadata to different types of full text datasets (optical character recognised or XML encoded versions), to provide additional facets for the analysis process such as personal or institutional names (creators, publishers), geographical information (places of publication), time span and so on.
Just a few examples: KOLIMO (Corpus of Literary Modernism)\footnote{\url{https://kolimo.uni-goettingen.de/index.html}} uses TEI headers containing catalogue information as well as other metadata, for extracting literature and language features specific to a given time period, or to a particular author. OmniArt~\cite{strezoski2017} is a research project, based on the metadata of Rijksmuseum (Amsterdam), the Metropolitan Museum of Arts (New York) and the Web Gallery of Art\footnote{\url{https://www.wga.hu/}}. They collected 432,217 digital images with curated metadata (which is the largest collection of that kind) to run categorical analysis. Benjamin Schmidt uses the HathiTrust\footnote{\url{https://www.hathitrust.org/}} digital library and its metadata records to test machine learning classification algorithms, where he can compare the results with the Library of Congress subject headings available in the metadata records~\cite{smith2017}. The common features of these project is that they use cultural heritage institutions’ catalogue data as primary sources in their own research. It is self evident, that quality of those data might have effect on the conclusions of the research, and on the other hand it is beyond the responsibilities and possibilities of a researcher (or even a research group) to validate the records one by one, and fix them as needed.
This third use case of cultural heritage data become so frequent recently, that two years ago it lead to coining a new phrase: ``collections as data''. As the Santa Barbara Statement on Collections as Data~\cite{santabarbarastatement2017} summarises: ``For decades, cultural heritage institutions have been building digital collections. Simultaneously, researchers have drawn upon computational means to ask questions and look for patterns. This work goes under a wide variety of names including but not limited to text mining, data visualisation, mapping, image analysis, audio analysis, and network analysis. With notable exceptions [...], cultural heritage institutions have rarely built digital collections or designed access with the aim to support computational use. Thinking about collections as data signals an intention to change that.'' While collections as data movement emphasises the importance of re-usability of cultural heritage data, and we expect that this great and important movement will help organisations to think more about the scientific usage or their metadata,\footnote{A 2016 report which analyses the usage of two important British cultural heritage collections mentions that ``The citation evidence that is available shows a growing literature that mentions using EEBO [Early English Books Online] or HCPP [House of Commons Parliamentary Papers]'', and ``Shifts to humanities data science and data-driven research are of growing interest to scholars''.~\cite{meyer2016}} their principles are focusing on access, and get rid of current barriers, however misses the aspects of quality. The quality assessment aspect we propose in this project would be a complementary element next to the other principles.
%%%
\section{Metadata quality}
\quote{``We know it [i.e. metadata quality] when we see it, but conveying the full bundle of assumptions and experience that allow us to identify it is a different matter.'' (Bruce and Hillmann)~\cite{bruce-hillmann2004}.}
The (US) National Information Standards Organization (NISO) provides a definition for metadata, which is ``structured information that describes, explains, locates, or otherwise represents something else.''~\cite{framework2007} The interesting thing in this definition is the list of verbs: describes, explains, locates, and represents. Metadata is not a static entity, it has multiple different functions and should be in context of other entities. That is in harmony with the famous the quality assurance slogan `fitness for purpose'. There are different definitions of the slogan, some of them are
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item fulfilment of a specification or stated outcomes
\item measured against what is seen to be the goal of the unit
\item achieving institutional mission and objectives
\end{itemize}
From these definitions we can draw two important conclusions:
1) an object's quality is not an absolute value, it depends on the context of the object, what goal(s) the agents in the current context would like to achieve with the help of the object
2) the quality is a multi-faceted value. As the object might have different functions, we should evaluate the fulfilment's of them independently.
NISO's definition of metadata nicely fits into this framework, as it highlights the multi-faceted and contextual nature of metadata.
In an aggregated metadata collection such as Europeana, the main purpose of the metadata is to provide access points to the objects which the metadata describe (and stored remotely in the providing cultural heritage institutions, outside of Europeana). If the metadata stored in Europeana is of low quality or missing, the service will not be able to provide access points, and the user will not use the object.
% more explanation:
% Data on the Web Best Practices
% W3C Working Draft, https://www.w3.org/TR/dwbp/
As Bruce and Hillmann states, an expert could recognise if a given metadata record is ``good'' or ``bad''. What we would like to achieve is to formalise this knowledge by setting up the dimensions of the quality, and establishing metrics and measurement methods.
\section{Metrics in the literature}
In the literature of metadata quality assessment (see Appendix A) one can find a number of metric definitions. In this section I review some of them which proved to be relevant in my research.
Regarding to the cultural heritage context Bruce and Hillmann's above cited seminal paper (\cite{bruce-hillmann2004}) defines the data quality metrics. Palavitsinis in his PhD thesis \cite{palavitsinis2014} summarises them as follows:
\emph{Completeness}: Number of metadata elements filled out by the annotator in comparison to the total number of elements in the application profile
\emph{Accuracy}: In an accurate metadata record, the data contained in the fields, correspond to the resource that is being described
\emph{Consistency}: Consistency measures the degree to which the metadata values provided are compliant to what is defined by the metadata application profile
\emph{Objectiveness}: Degree in which the metadata values provided, describe the resource in an unbiased way, without undermining or promoting the resource
\emph{Appropriateness}: Degree to which the metadata values provided are facilitating the deployment of search mechanisms on top of the repositories
\emph{Correctness}: The degree to which the language used in the metadata is syntactically and grammatically correct
The same author -- analysing the metadata quality literature focusing mainly on the Learning Object Repositories metadata -- lists the following additional dimensions proposed by different authors: accessibility, conformance, currency, intelligibility, objectiveness, presentation, provenance, relevancy and timeliness. He also repeats the categorisation of Lee et al. \cite{lee2002} regarding to the quality dimensions:
Intrinsic Metadata Quality: represents dimensions that recognise that metadata may have innate correctness regardless of the context in which it is being used. For example, metadata for a digital object may be more or less ‘accurate’ or ‘unbiased’ in its own right,
Contextual Metadata Quality: recognises that perceived quality may vary according to the particular task at hand, and that quality must be relevant, timely, complete, and appropriate in terms of amount, so as to add value to the purpose for which the information will be used,
Representational Metadata Quality: addresses the degree to which the metadata being assessed is easy to understand and is presented in a clear manner that is concise and consistent,
Accessibility Metadata Quality: references the ease with which the metadata is obtained, including the availability of the metadata and timeliness of its receipt.
Zaveri Amrapali and her colleagues surveyed the Linked Data Quality literature in 2015~\cite{zaveri2015}. Their work became the most cited paper regarding to data quality. They investigated what quality dimensions and metrics were suggested by other authors and grouped individual metrics into the following dimensions:
\emph{Accessibility dimensions}
Availability -- the extent to which data (or some portion of it) is present, obtainable, and ready for use. The metrics this dimension are:
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item A1 accessibility of the SPARQL endpoint and the server
\item A2 accessibility of the RDF dumps
\item A3 dereferenceability of the URI
\item A4 no misreported content types
\item A5 dereferenced forward-links
\end{itemize}
Licensing -- the granting of permission for a customer to reuse a dataset under defined conditions.
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item L1 machine-readable indication of a license
\item L2 human-readable indication of a license
\item L3 specifying the correct license
\end{itemize}
Interlinking -- the degree to which entities that represent the same concept are linked to each other, be it within or between two or more data sources.
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item I1 detection of good quality interlinks
\item I2 existence of links to external data providers
\item I3 dereferenced back-links
\end{itemize}
Security -- the extent to which data is protected against alteration and misuse.
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item S1 usage of digital signatures
\item S2 authenticity of the dataset
\end{itemize}
Performance -- the efficiency of a system that binds to a large dataset.
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item P1 usage of slash-URIs
\item P2 low latency
\item P3 high throughput
\item P4 scalability of a data source
\end{itemize}
\emph{Intrinsic dimensions}
Syntactic validity -- the degree to which an RDF document conforms to the specification of the serialization format
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item SV1 no syntax errors of the documents
\item SV2 syntactically accurate values
\item SV3 no malformed datatype literals
\end{itemize}
Semantic accuracy -- the degree to which data values correctly represent the real-world facts
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item SA1 no outliers
\item SA2 no inaccurate values
\item SA3 no inaccurate annotations, labellings or classifications
\item SA4 no misuse of properties
\item SA5 detection of valid rules
\end{itemize}
Consistency -- a knowledge base is free of (logical/formal) contradictions with respect to particular knowledge representation and inference mechanisms
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item CS1 no use of entities as members of disjoint classes
\item CS2 no misplaced classes or properties
\item CS3 no misuse of owl:DatatypeProperty or owl:ObjectProperty
\item CS4 members of owl:DeprecatedClass or owl:DeprecatedProperty not used
\item CS5 valid usage of inverse-functional properties
\item CS6 absence of ontology hijacking
\item CS7 no negative dependencies/correlation among properties
\item CS8 no inconsistencies in spatial data
\item CS9 correct domain and range definition
\item CS10 no inconsistent values
\end{itemize}
Conciseness -- the minimization of redundancy of entities at the schema and the data level
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item CN1 high intensional conciseness
\item CN2 high extensional conciseness
\item CN3 usage of unambiguous annotations/labels
\end{itemize}
Completeness -- the degree to which all required information is present in a particular dataset
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item CM1 schema completeness
\item CM2 property completeness
\item CM3 population completeness
\item CM4 interlinking completeness
\end{itemize}
\emph{Contextual dimensions}
Relevancy -- the provision of information which is in accordance with the task at hand and important to the users’ query
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item R1 relevant terms within metainformation attributes
\item R2 coverage
\end{itemize}
Trustworthiness -- the degree to which the information is accepted to be correct, true, real, and credible
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item T1 trustworthiness of statements
\item T2 trustworthiness through reasoning
\item T3 trustworthiness of statements, datasets and rules
\item T4 trustworthiness of a resource
\item T5 trustworthiness of the information provider
\item T6 trustworthiness of information provided (content trust)
\item T7 reputation of the dataset
\end{itemize}
Understandability -- the ease with which data can be comprehended without ambiguity and be used by a human information consumer
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item U1 human-readable labelling of classes, properties and entities as well as presence of metadata
\item U2 indication of one or more exemplary URIs
\item U3 indication of a regular expression that matches the URIs of a dataset
\item U4 indication of an exemplary SPARQL query
\item U5 indication of the vocabularies used in the dataset
\item U6 provision of message boards and mailing lists
\end{itemize}
Timeliness -- how up-to-date data is relative to a specific task
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item TI1 freshness of datasets based on currency and volatility
\item TI2 freshness of datasets based on their data source
\end{itemize}
\emph{Representational dimensions}
Representational conciseness -- the representation of the data, which is compact and well formatted on the one hand and clear and complete on the other hand
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item RC1 keeping URIs short
\item RC2 no use of prolix RDF features
\end{itemize}
Interoperability -- the degree to which the format and structure of the information conform to previously returned information as well as data from other sources
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item IO1 re-use of existing terms
\item IO2 re-use of existing vocabularies
\end{itemize}
Interpretability -- technical aspects of the data, that is, whether information is represented using an appropriate notation and whether the machine is able to process the data
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item IN1 use of self-descriptive formats
\item IN2 detecting the interpretability of data
\item IN3 invalid usage of undefined classes and properties
\item IN4 no misinterpretation of missing values
\end{itemize}
Versatility -- the availability of the data in different representations and in an internationalized way
\vspace{0mm}
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item V1 provision of the data in different serialization formats
\item V2 provision of the data in various languages
\end{itemize}
Some of these metrics are relevant only in Linked Data context (those which are LD technology specific, such as SPARQL endpoint or RDF dump). On the other hand there are lots of metrics which are useful for non-linked metadata as well. For example we will see in Chapter 2 that there is a tendency to add misinterpretable ad-hoc values into a placeholder (``+++EMPTY+++'' to quote an extreme case) when the value is missing. `V2 provision of the data in various languages' is similar concept than the multilinguality I'll describe in Chapter 3. Downloadable dumps are also very useful even it is not in a specific (e.g. RDF) format.
\subsection{FAIR metrics}
One of the main recent developments regarding to research data management was the formulation of FAIR principles.~\cite{wilkinson2016}. ``The FAIR Principles provide guidelines for the publication of digital resources such as datasets, code, workflows, and research objects, in a manner that makes them Findable, Accessible, Interoperable, and Reusable.'' It became the starting point of many different projects which either implement the principles, or investigate further extensions. One of the is FAIRMetrics~\cite{wilkinson2018, fairmetrics}. It concentrates on the measurement aspects of the FAIR principles: how can we set up metrics upon which we can validate the ``fairness'' or research data.
The authors suggested, that good metrics in general should have the following properties:
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item clear
\item realistic
\item discriminating
\item measurable
\item universal
\end{itemize}
There are 14 FAIR principles, and for each there is a metric. Each metric answers questions, such as `What is being measured?', `Why should we measure it?', `How do we measure it?', `What is a valid result?', `For which digital resource(s) is this relevant?' etc.
The creators published the individual metrics as nanopublications and they are working on an implementation. Besides the metrics they defined `Maturity Indicator tests' which are available as REST API backed by a Ruby based software called FAIR Evaluator. Maturity Indicators are an open set of metrics. Above the core set (which presented by the FAIRMetrics), the creators invited the research communities to create their own indicators. As they emphasise: ``we view FAIR as a continuum of `behaviors' exhibited by a data resource that increasingly enable machine discoverability and (re)use.''
The FAIRmetrics are as follows:
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item F1: Identifier Uniqueness (Whether there is a scheme to uniquely identify the digital resource.)
\item F1: Identifier persistence (Whether there is a policy that describes what the provider will do in the event an identifier scheme becomes deprecated.)
\item F2: Machine-readability of metadata (The availability of machine-readable metadata that describes a digital resource.)
\item F3: Resource Identifier in Metadata (Whether the metadata document contains the globally unique and persistent identifier for the digital resource.)
\item F4: Indexed in a searchable resource (The degree to which the digital resource can be found using web-based search engines.)
\item A1.1: Access Protocol (The nature and use limitations of the access protocol.)
\item A1.2: Access authorization (Specification of a protocol to access restricted content.)
\item A2: Metadata Longevity (The existence of metadata even in the absence/removal of data.)
\item I1: Use a Knowledge Representation Language (Use of a formal, accessible, shared, and broadly applicable language for knowledge representation.)
\item I2: Use FAIR Vocabularies (The metadata values and qualified relations should themselves be FAIR, for example, terms from open, community-accepted vocabularies published in an appropriate knowledge-exchange format.)
\item I3: Use Qualified References (Relationships within (meta)data, and between local and third-party data, have explicit and `useful' semantic meaning)
\item R1.1: Accessible Usage License (The existence of a license document, for both (independently) the data and its associated metadata, and the ability to retrieve those documents)
\item R1.2: Detailed Provenance (There is provenance information associated with the data, covering at least two primary types of provenance information: -- Who/what/When produced the data (i.e. for citation); -- Why/How was the data produced (i.e. to understand context and relevance of the data))
\item R1.3: Meets Community Standards (Certification, from a recognized body, of the resource meeting community standards.)
\end{itemize}
Most of these metrics rather measure the data repository, than individual research data sets. In this thesis I do not work with research data, it is among my future plans, but it is good to note that FAIRmetrics does not cover classical metadata quality metrics (such as completeness, accuracy etc.), so even if it will have a robust implementation, there will be space left for future research on research (meta)data quality, and on the other hand some of these metrics are applicable for cultural heritage data (e.g. persistent identifiers would help the ingestion process of Europeana, so the \emph{Identifier persistence} metric would be a useful indicator in this workflow).
\subsection{Vocabularies for validating Linked Data}
The domain of Linked Data (or semantic web) is based on `Open World assumption', which means that objects (entities) and statements about them are separated, different agents could create a statement about an object. Practically it means that there is no concept as ``record'', since the object does not have clear boundaries. The traditional record based systems have schemas, which describe what kind of statements could be done about an entity. For example the Dublin Core Metadata Element Set consists of 15 metadata element. If we would like to record a colour of a book in this schema, we can not do it directly. Of course we can put this information into a semantically more generic field, such as ``format'', but then we will loose specificity, and colour will be stored together with other features such as size, dimensions etc. In Linked Data context the situation is different: we can easily introduce a new property, and create a statement, however we loose the control of the schema. We can not tell if the new property is valid or not.
To solve this problem W3C set up RDF Data Shapes working group ``to produce a language for defining structural constraints on RDF graphs''\footnote{\url{https://www.w3.org/2014/data-shapes/charter}}. One of the results came from this approach is Shapes Constraint Language (SHACL)\footnote{\url{https://www.w3.org/TR/shacl/}. We should note that there is another approach for the same problem: Shape Expressions (ShEx) available at \url{http://shex.io}.}
SHACL defined a vocabulary (see Table \ref{table:shacl}) upon which one can create validation rules. It does not set metrics directly, but these constraint definitions are very useful building blocks of a data quality measurement system. The implementation of SHACL is based on Linked Data, but the definitions are meaningful in other contexts as well.
\begin{table}[ht]
\caption{Core constraints in SHACL}
\label{table:shacl}
\centering
\begin{tabular}{l|l}
category & constrains \\
\hline
Cardinality & minCount, maxCount \\
Types of values & class, datatype, nodeKind \\
Shapes & node, property, in, hasValue \\
Range of values & minInclusive, maxInclusive,\\
& minExclusive, maxExclusive \\
String based & minLength, maxLength, pattern, stem,\\
& uniqueLang \\
Logical constraints & not, and, or, xone \\
Closed shapes & closed, ignoredProperties \\
Property pair constraints & equals, disjoint, lessThan,\\
& lessThanOrEquals \\
Non-validating constraints & name, value, defaultValue \\
Qualified shapes & qualifiedValueShape, qualifiedMinCount,\\ & qualifiedMaxCount \\
\end{tabular}
\end{table}
Within Europeana Data Quality Committee we plan to define frequently occurring metadata problems (or `anti-patterns') with SHACL.
\subsection{Organising issues per responsible actors}
Christopher Groskopf who wrote a guide for data journalists how to recognise data issues~\cite{groskopf2015} followed a different approach. He wrote a practical guide, not an academic paper, so he organised issues based on who could fix them. His main take-away messages are
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item be skeptic about the data
\item check it with exploratory data analysis
\item check it early, check it often
\end{itemize}
His categorisation is the following:
\emph{Issues that your source should solve}
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item Values are missing
\item Zeros replace missing values
\item Data are missing you know should be there
\item Rows or values are duplicated
\item Spelling is inconsistent
\item Name order is inconsistent
\item Date formats are inconsistent
\item Units are not specified
\item Categories are badly chosen
\item Field names are ambiguous
\item Provenance is not documented
\item Suspicious numbers are present
\item Data are too coarse
\item Totals differ from published aggregates
\item Spreadsheet has 65536 rows
\item Spreadsheet has dates in 1900 or 1904
\item Text has been converted to numbers
\end{itemize}
\emph{Issues that you should solve}
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item Text is garbled
\item Data are in a PDF
\item Data are too granular
\item Data was entered by humans
\item Aggregations were computed on missing values
\item Sample is not random
\item Margin-of-error is too large
\item Margin-of-error is unknown
\item Sample is biased
\item Data has been manually edited
\item Inflation skews the data
\item Natural/seasonal variation skews the data
\item Timeframe has been manipulated
\item Frame of reference has been manipulated
\end{itemize}
\emph{Issues a third-party expert should help you solve}
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item Author is untrustworthy
\item Collection process is opaque
\item Data asserts unrealistic precision
\item There are inexplicable outliers
\item An index masks underlying variation
\item Results have been p-hacked
\item Benford’s Law fails
\item It’s too good to be true
\end{itemize}
\emph{Issues a programmer should help you solve}
\begin{itemize}
\setlength{\parskip}{0pt}
\setlength{\itemsep}{0pt plus 1pt}
\item Data are aggregated to the wrong categories or geographies
\item Data are in scanned documents
\end{itemize}
Groskopf's list is not a definition of general metrics, it is a catalogue of anti-patterns. It was created in reflection to the data journalism context, and it implies that -- comparing to cultural heritage data -- these project are smaller in both the number of contributors and the number of records. On the other hand, the sole purpose of these data is to be used in data analysis so during the data cleaning process the maintainer has more freedom than that of a librarian, who should keep in mind multiple data reuse scenarios. Despite of these differences cultural heritage projects also get inspirations from Groskopf's list.
\subsection{Conclusion about the metrics}
% pull all the red line arguments of the previous pages together
% - different approaches for the different nature of the data (fitness for purpose)
% - general metrics, format specific metrics, data and service metrics
% - finding individual issues
% - this is not a comprehensive overview
In the previous section I revised some of the metrics and approaches. This is not a comprehensive overview (for those who would like to read a general review of the metadata quality metrics I suggest the already quoted thesis of Palavitsinis \cite{palavitsinis2014}). What I wanted to show is that in different research areas or domains of activities there are quite different approaches for the measurement of metadata quality and detecting individual issues. There are general metrics, such as completeness, format specific metrics, such as those ones for Linked Data that were collected by Amrapali or those I will discuss in Chapter 4 for MARC records. Some metrics measure data, but there are metrics which focusing on services which helps users to access data (such as existence of different API endpoints, or downloadable data dumps --- we could label most of the FAIRmetrics into this category). In one of the early papers in metadata quality \cite{stvilia2007} Stvilia and his co-authors emphasized that the information quality (IQ) framework they created (which contains ``typologies of IQ variance, the activities affected, a comprehensive taxonomy of IQ dimensions along with general metric functions, and methods of framework operationalization''), should be applied to a data source by selecting relevant IQ dimensions. In other words not all metrics are useful in all situation, we should select the appropriate one for each and every use case.
\section{Research objectives}
% outline and introduce your thesis
In this thesis I would like to answer the following questions:
Q1: What kind of quality dimensions are meaningful in the context of two different cultural heritage data sources: the collection of Europeana and MARC 21 format library catalogues.
Q2: How could it be implemented in a flexible way, so the solution should remain easily extensible to measure the same metrics on data sources in other formats.
Since Europeana could be qualified as Big Data (at least in the cultural heritage domain) two more questions arose regarding to scalability:
Q3: How can these measurement be implemented in scalable way?
Q4: How could Big Data analysis be conducted with limited computational resources?
\subsection{The outline of this thesis}
In Chapter 2 I describe the main metrics for Europeana. I also give an overview of the tool I developed for implementing the measurments. Chapter 3 describes a new set of metrics, \emph{multilinguality} which measures how users with different language background can access Europeana's data. Chapter 4 concentrates on traditional library metadata, and shows the results of validation of 16 catalogues. Chapter 5 sheds light on the questions of flexibility: how the tool abstracts measurements in order to support different metadata schemas. Chapter 6 concentrates on resource optimisation: how the tool (or other tools which uses the same underlying technique, namely Apache Spark) should be optimised for speed in a multi-tenant environment with limited resources. Finally Chapter 7 provides a conclusion and shows future plans.