Introduction
Metadata
by definition is simply "data about data", information
about the objects stored within our collections, whether these
are in traditional or electronic formats. In the standard
library world, catalogue records are metadata, as they contain
information about the library's collection of "data",
i.e. the books, journals, and electronic resources that make up its collections. Metadata
records in the traditional library fulfil several functions,
including allowing users to find items, allowing them to assess
their usefulness, and to allow librarians to administer them
correctly. The same principles apply to objects within the
digital library.
Types
of metadata
Metadata
can take several forms, some of which will be visible to the
user of a digital library system, while others operate behind
the scenes. The Digital
Library Foundation (DLF), a coalition of 15 major research
libraries in the USA, defines three types of metadata which
can apply to objects in a digital library:-
- descriptive
metadata: information describing the intellectual
content of the object, such as MARC cataloguing records,
finding aids or similar schemes
- administrative
metadata: information necessary to allow a repository
to manage the object: this can include information on
how it was scanned, its storage format etc (often called
technical metadata), copyright and licensing information,
and information necessary for the long-term preservation
of the digital objects (preservation metadata)
- structural
metadata: information that ties each object to others
to make up logical units (for example, information that
relates individual images of pages from a book to the
others that make up the book itself)
In
general, only descriptive metadata is visible to the users
of a system, who search and browse it to find and assess the
value of items in the collection. Administrative metadata
is usually only used by those who maintain the collection,
and structural metadata is generally used by the interface
which compiles individual digital objects into more meaningful
units (such a journal volumes) for the user.
Any
metadata system for the Oxford Digital Library will have to
handle all relevant types of metadata, and will have to be
compatible with that used in traditional libraries, at least
to the extent of allowing integrated searching. Clearly, any
system employed will have to be more powerful than the standard
mechanisms (such as the MARC record) currently in use in libraries,
which generally limit themselves to the first two types of
metadata.
XML:
the language for metadata systems
A
decision was reached very early in the planning of metadata
for the ODL that it should be expressed in the eXtensible
Markup Language (XML). This is a language designed initially
for marking up electronic text, but which has since then been
used for a wide variety of metadata applications. Its advantages
for metadata encoding are many: they include its robustness,
its software independence and hence its ready interchangeability
between systems, and the way in which its structure maps neatly
to that of many digital objects.1
An
XML system can be expressed in two ways: the first, and longer
established, system is the Document Type Definition (DTD),
which lists what tags may be employed within an XML document,
and also their content and relationships to each other. A
much newer method of encoding an XML system is XML Schema,
which expresses the rules an XML document has to follow in
a further, separate XML document. XML Schema and its related schema languages are much more
powerful than a DTD and allow for easier development and maintenance as just another type of XML data.
Some
specific metadata standards
The
following is a list of a selection of key metadata standards
which are currently being used or being assessed for their potential use
in the Oxford Digital Library. All conform to, or can be used
with, the XML encoding language, which is generally acknowledged
as the most robust and easily transferrable system for holding
metadata.
A
newly devised standard, which refines and extends the earlier
Making
of America II (MOA), system, METS is designed specifically
to encode descriptive, administrative, and structural metadata
for objects within a digital library. One of the few systems
designed specifically for digital libraries, it can fulfil
all basic requirements of electronic collections. METS is widely used in the digital library world
and is the standard underpinning the ODL's digital collections. METS is closely associated with a number of standards often used in tandem with METS, such as the Metadata Object Description Schema (MODS) for handling descriptive metadata, or MIX (NISO Metadata for Images in XML) for expressing technical metadata.
METS
is written in XML Schema and so requires software that can handle this new format, e.g. XML editors such as oXygen, and XML databases such as eXist or Tamino. As
METS
depends on an elaborate system of cross references within
documents, it is better generated automatically. The ODL has developed a database-driven system for the creation of METS that allows
content producers to create, verify, and output their metadata (via a Web frontend) in the METS standard for the purpose of long-term preservation,
sharability with scholars and other interested parties, and use in any system supporting these standards or transformation to systems not yet supporting the standard.
EAD
is an XML DTD used throughout the archival community for the
encoding of finding aids (collection-level descriptions).
Because of its extensive facilities to link to digital objects,
it is able to describe digital collections as well as their
more traditional counterparts. It is also designed to map
closely to key standards such as MARC, which allows EAD records
to be searched in tandem with those in longer established
formats.
EAD
is capable to describing a digital collection and its internal
structure, from the topmost collection-level, down to individual
items: its item-level descriptions are, however, somewhat
limited, so making it unlikely to meet all metadata needs
at the item level. It can, however, be easily used in tandem
with other systems which have more extensive item-level facilities.
Many
digital library projects use EAD: among the most significant
are: MALVINE , Online
Archive of California, American
Heritage Project and ILEJ.
The
TEI is the de facto standard for the encoding of most types of electronic
texts, and as such is used by almost all of the world's e-text
centres. A modular system, it incorporates a set of base tags,
to which can be added specialized sets for use in particular
applications, such as linguistic corpora, transcriptions of
manuscripts, or critical editions. It also includes extensive
facilities for descriptive metadata, most of which are located
in the TEI header, a section of every document which holds
information on the electronic text file itself and on the
source from which it is taken. The header is designed to map
closely to MARC for purpose of creating a library catalogue entry for the electronic text file.
TEI
has been used in a large number of projects, mostly those
with an extensive textual component, although a few, such
as the Bodleian's Toyota
Project use it to render metadata for images only: a complete
list may be found at http://www.tei-c.org/Applications/.
The TEI is readibly usable for item-level descriptions within
a digital library, although for those without a textual component
it is likely to be unnecessarily complex and to require a
large amount of redundant tagging.
The
Dublin Core is a list of 15 basic fields designed initially
to describe web-based resources sufficiently to allow their
discovery by search engines. It is not an XML application
as such, but designates elements which might be incorporated
into such an application (as may be done in METS, for example).
Because the DC elements are so broad, they may be qualified
to limit their semantic range, which limits their functionality
for cross-searching but increases their precision.
Dublin
Core has great potential as a basic set of metadata for digital
objects, but will often have to be supplemented by more detailed
information specific to the needs of these objects. As it
is not in itself a DTD or XML Schema, it needs to be used
in conjunction with, or embedded in, such another XML application.
Preservation metadata supports activities intended to ensure the long-term usability of
a digital resource. It is "the information a
repository uses to support the digital preservation process." PREMIS stands for "PREservation Metadata: Implementation Strategies" which is the
name of an international working group sponsored by OCLC and RLG from 2003-2005. That working group produced a report called PREMIS Data Dictionary for
Preservation Metadata which includes both a data dictionary and quite a bit of
narrative about preservation metadata. An updated second version was issued in
March 2008. The Library of Congress maintains a schema for representing PREMIS
in XML.
There is an active PREMIS Maintenance Activity sponsored by the Library of
Congress. This includes a website linking to all sorts of official and unofficial
PREMIS information. The
Maintenance Activity also tries to promote awareness of PREMIS, sponsors tutorials
in using PREMIS, and commissions studies and publications related to PREMIS.
Usually, when people refer to "PREMIS" they mean the Data Dictionary.
Occasionally they may be referring to the XML schema, to the working group, or to
the entire effort including the Maintenance Activity.
The
RDF is not a metadata scheme per se, but a system for encoding
such schemes within a standardized framework. Designed initially
for describing electronic resources on the internet, it provides
a standard way of describing element names, their content
and their relationships, so making it easier to find these
resources and to exchange information on them. RDF is usually
expressed in XML, and can be used as a framework for any metadata
scheme listed here. For further information see An
Idiot's Guide to the Resource Description Framework by
Renato Iannella.
ONIX
is an XML application designed for use within the book trade
to enable publishers and booksellers to exchange essential
metadata. As a consequence, it has very good facilities for
describing key bibliographic, pricing and stock information,
but is very limited in terms of strutural and administrative
metadata. It also has limited capabilities to describe anything
other than printed books. ONIX is therefore unlikely to be
of much value for a digital library.
MARC
is the established standard for the creation of machine readable
cataloguing records, and underlies virtually all online library
catalogues. It consequently has extensive features for describing
bibliographic and copy-specific information, but has very
limited structural facilities and administrative metadata
which is heavily biased towards the needs of traditional library operations.
It is of limited use for incunabula or manuscripts, and other
objects which may be included in a digital collection.
Mappings
to MARC are incorporated into most metadata systems, so that
MARC records can be readily generated to allow linking from
these to library catalogues. This will allow library users
to find electronic versions of library materials in conjunction
with their traditional counterparts.
Devised
by the Art Information Task Force (AITF), CDWA attempts to
define a set of core fields for the description of works of
art. In effect, it has a similar aim to Dublin Core, but is
much more specialised in its scope and function: it distinguishes
between information intrinsic to the work (art object, architecture,
or group) and information extrinsic to the work (such as information
about persons, places, and concepts related to the work).
Like DC, it is not tied to any given DTD, but may be incorporated
into other XML systems.
A
similar project to CDWA is Visual
Resources Association Core Categories, which similarly
attempts to define core fields for the description of visual
resources, and also adds information on their surrogates (such
as digital images).This is still in its early testing stage,
however, and is certain to undergo further revision as it
is evaluated.
IMS
is a metadata system devised for the management of online
learning resources, which could include objects within a digital
library. Published as an XML DTD, it includes components to
provide descriptive and administrative metadata, and is designed
to map to Dublin Core. While undoubtedly powerful, it has
often been criticized as over-complex, and has been little
taken-up by digital libraries.
CEDARS
is a key system to encode metadata necessary for the long-term
preservation of digital materials. It aims "to promote
awareness of the importance of digital preservation, to produce
strategic frameworks for digital collection management policies
and to promote methods appropriate for long-term preservation."
Among its deliverables is a draft specification
for preservation metadata , which can readibly be incorporated
into an XML system.
Metadata
Tools
Numerous
software tools exist for the encoding of metadata, ranging
from freeware packages to highly complex and expensive integrated
systems. Those designed specifically for XML encoding range
from free packages such as Emacs
to commercial packages such as XMetal or oXygen.
Some
further references
An
extensive body of material on metadata for digital libraries
is available on the internet. A very limited selection of
important resources are listed below.
- Library
Digital Initiative: Metadata
- A
basic guide to metadata standards for digital libraries,
and specific metadata systems, compiled for Harvard's
Library Digital Initiative
- IFLANET:
Digital Library Metadata Resources
- A
extensive collection of links to information on digital
library metadata resources, compiled by IFLA.
- Library
of Congress Standards
- A
gateway page to information on standards maintained by
the Library of Congress, and others used by the Library
in its digital library projects.
- UKOLN
Metadata
- The
UK Office of Library Networking aims to be a national
focus for digital information management, and is involved
in several key metadata projects listed on this page.
- Digital
Library Standards and Practices
- The
Digital Library Federation's home page for standards and
practices lists key metadata systems as well as providing
information on topics such as benchmarking, quality evaluation
etc.
- METS
and TEI
- PowerPoint
Presentation for the 2005 Association for Computing
in the Humanities
Conference, at the University of Victoria, British
Columbia, 15th June 2005. Given
by Richard Gartner.
Top