|
INDEXING
& METADATA FOR DIGITAL OBJECTS
Issues
of Importance
Indexing for content description will be an important activity in your
digital project. The indexed terms add to the ability to search and retrieve
digital objects from your database. Indexing will enhance the researcher's
interest in and awareness of the digital objects you are offering access
to. Indexing will add to the body of textual information about the digital
object. You will need to select an indexing standard and format to guide
the creation of your indexing. Two commonly used indexing standards are
USMARC cataloging and the Dublin Core metadata schema. It will be useful
to create a set of protocols and guidelines for the use of project staff
who will be doing content indexing. Consistency is important to achieve
valid data. For example a common standard for numerically expressing dates
is the form YYYYMMDD.
Metadata is described
as information about information.
This "metadata" consists of:
- Administrative metadata - information about rights, authorship, ownership.
- Structural metadata
- used by viewing software.
- Content metadata
- description, title, etc.
Metadata chart from http://www.getty.edu/gri/standard/intrometadata/
Why do we need metadata?
It is useful for resource identification, resource discovery, authentication,
rights management, provenance, version control, resource system use and
tracking of users. It will become important as we develop library systems
that are interoperable.
Costs of indexing:
The costs to create useful finding aids and metadata is far higher than
the costs to create machine scanned files because it requires skilled,
human attention. Where such finding aids, metadata and resource descriptions
are in place, costs for these factors will be significantly reduced. The
cost to assign values to index attributes is dependent upon how much work
is needed to determine what information to post. If the index is essentially
prepared ahead of scanning, such as on a filled out form, then adding
index records to the database is a data entry effort. By contrast, if
the information must be derived from a reading of the document or an analysis
of photographs, it could be quite costly.
A fifteen element
index record with 500 characters of entry may take between 30 seconds
and a few minutes to complete, assuming the values are provided to the
indexer. At an average of four minutes to index each item in a 5,000-image
collection, a total of 334 hours will be required. At minimum wage ($6.20
per hour) this will cost aprox. $2,070.
Typical indexing applications
that are run on PCs include powerful functions to aid the operator. These
include head-down entry without the need to visually advance the cursor,
controlled vocabularies, double-click selection of dictionary values to
avoid key entry, spell checking, repetitive value entry, and automatic
assignment of sequential numbering.
TOP
Options
to consider:
Once the images have been amassed into a database, discovery and retrieval
become critical functions in digitization. Indexing each image, or set
of images, is used to aid in the location of the images during a database
search. The digital resource will only be useful if it contains metadata
and other meaningful "finding aids" to allow users to navigate
the digital file. Typically this includes general description of the object
using free text and values from controlled vocabularies.
Several metadata format
and syntax standards have been created and are in use throughout the world.
Similar to USMARC, these standards govern the choice of data elements,
the syntax (structure) of how their values are constructed, and the use
of controlled vocabularies or limits to the values.
Metadata standards
(EAD, Dublin Core, USMARC, etc.)
Resource description is essentially indexing and cataloging of resources.
Use of standard content description formats based on metadata is essential
One standard is the MARC format. Other useful metadata formats are Dublin
Core (http://purl.oclc.org/dc/),
WAGILS attribute set, and
encoded archival description (EAD) (http://lcweb.loc.gov/ead/).
Crosswalks between standards have been developed.
For digitized images,
a few metadata schema are often used. These include EAD, Dublin Core,
and TEI. While they use similar concepts, these schema have different
terms, syntax, and rules for assigning values. A goal of metadata creators
is to choose a popular and persistent standard that offers interoperability
with other metadata databases and one that is simple enough for public
use.
Index placement:
For digitized documents, the descriptive metadata are often found
in an "external" database. That is, the terms and their values
are retained in a database apart from where the images are stored. While
this provides efficiencies in managing the index, it remains critical
that the location information in the index agrees with the physical placement
of the image, or corruption of the index occurs.
Indexing process
Several indexing models have been developed for indexing digitized
objects. In the first model, metadata information is entered by the scanner
operator during scanning. The index so created is linked to the object
by "frame number" or its digital address in the storage medium.
In the second model,
objects are scanned or filmed to create an intermediate database of images.
Then an operator displays each object on the screen while entering index
information. The index is later married to the object database to establish
links to the physical image location.
Indexing may be accomplished
with desktop personal computers equipped with graphical viewing capabilities.
The raw images will be presented to indexing staff for assignment of index
values. These values are temporarily stored on the PC for ultimate transmission
to the final production facility. In Washington State some organizations
are using the CONTENT Digital Asset Management software (http://www.contentdm.com/)
to construct metadata using Dublin Core schema.
Centralized or
decentralized indexing
The best assignment
of index values is from trained operators with minimal handling of the
source documents. In large-scale document imaging operations, indexing
is almost exclusively done centrally while viewing the displayed object
on a monitor. However, with locally collected, and perhaps locally scanned
objects, decentralized indexing may be justified. At issue is the amount
of training required for successful assignment of values to the metadata
index.
Indexing services
Creating an index of values to describe digitized objects is a vital component
of the digitization process. The effort required to produce an index varies
depending upon the nature of the objects and the level of retrieval desired.
At some point in the capture process, human determination of values for
index terms is required. Indexing effort varies from simple keying of
batch control information to analysis of content and application of cataloging
/ indexing standards. Volunteer assistance is often applied where possible.
The index can be built after the conversion of an object into digital
form. For example, objects can be scanned and the images produced can
be presented to a local historical society or library staff to complete
the assignment of index values to the record. Finally, the index and image
are linked to direct retrieval.
Using indexing software
to automatically create a collection record is one option to consider.
An alternative to this process is to contract with indexing services to
determine and assign terms to the index. The use of controlled vocabularies
and indexing standards materially improve indexing quality.
TOP
Project
checklist:
- Choose the metadata model that best fits your materials and establish
a template for data entry.
- Establish the administrative
metadata before you begin scanning material.
- Develop a training
plan and workbook of examples for all staff or volunteers who will complete
indexing of digital objects.
- Develop a list
of indexing protocols to assure consistency in use of indexing terms.
- Establish a system
of review and quality control for all indexing tasks.
- Decide if you will
use structural metadata. This could be very important in relating scanned
pages from text documents to each other.
- Establish set file
names to provide a unique identifier for each digital object and match
the unique identifier to the corresponding indexing record.
- Set file names,
attribute fields, order and tagging.
TOP
|