|
1
|
- Richard Pearce-Moses
- Director of Digital Government Information
- Arizona State Library, Archives and Public Records
- Phoenix
|
|
2
|
- The web realized and expanded Vannevar Bush’s vision of memex
- The web catalyzed the information explosion
- The web challenges the traditional definition of publication
|
|
3
|
- Reassess criteria for collecting agency publications
- Scale workflows to enormous numbers
- Adapt curatorial practices
- Retool core skill sets
- Recognize profound impact on preservation programs
|
|
4
|
- Activities (Means)
- Identification and selection
- Acquisition
- Description
- Reference
- Preservation
- Principles and traditions (Objectives)
|
|
5
|
- Bibliocentric Model
- Item-level control, highly selective
- No significant difference between traditional print and digital
materials
- Technocentric Model
- Captures everything, patrons select
- Relies on full-text searching
- If it’s not online, it doesn’t exist
|
|
6
|
- An archival approach
- Designed to manage large numbers of documents with limited resources
- Based on observation that websites are similar to archival collections
- Documents share the same provenance
- Creators organize documents into groups (series / directories)
- Aggregates (provenance and series) serve as the basic units of curation
|
|
7
|
- Domain is a clue to provenance
- adobe.com
- adot.state.az.us (Transportation)
- azgovernor.gov (Governor)
- phoenixvis.net (Environmental Quality)
- Scalable
- 1,500 distinct domains on 50,000 pages
- 500 not obvious, needed to be checked
- 160 domains have in-scope content
|
|
8
|
- Based on aggregates, not items
- Directories contain similar materials
- Reflects web masters’ sense of order
- Analysis of URLs reveals hierarchy
- _derived (scripts)
- Browse (many directories)
- ContactUs (ephemera)
- AnnualReports (blank forms)
- Publications (nearly empty)
|
|
9
|
- Browse/
- DRTF (Drought Task Force)
- AMA (Area management plans)
- Surface (Surface water agreements)
- WQARF (Water Quality Assurance Fund)
- Website had ~5,000 documents
- ~50 directories and subdirectories
- ~25 were in-scope
|
|
10
|
- Interpolation not transcription
- Collections, series have no title page
- Many web documents have little or no publication information
- Many web documents have poor quality publication information
- Mapping the forest
- Humans describe the abstract
- Computers describe the concrete
|
|
11
|
- Series titles
- Series/directory names transformed from URLs to English
- Scope and contents note
- Why are these documents valuable?
- What are they about?
- Who’s responsible for them?
- Access points
- Associates controlled vocabulary with series
|
|
12
|
- Largely automated
- Can create a pick list
- Many repositories will accept default
- Harvesting software restricted to
- In-scope domains
- Selected series (directories)
- Specified file types (.doc, .html)
- Packages documents
- With supplied and internal metadata
- As complete units
|
|
13
|
- Google defines patrons’ expectations
- Classification can improve retrieval
- Classes eliminate noise by grouping similar documents
- Headings refine queries by suggesting related concepts
|
|
14
|
- Classification based on controlled vocabulary assigned to series
|
|
15
|
- Conservation
- Long-term accessibility
- Much work needs to be done
- Pear-Preserves
- To capture what will be lost until the documents can be conserved
|
|
16
|
- Library of Congress
National Digital Information Infrastructure Preservation Program
(NDIIPP)
- University of Illinois at Urbana-Champaign
- OCLC
- Five state libraries (AZ, CT, IL, NC, WI)
- Other content providers
- Practical tools and a workbench
|
|
17
|
- Domain tool
- Returns a list of distinct domains on seed sites
- Tracks decisions: in-scope, out-of-scope, new
- Entity tool
- Links domains to agencies and subordinates
- Organizes agencies, subordinates into a taxonomy
- Records information about agencies, subordinates
- Version 0.1 being tested now
|
|
18
|
- Analysis tool
- Shows organization of website series
- Records selection decisions about series
- Associates supplied metadata with series
- Title, Scope note, Dates, Access points
- Harvester
- Downloads documents
- Packages document with supplied and internal metadata
- Delivers METS packages to be loaded on another system
- Due for testing January 2006
|
|
19
|
- Comments and criticisms welcome
- Richard Pearce-Moses
rpm@lib.az.us
- Paper and slides available at
- www.lib.az.us/diggovt/azmodel/GODORT/
|