Notes
Slide Show
Outline
1
An Arizona Model for
Web Preservation and Access
  • Richard Pearce-Moses
  • Director of Digital Government Information
  • Arizona State Library, Archives and Public Records
  • Phoenix
2
The World Has Changed
  • The web realized and expanded Vannevar Bush’s vision of memex
  • The web catalyzed the information explosion
  • The web challenges the traditional definition of publication


3
Libraries Must Respond
  • Reassess criteria for collecting agency publications
  • Scale workflows to enormous numbers
  • Adapt curatorial practices
  • Retool core skill sets
  • Recognize profound impact on preservation programs
4
Curation
  • Activities (Means)
    • Identification and selection
    • Acquisition
    • Description
    • Reference
    • Preservation
  • Principles and traditions (Objectives)



5
Curatorial Responses
  • Bibliocentric Model
    • Item-level control, highly selective
    • No significant difference between traditional print and digital materials
  • Technocentric Model
    • Captures everything, patrons select
    • Relies on full-text searching
    • If it’s not online, it doesn’t exist

6
An Arizona Model
  • An archival approach
    • Designed to manage large numbers of documents with limited resources
  • Based on observation that websites are similar to archival collections
    • Documents share the same provenance
    • Creators organize documents into groups (series / directories)
    • Aggregates (provenance and series) serve as the basic units of curation
7
Identification
  • Domain is a clue to provenance
    • adobe.com
    • adot.state.az.us (Transportation)
    • azgovernor.gov (Governor)
    • phoenixvis.net (Environmental Quality)
  • Scalable
    • 1,500 distinct domains on 50,000 pages
    • 500 not obvious, needed to be checked
    • 160 domains have in-scope content
8
Selection/Appraisal
  • Based on aggregates, not items
  • Directories contain similar materials
    • Reflects web masters’ sense of order
  • Analysis of URLs reveals hierarchy
    • _derived (scripts)
    • Browse (many directories)
    • ContactUs (ephemera)
    • AnnualReports (blank forms)
    • Publications (nearly empty)
9
Selection/Appraisal – 2
  • Browse/
    • DRTF (Drought Task Force)
    • AMA (Area management plans)
    • Surface (Surface water agreements)
    • WQARF (Water Quality Assurance Fund)

  • Website had ~5,000 documents
    • ~50 directories and subdirectories
    • ~25 were in-scope


10
Description
  • Interpolation not transcription
    • Collections, series have no title page
    • Many web documents have little or no publication information
    • Many web documents have poor quality publication information

  • Mapping the forest
    • Humans describe the abstract
    • Computers describe the concrete
11
Description – 2
  • Series titles
    • Series/directory names transformed from URLs to English
  • Scope and contents note
    • Why are these documents valuable?
    • What are they about?
    • Who’s responsible for them?
  • Access points
    • Associates controlled vocabulary with series
12
Acquisition
  • Largely automated
    • Can create a pick list
    • Many repositories will accept default
  • Harvesting software restricted to
    • In-scope domains
    • Selected series (directories)
    • Specified file types (.doc, .html)
  • Packages documents
    • With supplied and internal metadata
    • As complete units
13
Reference
  • Google defines patrons’ expectations


  • Classification can improve retrieval
    • Classes eliminate noise by grouping similar documents
    • Headings refine queries by suggesting related concepts


14
Reference – 2
  • Classification based on controlled vocabulary assigned to series
15
Preservation: Two senses
  • Conservation
    • Long-term accessibility
    • Much work needs to be done


  • Pear-Preserves
    • To capture what will be lost until the documents can be conserved


16
Web Archives Workbench
  • Library of Congress
    National Digital Information Infrastructure Preservation Program (NDIIPP)
    • University of Illinois at Urbana-Champaign
    • OCLC
    • Five state libraries (AZ, CT, IL, NC, WI)
    • Other content providers
  • Practical tools and a workbench


17
Tools – Being Tested
  • Domain tool
    • Returns a list of distinct domains on seed sites
    • Tracks decisions: in-scope, out-of-scope, new


  • Entity tool
    • Links domains to agencies and subordinates
    • Organizes agencies, subordinates into a taxonomy
    • Records information about agencies, subordinates


  • Version 0.1 being tested now
18
Tools – Coming Soon
  • Analysis tool
    • Shows organization of website series
    • Records selection decisions about series
    • Associates supplied metadata with series
      • Title, Scope note, Dates, Access points

  • Harvester
    • Downloads documents
    • Packages document with supplied and internal metadata
    • Delivers METS packages to be loaded on another system

  • Due for testing January 2006
19
Stay tuned!
  • Comments and criticisms welcome
    • Richard Pearce-Moses
      rpm@lib.az.us


  • Paper and slides available at
    • www.lib.az.us/diggovt/azmodel/GODORT/