Skip to content
tingletech edited this page Dec 29, 2011 · 32 revisions

EAD Noun-Obscured Specimen Collection

Project Objective

This project is to build a collection of DTD or w3c schema valid EAD 2002 XML specimens, or sample files, that are made available under a CC0 Public Domain Dedication.

The collection seeks to capture specimens that follow a wide diversity of valid encoding practice. The primary use case of the collection is for the design and testing of systems that need to handle a variety of EAD mark up features.

Greeking

As in typography, greeking involves inserting nonsense text or, commonly, Greek or Latin text in prototypes of visual media projects (such as in graphic and web design) to check the layout of the final version before the actual text is available, or to enhance layout assessment by eliminating the distraction of readable text. -- http://en.wikipedia.org/wiki/Greeking

Typical EAD files contain many names and paragraphs of text. EAD systems often convert EAD to HTML and make them available to search engines such as google. Specimens are systematically obscured by scrambling nouns in XML text nodes (and certain XML attributes?), so that the donated specimens don't show up in common google searches or get confused with the original collection description.

The greeking process is not cryptographically secure. A dedicated person probably could recover the original EAD file from the greeked version (buy why would they?).

Noun Obscurification

So that the end product maintains some readability (it reads sort of like a mad lib), only nouns are replaced by the systematic process. Noun inflection and capitalization is preserved by the process.

They pyhton Natural Language Toolkit is used to identify nouns.

(Question, should a "stop word" list of common archival terms to be exempted from noun obscurfication?)

Phone numbers, ZIP codes, and other arabic numerals

Digits are ignored by the systematic greeking and are left unaltered. Phone numbers, ZIP codes, and other identifiable numbers may also be greeked. Dates expressed in numbers will not be changed, but spelled out month names are obscured if identified as a noun.

Acquisition/Submission Process

The collection of noun-obscured EAD specimen is maintained in a revision control system repository (specifically, a git repository on git hub https://github.com/tingletech/ead-test-col ).

An EAD file "in the wild" is submitted to a specimen processor. The submitter asserts they have the right to submit the original specimen for the purpose of it being processed and included in the collection. The specimen processor then conducts the noun-obscuring "greeking" transformation procedure on the file and commits the transformed file to the github repository or a fork. In the commit message; the specimen processor references the source of the original file, including metadata identifying the correspondence indicating the submitter's agreement to donate the specimen.

[Question; keep source URLs to the un-greeked in the official repo; or keep URLs to the original EAD XML confidential?]

Misc.

( the greeking script: https://github.com/tingletech/greeker.py )

original post to EAD listserv about project http://bit.ly/rPV1hJhttp://listserv.loc.gov/cgi-bin/wa?A2=ind1112&L=ead&T=0&P=1437

Clone this wiki locally