Skip to content
tingletech edited this page Dec 29, 2011 · 32 revisions

EAD Noun-Obscured Specimen Collection

Project Objective

This project is to build a collection of DTD or w3c schema valid EAD 2002 XML specimens, or sample files, that are made available with an open source license.

The collection seeks to capture specimens that follow a wide diversity of valid encoding practice. The primary use case of the collection is for the design and testing of systems that need to handle a variety of EAD mark up features.

Greeking

As in typography, greeking involves inserting nonsense text or, commonly, Greek or Latin text in prototypes of visual media projects (such as in graphic and web design) to check the layout of the final version before the actual text is available, or to enhance layout assessment by eliminating the distraction of readable text. -- http://en.wikipedia.org/wiki/Greeking

Typical EAD files contain many names and paragraphs of text. EAD systems often convert EAD to HTML and make them available to search engines such as google. Specimen are systematically obscured by scrambling nouns in XML text nodes and certain XML attributes, so that the donated specimens don't show up in common google searches or get confused with the original collection description.

The greeking process is not cryptographically secure, and it is possible that noun identification will not be 100% accurate.

Noun Obscurification

So that the end product maintains some readability (it reads sort of like a mad lib), only nouns are replaced by the systematic process. Noun inflection and capitalization is preserved in the process.

(Use a "stop word" list of common archival terms to be exempted from noun obscurfication?)

Phone numbers and ZIP codes and other arabic numerals

Digits are ignored by the systematic greeking. Phone numbers, ZIP codes, and other identifiable numbers may also be greeked. Dates expressed in numbers will not be changed, but spelled out month names are obscured if identified as a noun.

Collection License

The noun-obscured-specimine collection is made available under a CC0 Public Domain Dedication.

Acquisition/Submission Process

The collection of noun-obscured EAD specimen is maintained in a revision control system repository (specifically, a git repository on git hub https://github.com/tingletech/ead-test-col ).

An EAD file "in the wild" is submitted to a specimen preparer. The submitter asserts they have the right to submit the original specimen for the purpose of it being processed and included in the collection. The specimen preparer then conducts the noun-obscuring "greeking" transformation procedure on the file and commits the transformed file to the github repository or a fork. In the commit message; the submission preparer references the source of the original file, including metadata identifying the correspondence indicating the submitters agreement to submit the specimen.

Misc.

( the greeking script: https://github.com/tingletech/greeker.py )

original post to EAD listserv about project http://bit.ly/rPV1hJhttp://listserv.loc.gov/cgi-bin/wa?A2=ind1112&L=ead&T=0&P=1437

Clone this wiki locally