Home

EAD Noun-Obscured Specimen Collection

Contributors

Project Objective

This project is to build a collection of DTD or w3c schema valid EAD 2002 XML specimens, or sample files, that are made available under a CC0 Public Domain Dedication.

The collection seeks to capture specimens that follow a wide diversity of valid encoding practice. The primary use case of the collection is for the design and testing of systems that need to handle a variety of EAD mark up features.

Greeking

As in typography, greeking involves inserting nonsense text or, commonly, Greek or Latin text in prototypes of visual media projects (such as in graphic and web design) to check the layout of the final version before the actual text is available, or to enhance layout assessment by eliminating the distraction of readable text. -- http://en.wikipedia.org/wiki/Greeking

Typical EAD files contain many names and paragraphs of text. EAD systems often convert EAD to HTML and make them available to search engines such as google. Specimens are systematically obscured by scrambling nouns in XML text nodes, so that the donated specimens don't show up in common google searches or get confused with the original collection description.

The greeking process is not cryptographically secure. A dedicated person probably could recover the original EAD file from the greeked version (buy why would they?).

Noun Obscurification

So that the end product maintains some readability (it reads sort of like a mad lib), only nouns are replaced by the systematic process. Noun inflection and capitalization are preserved by the process.

The pyhton Natural Language Toolkit is used to identify nouns.

(Question, in a latter phase, should a "stop word" list of common archival terms to be exempted from noun obscurfication?)

Phone numbers, ZIP codes, and other arabic numerals

Digits are ignored by the systematic greeking and are left unaltered. Phone numbers, ZIP codes, and other identifiable numbers could also be greeked (all phone numbers to the 555 exchange?) but in the current data set these are left unaltered. Dates expressed in numbers are not changed, but spelled out month names are obscured if identified as a noun.

XML Attributes and XML Comments

In the current data set; data in XML attributes and XML Comments are not obscured.

Greeking and Authority Control

The greeking algorithm will aways return the same result for every noun. It should still be possible to test building interfaces that can browse or facet on controlled access terms.

Why CC0

Why use the Creative Commons Public Domain Dedication, rather than retaining copyright but allowing anyone to use the collection? Retaining copyright is common in even some of the most permissive of open source software licenses. For some reason (that I can't remember/don't know) software licenses are not appropriate for content, and visa versa. Creative Commons Zero is the least restrictive content license, imposing no restrictions on the use of the systematically obscured content in the collection. Contributors' copyright in the original files is fully retained.

Acquisition/Submission Process

The collection of noun-obscured EAD specimen files is maintained in a revision control system repository (specifically, a git repository on git hub https://github.com/tingletech/ead-test-col ).

An EAD file "in the wild" is submitted to a specimen processor. The submitter asserts they have the right to submit the original specimen for the purpose of it being processed and included in the collection. The specimen processor then conducts the noun-obscuring "greeking" transformation procedure on the file and commits the transformed file to the github repository or a fork. In the commit message; the specimen processor references the source of the original file.

Misc.

original post to EAD listserv about project http://bit.ly/rPV1hJ → http://listserv.loc.gov/cgi-bin/wa?A2=ind1112&L=ead&T=0&P=1437

Provide feedback

Saved searches

Use saved searches to filter your results more quickly