Skip to content

The 'application alto xml' Media Type

Jukka Kervinen edited this page Mar 3, 2017 · 1 revision

The 'application/alto+xml' Media Type

Abstract

This document defines the 'application/alto+xml' media type for the standardized XML format ALTO (Analyzed Layout and Text Object). ALTO stores layout information and OCR recognized text of pages of any kind of printed documents like books, journals and newspapers.

Status of This Memo

This document is not an Internet Standards Track specification; it is published for informational purposes.

This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Not all documents approved by the IESG are a candidate for any level of Internet Standard; see Section 2 of RFC 5741.

Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc6129.

Copyright Notice

Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

1. Introduction

ALTO (Analyzed Layout and Text Object) is an international and interdisciplinary standard that is widely used by digital libraries and archive repositories to represent the physical layout information of documents and handle word positions [ALTO].

This document defines the 'application/alto+xml' media type in accordance with [RFC3023] in order to enable generic processing of such documents on the Internet using eXtensible Markup Language (XML) [W3C.REC-xml-20081126] technologies.

2. Recognizing ALTO Files

ALTO files are XML documents or fragments having the root element (as defined in [W3C.REC-xml-20081126]) in a ALTO namespace. ALTO namespace names are defined as a Universal Resource Identifier (URI) [RFC3986] in accordance with [W3C.REC-xml-names-20091208] and begins with http://www.loc.gov/standards/alto/ns followed by the version number of the namespace. The current namespace is http://www.loc.gov/standards/alto/ns-v3#

The only root element name for ALTO documents is: <alto>

The global structure of the ALTO file is as follows:

<alto>
 <Description>
   <MeasurementUnit/>
   <sourceImageInformation/>
   <Processing/>
 </Description>
 <Styles>
   <TextStyle/>
   <ParagraphStyle/>
 </Styles>
 <Layout>
   <Page>
     <TopMargin/>
     <LeftMargin/>
     <RightMargin/>
     <BottomMargin/>
     <PrintSpace/>
   </Page>
 </Layout>
</alto>

ALTO files are generally given the extension .xml.

3. Fragment Identifier

Documents having the media type 'application/alto+xml' use the fragment identifier notation as specified in [RFC3023] for the media type 'application/xml'.

4. Security Considerations

An XML resource does not in itself compromise data security. When being available on a network simply through the dereferencing of an Internationalized Resource Identifier (IRI) [RFC3987] or a URI, care must be taken to properly interpret the data to prevent unintended access. Hence the security issues of [RFC3986], Section 7, apply. In addition, as this media type uses the "+xml" convention, it shares the same security considerations as described in RFC 3023 [RFC3023], Section 10. In general, security issues related to the use of XML in IETF protocols are treated in RFC 3470 [RFC3470], Section 7. We will not try to duplicate this material, but review some aspects that are important for document-centric XML as applied to text encoding.

4.1. Harmful Content

Any application accepting submitted or retrieving ALTO XML for processing has to be aware of risks connected with injection of harmful scripts and executable XML. XML inclusion [W3C.REC-xinclude-20061115] and the use of external entities are vulnerable to various forms of spoofing, and can also reveal aspects of a service in a way that may compromise its security. Any vulnerability of these kinds are, however, application specific. The ALTO namespaces do not contain such elements.

4.2. Intellectual Property Rights

ALTO documents often arise in digitization of cultural heritage materials. Texts made accessible in ALTO format may be unrestricted in the sense that their distribution may be unlimited by Digital Rights Management [DRM] or Intellectual Property Rights [IPR] constraints. However, ALTO documents are heterogeneous. Some parts of a document may be unrestricted, whereas others, such as editorial text and annotations, may be subject to DRM restrictions.

The ALTO format provides means for highly granular attribution, down to the content of individual XML elements. Software agents participating in the exchange or processing ALTO may be required to honour markup of this kind. Even when there are no IPR constraints, intellectual property attribution alone requires that document users be able to tell the difference between content from different sources.

4.3. Authenticity and confidentiality

Historical archival records are often encoded in ALTO and legal document may be binding centuries after they were written. Digitization and encoding of legal texts may require technologies for assuring authenticity, such as cryptographic checksums and electronic signatures.

Similarly, historical documents may in part or in their entirety be confidential. This may be required by law or by the terms and conditions, such as in the case of donated or deposited text from private sources. A text archive may need content filtering or cryptographic technologies to meet such requirements.

5. IANA Considerations

5.1. Registration of MIME Type 'application/alto+xml'

  • MIME media type name: application
  • MIME subtype name: alto+xml
  • Required parameters: None
  • Optional parameters: charset. This parameter has identical semantics to the charset parameter of the "application/xml" media type as specified in RFC 3023 [RFC3023].
  • Encoding considerations: Identical to those for 'application/xml'. See RFC 3023 [RFC3023], Section 3.2.
  • Security considerations: See Security Considerations (Section 4) in this specification.
  • Interoperability considerations: ALTO documents are most often given the extension '.xml', which is not uncommon for other XML document formats.
  • Published specification: This media type registration is for ALTO documents [ALTO] as described here. ALTO syntax is defined in a schema [ALTOschema].
  • Applications which use this media type: IIIF (International Image Interoperability Framework, http://iiif.io) uses alto+xml media type in its Presentation API
  • Fragment Identifier Considerations: See Fragment Identifier Considerations (Section 3) in this specification
  • Restriction on Usage: ?????? https://tools.ietf.org/html/rfc6838#section-4.9
  • Provisional Registrations: ???? https://tools.ietf.org/html/rfc6838#section-5.2.1
  • Additional information: Magic number(s): There is no single initial octet sequence that is always present in TEI documents.
  • file extension(s): Common extensions are '.xml'. See Recognizing ALTO files (Section 2) in this specification.
  • Macintosh File Type Code(s) : TEXT
  • Object Identifier(s) or OID(s): Not applicable

6. References

6.1. Normative References

[RFC3023] Murata, M., St. Laurent, S., and D. Kohn, "XML Media Types", RFC 3023, January 2001. [RFC3470] Hollenbeck, S., Rose, M., and L. Masinter, "Guidelines for the Use of Extensible Markup Language (XML) within IETF Protocols", BCP 70, RFC 3470, January 2003. [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, January 2005. [RFC3987] Duerst, M. and M. Suignard, "Internationalized Resource Identifiers (IRIs)", RFC 3987, January 2005. [ALTO] "ALTO Standard", http://www.loc.gov/standards/alto [ALTOschema] http://www.loc.gov/standards/alto/news.html#3-1-released [W3C.REC-xml-20081126] Paoli, J., Yergeau, F., Sperberg-McQueen, C., Maler, E., and T. Bray, "Extensible Markup Language (XML) 1.0 (Fifth Edition)", World Wide Web Consortium Recommendation REC- xml-20081126, November 2008, http://www.w3.org/TR/2008/REC-xml-20081126. [W3C.REC-xml-names-20091208] Bray, T., Hollander, D., Layman, A., Tobin, R., and H. Thompson, "Namespaces in XML 1.0 (Third Edition)", World Wide Web Consortium Recommendation REC-xml-names-20091208, December 2009, http://www.w3.org/TR/2009/REC-xml-names-20091208.

6.2. Informative References

[DRM] "Digital rights management", <http://en.wikipedia.org/w/ index.php?title=Digital_rights_management& oldid=412653591>. [IPR] "Intellectual property", <http://en.wikipedia.org/w/ index.php?title=Intellectual_property&oldid=411690322>. [W3C.REC-xinclude-20061115] Marsh, J., Orchard, D., and D. Veillard, "XML Inclusions (XInclude) Version 1.0 (Second Edition)", World Wide Web Consortium Recommendation REC-xinclude-20061115, November 2006, http://www.w3.org/TR/2006/REC-xinclude-20061115.

Authors' Addresses