Skip to content

Commit

Permalink
feat: add relaton-doi blog post, fixes #61
Browse files Browse the repository at this point in the history
  • Loading branch information
ronaldtse committed Mar 3, 2024
1 parent a5f7950 commit cb78429
Showing 1 changed file with 377 additions and 0 deletions.
377 changes: 377 additions & 0 deletions _posts/2022-12-28-relaton-doi.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,377 @@
---
layout: post
title: Support for DOI bibliographic information auto-fetch through CrossRef
date: 2022-12-28
categories: relaton
authors:
-
name: Nick Nicholas
email: [email protected]
social_links:
- https://github.com/opoudjis
-
name: Andrei Kislichenko
email: [email protected]
social_links:
- https://github.com/andrew2net

excerpt: >-
Relaton now supports auto-fetching Crossref bibliographic information via DOI
identifiers.
---

== Introduction

=== Digital Object Identifier and the DOI system

Since 1997, the https://www.doi.org[Digital Object Identifier (DOI)] service has
been part of the essential infrastructure of digital publishing, registering and
managing a unique identifier (DOI identifier) for all publications.

The DOI identifier scheme was ratified as an ISO standard as ISO 26324, with the
actual implementation managed by the DOI Foundation, which delegates
registration matters to DOI registration agencies. DOI identifiers are lodged
with a DOI registration agency.

As publishers and agencies have embraced the DOI, these identifiers have become
ubiquitous, and have been adopted not only for most new publications, whether
electronic or hardcopy, but increasingly for past publications as well,
especially as those publications are reissued online.

One of the main advantages to a widely used identifier for publications is that
value-added services can use that identifier to provide information about the
publication.

=== Crossref

Among a wide range of online services that operate on the DOI identifier,
https://www.crossref.org[Crossref] is of particular interest in the data it
collects and offers in the topic of bibliography.

Crossref, now a DOI registration agency, was originally established by a
consortium of member publishers intended to facilitate stable cross-references
(hence, "cross-ref") and citations across academic journal articles.

Therefore it is in the interests of Crossref, not only to register DOIs
for publications, and use them in online publication, but also to associate
those DOIs with bibliographic information that can be used in citations.

As a result, Crossref has one of the largest online bibliographic databases in
existence, counting more than 141 million records today.

NOTE: It is worth noting that out of the millions of DOI identifiers,
*still only a fraction* of them have bibliographic information on Crossref. So
if Crossref doesn't return information for a particular DOI identifier, don't
fret.

=== Crossref bibliographic API

Crossref has made this database publicly available
in recent years through an
https://www.crossref.org/documentation/retrieve-metadata/rest-api/[API],
and is committed to making citation information openly available.

The Crossref API data is natively in a Crossref JSON Schema, but it can readily
be exported into popular bibliographic export formats.

In fact, a search on the Crossref home page for DOI `10.1515/9783110889406.257`
returns
https://search.crossref.org/?from_ui=yes&q=10.1515/9783110889406.257[on its landing page]
links not only to the
https://api.crossref.org/v1/works/10.1515/9783110889406.257[Native Crossref JSON record]
and the
https://www.degruyter.com/document/doi/10.1515/9783110889406.257/html[URI of the associated resource],
but also exports of the JSON record (via Javascript, as "Cite"), both
machine-readable (BibTeX, RIS), and as human-readable renderings (APA, Harvard,
MLA, Vancouver, Chicago).


== Relaton integration with Crossref

Relaton-DOI (the https://github.com/relaton/relaton-doi/[relaton-doi] gem)
integrates the Crossref bibliographic database through the Crossref API, and
Crossref bibliographic records are now available to the Relaton ecosystem.

In the following examples, we show the method of access to a Crossref
bibliographic record given its DOI identifier, and how to use it in your chosen
environment.

We also outline some of the enhancements we have brought to Crossref data, and
some of the caveats to bear in mind while using it.


== Fetching Crossref data from Relaton CLI

Like all Relaton gems, a reference can be fetched from `relaton-doi` through the
https://github.com/relaton/relaton-cli/[Relaton CLI].

.Install Relaton-DOI at the command line
[source,console]
----
$ gem install relaton-cli
----

.Fetching bibliographic data using a DOI identifier in Relaton XML format
[source,console]
----
$ relaton fetch doi:10.1515/9783110889406.257
# returns record for 10.1515/9783110889406.257 in Relaton XML
----

.Fetching bibliographic data using a DOI identifier in Relaton YAML format
[source,console]
----
$ relaton fetch doi:10.1515/9783110889406.257 -f yaml
# returns record for 10.1515/9783110889406.257 in Relaton YAML
----

.Fetching bibliographic data using a DOI identifier in BibTeX format
[source,console]
----
$ relaton fetch doi:10.1515/9783110889406.257 -f bibtex
# returns record for 10.1515/9783110889406.257 in Bibtex
----

As with any command line tool, the output of these commands can be redirected to
a text file and stored.

The command line tool consumes identifiers from a range of sources, and it
requires `doi:` to be prefixed to each DOI, so they can be recognized as such.

So, if you wish to build up a BibTeX database of bibliographic entries given some DOIs, you
might write a shell script to loop through the DOIs, and write each BibTeX object fetched into a file,
like this:

.bash script using Relaton to fetch Crossref data using DOI identifiers into a BibTeX file
[source,sh]
----
#!/bin/bash
StringVal="10.1515/9783110889406.257 10.6028/nist.ir.8245 doi:10.1215/9781478007609-047"
echo "" > my.bibtex
for doi in $StringVal; do
relaton fetch doi:$doi -f bibtex >> my.bibtex
done
----

== Ruby

The `relaton-doi` gem can be used natively in your Ruby code, although this
presupposes that you are already coding within the relaton Ruby ecosystem.

The received objects will be native Relaton objects, which will eventually need
to be exported to XML, YAML, or BibTeX.

.Ruby code that fetches Relaton bibliographic objects using DOI identifiers
[source,ruby]
----
> require 'relaton_doi'
=> true
# get NIST standard
> RelatonDoi::Crossref.get "doi:10.6028/nist.ir.8245"
[relaton-doi] ["doi:10.6028/nist.ir.8245"] fetching...
[relaton-doi] ["doi:10.6028/nist.ir.8245"] found 10.6028/nist.ir.8245
=> #<RelatonNist::NistBibliographicItem:0x00007ff22420d820
...
----


== Enhancements and caveats

=== Challenges with Crossref data

Crossref is an unprecedented boon to users of electronic bibliographies: it is
centralised source offering access to a huge and growing number of bibliographic
records, with a universal and broadly discoverable identifier.

That said, there are downsides to the bibliographic metadata from Crossref:

* Quality of bibliographic data is highly variable;
* Metadata provided by Crossref is not normalized;
* Encoding of metadata is often inconsistent;
* Critical data is often missing.

These caveats stem from the fact that Crossref metadata is entered by the
publishers themselves, and there is no effort in normalizing data provided
by the different publishers.

We encourage users to use Crossref data as a first pass to populating
bibliographic information quickly and at scale, but we also recommend that users
review the bibliographic data fetched, and fill in any gaps they can identify.


== Relaton repairs Crossref data

Relaton-DOI implements a number of enhancement mechanisms on top of the Crossref
API data to remedy some of the clearly missing information.


=== Missing host items

Crossref does not normally in the fetched record any details of the *host item*
-- the edited volume or proceedings that a paper may be included in.

So the Crossref record will include the title of the volume the chapter or
paper is included in, as well as the publisher and place of publication. However, it will not include the editors
of the edited volume or proceedings.

So a search for `10.1515/9783110889406.257` natively returns, in BibteX:

.Host item missing in Crossref bibliographic data
[source,bibtex]
----
@incollection{Heller,
doi = {10.1515/9783110889406.257},
url = {https://www.degruyter.com/document/doi/10.1515/9783110889406.257/html},
publisher = {{DE} {GRUYTER} {MOUTON}},
author = {Monica Heller},
title = {Gender and public space in a bilingual school},
booktitle = {Multilingualism, Second Language Learning, and Gender}
}
----

But the editors of the volume
_Multilingualism, Second Language Learning, and Gender_
are *not* included in the Crossref record. Relaton-DOI makes a best-effort
search, and will in most cases return the editors as well as the author.

So in the foregoing case, Relaton-DOI will look up the record for the edited
volume
_Multilingualism, Second Language Learning, and Gender_,
and the record it returns will include its editors.

.Host item restored to Crossref bibliographic data
[source,bibtex]
----
@inbook{heller-a,
title = {Gender and public space in a bilingual school},
author = {Heller, Monica},
editor = {Pavlenko, Aneta and Blackledge, Adrian and Piller, Ingrid and Teutsch-Dwyer, Marya},
booktitle = {Multilingualism, Second Language Learning, and Gender},
publisher = {DE GRUYTER MOUTON},
address = {Berlin},
timestamp = {2023-01-15},
doi = {10.1515/9783110889406.257},
url = {https://www.degruyter.com/document/doi/10.1515/9783110889406.257/html}
}
----

=== Stray delimiters in data fields

Some Crossref records, particularly from older bibliographic sources, have messy
data incorporating delimiters.

For example, the record for `10.5962/bhl.title.124254`, a publication from the
year 1852, includes trailing punctuation in its title, place of publication, and
publisher fields. These issues should not exist in the data.

.Crossref fields containing stray delimiters
[source,bibtex]
----
@book{kuster1852a,
title = {Die Gattungen Pupa, Megaspira, Balea und Tornatellina : in Abbildungen nach der Natur mit Beschreibungen /},
author = {H. C. Kuster and Johann Hieronymus Chemnitz and Friedrich Heinrich Wilhelm Martini},
publisher = {Verlag von Bauer und Raspe (Julius Merz),},
year = {1852},
address = {Nürnberg :},
doi = {http://dx.doi.org/10.5962/bhl.title.124254},
url = {http://www.biodiversitylibrary.org/bibliography/124254}
}
----

Relaton-DOI cleans up the fields to the extent reasonable:

.Crossref data in Relaton cleaned up of stray delimiters
[source,bibtex]
----
@book{kuster1852a,
title = {Die Gattungen Pupa, Megaspira, Balea und Tornatellina : in Abbildungen nach der Natur mit Beschreibungen},
author = {Kuster, H. C. and Chemnitz, Johann Hieronymus and Martini, Friedrich Heinrich Wilhelm},
publisher = {Verlag von Bauer und Raspe (Julius Merz)},
year = {1852},
address = {Nürnberg},
timestamp = {2023-02-02},
doi = {http://dx.doi.org/10.5962/bhl.title.124254},
url = {http://www.biodiversitylibrary.org/bibliography/124254}
}
----

=== Capitalization of names

On occasion, author and editor names appear in all capital letters.

Relaton-DOI will change these to *title case*, so long as the name is more than
two letters long.

You may need to review the results, to catch *camel case* exceptions such as
"MacDonald".


=== Uncorrectable and irrecoverable Crossref data (caveats)

There are however areas where Crossref is missing information, and for which
nothing can be done but to emend the record after fetching it.

In particular:

==== Page numbers of journal articles missing

Crossref quite often omits the page numbers of a journal article, even though it
retains the volume and issue of the article.

NOTE: Page numbers are no longer essential for online access; but if a journal
is published in print, or even in a medium emulating print (PDF), page numbers
are still expected in citations.

==== Missing series information

Crossref usually does not include the series of a monograph in its data.

==== Undifferentiated volume numbering for restarted journals

Some journals restart their volume numbering while keeping the same title; in a
few cases, they do so multiple times.

Bibliographies indicate this where applicable, before the volume number ("New
Series", "3rd Series"). Crossref does not differentiate between different runs
(i.e. numberings) of journals, so that information cannot be included in any
citations.

==== Missing host item

The details of the host item may be impractical to retrieve, particularly if the
host item is a large reference work, containing many items each with their own
DOI.

==== Non-normalized place of publication

The place of publication is free text, and is not systematically broken down
into city, region, country.

==== Author names not broken down properly

Occasionally, a personal name is not broken down into forename and surname, but
is presented in its entirety as the surname.

==== Non-normalized organization names and their abbreviations

Records may use abbreviations of organisations instead of the full names, and
some records may mix the two in different fields (e.g. both "IEEE" and
"Institute of Electric and Electronic Engineers").

==== Missing publication identifiers

The identifiers of standards and reports are often not included in the record.


== Conclusion

Relaton now supports bibliographic information fetching using DOI identifiers
through the Crossref service.

Despite the inherent data deficiencies of the Crossref database, it is
nonetheless a boon for users who wish to retrieve citable information.

As well said in old adage, *something is better than nothing*. And to keep us
feeling warm and fuzzy, any improvement in Crossref metadata by the publishers
will benefit all Relaton users!

0 comments on commit cb78429

Please sign in to comment.