-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: add relaton-doi blog post, fixes #61
- Loading branch information
Showing
1 changed file
with
377 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,377 @@ | ||
--- | ||
layout: post | ||
title: Support for DOI bibliographic information auto-fetch through CrossRef | ||
date: 2022-12-28 | ||
categories: relaton | ||
authors: | ||
- | ||
name: Nick Nicholas | ||
email: [email protected] | ||
social_links: | ||
- https://github.com/opoudjis | ||
- | ||
name: Andrei Kislichenko | ||
email: [email protected] | ||
social_links: | ||
- https://github.com/andrew2net | ||
|
||
excerpt: >- | ||
Relaton now supports auto-fetching Crossref bibliographic information via DOI | ||
identifiers. | ||
--- | ||
|
||
== Introduction | ||
|
||
=== Digital Object Identifier and the DOI system | ||
|
||
Since 1997, the https://www.doi.org[Digital Object Identifier (DOI)] service has | ||
been part of the essential infrastructure of digital publishing, registering and | ||
managing a unique identifier (DOI identifier) for all publications. | ||
|
||
The DOI identifier scheme was ratified as an ISO standard as ISO 26324, with the | ||
actual implementation managed by the DOI Foundation, which delegates | ||
registration matters to DOI registration agencies. DOI identifiers are lodged | ||
with a DOI registration agency. | ||
|
||
As publishers and agencies have embraced the DOI, these identifiers have become | ||
ubiquitous, and have been adopted not only for most new publications, whether | ||
electronic or hardcopy, but increasingly for past publications as well, | ||
especially as those publications are reissued online. | ||
|
||
One of the main advantages to a widely used identifier for publications is that | ||
value-added services can use that identifier to provide information about the | ||
publication. | ||
|
||
=== Crossref | ||
|
||
Among a wide range of online services that operate on the DOI identifier, | ||
https://www.crossref.org[Crossref] is of particular interest in the data it | ||
collects and offers in the topic of bibliography. | ||
|
||
Crossref, now a DOI registration agency, was originally established by a | ||
consortium of member publishers intended to facilitate stable cross-references | ||
(hence, "cross-ref") and citations across academic journal articles. | ||
|
||
Therefore it is in the interests of Crossref, not only to register DOIs | ||
for publications, and use them in online publication, but also to associate | ||
those DOIs with bibliographic information that can be used in citations. | ||
|
||
As a result, Crossref has one of the largest online bibliographic databases in | ||
existence, counting more than 141 million records today. | ||
|
||
NOTE: It is worth noting that out of the millions of DOI identifiers, | ||
*still only a fraction* of them have bibliographic information on Crossref. So | ||
if Crossref doesn't return information for a particular DOI identifier, don't | ||
fret. | ||
|
||
=== Crossref bibliographic API | ||
|
||
Crossref has made this database publicly available | ||
in recent years through an | ||
https://www.crossref.org/documentation/retrieve-metadata/rest-api/[API], | ||
and is committed to making citation information openly available. | ||
|
||
The Crossref API data is natively in a Crossref JSON Schema, but it can readily | ||
be exported into popular bibliographic export formats. | ||
|
||
In fact, a search on the Crossref home page for DOI `10.1515/9783110889406.257` | ||
returns | ||
https://search.crossref.org/?from_ui=yes&q=10.1515/9783110889406.257[on its landing page] | ||
links not only to the | ||
https://api.crossref.org/v1/works/10.1515/9783110889406.257[Native Crossref JSON record] | ||
and the | ||
https://www.degruyter.com/document/doi/10.1515/9783110889406.257/html[URI of the associated resource], | ||
but also exports of the JSON record (via Javascript, as "Cite"), both | ||
machine-readable (BibTeX, RIS), and as human-readable renderings (APA, Harvard, | ||
MLA, Vancouver, Chicago). | ||
|
||
|
||
== Relaton integration with Crossref | ||
|
||
Relaton-DOI (the https://github.com/relaton/relaton-doi/[relaton-doi] gem) | ||
integrates the Crossref bibliographic database through the Crossref API, and | ||
Crossref bibliographic records are now available to the Relaton ecosystem. | ||
|
||
In the following examples, we show the method of access to a Crossref | ||
bibliographic record given its DOI identifier, and how to use it in your chosen | ||
environment. | ||
|
||
We also outline some of the enhancements we have brought to Crossref data, and | ||
some of the caveats to bear in mind while using it. | ||
|
||
|
||
== Fetching Crossref data from Relaton CLI | ||
|
||
Like all Relaton gems, a reference can be fetched from `relaton-doi` through the | ||
https://github.com/relaton/relaton-cli/[Relaton CLI]. | ||
|
||
.Install Relaton-DOI at the command line | ||
[source,console] | ||
---- | ||
$ gem install relaton-cli | ||
---- | ||
|
||
.Fetching bibliographic data using a DOI identifier in Relaton XML format | ||
[source,console] | ||
---- | ||
$ relaton fetch doi:10.1515/9783110889406.257 | ||
# returns record for 10.1515/9783110889406.257 in Relaton XML | ||
---- | ||
|
||
.Fetching bibliographic data using a DOI identifier in Relaton YAML format | ||
[source,console] | ||
---- | ||
$ relaton fetch doi:10.1515/9783110889406.257 -f yaml | ||
# returns record for 10.1515/9783110889406.257 in Relaton YAML | ||
---- | ||
|
||
.Fetching bibliographic data using a DOI identifier in BibTeX format | ||
[source,console] | ||
---- | ||
$ relaton fetch doi:10.1515/9783110889406.257 -f bibtex | ||
# returns record for 10.1515/9783110889406.257 in Bibtex | ||
---- | ||
|
||
As with any command line tool, the output of these commands can be redirected to | ||
a text file and stored. | ||
|
||
The command line tool consumes identifiers from a range of sources, and it | ||
requires `doi:` to be prefixed to each DOI, so they can be recognized as such. | ||
|
||
So, if you wish to build up a BibTeX database of bibliographic entries given some DOIs, you | ||
might write a shell script to loop through the DOIs, and write each BibTeX object fetched into a file, | ||
like this: | ||
|
||
.bash script using Relaton to fetch Crossref data using DOI identifiers into a BibTeX file | ||
[source,sh] | ||
---- | ||
#!/bin/bash | ||
StringVal="10.1515/9783110889406.257 10.6028/nist.ir.8245 doi:10.1215/9781478007609-047" | ||
echo "" > my.bibtex | ||
for doi in $StringVal; do | ||
relaton fetch doi:$doi -f bibtex >> my.bibtex | ||
done | ||
---- | ||
|
||
== Ruby | ||
|
||
The `relaton-doi` gem can be used natively in your Ruby code, although this | ||
presupposes that you are already coding within the relaton Ruby ecosystem. | ||
|
||
The received objects will be native Relaton objects, which will eventually need | ||
to be exported to XML, YAML, or BibTeX. | ||
|
||
.Ruby code that fetches Relaton bibliographic objects using DOI identifiers | ||
[source,ruby] | ||
---- | ||
> require 'relaton_doi' | ||
=> true | ||
# get NIST standard | ||
> RelatonDoi::Crossref.get "doi:10.6028/nist.ir.8245" | ||
[relaton-doi] ["doi:10.6028/nist.ir.8245"] fetching... | ||
[relaton-doi] ["doi:10.6028/nist.ir.8245"] found 10.6028/nist.ir.8245 | ||
=> #<RelatonNist::NistBibliographicItem:0x00007ff22420d820 | ||
... | ||
---- | ||
|
||
|
||
== Enhancements and caveats | ||
|
||
=== Challenges with Crossref data | ||
|
||
Crossref is an unprecedented boon to users of electronic bibliographies: it is | ||
centralised source offering access to a huge and growing number of bibliographic | ||
records, with a universal and broadly discoverable identifier. | ||
|
||
That said, there are downsides to the bibliographic metadata from Crossref: | ||
|
||
* Quality of bibliographic data is highly variable; | ||
* Metadata provided by Crossref is not normalized; | ||
* Encoding of metadata is often inconsistent; | ||
* Critical data is often missing. | ||
|
||
These caveats stem from the fact that Crossref metadata is entered by the | ||
publishers themselves, and there is no effort in normalizing data provided | ||
by the different publishers. | ||
|
||
We encourage users to use Crossref data as a first pass to populating | ||
bibliographic information quickly and at scale, but we also recommend that users | ||
review the bibliographic data fetched, and fill in any gaps they can identify. | ||
|
||
|
||
== Relaton repairs Crossref data | ||
|
||
Relaton-DOI implements a number of enhancement mechanisms on top of the Crossref | ||
API data to remedy some of the clearly missing information. | ||
|
||
|
||
=== Missing host items | ||
|
||
Crossref does not normally in the fetched record any details of the *host item* | ||
-- the edited volume or proceedings that a paper may be included in. | ||
|
||
So the Crossref record will include the title of the volume the chapter or | ||
paper is included in, as well as the publisher and place of publication. However, it will not include the editors | ||
of the edited volume or proceedings. | ||
|
||
So a search for `10.1515/9783110889406.257` natively returns, in BibteX: | ||
|
||
.Host item missing in Crossref bibliographic data | ||
[source,bibtex] | ||
---- | ||
@incollection{Heller, | ||
doi = {10.1515/9783110889406.257}, | ||
url = {https://www.degruyter.com/document/doi/10.1515/9783110889406.257/html}, | ||
publisher = {{DE} {GRUYTER} {MOUTON}}, | ||
author = {Monica Heller}, | ||
title = {Gender and public space in a bilingual school}, | ||
booktitle = {Multilingualism, Second Language Learning, and Gender} | ||
} | ||
---- | ||
|
||
But the editors of the volume | ||
_Multilingualism, Second Language Learning, and Gender_ | ||
are *not* included in the Crossref record. Relaton-DOI makes a best-effort | ||
search, and will in most cases return the editors as well as the author. | ||
|
||
So in the foregoing case, Relaton-DOI will look up the record for the edited | ||
volume | ||
_Multilingualism, Second Language Learning, and Gender_, | ||
and the record it returns will include its editors. | ||
|
||
.Host item restored to Crossref bibliographic data | ||
[source,bibtex] | ||
---- | ||
@inbook{heller-a, | ||
title = {Gender and public space in a bilingual school}, | ||
author = {Heller, Monica}, | ||
editor = {Pavlenko, Aneta and Blackledge, Adrian and Piller, Ingrid and Teutsch-Dwyer, Marya}, | ||
booktitle = {Multilingualism, Second Language Learning, and Gender}, | ||
publisher = {DE GRUYTER MOUTON}, | ||
address = {Berlin}, | ||
timestamp = {2023-01-15}, | ||
doi = {10.1515/9783110889406.257}, | ||
url = {https://www.degruyter.com/document/doi/10.1515/9783110889406.257/html} | ||
} | ||
---- | ||
|
||
=== Stray delimiters in data fields | ||
|
||
Some Crossref records, particularly from older bibliographic sources, have messy | ||
data incorporating delimiters. | ||
|
||
For example, the record for `10.5962/bhl.title.124254`, a publication from the | ||
year 1852, includes trailing punctuation in its title, place of publication, and | ||
publisher fields. These issues should not exist in the data. | ||
|
||
.Crossref fields containing stray delimiters | ||
[source,bibtex] | ||
---- | ||
@book{kuster1852a, | ||
title = {Die Gattungen Pupa, Megaspira, Balea und Tornatellina : in Abbildungen nach der Natur mit Beschreibungen /}, | ||
author = {H. C. Kuster and Johann Hieronymus Chemnitz and Friedrich Heinrich Wilhelm Martini}, | ||
publisher = {Verlag von Bauer und Raspe (Julius Merz),}, | ||
year = {1852}, | ||
address = {Nürnberg :}, | ||
doi = {http://dx.doi.org/10.5962/bhl.title.124254}, | ||
url = {http://www.biodiversitylibrary.org/bibliography/124254} | ||
} | ||
---- | ||
|
||
Relaton-DOI cleans up the fields to the extent reasonable: | ||
|
||
.Crossref data in Relaton cleaned up of stray delimiters | ||
[source,bibtex] | ||
---- | ||
@book{kuster1852a, | ||
title = {Die Gattungen Pupa, Megaspira, Balea und Tornatellina : in Abbildungen nach der Natur mit Beschreibungen}, | ||
author = {Kuster, H. C. and Chemnitz, Johann Hieronymus and Martini, Friedrich Heinrich Wilhelm}, | ||
publisher = {Verlag von Bauer und Raspe (Julius Merz)}, | ||
year = {1852}, | ||
address = {Nürnberg}, | ||
timestamp = {2023-02-02}, | ||
doi = {http://dx.doi.org/10.5962/bhl.title.124254}, | ||
url = {http://www.biodiversitylibrary.org/bibliography/124254} | ||
} | ||
---- | ||
|
||
=== Capitalization of names | ||
|
||
On occasion, author and editor names appear in all capital letters. | ||
|
||
Relaton-DOI will change these to *title case*, so long as the name is more than | ||
two letters long. | ||
|
||
You may need to review the results, to catch *camel case* exceptions such as | ||
"MacDonald". | ||
|
||
|
||
=== Uncorrectable and irrecoverable Crossref data (caveats) | ||
|
||
There are however areas where Crossref is missing information, and for which | ||
nothing can be done but to emend the record after fetching it. | ||
|
||
In particular: | ||
|
||
==== Page numbers of journal articles missing | ||
|
||
Crossref quite often omits the page numbers of a journal article, even though it | ||
retains the volume and issue of the article. | ||
|
||
NOTE: Page numbers are no longer essential for online access; but if a journal | ||
is published in print, or even in a medium emulating print (PDF), page numbers | ||
are still expected in citations. | ||
|
||
==== Missing series information | ||
|
||
Crossref usually does not include the series of a monograph in its data. | ||
|
||
==== Undifferentiated volume numbering for restarted journals | ||
|
||
Some journals restart their volume numbering while keeping the same title; in a | ||
few cases, they do so multiple times. | ||
|
||
Bibliographies indicate this where applicable, before the volume number ("New | ||
Series", "3rd Series"). Crossref does not differentiate between different runs | ||
(i.e. numberings) of journals, so that information cannot be included in any | ||
citations. | ||
|
||
==== Missing host item | ||
|
||
The details of the host item may be impractical to retrieve, particularly if the | ||
host item is a large reference work, containing many items each with their own | ||
DOI. | ||
|
||
==== Non-normalized place of publication | ||
|
||
The place of publication is free text, and is not systematically broken down | ||
into city, region, country. | ||
|
||
==== Author names not broken down properly | ||
|
||
Occasionally, a personal name is not broken down into forename and surname, but | ||
is presented in its entirety as the surname. | ||
|
||
==== Non-normalized organization names and their abbreviations | ||
|
||
Records may use abbreviations of organisations instead of the full names, and | ||
some records may mix the two in different fields (e.g. both "IEEE" and | ||
"Institute of Electric and Electronic Engineers"). | ||
|
||
==== Missing publication identifiers | ||
|
||
The identifiers of standards and reports are often not included in the record. | ||
|
||
|
||
== Conclusion | ||
|
||
Relaton now supports bibliographic information fetching using DOI identifiers | ||
through the Crossref service. | ||
|
||
Despite the inherent data deficiencies of the Crossref database, it is | ||
nonetheless a boon for users who wish to retrieve citable information. | ||
|
||
As well said in old adage, *something is better than nothing*. And to keep us | ||
feeling warm and fuzzy, any improvement in Crossref metadata by the publishers | ||
will benefit all Relaton users! |