Skip to content

Commit

Permalink
Merge branch 'scripts'
Browse files Browse the repository at this point in the history
Conflicts:
	README.md
	jats-to-mediawiki.xsl
  • Loading branch information
wrought committed May 28, 2014
2 parents 89d1b21 + 6fb3135 commit a3eb042
Show file tree
Hide file tree
Showing 6 changed files with 254 additions and 21 deletions.
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
build/
env/
tmp/
*.pyc
74 changes: 56 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,45 +13,83 @@ it to be tightly integrated with the [open-access-media-importer][5].

[5]: http://en.wikiversity.org/wiki/User:OpenScientist/Open_grant_writing/Wissenswert_2011/Documentation

## Set up your environment
## Usage

The following command should work in a `bash` shell.
The following commands should work in a `bash` shell.

### Clone this repository
```
# Clone this repository
git clone https://github.com/Klortho/JATS-to-Mediawiki.git
cd JATS-to-Mediawiki
```

### Usage Scripts

Choose a wrapper script to use the JATS-to-Mediawiki conversion library:

#### python
This python script providess a robust and human-friendly interface, including streaming using stdin, stdout, and stderr. Article IDs can be passed to the script as stdin, listed by line in an input file `-i`, or are as arguments to the `-a` or `--articles` flag.

##### Setup

```
virtualenv env/
source env/bin/activate
pip install -r requirements.txt
```

##### Run

# Make sure you have xsltproc
which xsltproc # should return with the location of your xsltproc command
For command line usage, use python or otherwise execute the script with a `--help` flag

```
python jats-to-mediawiki.py --help
```

#### bash
[Incomplete] This is the beginning of a bash script to provide a minimal interface.

`bash jats-to-mediawiki.sh`

#### other scripts
Fork this repository to add new scripts, then submit a 'pull request'.


### Manual

#### Set up environment
```
# Check for xsltproc, will warn if not installed
command -v xsltprocfoo >/dev/null 2>&1 || { echo >&2 "I require foo but it's not installed. Aborting."; exit 1; }
# Set up XML catalog file
export XML_CATALOG_FILES=`pwd`/dtd/catalog-test-jats-v1.xml
```

#### (Optional) Check the JATS dtd Version

Run this command to display the modified date

```
wget http://ftp.ncbi.nlm.nih.gov/pub/jats/archiving/1.0/ > dtd-tmp.html && cat dtd-tmp.html | grep "jats-archiving-dtd-1.0.zip" && rm dtd-tmp.html
```

If date modified is after "12-Oct-2012 08:36" then, replace the dtd/ (and submit an issue to [this repository](https://github.com/Klortho/JATS-to-Mediawiki/issues/new) to update it):

# (Optional) Check the JATS dtd Version
wget http://ftp.ncbi.nlm.nih.gov/pub/jats/archiving/1.0/ | grep "jats-archiving-dtd-1.0.zip"
# if date modified is after "10/11/12 5:00:00 pm" then,
```
rm -rf dtd/*
cd dtd
wget ftp://ftp.ncbi.nlm.nih.gov/pub/jats/archiving/1.0/jats-archiving-dtd-1.0.zip
unzip *.zip
```

## Convert an article

### Automatic

Driver scripts for the JATS-to-Mediawiki conversion library are forthcoming, to be listed here.

### Manual

#### Convert an Article
The following are manual instructions for converting a single article, given its DOI.
It would be fairly easy to script, if you want.

First, you need to find the PMCID for the article. If you have the DOI (for example,
`10.1371/journal.pone.0010676`) the easiest way to do this is with the [PMC ID converter
API](http://www.ncbi.nlm.nih.gov/pmc/tools/id-converter-api/). Point your browser at
[http://www.pubmedcentral.nih.gov/utils/idconv/v1.0/?ids=10.1371/journal.pone.0010676&format=json](),
[http://www.pubmedcentral.nih.gov/utils/idconv/v1.0/?ids=10.1371/journal.pone.0010676&format=json](http://www.pubmedcentral.nih.gov/utils/idconv/v1.0/?ids=10.1371/journal.pone.0010676&format=json),
and make a note of the `pmcid` value (in this example, `PMC2873961`).

Next, find the location of the gzip archive file for this article, using the [PMC OA web
Expand Down
174 changes: 174 additions & 0 deletions jats-to-mediawiki.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
import sys, os, traceback, re
import argparse
import requests
from bs4 import BeautifulSoup
import wget
import urllib
import tarfile
import subprocess
import glob

'''
Helper functions
'''

# escape parentheses, @TODO may need to escape other characters
def shellquote(s):
return "'" + s.replace("(", "\(").replace(")", "\)") + "'"

# Unicode handling
# (decode to unicode early, use unicode everywhere, encode late to string such as when
# writing to disk or print)

# Use this function to decode early
def to_unicode_or_bust( obj, encoding='utf-8-sig'):
if isinstance(obj, basestring):
if not isinstance(obj, unicode):
obj = unicode(obj, encoding)
return obj
# use .encode('utf-8') to encode late

'''
Main function
'''

def main():
try:

# parse command line options
try:
# standard flags
parser = argparse.ArgumentParser(description='Command-line interface to jats-to-mediawiki.xslt, a script to manage conversion of articles (documents) from JATS xml format to MediaWiki markup, based on DOI or PMCID')
parser.add_argument('-t', '--tmpdir', default='tmp/', help='path to temporary directory for purposes of this script')
parser.add_argument('-x', '--xmlcatalogfiles',
default='dtd/catalog-test-jats-v1.xml', help='path to xml catalog files for xsltproc')

# includes arbitrarily long list of keywords, or an input file
parser.add_argument('-i', '--infile', nargs='?', type=argparse.FileType('r'), default=sys.stdin, help='path to input file', required=False)
parser.add_argument('-o', '--outfile', nargs='?', type=argparse.FileType('w'), default=sys.stdout, help='path to output file', required=False)
parser.add_argument('-a', '--articleids', nargs='+', default=None, help='an article ID or article IDs, either as DOIs or PMCIDs')

args = parser.parse_args()

# print args #debug

except:
print 'Unable to parse options, use the --help flag for usage information'
sys.exit(-1)

# Handle and convert input values
tmpdir = args.tmpdir
xmlcatalogfiles = args.xmlcatalogfiles
infile = args.infile
outfile = args.outfile
articleids = []
# add articleids if passed as option values
if args.articleids:
articleids.extend([to_unicode_or_bust(articleid) for articleid in args.articleids])
# add articleids from file or STDIN
if not sys.stdin.isatty() or infile.name != "<stdin>":
articleids.extend([to_unicode_or_bust(line.strip()) for line in infile.readlines()])
# De-duplicate by converting to set (unique) then back to list again
articleids = list(set(articleids))

# set environment variable for xsltproc and jats dtd
try:
cwd = to_unicode_or_bust(os.getcwd())
os.environ["XML_CATALOG_FILES"] = cwd + to_unicode_or_bust("/") + to_unicode_or_bust(xmlcatalogfiles)
except:
print 'Unable to set XML_CATALOG_FILES environment variable'
sys.exit(-1)

# create temporary directory for zips
tmpdir = cwd + "/" + to_unicode_or_bust(tmpdir)
try:
if not os.path.exists(tmpdir):
os.makedirs(tmpdir)
except:
print 'Unable to find or create temporary directory'
sys.exit(-1)
# print "\n" + os.environ.get('XML_CATALOG_FILES') + "\n" #debug

# separate DOIs and PMCIDs
articledois = [i for i in articleids if re.match('^10*', i)]
articlepmcids = [i for i in articleids if re.match('^PMC', i)]

articlepmcidsfromdois = []

# Send DOIs through PMC ID converter API:
# http://www.ncbi.nlm.nih.gov/pmc/tools/id-converter-api/
if articledois:

articledois = ",".join(articledois)
idpayload = {'ids' : articledois, 'format' : 'json'}
idconverter = requests.get('http://www.pubmedcentral.nih.gov/utils/idconv/v1.0/', params=idpayload)
print idconverter.text
records = idconverter.json()['records']
if records:
articlepmcidsfromdois = [i['pmcid'] for i in records]

# Extend PMCIDs with those from converted DOIs
articlepmcids.extend(articlepmcidsfromdois)

# De-duplicate with set to list conversion
articlepmcids = list(set(articlepmcids))

print "\nArticle IDs to convert:\n" #debug
print articlepmcids #debug

# Main loop to grab the archive file, get the .nxml file, and convert
for articlepmcid in articlepmcids:

# @TODO make flag an alternative to .tar.gz archive download
# use instead the regular API for xml document
# http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&id=PMC2953622
# unclear if this front-facing XML is updated frequently
# I recall from plos that updates are made via packaged archives

# request archive file location
archivefilepayload = {'id' : articlepmcid}
archivefilelocator = requests.get('http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi', params=archivefilepayload)
record = BeautifulSoup(archivefilelocator.content)

# parse response for archive file location
archivefileurl = record.oa.records.record.find(format='tgz')['href']

# download the file
print "\nDownloading file..."
archivefilename = wget.filename_from_url(archivefileurl)
urllib.urlretrieve(archivefileurl, archivefilename)

# @TODO For some reason, wget hangs and doesn't finish, using
# urllib.urlretrieve() instead for this for now.
# archivefile = wget.download(archivefileurl, wget.bar_thermometer)

# open the archive
archivedirectoryname, archivefileextension = archivefilename.split('.tar.gz')
print archivedirectoryname
tfile = tarfile.open(archivefilename, 'r:gz')
tfile.extractall('.')


# run xsltproc
# @TODO use list comprehension instead
for n in glob.glob(archivedirectoryname + "/*.nxml"):
nxmlfilepath = n
print "\nConverting... "
print nxmlfilepath
xsltcommand = "xsltproc jats-to-mediawiki.xsl " + shellquote(cwd + "/" + nxmlfilepath) + " > " + articlepmcid + ".xml.mw"
xsltprocess = subprocess.Popen(xsltcommand, stdout=subprocess.PIPE, shell=True)
print "\nReturning results..."
(output, err) = xsltprocess.communicate()
if output:
print "\nXSLT output..."
print output


except KeyboardInterrupt:
print "Killed script with keyboard interrupt, exiting..."
except Exception:
traceback.print_exc(file=sys.stdout)
sys.exit(0)

if __name__ == "__main__":
main()
11 changes: 11 additions & 0 deletions jats-to-mediawiki.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash

#####
# Provides simple bash interface to jats-to-mediawiki.xslt
#####

# Make sure you have xsltproc
command -v xsltprocfoo >/dev/null 2>&1 || { echo >&2 "I require foo but it's not installed. Aborting."; exit 1; }

# Set up XML catalog file
export XML_CATALOG_FILES=`pwd`/dtd/catalog-test-jats-v1.xml
6 changes: 3 additions & 3 deletions jats-to-mediawiki.xsl
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,8 @@
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:ex="http://exslt.org/dates-and-times"
xmlns:str="http://exslt.org/strings"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
xmlns:mml="http://www.w3.org/1998/Math/MathML"
xmlns="http://www.mediawiki.org/xml/export-0.8/"
extension-element-prefixes="ex str"
version="1.0">

Expand Down Expand Up @@ -38,7 +39,6 @@

<!-- Start MediaWiki document -->
<xsl:element name="mediawiki">
<xsl:attribute name="xmlns">http://www.mediawiki.org/xml/export-0.8/</xsl:attribute>
<xsl:attribute name="xsi:schemaLocation">http://www.mediawiki.org/xml/export-0.8/ http://www.mediawiki.org/xml/export-0.8.xsd</xsl:attribute>
<xsl:attribute name="version">0.8</xsl:attribute>
<xsl:attribute name="xml:lang"><xsl:value-of select="/article/@xml:lang"/></xsl:attribute>
Expand Down Expand Up @@ -280,7 +280,7 @@
<xsl:template match="break">&amp;#xA;</xsl:template>

<xsl:template match="underline">
<span style="text-decoration: underline;"><xsl:apply-templates/></span>
&amp;lt;span style="text-decoration: underline;"&amp;gt;<xsl:apply-templates/>&amp;lt;/span&amp;gt;
</xsl:template>
<xsl:template match="underline-start">
<!-- double-escape the entity refs so the resulting XML contains '&lt;' instead of '<' and therefore remains well-formed -->
Expand Down
6 changes: 6 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
argparse==1.2.1
beautifulsoup4==4.3.2
distribute==0.6.24
requests==2.2.1
wget==2.0
wsgiref==0.1.2

0 comments on commit a3eb042

Please sign in to comment.