-
Notifications
You must be signed in to change notification settings - Fork 4
Bot Specification
Matt Senate edited this page Jun 13, 2014
·
16 revisions
This is a simple specification (spec) for the design and implementation of our new OA-signal bot. This spec is intended to remain more or less human readable. Key terms should be defined or easily checked via Wikipedia, etc.
##Flow
- A doi is cited somewhere on Wikipedia, or a manual request to update a doi citation is made (see Producers below).
- We get the full-text and media from Pubmed Central (PMC).
- We use the JATS-to-MediaWiki conversion library to convert article XML to wikitext.
- We upload full wikitext to Wikisource.
- We upload images and other media files (limited to accepted file types) to Wikimedia Commons.
- We start a Wikidata item with article metadata and suitable statements.
- Lastly, we signal availability of Wikisource, Wikimedia Commons, Wikidata materials in references cited elsewhere on Wikimedia projects, starting with the English Wikipedia.
- Using a template that looks like this mockup
We're making some assumptions here about what tools we're going to use:
- PubMedCentral API - archive of article source files and meta-data (making use of existing OAMI code as appropriate).
- CrossRef API - article license data by DOI.
- Linux (Unix-like) server
- Python programming language
- Virtual Environment (
virtualenv
) for managing packages withpip
and arequirements.txt
file - Modular development (
python
style) - Object-oriented development
- Multi-threading inside single python process.
- Core python data structures (
shelve
,pickle
, etc)- If we have any trouble with
shelve
alternative is to usemongodb
instead
- If we have any trouble with
- Deque python core module for queue system (enables working on both ends of the stack)
- Publisher/Subscriber (Pub/Sub) paradigm (internally referred to as Producer/Consumer for clarity)
-
PyWikiBot
for Mediawiki (Wikimedia project) interface - Other various appropriate libraries (
python
modules), specified inrequirements.txt
file
- Virtual Environment (
-
JATS-to-MediaWiki
XSLT converter
For the application itself, we'll use the following layers of abstraction:
-
Data
- Store as plain text or in-memory (shelve, pickle, deque)
-
Logging and Error-handling
- Log useful, fully specified messages.
- Handle errors gracefully with try/except, timeouts, max attempts, etc.
-
Queue
- Use "Double-ended Queue" (Deque)
- Queue manages stack of Articles (merely by some ID) to be handled by the application.
-
Producers
- Run multiple threads to handle input streams, feed into Queue
- Primary stream - "Listen for New Citations" probably by making a regular, narrow WMFLabs SQL-replica query for:
- New uses of {{cite doi}} template and perhaps other new citations of DOIs
- Updates to existing uses of {{cite doi}} template.
- Probably on this sub-stream: https://en.wikipedia.org/wiki/Special:RecentChangesLinked/Category:Cite_doi_templates
- Secondary stream - "Jump the Queue" by user-submitted
POST
request (or similar), e.g. through on-wiki web form requesting a pass over a particular citation.- Best practice at this point may be to run simple
REST
api (such as withflask-shelve
), host simplehtml
form for submission, and include the form in a Wiki page using an<iframe>
or similar. - Alternatively, if we move to
mongodb
instead ofshelve
, may want to use rest apieve
.
- Best practice at this point may be to run simple
- Primary stream - "Listen for New Citations" probably by making a regular, narrow WMFLabs SQL-replica query for:
- Run multiple threads to handle input streams, feed into Queue
-
Consumers
- Run multiple threads to source from Queue and handle output streams (only one planned for now).
- Primary stream - "Publish Article Reference, Source Content, and Meta-data" to MediaWiki instances (namely WikiSource, Wikipedia, etc). Requires the following distinct functions:
-
Download
- Use JATS-to-MediaWiki handler script, port to internal class.
-
Convert
- Use JATS-to-MediaWiki converter, port handler script to internal class.
-
Upload
- Use OAMI bot (or custom fork) to upload media to commons.
- Upload to WikiSource and extend upload script by @notconfusing
-
Download
- Primary stream - "Publish Article Reference, Source Content, and Meta-data" to MediaWiki instances (namely WikiSource, Wikipedia, etc). Requires the following distinct functions:
- Run multiple threads to source from Queue and handle output streams (only one planned for now).
- The Wikiproject Open Access Signalling team is funded to maintain the bot through September 2014. Thereafter our commitment to Open Access will drive us to maintain the bot in an unofficial capacity. We will also do as best we can to document and make it easier for other volunteer developers to maintain it.