rss_parse is a module for Python 3.4.2 or newer. It takes an RSS feed URL and a dictionary object that contains xpaths to the relevent data as input, fetchs the RSS feed data, parses it, and returns it as an iterable object where each element contains the following details from each <item>
in the RSS feed: title, body, url, publication date, and image resource URL.
from rss_parse import RSSParser
rss_url = 'http://www.jpl.nasa.gov/multimedia/rss/news.xml'
xpath_configuration = { 'xpathParse': {
'stripHTML': True,
'item': '/rss/channel/item',
'namespace': {'re': 'http://exslt.org/regular-expressions'},
'title': './/title/text()',
'url': './/link/text()',
'body': './/description/text()',
'date': './/pubDate/text()',
'image': '((re:match(.//description/text(), '
'\'www.jpl.nasa.gov/images/[^\\">]+\', '
"'g')/text()) | /rss/channel/image/url/text())[1]"
}}
parsed_feed = RSSParser(rss_url, xpath_configuration)
print(parsed_feed[0].title)
rss_parse.RSSParser uses XPaths to identify the various parts of a news article in an RSS feed. XPaths are an entire separate topic not covered in this documentation. However, you can generally think of them as being like a directory structure where the first item in the path encapsulates the subsequent items. So given the XML Hi!, the XPath /foo/bar/baz2 would point us at the data in the baz2 item and /foo/bar/baz2/text() would give us just the text Hi!
NOTE: Except for the XPath for the
item
key, all XPaths are relative to the<item>
tag.
####In top-down ordering, we see the following:
Value: The value is a dictionary containing the following key:value pairs.
Value: This will either be true
or false
depending on if the RSS feed has undesired HTML content in the main body (description/summary) text. Generally it's a good idea to simply set this to true
. However, some RSS feeds, such as Google News, add links to recommended stories. Stripping HTML in those cases can make the summary text confusing to read. A future version of xkcd_news will have an additional option to fine-tune what content should be stripped from the feed.
Value: This is a fully specified XPath to news items (headlines/articles) in the feed. Generally, this will never need to be changed. The exception might be for Atom feeds wich use a slightly different specification that is similar to RSS.
Value: Namespaces are a part of XML and deserve their own section that won't be covered here. In rss_parse, they're generally used to help specify the XPath to an image associated with a specific news item in the RSS feed. If you are unsure what to use here, simply leave the value as an empty dictionary (e.g. {}
)
Value: This value is a relative XPath where the specific item in the XPath /rss/channel/item
is handled for you. This is the effectively the headline of the news article. It is unlikely you will need to change this.
Value: This is the relative XPath that specifies a link to the full news article. It is unlikely you will need to change this.
Value: This is the relative XPath that specifies the summary/description text of the news article. It is unlikely you will need to change this.
Value: This is the relative XPath that specifies the publication date of the news article. It is unlikely you will need to change this. This date value determines the order of the final output.
Value: An image is not part of the default RSS specification. The result is that this value will likely need to be changed for any given RSS feed. In the example, we use the re
namespace to use a regular expression to parse the image URL from the body
content. See the xkcd_news project for additional examples.
The output from creating the RSSParser can be treated as a list. Each item in that list contains the values retreived by the associated XPaths (as described above). To build on the above example, we could do the following with the parsed_feed variable.
for item in parsed_feed:
print(item.url) # the URL to the specific <item> in the RSS feed. (e.g. a link to a news story)
print(item.title) # the title of the <item> (e.g. the headline of a news article)
print(item.body) # the main body text of the <item> (e.g. the summary text of a news article)
print(item.date) # the date the <item> was added or updated in the RSS feed (e.g. the publication date of a news article)
print(item.image) # the URL to an image associated with <item>. This is sometimes None. (e.g. the logo of a news service)
NOTE: You must convert these into a Python dictionary before passing them to RSSParser(). The below is for formatting reference.
xpathParse:
stripHTML: true
item: '/rss/channel/item'
namespace:
re: http://exslt.org/regular-expressions
title: .//title/text()
url: .//link/text()
body: .//description/text()
date: .//pubDate/text()
image: ((re:match(.//description/text(), 'www.jpl.nasa.gov/images/[^\">]+', 'g')/text()) | /rss/channel/image/url/text())[1]
{
"xpathParse": {
"item": "/rss/channel/item",
"url": ".//link/text()",
"body": ".//description/text()",
"date": ".//pubDate/text()",
"stripHTML": true,
"namespace": {
"re": "http://exslt.org/regular-expressions"
},
"title": ".//title/text()",
"image": "((re:match(.//description/text(), 'www.jpl.nasa.gov/images/[^\\\">]+', 'g')/text()) | /rss/channel/image/url/text())[1]"
}
}