-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Media Citation Crawler and Tree Generator #8
Comments
Oh, and the SO and I thought of the name Argus--after Argus Panoptes, a many-eyed giant from Greek mythology who would observe those he watched with his many eyes with the utmost scrutiny.
I realize that four doesn't seem like many, but apparently the count varies.
|
FWIW, I love this idea. You should be aware of a great Python3 library called newspaper that is really good at extracting the text from news site pages. I've used it on occasion. Also, if you happen to be focusing on the Washington Post (at least as part of this), you should be aware that the Post has what amounts to an undocumented article metadata API. If you take a typical article and change the It used to have full text, which is insane, but I think me publicizing that caused them to drop it. |
This is a great idea and I'd love to contribute. By the way, you should however be aware the Argus is also the name of a company that collects and sells US credit card usage information, so not necessarily the most appropriate name 😊 |
Just to lend some inspiration from the internal-displacement project, it could also be interesting and useful to use some NLP to also extract text that isn't a hyperlink, but does reference some outside source and then see if there is a matching source in your database. Maybe that's a bit of a phase 2 idea. |
That's an interesting thought and probably not too difficult to implement. |
@dwillis - I was told about It'll definitely be great to take a closer look at that WaPo data too. I haven't considered leveraging article metadata, and a combination of sites which publish it well and @amtias Aww, well there's nothing wrong with @georgerichardson - That would be awesome! I'm fairly new to NLP myself, so I'm looking forward to what can be learned from the really successful projects like I'm really happy to see that other developers here like the idea!! I'm going to ride the motivation you're all giving me and try to get a github project and more delineated working plan going these next couple days so we can get everyone who wants to involved. |
Seems like after the initial enthusiasm life caught up with everyone and it got kinda silent around here :) If we want to get this done, we might need to break it down to clear tasks that we can divvy up between interested members. |
Hey @amtias! That's an on-point assumption. I really underestimated how busy my move to Orlando and getting my next job lined up would be. It all settled down these last few days, and I've been looking to start getting this in shape to be divvied up appropriately into tractable issues under its own project. I'd honestly love to have someone else take the helm with me in coordinating this. For starters, I thought we could break down grabbing the links and other text data from each entity in the form of Jupyter Notebooks. I saw this had some reasonable success over in To prioritize the sites for starting on this, I was thinking of taking the whitelist of journalistic entities and similar sources from over at /r/politics, generating a CSV, then running a script over the subreddit itself to determine the most commonly linked sources. These could be our starting point, and we can bring on other entities as they become leaves in our expanding reference trees. Aside from this, reddit and twitter integration seem pretty tractable problems given the available APIs. I believe the only really difficult part of this, apart from the breadth of the problem itself, is going to be determining what constitutes a true call to a journalistic reference in English text. Some links aren't necessarily references so much as they are definitions, and some references/links don't necessarily support the body of the argument. Figuring out a tractable scope for this is going to be our biggest point of discussion. I believe we can incorporate part-of-speech and primary subject tagging to some extent in whittling our search down. What are your thoughts? I'd also be down to talk over this on hangouts sometime soon as well--as I mentioned in |
@josephpd3 I'm interested in this topic in general--especially as it pertains to trustworthiness of a news article. It looks like work stalled but I wanted to see whether similar work is being done elsewhere. |
Proposed itinerary at bottom :)
I realized my last description on the Slack left a bit to be desired, so I wanted to flesh it out:
What I'm proposing is a media citation and reference crawler which can produce reference trees for analysis and determining the strength of a source (with respect to how well it backs itself up with citations, at least).
Let's say, for instance, that you take a Washington Post article...
You would then grab only the body content of the article itself with a web scraper and grab every
<a href="...">
tag from it. You could also save the context of the tag--all the paragraph text surrounding it--tagging the words and content used to frame the reference as the source/description/lead-up/etc. This could be done with something like Python's lxml package and a little tree traversal, but let's forget those implementation details for now.Imagine that this article itself is a node in a larger
n-ary tree
. Its children and parents are tweets, articles on other sites, government releases, comments and text posts on Reddit, and maybe some other internal articles from Washington Post--all the way down to collected transcripts. Let's call thesemedia nodes
for reference. All the articles and their references out there are just hanging out in one big, generally acyclic graph.You could start from any of these potential media nodes and build a tree of sources from a given root media node. You could even allow a user to submit an article, tweet, post, or whatever on a web frontend to generate a tree of a certain (probably limited) depth which they can have visualized.
Scaled up, this could also permit analysis of citations from bodies of sources themselves. How often does WaPo (or other media entities) cite external sources? How often does WaPo cite themselves as an entity? What can be said for the authors of articles and comments? What can be said for users on Twitter? Do sources from certain entities tend to fall back on government documents and corroborated sources, tweets from the horse's mouth, or just good-old-fashioned
hot air
two levels down? How deep does a certain rabbit hole go?The context of references can be used for both pruning and natural language analysis in the context of research as well.
You could store trees in a growing document database as well as specialized graph databases like Neo4j and text-search databases like ElasticSearch.
The main catch I see with this is how to work with each specific site. HTML traversal logic can be generified to some extent, but utilities and crawlers for each variety of media node will likely be necessary at some level. Wrappers for Reddit and Twitter could be useful as well. The silver lining with respect to non-API sites is that citations/references seem to just hang out in
a
tags in the middle ofp
,span
,em
, and other tags with text content.I'd like this to grow to include even media such as Breitbart, Sputnik, and other intensely polarized sites and sources. Scrutiny doesn't need to see party lines or extremes [unless you want to prune or tag those branches (; ].
I imagine this project could also impact existing D4D projects such as assemble and the collaboration with propublica if implemented with a well-documented web API.
Proposed Itinerary for Base API:
a
tag farming (this will likely evolve through each phase with the media node varieties)The text was updated successfully, but these errors were encountered: