Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Media Citation Crawler and Tree Generator #8

Open
josephpd3 opened this issue Jul 21, 2017 · 9 comments
Open

Media Citation Crawler and Tree Generator #8

josephpd3 opened this issue Jul 21, 2017 · 9 comments

Comments

@josephpd3
Copy link

josephpd3 commented Jul 21, 2017

Proposed itinerary at bottom :)

I realized my last description on the Slack left a bit to be desired, so I wanted to flesh it out:

What I'm proposing is a media citation and reference crawler which can produce reference trees for analysis and determining the strength of a source (with respect to how well it backs itself up with citations, at least).

Let's say, for instance, that you take a Washington Post article...

You would then grab only the body content of the article itself with a web scraper and grab every <a href="..."> tag from it. You could also save the context of the tag--all the paragraph text surrounding it--tagging the words and content used to frame the reference as the source/description/lead-up/etc. This could be done with something like Python's lxml package and a little tree traversal, but let's forget those implementation details for now.

Imagine that this article itself is a node in a larger n-ary tree. Its children and parents are tweets, articles on other sites, government releases, comments and text posts on Reddit, and maybe some other internal articles from Washington Post--all the way down to collected transcripts. Let's call these media nodes for reference. All the articles and their references out there are just hanging out in one big, generally acyclic graph.

You could start from any of these potential media nodes and build a tree of sources from a given root media node. You could even allow a user to submit an article, tweet, post, or whatever on a web frontend to generate a tree of a certain (probably limited) depth which they can have visualized.

Scaled up, this could also permit analysis of citations from bodies of sources themselves. How often does WaPo (or other media entities) cite external sources? How often does WaPo cite themselves as an entity? What can be said for the authors of articles and comments? What can be said for users on Twitter? Do sources from certain entities tend to fall back on government documents and corroborated sources, tweets from the horse's mouth, or just good-old-fashioned hot air two levels down? How deep does a certain rabbit hole go?

The context of references can be used for both pruning and natural language analysis in the context of research as well.

You could store trees in a growing document database as well as specialized graph databases like Neo4j and text-search databases like ElasticSearch.

The main catch I see with this is how to work with each specific site. HTML traversal logic can be generified to some extent, but utilities and crawlers for each variety of media node will likely be necessary at some level. Wrappers for Reddit and Twitter could be useful as well. The silver lining with respect to non-API sites is that citations/references seem to just hang out in a tags in the middle of p, span, em, and other tags with text content.

I'd like this to grow to include even media such as Breitbart, Sputnik, and other intensely polarized sites and sources. Scrutiny doesn't need to see party lines or extremes [unless you want to prune or tag those branches (; ].

I imagine this project could also impact existing D4D projects such as assemble and the collaboration with propublica if implemented with a well-documented web API.

Proposed Itinerary for Base API:


  1. Build base web scraper
  2. Write logic for generic HTML tree traversal and a tag farming (this will likely evolve through each phase with the media node varieties)
  3. Write logic for scraping mainstream media nodes
  4. Write logic for working with Reddit, Twitter, and other APIs as media nodes
  5. Write logic for progressively less and less mainstream media nodes--opening the floor to each as an issue and eventually PR that can be integrated

A frontend and database solution can begin happening once the API and node structure are reasonably solidified, respectively. This will also likely evolve as the project grows.


@josephpd3
Copy link
Author

Oh, and the SO and I thought of the name Argus--after Argus Panoptes, a many-eyed giant from Greek mythology who would observe those he watched with his many eyes with the utmost scrutiny.

And set a watcher upon her, great and strong Argus, who with four eyes looks every way. And the goddess stirred in him unwearying strength: sleep never fell upon his eyes; but he kept sure watch always.

I realize that four doesn't seem like many, but apparently the count varies.

According to Ovid, to commemorate her faithful watchman, Hera had the hundred eyes of Argus preserved forever, in a peacock's tail.

@dwillis
Copy link

dwillis commented Jul 22, 2017

FWIW, I love this idea. You should be aware of a great Python3 library called newspaper that is really good at extracting the text from news site pages. I've used it on occasion.

Also, if you happen to be focusing on the Washington Post (at least as part of this), you should be aware that the Post has what amounts to an undocumented article metadata API. If you take a typical article and change the story.html part to json.html (yes, I know) you'll get a JSON representation of the article metadata.

It used to have full text, which is insane, but I think me publicizing that caused them to drop it.

@amtias
Copy link

amtias commented Jul 22, 2017

This is a great idea and I'd love to contribute.
I could help to build a scoring algorithm based on the scraped data.

By the way, you should however be aware the Argus is also the name of a company that collects and sells US credit card usage information, so not necessarily the most appropriate name 😊

@georgerichardson
Copy link

Just to lend some inspiration from the internal-displacement project, it could also be interesting and useful to use some NLP to also extract text that isn't a hyperlink, but does reference some outside source and then see if there is a matching source in your database. Maybe that's a bit of a phase 2 idea.

@amtias
Copy link

amtias commented Jul 24, 2017

Just to lend some inspiration from the internal-displacement project, it could also be interesting and useful to use some NLP to also extract text that isn't a hyperlink, but does reference some outside source and then see if there is a matching source in your database. Maybe that's a bit of a phase 2 idea.

That's an interesting thought and probably not too difficult to implement.
Though I do agree that it should be more of a phase 2 goal, as it would require the existence of a persistent document database to verify the claims made in the text, while the basic scraping of hyperlinks can work without one. Either that or the use of some API to search google for a matching document, but that would also add significant work.

@josephpd3
Copy link
Author

@dwillis - I was told about newspaper by a couple of other D4D developers as well. It doesn't seem fantastic for the link extraction, but I love the full-text extraction features. I think a combination of that and lxml may be just what we need.

It'll definitely be great to take a closer look at that WaPo data too. I haven't considered leveraging article metadata, and a combination of sites which publish it well and newspaper seems like it could really help with the analysis stage.

@amtias Aww, well there's nothing wrong with media-crawler for lack of any fancy names that may not have...interesting connotations 😄 What kind of scoring algorithm comes to mind for this?

@georgerichardson - That would be awesome! I'm fairly new to NLP myself, so I'm looking forward to what can be learned from the really successful projects like internal-displacement. Are there some examples of tagging and extraction we could look at there to see what could be done in this context?


I'm really happy to see that other developers here like the idea!! I'm going to ride the motivation you're all giving me and try to get a github project and more delineated working plan going these next couple days so we can get everyone who wants to involved.

@amtias
Copy link

amtias commented Aug 9, 2017

Seems like after the initial enthusiasm life caught up with everyone and it got kinda silent around here :)

If we want to get this done, we might need to break it down to clear tasks that we can divvy up between interested members.
Since this is your project, I'll let you do the honors, but I'm willing to help organize a bit if you wish. (as long as life doesn't catch up again until I get around to doing it :P )

@josephpd3
Copy link
Author

Hey @amtias! That's an on-point assumption. I really underestimated how busy my move to Orlando and getting my next job lined up would be. It all settled down these last few days, and I've been looking to start getting this in shape to be divvied up appropriately into tractable issues under its own project.

I'd honestly love to have someone else take the helm with me in coordinating this.

For starters, I thought we could break down grabbing the links and other text data from each entity in the form of Jupyter Notebooks. I saw this had some reasonable success over in Assemble.

To prioritize the sites for starting on this, I was thinking of taking the whitelist of journalistic entities and similar sources from over at /r/politics, generating a CSV, then running a script over the subreddit itself to determine the most commonly linked sources. These could be our starting point, and we can bring on other entities as they become leaves in our expanding reference trees.

Aside from this, reddit and twitter integration seem pretty tractable problems given the available APIs.

I believe the only really difficult part of this, apart from the breadth of the problem itself, is going to be determining what constitutes a true call to a journalistic reference in English text. Some links aren't necessarily references so much as they are definitions, and some references/links don't necessarily support the body of the argument. Figuring out a tractable scope for this is going to be our biggest point of discussion. I believe we can incorporate part-of-speech and primary subject tagging to some extent in whittling our search down.

What are your thoughts?

I'd also be down to talk over this on hangouts sometime soon as well--as I mentioned in general chat two weeks ago.

@aullrich2013
Copy link

@josephpd3 I'm interested in this topic in general--especially as it pertains to trustworthiness of a news article. It looks like work stalled but I wanted to see whether similar work is being done elsewhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants